Data Management
Before beginning any research project, it is essential to identify and address key ethical considerations related to data. These include determining appropriate methods for data collection, clarifying data ownership, and establishing responsible practices for analyzing, interpreting, and sharing results. Effective oversight of data management requires a significant commitment from the Principal Investigator (PI), who must not only understand the fundamental principles of data management but also ensure that all team members are actively engaged in developing, implementing, and maintaining clear policies and procedures throughout the research process.
Plan and Design
What is a Data Management Plan?
A Data Management Plan (DMP) is a formal document that describes how research data will be collected, organized, stored, shared, and preserved throughout and after a project. Many funding agencies, such as NIH and NSF, require a DMP to be submitted with grant proposals and annual reports. Even when funding is not being pursued, creating a DMP is considered best practice. Planning and organizing your data from the outset can save significant time, money, and effort over the course of your research.
Plan processes from onboarding to project closure and data resources.
- Review applicable Data Policies.
- Develop standardized file naming conventions.
- Assign roles & responsibilities for oversight of data management and sharing.
- Consider active and short term projects.
Collect and Analyze
Effective data and document management within a team or project requires implementing processes that are practical and accessible for all members. Clear and consistent data documentation is essential—it provides the context and details needed to accurately understand and interpret the data, both during the project and long after it concludes.
Store and Evaluate
At every stage of the Data Lifecycle, effective data storage management is essential. Maintaining proper storage practices ensures that data remains secure and compliant with established safety standards. As research grows more complex, so do the requirements for storing and transmitting data. Strong data privacy and security measures are necessary to protect research subjects and safeguard sensitive, personally identifiable information.
When a project concludes, additional planning is required to prepare data for sharing and publication. This includes decisions about long-term storage—such as where data will be housed, how long it will be preserved, what should be retained, and to what extent data should be made available to support transparency and reproducibility.
Share and Publish
Why Share Data?
Sharing research data is crucial for turning scientific findings into actionable knowledge, innovations, and practices that benefit human health. The push for greater data sharing has accelerated, largely driven by policies from funding agencies and academic journals.
- Meet the requirements of research funders who mandate data management plans and open access to data
- Comply with journal policies that require submission of supporting datasets alongside manuscripts
- Easily locate and access your own data long after a project has ended
- Allow others to replicate your research and validate your findings
- Support new research by enabling others to perform secondary analyses with your data
- Receive recognition through data citation, which is becoming a standard practice among publishers; many data repositories now provide citable references when data is deposited.
Open Access
Open access refers to free, unrestricted online availability of scientific and scholarly research. There are two main ways to provide open access: through open access journals, which publish research articles freely available to readers, and open access repositories, which archive and share research outputs without paywalls.
Open data is data that can be freely accessed, used, modified, and shared by anyone — typically with minimal restrictions, such as attribution or share-alike requirements (Open Data Handbook). In scientific contexts, open scientific data specifically refers to the primary research data that are made available within or alongside scholarly publications.
Preprints & Publishing
Preprint repositories provide an opportunity to share manuscripts and working papers prior journal publication.
- Preprint repositories allow researchers to deposit, discover and disseminate scholarship in the early stages of the research process.
- Preprint manuscripts have not yet gone through the traditional publisher-based peer-review system.
- Major preprint servers include a feedback forum permitting scholars to offer comments, reviews and transparently evaluate preprint manuscripts collectively online.
Selected Preprint Repositories for Health Science, Biomedical & Social Science research:
- bioRxiv: A free online archive and distribution service for preprints in life sciences and operated by Cold Spring Harbor Laboratory.
- medRxiv: A free online archive and distribution server for complete but unpublished manuscripts in medical, clinical, and related health sciences.
- arXiv: Containing scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
- SSRN: Preprint repository representing disciplines across the full research spectrum, including the applied sciences, health sciences, humanities, life sciences, physical sciences, and social sciences.
Data Use Agreements (DUAs)
A Data Use Agreement (DUA) is a binding contract between organizations governing the transfer and use of data. The specific terms depend on applicable laws, regulations, and the policies of the data-providing organization. A DUA must be signed by an Institutional Authorized Signatory to be valid.
Images and information developed by Harvard Longwood Medical Area Research Data Management Working Group. Cioffi, M., Goldman, J., & Marchese, S. (2023). Harvard Biomedical Research Data Lifecycle (Version 5). Zenodo. https://doi.org/10.5281/zenodo.8076168
DEFINITIONS
Anonymous Data: Data that never had a code or other identifier assigned to it and there are no means to trace the data back to an individual. Florida Atlantic and some international standards consider IP addresses to be identifiable even though the address is linked to the computer and not specifically to the individual.
Archive: The transfer of material or data to facility or electronic repository authorized to appraise, preserve, and provide access to those records.
Coded Data: Data was collected with identifying information but was replaced with a code enabling linkage of the identifying information to the private information or specimens of a participant. The researcher could readily ascertain the identity of the individual.
Confidential Data: Information disclosed with the expectation that it will not be divulged without permission to others in ways inconsistent with the understanding of the original disclosure.
Data Management Plan (DMP): Determines how data should be collected, normalized, processed, analyzed, preserved, used, and re-used over its lifetime. A data management plan associated with a research study can include comprehensive information including the types of data, metadata standards used, policies for access and sharing, and plans for archiving and preserving data to make accessible over time. DMPs ensure data will be properly documented and available for use by researchers in the future and are often required by grant funding agencies such as the National Science Foundation.
Data Repository: A place to hold data, make data available for use, and organize data in a logical manner. An appropriate, subject-specific location where researchers can submit their data. Data repositories may have specific requirements concerning subject or research domain; data re-use and access; file format and data structure; and the types of metadata that can be used.
Data Security: Data security refers to ways data is kept safe from harm, alteration, or unauthorized access during gathering, analysis, storage, and transmission. Computer systems used to store data should have security measures such as firewalls, virus protection, and strong password protection.
Data Sharing: Data sharing makes scholarly research data available to other investigators. Many funding agencies, institutions, and publication venues have policies regarding data sharing because transparency and openness are considered important parts of scientific discovery. Currently, in the biomedical field, the National Science Foundation and the National Institutes of Health have implemented data sharing policies that either expect or require scientific researchers to share their data.
Deductive Disclosure: Identification of an individual using known characteristics of that individual. Even though direct identifiers (e.g. name, addresses) are removed from the data, it may be possible to identify respondents with unique characteristics.
De-Identified Data: Data in which the identify of an individual cannot be readily ascertained via direct or indirect identifiers.
Electronic Laboratory Notebook: A software tool that in its most basic form replicates an interface much like a page in a paper lab notebook. In an electronic notebook, one can enter protocols, observations, notes, and other data using a computer or mobile device.
Individually Identifiable Data: Information that can be used to identify a specific individual. This can include direct identifiers like name or social security number, or indirect identifiers like date of birth or address, especially when combined with other information.
Metadata: Structured information about a resource that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage that resource. It ensures that the context for how data was created, analyzed, or stored is clear, detailed, and therefore, reproducible.
Protected Health Information (PHI): Individually identifiable health information collected from an individual. PHI encompasses information that identifies an individual or might reasonably be used to identify an individual and relates to the individual's past, present or future physical or mental health or condition of an individual; the provision of health care to the individual; or the past, present or future payment of heath care to an individual.
Personally Identifiable Information (PII): Any name or number that may be used, alone or in conjunction with other information, to identify a specific individual. PII includes:
- Name, postal or electronic mailing address, telephone number, Social Security number, date of birth, mother's maiden name, official state-issued or US-issued driver's license or identification number, alien registration number, government passport number, employer or taxpayer identification number, Medicaid or food stamp account number, bank account number, credit or debit card number, or personal identification number or code assigned to the holder of a debit card by the issuer to permit authorized electronic use of such card;
- Unique biometric data, such as fingerprint, voice print, retina or iris image, or other unique physical representation;
- Unique electronic identification number, address, or routing code;
- Medical records;
- Telecommunication identifying information or access device;
- Other number or information that can be used to access a person's financial resources.
Private Information: Information about a behavior that occurs in a context in which an individual can reasonably expect that no observation or recording is taking place, and information that has been provided for specific purposes by an individual and that the individual can reasonably expect will not be made public (for example, a medical record).
Research Data Management (RDM): Research Data Management (RDM) is a concept used to describe the managing, sharing, and archiving of research data to make it more accessible to the broader research community. Research data management provides an opportunity for researchers to create a plan ensuring data will be organized and shared with other researchers, or archived for long-term preservation.
Restricted Data: Data made available under stringent, secure conditions. Typically confidential or sensitive data.
RESOURCES
FAU Libraries: Data Management Resources and Services
FAU Research Data Policy 10.1.6