Data Protection at the Health Data Lab

Privacy and Security

The privacy and security of citizens are the top priority for the Health Data Lab (HDL). Therefore, data at the HDL is protected by various technical and organizational measures to ensure data protection and security in accordance with the latest technological standards. These measures are continuously developed and reviewed in consultation with the Federal Commissioner for Data Protection and Freedom of Information (BfDI) and the Federal Office for Information Security (BSI).

Examples of security measures include:

  • Certification of the storage location: Before data is transferred, the HDL’s technical infrastructure is established in close coordination with the BSI and regularly monitored to ensure compliance with the highest security standards.
  • No direct data transmission to researchers: Personally identifiable data is never transmitted directly to researchers. Instead, only the necessary data is provided in a secure processing environment. These data subsets may only be processed within this secure environment.
  • Review of research results: All research results are carefully examined by HDL staff to minimize the risk of re-identification as much as possible.
  • Continuous adaptation of security measures: Technical and organizational security precautions are regularly updated to keep pace with evolving requirements and technologies.

These and other measures at the HDL ensure that data is protected at the highest level and that the privacy of affected individuals is always maintained.

Pseudonymization

Pseudonymization is essential for protecting the identity of citizens. "Pseudonymization" means that personal data is altered in such a way that it can no longer be attributed to a specific individual without additional information. This additional information is stored separately and safeguarded by technical and organizational measures, ensuring that personal data cannot be linked back to a specific person.

In concrete terms, this means that clearly identifying characteristics, such as names, addresses, or insurance numbers, are not transmitted to the Health Data Lab (HDL). Instead, a pseudonym is generated for each person in the dataset at different levels. A pseudonym is a type of code that prevents direct identification of the individual. This pseudonymization occurs for the first time before the data even leaves the health insurance provider. This process is carried out by a trusted third party—in this case, the Robert Koch Institute—which is responsible for performing the pseudonymization procedure.

This process ensures that at no point can identifying characteristics be linked to health data within the HDL.

Anonymization vs. Pseudonymization

Anonymization means that information is altered in such a way that it can no longer be attributed to a specific person.

Anonymized data is often less useful in research because it does not allow for a clear reference to individual persons. In contrast, pseudonymized data retains an important personal reference without revealing a person’s identity. Instead of a name, each person in the dataset is assigned a pseudonym that does not allow conclusions about their real identity.

The ability to link data to unknown but distinct individuals is crucial for many scientific studies. Take cancer risk research as an example: researchers use pseudonymized data to examine whether and how certain pre-existing conditions influence the risk of developing cancer. Pseudonymized data allows them to link earlier diagnoses with diseases that appear later in a person’s life. This is because each person in the HDL dataset retains the same pseudonym across all years. For instance, researchers can see that Person X had a cold in 2019 and was diagnosed with cancer in 2024. Such connections are often not possible with anonymized data. The anonymization process severs the link between 2019 and 2024, making it impossible to trace which pre-existing conditions Person X had before their cancer diagnosis in 2024. However, such questions are essential for research.

The General Data Protection Regulation (GDPR) permits data to be anonymized or pseudonymized as long as this does not compromise the purpose of the research. The Health Data Lab (HDL) follows these regulations to ensure that important research questions can be answered without compromising individuals' privacy.