Data-driven businesses are increasingly looking toward synthetic data – ie, data that is artificially generated but may be derived from real-world datasets – as an option to unlock larger volumes of data for AI and similar purposes, while minimizing tension with expanding data protection laws. To learn more about synthetic data, check out our previous blog post. Given that synthetic data is artificially generated, some argue (especially in areas such as healthcare, image generation or face recognition) that synthetic data should help to minimise data protection issues.
However, companies should not assume that synthetic data is anonymous, and thus exempt from data protection laws. Public authorities, such as the Information Commissioner’s Office (UK), the Spanish Data Protection Agency and the European Data Protection Supervisor, are already probing the legal impact and potential consequences of synthetic data, including the potential risks that synthetic data may reveal personal data. In light of these developments, it is important for companies to understand and assess potential data protection implications when creating or using synthetic data.
When may synthetic data be personal data?
If synthetic data relates to an identified or identifiable individual, it likely constitutes personal data, even if that synthetic data is made up or inferred from other sources. For example, EU General Data Protection Regulation (GDPR) defines ‘personal data’ as any information ‘relating to an identified or identifiable individual,’ and an individual is identifiable if they can be identified either ‘directly or indirectly,’ such as by reference to any type of identifier. In the U.S., the California Consumer Privacy Act (CCPA) defines ‘personal information’ as any information that ‘identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.’ The CCPA also expressly provides that ‘personal information’ includes inferences drawn about an individual’s preferences, characteristics, etc. Thus, as a threshold matter, if synthetic data contains information about identified or identifiable individuals, it likely will be viewed as containing personal data.
In other cases, the synthetic data may not contain individual-level data, or otherwise may be sufficiently anonymised such that it does not reveal any personal data. The basic idea is that if the data has nothing to do with a certain person or no one can figure out who it relates to, then it shouldn't be protected by those data protection laws. For example, under EU GDPR, anonymised data is understood as information ‘which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.’ To assess whether an individual might be identified from the data, the GDPR refers to the ‘means reasonably likely to be used.’ Similarly, the CCPA does not apply to data qualifying as ‘aggregate consumer information’ (ie data relating to ‘a group or category of consumers, from which individual consumer identities have been removed, that is not linked or reasonably linkable to any consumer or household, including via a device’) or as ‘deidentified’ information (ie ‘information that cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer,’) The Utah Consumer Privacy Act (UCPA) provides more specifically that ‘synthetic data’ (which is defined as ‘data that has been generated by computer algorithms or statistical models and does not contain personal data’) falls within the category of ‘deidentified data,’ provided that the data cannot ‘cannot be reasonably linked’ to an individual and that the company holding the data must implement certain safeguards to prevent reidentification, including publicly committing not to reidentify the data, and contractually imposing this obligation on data recipients.
Thus, whether data is considered personal information under the scope of these privacy laws goes into questions of whether the information could be used to relate or link back to an individual. For example, the FTC recently issued guidance asserting that hashed data is not anonymised, because while hashed data might ‘obscure’ data so that it does not directly reveal a person’s identity on its face, it still creates data that can track an individual over time. Under the GDPR, regulators have clarified that whether data is anonymous or not depends on the following:
- Singling out: the ability to isolate some or all of the records that identify a person in the data set.
- Linkability: the possibility to link information about a person.
- Inference: the ability to deduce, with a high degree of probability, the value of a particular data point.
Courts in Europe have also clarified considerations where assessing likelihood of re-identification. The General Court of the European Union recently upheld a risk-based approach, which allows for a low likelihood of re-identification as long as that likelihood is not probable or foreseeable to consider data still being in anonymised form. As the mentioned decision was appealed by the European Data Protection Supervisor, it will be for the CJEU to finally decide on the matter. Besides this judicial clarification to come, the European Data Protection Board is expected to issue respective guidance on anonymisation later this year.
So when can synthetic data be ‘anonymous’?
Ensuring the level of anonymisation necessary for synthetic data to be deemed truly anonymous is a challenging task. Even if complete anonymisation would be possible in theory, in practice, data that is anonymous today might not stay that way tomorrow, and lessons from non-synthetic data cases show that common privacy protection methods may not always be enough: for example, two experiments showed that the weakly encrypted Resident Registration Numbers (RRNs) of South Korean decedents pose a risk of de-anonymisation, with a success rate of 100% on 23,163 encrypted RRNs. In another landmark study, researchers estimated that 99.98% of Americans could be correctly reidentified in any dataset using 15 demographic attributes. Given their finding that only 15 attributes were required for re-identification, the exponential growth of data and the multiple connections between people's data points as they browse the internet, use of IoT devices and other connected technology means that the risk of re-identification is significantly increasing. In addition, the risk of re-identification remains high even in large-scale datasets containing tens of millions of people, as researchers have estimated that 93% of people in a data set of 60 million people can be uniquely identified by using only four data points of additional information.
These examples clearly emphasise that anonymisation is not a ‘one-off’ exercise for traditional data, but the same principle applies to synthetic data. Data that has been classified as anonymous in the past may become personally identifiable information in the future. Synthetic data now has a big upside because it can help protect people's personal data from being exposed and also make the data set more useful for getting better statistical properties, but the use of synthetic data is not a silver bullet against all privacy risks. The likelihood of re-identification in synthetic data can differ depending on how the synthetic data is generated, ie whether the synthetic data is based on a real dataset or only artificially mimics a real dataset without any use of real data. Thus, entities that use or provide synthetic data will have to strike a careful balance while remaining vigilant about emerging risks of re-identification.
Exploiting the weakness
When synthetic data is generated based on real data, the synthetic data may still reflect the original data’s distribution and thus leak some personal information. Attacks on synthetic data with the aim of re-identification are often carried out by the following methods:
- Linkage attacks: this is when an attacker tries to connect two or more records that belong to the same data subject in a dataset, or across different datasets, which results in revealing the identity of an individual by combining the information available. This can be done by the attacker by finding common attributes that both data sets share. A significant challenge in this regard is the presence of exceptional characteristics in a data set, which are particularly relevant for the health sector (eg rare diseases).
- Attribute inference attacks: in these attacks, attackers attempt to infer information from the data available, ie, the values of a set of attributes which might be deduced from the data, by evaluating the patterns and correlations present in the underlying data. This also happens when attackers use aggregate statistics, which are statistics that describe a whole dataset, and try to recover the original data. This way, they can infer values of specific attributes for individual records.
What measures should companies consider in generating or using synthetic data?
Although it does not solve all inherent risks associated with privacy, synthetic data stands as a game-changer for a number of reasons. It allows for the generation of vast amounts of data with potentially less impact on individual privacy, making it a powerful tool for training machine learning models and conducting research in fields where data is scarce or sensitive.
But companies must examine this type of data closely to balance the benefits and risks for the business model. While synthetic data may reduce the risk of privacy concerns in certain aspects, it doesn’t necessarily eliminate all privacy-related risks. Businesses are advised to:
- define the data need of the business case;
- determine the requirements for the data (such as scale, quality, quantity, real-world accuracy);
- analyse the legal implications and risks of these data types;
- select a suitable method to generate the synthetic data as required;
- maintain documentation regarding both the generation and the use of the synthetic data;
- regularly check whether technical innovations have had any implications on the re-identifiability of the synthetic data set; and
- if you purchase synthetic data, make sure you agree on respective representations and indemnities with the synthetic data provider.
Finally, companies who want to leverage synthetic data should continue to consider the potential privacy compliance implications and risks with these data types, as regulators in the EU and US continue to develop their approaches to how privacy regimes may apply to a broad array of data.