In our digital age, data stands as the undisputed ruler, coursing through the veins of our ever-more interconnected society. Amidst this data-driven upheaval, an unseen champion emerges, subtly but profoundly reshaping the landscape: synthetic data. This formidable ally operates in the shadows, quietly revolutionising our economic terrain. It provides a beacon of hope, addressing the hurdles posed by rigorous data protection laws and the intricate dance of data collection and manipulation.
More than a mere buzzword, it blazes a trail in our digital landscape and regulators aren’t just sitting on the side-lines. The European Commission has acknowledged the immense potential of synthetic data and is actively probing its potential applications across a spectrum of sectors. The EU Digital Finance Platform has taken a leap forward, constructing a Data Hub that harnesses synthetic data for seamless data exchange among national supervisors and financial firms. The privacy risks of making and using synthetic data are being assessed by EU data protection authorities.
In the United States, the Federal Chief Data Officers Council is not only investigating synthetic data but also inviting public input on synthetic data generation. This underscores its escalating importance in government decision-making and operations. The anticipation is building for the resulting position paper, expected to be published within the year. Adding to the momentum, the Department of Homeland Security (DHS) Science and Technology Directorate (S&T) recently announced a new solicitation seeking solutions to generate synthetic data that models and replicates the shape and patterns of real data, while safeguarding privacy and mitigating security harms.
A report by Fortune Business Insights reveals that the synthetic data generation market size was valued at $288.5m in 2022 and is projected to skyrocket to $2,339.8m by 2030. This rapid growth underscores the escalating demand and the dynamic developments in the field of synthetic data. Moreover, analysts have discovered that in 2022, a significant 18% of businesses were already using synthetic data to comply with privacy regulations and facilitate secure data exchange. Looking ahead, forecasts suggest that by 2027, synthetic data will be woven into 40% of systems employed by insurance companies.
At its core, synthetic data is far more than a mere trending topic. It’s an innovative power, a game-changer, and a vital field of knowledge in the digital age. Synthetic data in particular enables scenarios where there is not enough real data or where real data cannot be used because of privacy limitations. The world is buzzing with developments in synthetic data, which makes it essential to examine this transformative hero in detail.
Demystifying the hero: What exactly is synthetic data?
Synthetic data, which might sound like a fancy word to some, is actually a real thing in the world of data science. Synthetic data, generated artificially and independent of real-world occurrences, enables the rapid and cost-effective provision of substantial data across diverse scenarios. This transformative capability revolutionises data production and utilisation on a grand scale.
Real data mirrors actual events, akin to a photograph capturing reality. In contrast, synthetic data is like an unconstrained painting. It exists in a separate realm, not bound by real-world limitations. Its versatility lies in its ability to be shaped for various applications, offering unique insights. Whether entirely artificial or derived from real data, synthetic data enhances utility while having the potential to avoid direct mirroring of the real world.
In the expansive realm of data science and artificial intelligence, synthetic data plays a role akin to the wind beneath the wings. It empowers these systems to ascend to greater heights, honing their skills on extensive datasets, resulting in enhanced precision and efficiency when tackling complex problems. The quality of synthetic data, particularly its ability to faithfully mirror reality, stands as a critical factor.
Synthetic data is the outcome of a process that captures the essence of real data and creates it as a set of synthetic datasets. This process is done to reduce the chance of exposing sensitive information.
The generation of synthetic data, driven by growing data demands, has become a lucrative business, with various methods, including interpolation and distributions, being used to create customised data for diverse applications. Synthetic data may be generated through various methods, with quality and diversity in the data dependent on the method. Statistical methods like random sampling and multivariate normal distribution as well as interpolation (a mathematical technique to establish unknown values between several data points) have been the key methods used to create data resembling real data’s distribution and enhancing synthetic data sets. Nowadays, also deep learning techniques like variational auto-encoders and generative adversarial networks are used to create new synthetic data based on original data sets.
Synthetic data is being used across all sectors, from finance to healthcare or manufacturing. Given the wide range of possible applications, synthetic data has emerged as a formidable tool.
The unsung champion: Three superpowers of synthetic data
Synthetic data serves as an effective solution to bridge data gaps, while also having the benefit of being scalable and easy to use.
Scalability: Synthetic data is a boon for all use cases that rely on a vast amount of data to achieve a sufficiently high level of accuracy and precision. Specially developed algorithms can be used to define certain characteristics as to how the data to be generated should look and behave. Custom requirements for the data, such as gender, ethnicity or age, can also be easily accommodated, facilitating the creation of balanced datasets.
Simplicity: Data preparation for real word data is always a crucial prerequisite for a well-functioning systems, yet time-consuming. Key steps include data cleansing and data pre-processing, such as removing outliers, filling in missing values or standardising patterns. The subsequent steps involve transforming the data into a suitable format and often enriching the data with additional related information to provide deeper insights. While these steps are essential for working with real world data, most of them can be skipped or done less thoroughly when using synthetic data, as these particular features of the dataset can already be taken into account in the data creation process. Synthetic data eliminates common data issues like inaccuracies and duplicates, enhancing its value eg for machine learning. Also, the artificial data creation ensures consistent patterns, same formats and labels.
Improving data gaps: To address data gaps, especially when data is scarce or missing (like in climate research or medical research, or with uncollected facts in statistics), synthetic data is created to ensure a solid dataset for the purpose at hand. This includes, for example. integrating underrepresented groups into datasets. Oversampling duplicates minority class samples, while undersampling removes majority class samples to balance the data representation. These methods ensure fair and equal class representation in datasets.
Unmasking the hero: The Kryptonite of synthetic data
Synthetic data is a useful tool, but it does not exactly match real-world data. It may miss rare cases or outliers that real data would have. This means that small differences in the real data, which could impact the system's performance, are ignored. There’s always a risk that synthetic data may no longer accurately represent the real world, leading to incorrect predictions.
Synthetic data can also perpetuate or even amplify an existing bias. If real world data contained a bias and synthetic data is generated from such real world data, it is possible for this inherent bias to be transferred to the synthetic data by duplicating the related pattern in the data. This happens because the synthetic data is created by copying the statistical features of the original data, so it reflects the original data and likely also its bias. An example for amplification would be in predictive policing tools: If the real world data shows a higher percentage of crime in a particular area, the synthetic data could exaggerate the association between crime and those areas. This could make police target these areas more, increasing police presence there and finding more actual crime there (unlike other areas where crime happens but is missed because of less police there), making the bias worse and deepening existing prejudices.
Likewise, a lack of diversity in the initial data could result in the continuation of a bias. For example, in the field of medical research, most historical research data is over-represented by middle-aged white males, while women or children are significantly under-represented. Using such unequal dataset in a clinical trial reinforces the existing bias. This data would not support a treatment for women or children, because they need different things than middle-aged men.
Of course, there are ways to address these situations when creating synthetic data.
The hero's journey
Synthetic data is not real, but made up by machines. This raises several legal questions, especially from a data protection point of view. In this situation, the matter of anonymisation is crucial, as synthetic data should not have any personal information. For instance, health data is a very sensitive kind of data that medical research depends on, and using synthetic data instead of it would simplify the privacy aspect of doing research.
If created right, businesses don't have to worry about privacy issues. This is a big deal, because privacy laws can make it hard to use real people's data. Synthetic data is like a light in the dark, showing us how to get more data and at the same time keep privacy safe.
Synthetic data can also help businesses with upcoming technological and digital regulations, because synthetic data makes it easier to show and explain how the data is made, what it means, and any bias or imbalance in it. Despite the advantages of synthetic data, there are still legal challenges that have not yet been addressed, and their inherent limitations as an effective technology for privacy preservation and bias mitigation. Stay tuned for more, as we embark on this hero’s journey in our future blog posts.