Synthetic Data in 2024: The Year In Review

2024 has been the biggest year in the history of made-up data - and, no, we’re not making that up

(We welcome reader feedback and this article will be updated based on reader feedback to showcase more examples.)

Sorry, xkcd - we disagree!

A new report from Allied Market Research this month indicates that the global synthetic data market is estimated to reach $3.1 billion by 2031. Several key factors have affected the larger landscape of synthetic data:

The growing demand for artificial intelligence (AI) applications that require large and diverse datasets for training
The recent development of technologies and algorithms capable of generating realistic, high-quality synthetic data
Increased awareness that utilization of synthetic data as a stand in for real data to dramatically improve data access and thus developer productivity

For many years synthetic data was viewed as a substitute or backup for real data, but now it can often match - and sometimes even surpass - the quality of real data.

It’s wild to think that we are now at just the five-year horizon of 2030, the year when Gartner famously predicted that synthetic data would surpass real data in training AI models. While that might seem ambitiously close to present-day, it’s worth stepping back and appreciating how far generative AI and related technologies have come in the last couple years: for example, the first public version of ChatGPT (GPT-3.5) went live barely 2 years ago, in November 2022. Since then, entire industries have been fundamentally transformed.

Here are some of the biggest stories to come out of the synthetic data space this year.

Synthetic data is not just a "stand-in" for real data to overcome privacy concerns

Synthetic data continues to be a widely successful in making AI models better. This spans the classical predictive models, to training AI agents and to improving language models themselves (more on this in the next section). Here are just a few examples:

Researchers at Scotiabank developed methods for generating synthetic tabular data using generative adversarial networks (GANs).
A team at UCLA showed how synthetic data can dramatically improve banks’ abilities to detect credit card fraud - a crime that happened in the U.S. more than 400,000 times last year alone.
Georgia Tech computer scientists found that synthetic data may be able to augment existing data in improving the ability of AI systems to detect hate speech.
Apple scientists presented a paper about a data generation framework they developed for a form of natural language processing called “dialogue state tracking,” finding that they could take a simple dialogue schema and a few hand-crafted templates to be able to synthesize natural, free-flowing dialogues.
Also in July, Microsoft bolstered their efforts to combat human trafficking using privacy-preserving synthetic data of victims and perpetrators. Their work helped inspire the Counter Trafficking Data Collaborative (CTDC), the first and largest data hub of its kind, with over 206,000 individual cases of human trafficking visualized throughout the site.

Even governments are starting to notice the field’s impact: in January the Department of Homeland Security issued a call for solutions to generate synthetic data so that they can better train machine learning models when real-world data isn’t available or would pose privacy risks.

In big tech synthetic language data reigns supreme

The world’s biggest tech companies continue to show the value of synthetic data by using it to power some of their most impactful products, and even occasionally contributing their own open-source tools to the larger synthetic-data ecosystem. Almost all of these innovative efforts focused on being able to create language or text data using LLMs - to enable better training of LLMs or agents.

In March OpenAI unveiled their Voice Engine tool, which uses synthetic voices to assist individuals with learning disabilities and speech impairments.
Google has also used synthetic data to train tools like its Gemma large-language models (LLM) and AI-driven math-solver AlphaGeometry.
Also in June, NVIDIA released Nemotron-4 340B, a family of open-access training models for synthetic data generation that includes base, reward and instruct models from the Hugging Face model repository.
In July Meta released the Llama 3.1 family of LLMs. Importantly, they not only used synthetic data in the training process, but updated the license to allow developers to use Llama 3.1’s outputs to train smaller models. Such a move could lower the barrier to entry for smaller developers and startups, “enabling them to create competitive models without requiring vast amounts of real-world data.”
Also in December, Microsoft launched its Phi-4 language model, which they trained mostly on synthetic data rather than web content (as has most commonly been the practice.) The company trained the model using more than 50 synthetic datasets that collectively contained about 400 billion tokens. In recent months Microsoft has also developed frameworks for improving the quality and diversity of synthetic data.

As generative models are used to create synthetic data to overcome shortage of data (or annotated data for that matter) for language models, there has been a constant debate whether it will result in biased data and how far can we really take it ? (The promise and perils of synthetic data - Kyle Wiggers)

Synthetic data for training AI Agents

In December an MIT team in an attempt to create an agent to narrate explanations for predictions from a classical ML model (previously done using data visualizations) created synthetic narratives. The team created hand-written narratives and created additional using DSPy’s BootstrapFewShot. The team found that the "Bootstrapped exemplars contribute more to generating accurate and complete narratives"
Apple scientists presented a paper about a data generation framework they developed for a form of natural language processing called “dialogue state tracking,” finding that they could take a simple dialogue schema and a few hand-crafted templates to be able to synthesize natural, free-flowing dialogues.
In December Databricks released a new synthetic data generation API, to help developers with so-called “compound AI agents,” by helping them more quickly and iteratively build evaluation datasets.

LLMs for Synthetic Tabular Data ?

One common question is whether it's possible to use LLMs to create synthetic tabular data. In 2024, this was explored in both the academic and tech worlds.

In June, Google formally released BigQuery DataFrames. They also published a blog post, "Exploring synthetic data generation with BigQuery DataFrames and LLMs," in which they demonstrate how to prompt a GeminiTextGenerator to generate Python code capable of creating synthetic data using the popular Faker library. The focus is on creating a Python wrapper code to generate fake data.
In December, Snowflake announced a functionality for generating synthetic data. Kudos to the Snowflake product team for clearly delineating various types of artificially generated data, ranging from fake data made from scratch to synthetic data that looks real, as well as detailing where LLMs can currently be helpful. They point out that LLMs can create generic random data from scratch (much like SDV's DayZSynthesizer) based on a schema. For generating synthetic data that has the same statistical properties as real data, they leverage copulas to capture and recreate data distributions. Currently, this functionality is limited to 5 input tables and 100 columns per table.

At DataCebo, we continue to push the envelope for tabular synthetic data

At DataCebo, we continue to push the envelope on synthetic data for tabular, relational and time series enterprise datasets. Starting last December and continuing into early this year, we launched SDV Enterprise, a commercial version of our most popular publicly available library, SDV.

We have also continued to roll out new iterations of our products. We recently launched constraint-augmented data generation (CAG for short), which enables enterprise users to provide additional context during the data generation process, and is aimed at helping software developers to create premium-quality tabular synthetic data.

We’ve also collaborated with companies to show them the value-add of synthetic data. To give one example, ING Belgium found that using SDV Enterprise for software development allowed them to achieve 100x the test coverage in one-tenth the time, leading to a safer and more robust payment processing experience for millions of ING customers.

Our predictions for 2025

1. The rise of generative AI will result in a number of LLM-based synthetic data generation tools for tabular data. None will deliver on the promise, but this process will help enterprises define requirements.

The explosive growth of large language models has made everyone understandably gung-ho to go all in on generative AI. Meanwhile, a number of academics and researchers have begun exploring using LLMs to generate synthetic tabular data. We predict that such efforts will proliferate, and will show promise on simple toy datasets or single-table datasets — but that these tools will fall short for complex, enterprise-grade, multi-table databases with additional context that is not clearly delineated within the database schema. It will get worse, before it gets better: That is, a number of these tools will be tested, and will fail to deliver… but this will lead to the development of much more concrete requirements for tabular synthetic data generators.

2. Companies will face a freeze in data asset availability due to regulations and declining customer consent.

Privacy and security regulations continue to get stricter, and many countries now have some kind of personal data protection policy in place. Using customer data is getting difficult for other reasons, too — people are more privacy-conscious, and are increasingly likely to not consent to companies using their data for analytics purposes. We predict that this will lead to companies literally running out of relevant and usable data assets, and bringing in synthetic data as a viable solution.

3. Every company will at the very least experiment with synthetic data in 2025 as part of their broader AI data strategy.

Synthetic data is often better than real data when it comes to AI training, and it has the added benefit of being more freely shareable across teams and organizations. AI and machine learning algorithms simply perform better when trained with upsampled, augmented and bias-corrected synthetic data, as they can pick up on patterns more efficiently without overfitting. We are already seeing this — the SDV software has been downloaded more than 7 million times, and as many as 10% of global Fortune 500 companies currently experiment with SDV. We predict this number will grow exponentially next year.

4. Synthetic data for training AI agents will become a more pressing need.

As enterprises adopt more AI agents to perform tasks, we predict they will turn to synthetic data in order to train more robust agents. With few only few input-output pairs at hand to train an agent, more synthetic data would be needed.

5. Enterprises will gain big from synthetic tabular data and synthetic data to train AI agents

We predict that while big tech focuses on creating better language models, enterprises will gain more value from synthetic tabular data (whether it's used to improve data access or to train more robust predictive ML models), along with synthetic data for training better AI agents. While AI agents will likely be in an experimental phase this year, a number of proofs-of-concept will emerge that require synthetic "exemplar" data.

Synthetic Data in 2024: The Year In Review

Synthetic data is not just a "stand-in" for real data to overcome privacy concerns

In big tech synthetic language data reigns supreme

Synthetic data for training AI Agents

LLMs for Synthetic Tabular Data ?

At DataCebo, we continue to push the envelope for tabular synthetic data

Our predictions for 2025

More blog articles

Differential Privacy for Synthetic Data (Part II): Trust-but-Verify

7 signs a synthetic data software violates privacy

Join the DataCebo Forum

Explore our blog