Synthetic Data in 2024: The Year In Review

DataCebo Team
by DataCebo TeamDecember 26, 2024

2024 has been the biggest year in the history of made-up data - and, no, we’re not making that up

(We welcome reader feedback and this article will be updated based on reader feedback to showcase more examples.)

Sorry, xkcd - we disagree!

A new report from Allied Market Research this month indicates that the global synthetic data market is estimated to reach $3.1 billion by 2031. Several key factors have affected the larger landscape of synthetic data:

  1. The growing demand for artificial intelligence (AI) applications that require large and diverse datasets for training
  2. The recent development of technologies and algorithms capable of generating realistic, high-quality synthetic data
  3. Increased awareness that utilization of synthetic data as a stand in for real data to dramatically improve data access and thus developer productivity 

For many years synthetic data was viewed as a substitute or backup for real data, but now it can often match - and sometimes even surpass - the quality of real data.

It’s wild to think that we are now at just the five-year horizon of 2030, the year when Gartner famously predicted that synthetic data would surpass real data in training AI models. While that might seem ambitiously close to present-day, it’s worth stepping back and appreciating how far generative AI and related technologies have come in the last couple years: for example, the first public version of ChatGPT (GPT-3.5) went live barely 2 years ago, in November 2022. Since then, entire industries have been fundamentally transformed.

Here are some of the biggest stories to come out of the synthetic data space this year.

Synthetic data is not just a "stand-in" for real data to overcome privacy concerns

Synthetic data continues to be a widely successful in making AI models better. This spans the classical predictive models, to training AI agents and to improving language models themselves (more on this in the next section). Here are just a few examples:

  • Researchers at Scotiabank developed methods for generating synthetic tabular data using generative adversarial networks (GANs). 
  • A team at UCLA showed how synthetic data can dramatically improve banks’ abilities to detect credit card fraud - a crime that happened in the U.S. more than 400,000 times last year alone.
  • Georgia Tech computer scientists found that synthetic data may be able to augment existing data in improving the ability of AI systems to detect hate speech.
  • Apple scientists presented a paper about a data generation framework they developed for a form of natural language processing called “dialogue state tracking,” finding that they could take a simple dialogue schema and a few hand-crafted templates to be able to synthesize natural, free-flowing dialogues.
  • Also in July, Microsoft bolstered their efforts to combat human trafficking using privacy-preserving synthetic data of victims and perpetrators. Their work helped inspire the Counter Trafficking Data Collaborative (CTDC), the first and largest data hub of its kind, with over 206,000 individual cases of human trafficking visualized throughout the site.

Even governments are starting to notice the field’s impact: in January the Department of Homeland Security issued a call for solutions to generate synthetic data so that they can better train machine learning models when real-world data isn’t available or would pose privacy risks.

In big tech synthetic language data reigns supreme

The world’s biggest tech companies continue to show the value of synthetic data by using it to power some of their most impactful products, and even occasionally contributing their own open-source tools to the larger synthetic-data ecosystem. Almost all of these innovative efforts focused on being able to create language or text data using LLMs - to enable better training of LLMs or agents.

As generative models are used to create synthetic data to overcome shortage of data (or annotated data for that matter) for language models, there has been a constant debate whether it will result in biased data and how far can we really take it ? (The promise and perils of synthetic data - Kyle Wiggers)

Synthetic data for training AI Agents

LLMs for Synthetic Tabular Data ?

One common question is whether it's possible to use LLMs to create synthetic tabular data. In 2024, this was explored in both the academic and tech worlds.

  • In June, Google formally released BigQuery DataFrames. They also published a blog post, "Exploring synthetic data generation with BigQuery DataFrames and LLMs," in which they demonstrate how to prompt a GeminiTextGenerator to generate Python code capable of creating synthetic data using the popular Faker library. The focus is on creating a Python wrapper code to generate fake data.
  • In December, Snowflake announced a functionality for generating synthetic data. Kudos to the Snowflake product team for clearly delineating various types of artificially generated data, ranging from fake data made from scratch to synthetic data that looks real, as well as detailing where LLMs can currently be helpful. They point out that LLMs can create generic random data from scratch (much like SDV's DayZSynthesizer) based on a schema. For generating synthetic data that has the same statistical properties as real data, they leverage copulas to capture and recreate data distributions. Currently, this functionality is limited to 5 input tables and 100 columns per table.

At DataCebo, we continue to push the envelope for tabular synthetic data 

At DataCebo, we continue to push the envelope on synthetic data for tabular, relational and time series enterprise datasets. Starting last December and continuing into early this year, we launched SDV Enterprise, a commercial version of our most popular publicly available library, SDV. 

We have also continued to roll out new iterations of our products. We recently launched constraint-augmented data generation (CAG for short), which enables enterprise users to provide additional context during the data generation process, and is aimed at helping software developers to create premium-quality tabular synthetic data.

We’ve also collaborated with companies to show them the value-add of synthetic data. To give one example, ING Belgium found that using SDV Enterprise for software development allowed them to achieve 100x the test coverage in one-tenth the time, leading to a safer and more robust payment processing experience for millions of ING customers.  

Our predictions for 2025

1. The rise of generative AI will result in a number of LLM-based synthetic data generation tools for tabular data. None will deliver on the promise, but this process will help enterprises define requirements.

The explosive growth of large language models has made everyone understandably gung-ho to go all in on generative AI. Meanwhile, a number of academics and researchers have begun exploring using LLMs to generate synthetic tabular data. We predict that such efforts will proliferate, and will show promise on simple toy datasets or single-table datasets — but that these tools will fall short for complex, enterprise-grade, multi-table databases with additional context that is not clearly delineated within the database schema. It will get worse, before it gets better: That is, a number of these tools will be tested, and will fail to deliver… but this will lead to the development of much more concrete requirements for tabular synthetic data generators. 

2. Companies will face a freeze in data asset availability due to regulations and declining customer consent.

Privacy and security regulations continue to get stricter, and many countries now have some kind of personal data protection policy in place. Using customer data is getting difficult for other reasons, too — people are more privacy-conscious, and are increasingly likely to not consent to companies using  their data for analytics purposes. We predict that this will lead to companies literally running out of relevant and usable data assets, and bringing in synthetic data as a viable solution.

3. Every company will at the very least experiment with synthetic data in 2025 as part of their broader AI data strategy.

Synthetic data is often better than real data when it comes to AI training, and it has the added benefit of being more freely shareable across teams and organizations. AI and machine learning algorithms simply perform better when trained with upsampled, augmented and bias-corrected synthetic data, as they can  pick up on patterns more efficiently without overfitting. We are already seeing this — the SDV software has been downloaded more than 7 million times, and as many as 10% of global Fortune 500 companies currently experiment with SDV. We predict this number will grow exponentially next year.

4. Synthetic data for training AI agents will  become a more pressing need.

As enterprises adopt more AI agents to perform tasks, we predict they will turn to synthetic data in order to train more robust agents. With few only few input-output pairs at hand to train an agent, more synthetic data would be needed.

5. Enterprises will gain big from synthetic tabular data and synthetic data to train AI agents

We predict that while big tech focuses on creating better language models, enterprises will gain more value from synthetic tabular data (whether it's used to improve data access or to train more robust predictive ML models), along with synthetic data for training better AI agents. While AI agents will likely be in an experimental phase this year, a number of proofs-of-concept will emerge that require synthetic "exemplar" data.

Share:
Popular topics
Differential Privacy for Synthetic Data (Part II): Trust-but-Verify
Product

You can trust that your software is applying differential privacy, but can you verify it for yourself? Use our framework to measure privacy for any synthesizer.

Neha PatkiAugust 14, 2025
7 signs a synthetic data software violates privacy
Applications

Are you evaluating synthetic data vendors? Look out for these signs that their software might be violating privacy.

Neha PatkiJuly 22, 2025

Become part of our community

Join our Slack community to discuss your synthetic data projects and connect with other users.

Join our Slack

Explore our blog

Read our newest insights about synthetic data, updates on our products, and successful use cases.

Read our blog

© 2025, DataCebo, Inc.