Deploying AI-Generated Synthetic Data: 6 Lessons Learned from Scaling Enterprise Adoption

ING’s journey in adopting synthetic data across business applications is explored in this article, collaboratively written by Wim Blommaert, Head of Test Data Management at ING, and Kalyan Veeramachaneni, CEO of DataCebo.

Synthetic data has been around for several decades. A particularly powerful example comes from 1993, when statistician Donald Rubin created synthetic census data and showed that it could be used to preserve the anonymity of individual households. But it's only recently that AI algorithms, driven by hugely increased computational power, have enabled us to build generative models that can create synthetic data that emulates the complexity of real world data. A decade ago, at MIT, we introduced algorithms that can train generative AI models on complex enterprise data and create synthetic data. DataCebo now maintains a thriving open open core software system called The Synthetic Data Vault (SDV), which many enterprises now use to build generative models for tabular, multi-table and time series datasets.

More powerful than these technological advancements is the increased appetite for this type of data. As businesses increase their data collection by many orders of magnitude, their willingness to unlock its value by mastering the use of “non-real data” has grown. Meanwhile, generative AI models trained on language, images, videos, audio and even software have transformed how we think about AI-generated data and its applications. In 2025 alone, SDV was downloaded 7 million times, surpassing all previous years combined.

Despite this surge in interest, the wide range of use cases for AI-generated synthetic data—overcoming privacy concerns, addressing training data shortages, and enabling simulations—can be a double-edged sword. In our work with enterprises, we hear questions like: Where should we get started? How can we bring in enterprise data—which may be siloed across the enterprise, and may lack provenance—so it can fuel these models? How do we categorize use cases from an effort-impact perspective? To actualize the value of this technology we must answer these questions.

In 2021, we set out to deploy this technology at ING Belgium, one of 29 Global Systemically Important Banks, often referred to as “too big to fail.” Today, AI-generated synthetic data is deployed in more than 20 business applications at ING, and a platform is being rolled out that will allow it to scale beyond those numbers. Current applications range from testing payment systems, to sharing synthetic data with an external party for performance testing, to building AI models for predicting IT incidents. Many lessons were learned along the way, and we will share some here in the hopes that they will help drive wider and broader adoption of this technology.

Lessons from incorporating synthetic data into our Data and AI strategy.

Lesson 1: The true value of synthetic data goes far beyond privacy preservation

While many associate synthetic data with privacy preservation, generative AI provides an even more valuable opportunity: the ability for users to generate specific, “realistic” data that does not exist in your real data set, simply by setting particular conditions (known as prompts in the world of large language models). For example, given a small amount of real customer data from Ohio, a model could create "synthetic transactions for 1000 customers from Ohio who are older than 35 and college-educated.”

The gap-filling abilities of generative models can provide necessary training data for machine learning models that can predict rare events. In one case, we wanted to predict whether IT changes would lead to software failures. In reality, only 2.5% of such changes result in failures—not enough to train a prediction model. So we created a dataset of synthetic failures, and used it to train a model that can now perform these predictions.

Similarly, providing test data for software applications while mitigating privacy concerns could be achieved via masking and anonymization, but the effectiveness of such techniques is limited by the amount and variety of real data available (you cannot mask what you don’t have). Software applications often fail because of edge cases that may not be present in the real data. Generating realistic data for these cases enables more exhaustive and effective testing.

Lesson 2: Adoption should be driven by the business application, not by the data itself.

We frequently see enterprises suggest building a generative AI model for a large set of tables (perhaps hundreds), and worrying about use cases later. In our experience, it's much more effective to work in the other direction—to identify a use case and an application, and only then to pinpoint the relevant data asset. This order of operations brings a number of advantages, including inbuilt stakeholder buy-in, immediate assessment of value, better governance, and (most importantly) a measurable unit of the scale of adoption—the number of applications or use cases. For example, at ING, we started by building a solution for one software application with 11 tables, before slowly scaling up to 5 applications with 20 tables. Similarly, MAPFRE insurance started with a clear business application, using synthetic data to improve motor insurance fraud prediction. Once they succeeded in this, they started looking at other home insurance fraud prediction questions where synthetic data could alleviate data shortages.

Lesson 3: AI is just like any other technology, enterprise wide adoption will take time– don’t give up

One challenging aspect of AI is that the public discourse around it has been overly optimistic. Public demos and consumer-oriented applications like ChatGPT have created a perception that AI tools can be adopted overnight. As a result, while enterprises will spend several years on major projects like cloud migration, expectations around AI projects are often quite unrealistic. Factoring in time to create proofs of concept, build a team and a knowledge base, complete pilots, and (most importantly) show value at each step has proven critical for the successful adoption of synthetic data.

At ING, we approached this methodically, starting by generating synthetic test data for a proof of concept and following up with 3 pilots. By our second year using these tools, we had adopted them across 10+ applications. By our third year, we expanded to 20+ applications and started to deploy a platform that will enable further scaling. We are now starting a new use case: addressing training shortages by generating synthetic data for AI models. As this project has things in common with our original software testing use case, it builds nicely.

Lesson 4: Focus on scaling for one use case, it helps build a more robust bigger vision while generating value.

Due to the broad applicability of synthetic data, one common issue we see is trying to attempt to develop a roadmap and execution strategy for all the use cases all at once. This is detrimental in many ways. Different use cases will have different technology requirements, value assessments and stakeholders. If you're using synthetic data to provide test data for software applications, the synthetic data must match the formats, and data generation must happen quickly. In contrast, if you'll be using it to provide training data for machine learning, some format requirements can be relaxed—but context and semantics in the data should be preserved. More importantly it delays value generation from this technology. Instead we focused on scaling for one use case (providing data for testing software applications) before moving on to a second use case (alleviating a lack of training data). Even within the first use case, it is beneficial to do effort-impact analysis so ones with most value can be captured right away. While being deliberate about this strategy, it also helps in achieving the bigger goal-enterprise-wide adoption. For instance, recognizing that different use cases come with varying requirements for synthetic data helped identify which requirements they did have in common, which helped us build a platform that could support many use cases.

Lesson 5: Focus less on sophisticated new AI methods and benchmarks, and more on ensuring proper data complexity and human integration.

The AI community has exploded with researchers publishing techniques that promise a few percentage points' improvement on benchmark datasets. In complex, real-world enterprise scenarios, though, such improvements generally provide very little actual value. We have seen this play out in predictive AI, generative AI and so many other AI technologies. In our experience, real success depends on an AI technique’s ability to model the complexity of enterprise data—the sprawl of data types, tables, and hidden context that metadata does not capture—along with how well people are able to embrace the changes that come with this new approach.

Lesson 6: Three common technological hurdles that come up during implementation

Many consumer and enterprise applications use software as a service (SaaS). However, because training a synthetic data model requires access to sensitive data, the model must be developed within an enterprise's firewall. This puts the SaaS model at odds with this technology’s adoption. A synthetic data generator must be created on prem or within a private cloud, which can be difficult for vendors to grapple with.

Another issue is that much of the data that this AI technique needs in order to mature is behind enterprise walls. (This is a different situation than with images, videos, and language, for which an immense amount of data is publicly available.) To be enterprise-ready, AI techniques need to see complex, real data, but for enterprises to allow a technique to be adopted on their data, it needs to be ready… a true catch-22.

To mitigate this, we have found it helpful to use a tool with an open source or open core component. An open core component allows enterprises to freely use the tool and report any issues, without releasing private data. This pushes the tool to improve and mature until it is enterprise-ready and trustworthy.

Another significant hurdle in successful deployment is the lack of “enough” data to train the AI model. In some cases, like market research, there may be only 1000 rows of data. In other cases, it's not possible to provide the model with all the available private data. While some AI techniques are data-hungry, some surprisingly are not. A scalable adoption will require AI techniques that can train on minimal data and be able to produce synthetic data.

Conclusion: This new (GenAI) technology is transformative, you can derive value today and grow with it.

As has been widely reported, analysts at Gartner have estimated that synthetic data will completely overshadow real data in AI models by 2030, and will mature as a technology in the next 3 to 5 years. This technology's potential is so big that, although its maturity peak has not yet been reached, it is already able to create large amounts of value. Many enterprises can start by focusing on low-hanging fruit, such as by testing software applications with a large volume of realistic synthetic data. Early adoption will put you in a great position to shape these tools with your specific context, inspire the creation of necessary modules, or otherwise inform its development, pre-empting integration challenges down the road. Starting early also allows you to begin building knowledge and an in-house playbook, and to get a head start on the experimentation necessary to maximize this technology's use. And as the technology matures, you are already adding value—finally able to unlock the potential of all of your data.