What Is AI Model-Generated Synthetic Data?

Kalyan: Wim, over the years we've discussed how the term "synthetic data" is often conflated with other concepts. Technically, synthetic data is data generated by computer software or algorithms. Would you say that what is new is being able to use generative AI software to create synthetic data?

Wim: Yeah — for all our discussions, when I say synthetic data, I'm already assuming that an AI model is trained and the data is generated from the model.

Kalyan: That is the key. An AI model is trained on real data, and synthetic data is generated from the model.

Broadly speaking, the definition of synthetic data is "data generated by an algorithm." That could be any algorithm, so I can see why the concepts get conflated.

So if you use generative AI to model the real data and sample from it — well, that's an algorithm;
Or you can design an algorithm to create a certain type of data. This algorithm will detect the data type for a column. Once it finds that the data type is phone numbers, maybe it checks whether the phone numbers are from the US. If the answer is yes, then it will generate more phone numbers from the US and distribute them equally across various regions. That's basically a simple transformer or data creator, if you will. They look at a column and the type, and then they just determine some properties of that data, and they'll generate it, right?

Wim: Yeah. It's not learned from the real data, it's more rule-based. But in a very generalized way. All phone numbers look the same regardless of the dataset. We don't specify very precisely how to generate the data.

Kalyan: And then the second level is, you can look at a column and say, 'I'm going to write specific rules for this particular column.' Like: I want 30% of my numbers to be from Boston with a 6 1 7 area code, and I want 30% to be from New York, and then the remaining 40% to be created equally from all possible area codes. You can write a specific rule like that. But once you write that rule, data generated from that rule is technically still synthetic. It's not AI-generated synthetic data, but it is synthetic data. Some rules are very general, and some rules have very fine-grained specifications. I guess, Wim, the obvious question would be, why wouldn't those rules work well enough? Why do we need AI-generated synthetic data?

Wim: It's the learning versus teaching kind of thing. In one case, you're teaching the machine by explicitly writing rules and algorithms to generate data. An AI model is able to learn most of the patterns from the data automatically. I think the teaching is not scalable. It is not possible—I mean, there is no way a user can specify with rules all the data patterns they need, because there's just too many rules. Not only are there too many rules, we humans are not aware of a lot of these rules or the experts that might know are not available to us. We could do a little demonstration of this, where we could actually show that this is the case by letting people come up with all the rules. And we will probably see that the number of rules very quickly gets out of control in terms of quantity. And in quality also — they might get very complicated. And even so, we would probably miss a lot of important ones. So that's for me the advantage of learning directly from the data. Yeah, the learning versus teaching, let's say.

What can AI model-generated synthetic data achieve?

Automatically learn millions or billions of rules from your data. When you train a generative model on your real data, it can automatically learn patterns, connections, correlations and so much more. It's not possible for humans to write all these rules, or as Wim elegantly put it, to “teach the machine” to emulate all these patterns. We've found that to try quickly becomes impossible when you have 100’s (or even tens) of tables and 1000’s of columns across the tables.

Generating synthetic data that emulates the patterns of your real data is a must for software testing and quality assurance. With a generative model, you can automatically sample synthetic data that emulates the distribution of your real data. This is a must-have for providing accurate test data, replacing data subsetting, and anonymization-based solutions.

In closing

In this first installation of our new series, I wanted to share the definition of AI model-based generation of synthetic data. In subsequent discussions, we'll cover how this type of data can help transform your test data generation workflows while working hand in hand with your traditional test data management solutions.

You may also wonder — if AI model-based synthetic data generation is so powerful, why isn't it everywhere? Some vendors who initially attempted to use AI-generated synthetic data for testing quickly stalled out. They couldn’t achieve advanced enough generative modeling (and generative AI-based solutions) to produce high-quality, constraint-compliant data for complex scenarios like multi-table structures. Instead of continuing to innovate—much like how language models evolved over time—they reverted to traditional methods, such as subsetting, copying, masking, or manually crafting data generators for each data type (e.g., phone numbers, URLs) and labeling that as “synthetic data.”

Some of these vendors now claim that AI-generated data is only suitable for model training, not testing. But that’s not true — properly developed generative models can produce robust test data that overcomes the limitations of classical test data management. We aim to dispel this misconception.

What Is AI Model-Generated Synthetic Data?

What can AI model-generated synthetic data achieve?

In closing

Let's put synthetic data to work