Welcome to BaseTransformer

October 15, 2025

Company

Hello readers! I'm excited to introduce our new blog, BaseTransformer. Here we'll take you behind the scenes and show you how we develop our reliable, customer-centric, and extremely popular synthetic data platform - the Synthetic Data Vault (SDV). We'll share what drives our thinking, the challenges we've overcome, and the core values we stick to in an industry that's full of ups and downs.

The backstory that helped us formulate our core values

In late 2020, we launched the Synthetic Data Vault (or SDV), an AI software platform people can use to train a generative model and create realistic but private synthetic data. Soon after that, our core team formed the startup DataCebo to continue this work.

SDV was one of the only synthetic data generation platforms focused on relational, time series, and tabular data. We launched to tremendous interest from open source users and enterprises. We formed commercial relationships, supporting our customers as they created synthetic data to test software applications and train machine learning models. But by August 2021, we realized that the research code we had been working with would not get us to our goal of becoming a commercial-grade, enterprise-ready platform.

So early in 2022, Neha Patki — my co-founder and DataCebo's head of product — called a meeting with me and Andrew Montanez, our Head of Engineering. When Neha was at MIT, she invented the SDV's core algorithms; now, she presented us with a proposal for updating our backbone library, which we call “Reversible Data Transforms,” or RDT.

This proposal was simple, elegant, and yet transformative. Reversible Data Transforms use metadata to transform the input data into a different space, so the generative models can model the transformed data. When you sample, the transformed data is reversed to the original space to create realistic synthetic data. Neha's proposal focused on basic principles and the essential but highly undervalued craft of easy-to-use API design. Simple principles like: "we should not have conditional arguments — a situation where a function has an argument and a value assigned to it determines whether or not other arguments exist." This made every function like a puzzle.

We all agreed that this was the right move. Our mission became: “Making the process of building generative AI models for tabular data more transparent, adaptable and controllable for enterprise.”

It was a massive bet. We spent almost a year rearchitecting RDT (and SDV), redoing the APIs, and fundamentally rethinking how synthetic data is generated. We did all this while serving our open source community, running our business and helping our customers, and worrying about whether our competitors would leapfrog us. In November 2022, after months of incorporating Neha and Andrew’s ideas, we released SDV 1.0.

After the release, we were anxious. We knew deep down that this would be transformative. But how and when would the effects show up?

For our customers who were on the previous version of the platform we were ready to give whatever help they needed to transition their projects and applications to SDV 1.0. We prepared a roadmap, and estimated that the transition process for a customer with 6 applications would take one quarter.

Early evidence that this massive gamble paid off!

In early 2023, at a talk at UCLA, a team from Amazon approached me. They mentioned that prior to SDV 1.0, they had been writing scripts around SDV to fill in gaps or find workarounds. “With your recent release of SDV 1.0, all those scripts are now redundant,” they said.

Soon after, they sent me an email:

“We had previously experimented with SDV and have developed a number of extensions to make it work better for our data and use cases. We are happy to see that some of the very needed changes were introduced in the recently released version 1.x. Would you or someone from the team have time and bandwidth to meet with us over Chime and introduce all the improvements and benefits of SDV 1.x? ”

This was a great moment - an early indicator that our team's major gamble had paid off, probably more than we had imagined.
But what about our customers who needed to transition to SDV 1.0? It turned out we didn't have to worry. One day, we learned that one of our customer's engineers built a new project on SDV 1.0, and found it so easy to understand that he decided to migrate a previous project on his own. That was easy enough that he migrated all the customer’s projects within a week. What we thought would take 3 months with our help took less than a week and required no involvement from us!

Today, tons of synthetic data platforms depend on Reversible Data Transforms — including NVIDIA's

Today, RDT is the backbone of all synthetic data generation platforms — ours and everyone else's. It has become one of the most used libraries in the world in this space, counting ~9 million total downloads and averaging 300,000 new ones every month.

NVIDIA, one of the most influential companies in the world, uses a tabular synthetic data generation platform that depends on RDT and SDV. This gamble we took is benefitting the larger ecosystem of synthetic data generation, and we are proud to be leading it.

Launching BaseTransformer

This post is the first in a new blog called “BaseTransformer” (a play on the name of the base class in the RDTs). In this blog, we will share our engineering practices, our culture, our core values, and how we built the tools that have become the backbone of the industry. We maintain the most comprehensive synthetic data platform, make consistent releases, and count several Fortune 500 companies among our clients. We build reliability bots and benchmarking systems to support our platform, and we still follow the API principles laid out by Neha and Andrew.

We'll cover all of this here. But we'll start by saying that we hope many of the teams building AI systems who find themselves in the same predicament we faced in 2021 will take the gamble and build reliable, user-friendly systems.

On a personal note — I could not be prouder of my team for teaching me the discipline and craftsmanship that it takes to build something the world can really use. And most of all, for helping me understand what the heck Steve Jobs meant when he said this:

“There’s just a tremendous amount of craftsmanship in between a great idea and a great product. Designing a product is keeping 5000 things in your brain… continuing to push them together in new and different ways to get what you want. And every day you discover something new — that is, a new problem, or a new opportunity to fit these things together a little differently. It's that process that is the magic.” Steve Jobs

Welcome to BaseTransformer

The backstory that helped us formulate our core values

Early evidence that this massive gamble paid off!

Today, tons of synthetic data platforms depend on Reversible Data Transforms — including NVIDIA's

Launching BaseTransformer

Let’s put synthetic data to work