Interpreting the Progress of CTGAN

20 December, 2022

Santiago Gomez Paz

Santiago Gomez Paz

This article was researched by Santiago Gomez Paz, a DataCebo intern. Santiago is a Sophomore at BYU and an aspiring entrepreneur who spent his summer learning and experimenting with CTGAN.

The open source SDV library offers many options for creating synthetic data tables. Some of the library's models use tried-and-true methods from classical statistics, while others use newer innovations like deep learning. One of the newest and most popular models is CTGAN, which uses a type of neural network called a Generative Adversarial Network (GAN).

Generative models are a popular choice for creating all kinds of synthetic data – for example, you may have heard of OpenAI's DALL-E or ChatGPT tools, which use trained models to create synthetic images and text respectively. A large driver behind their popularity is that they work well — they create synthetic data that closely resembles the real deal. But this high quality often comes at a cost.

Generative models can be resource-intensive. It can take a lot of time to properly train one, and it's not always clear whether the model is improving much during the training process.

In this article, we'll unpack this complexity by performing experiments on CTGAN. We'll cover –

  • A high-level explanation of how GANs work
  • How to measure and interpret the progress of CTGAN
  • How to confirm this progress with more interpretable, user-centric metrics

Since the library is open source, you can see and run the code yourself with this Colab Notebook.

How do GANs work?

Before we begin, it's important to understand how GANs work. At a high level, a GAN is an algorithm that makes two neural networks compete against each other (thus the label “Adversarial”). These neural networks are known as the generator and the discriminator, and they each have competing goals:

  • The discriminator's goal is to tell real data apart from synthetic data
  • The generator's goal is to create synthetic data that fools the discriminator

The setup is illustrated below.

The generator is a neural network that creates synthetic data. In this case, it creates a table describing the names of different people, along with their heights and ages. The discriminator is an adversarial network that tries to tell these synthetic people apart from the real ones.

This setup allows us to measure – and improve – both neural networks over many iterations by telling them what they got wrong. Each of these iterations is called an epoch, and CTGAN tracks inaccuracies as loss values. The neural networks are trying to minimize their loss values for every epoch.

The CTGAN algorithm calculates loss values using a specific formula that can be found in this discussion. The intuition behind it is shown below.

As shown by the table, lower loss values – even if they are negative – mean that the neural networks are doing well.

As the epochs progress, we expect both neural networks to improve at their respective goals – but each epoch is resource-intensive and takes time to run. A common request is to find a tradeoff between the improvement achieved and the resources used.

Measuring progress using CTGAN

The open source SDV library makes it easy to train a CTGAN model and inspect its progress. The code below shows the steps. We train CTGAN using a publicly available SDV demo dataset named RacketSports, which stores various measurements of the strokes that tennis and squash players make over the course of a game.

from sdv.demo import load_tabular_demo
from sdv.tabular import CTGAN

metadata, real_data = load_tabular_demo('RacketSports', metadata=True)
table_metadata = metadata.to_dict()

model = CTGAN(table_metadata, verbose=True, epochs=800)
model.fit(real_data)

As part of the fitting process, CTGAN trains the neural networks for multiple epochs. After each epoch, it prints out the count, the generator loss (G) and the discriminator loss (D). Keep in mind that lower numbers are better – even if they are negative. An example is shown below.

Epoch 1, Loss G:  1.0435,Loss D: -0.1401
Epoch 2, Loss G:  0.4489,Loss D: -0.1455
Epoch 3, Loss G:  0.4756,Loss D: -0.0956
Epoch 4, Loss G:  0.3902,Loss D:  0.0344
Epoch 5, Loss G:  0.0912,Loss D:  0.3030
...

To see how the neural networks are improving, we plot the loss values for every epoch. The results from our experiment are shown in the graph below.

A graph of the GAN's progress over time. The generator loss is shown in blue, while the discriminator loss for the same epoch is shown in red.

Based on the characteristics of this graph, it's possible to deduce how the GAN is progressing.

Interpreting the loss values

The graph above may seem confusing at first glance: Why is the discriminator's loss value score oscillating at 0 if it is supposed to improve (minimize and become negative) over time? The key to interpreting the loss values is to remember that the neural networks are adversaries. As one improves, the other must also improve just to keep its score consistent. Here are three scenarios that we frequently see:

  1. Generator loss is slightly positive while discriminator loss is 0. This means that the generator is producing poor quality synthetic data while the discriminator is blindly guessing what is real vs. synthetic. This is a common starting point, where neither neural network has optimized for its goal.
  2. Generator loss is becoming negative while the discriminator loss remains at 0. This means that the generator is producing better and better synthetic data. The discriminator is improving too, but because the synthetic data quality has increased, it is still unable to clearly differentiate real vs. synthetic data.
  3. Generator loss has stabilized at a negative value while the discriminator loss remains at 0. This means that the generator has optimized, creating synthetic data that looks so real, the discriminator cannot tell it apart.

It is encouraging to see that the general pattern for the RacketSports dataset is similar to a variety of other datasets. These are shown below.

The generator and discriminator loss values for a variety of other datasets all follow the same learning pattern. The dataset names are shown in bold. They can be downloaded from the SDV demo module.

Of course, other patterns may be possible for different datasets. But if loss values are not stabilizing, watch out! This would indicate that the neural networks were not able to effectively learn patterns in the real data.

Metrics-Powered Analysis

You may be wondering whether to trust the loss values. Do they indicate a meaningful difference in synthetic data quality? To answer this question, it's helpful to create synthetic data sets after training the model for different numbers of epochs, and assess the quality of the data sets.

NUM_SYNTHETIC_ROWS = len(real_data)

synthetic_data = model.sample(num_rows=NUM_SYNTHETIC_ROWS)

It is important to select a few key metrics for a quantifiable quality measure. For our experiments, we chose 4 metrics from the open source SDMetrics library:

Each metric produces a score ranging from 0 (worst quality) to 1 (best quality). In the example below, we use the KSComplement metric on a numerical column in the RacketSports dataset.

from sdmetrics.single_column import KSComplement

NUMERICAL_COLUMN_NAME='dim_2'

score = KSComplement.compute(
   real_data[NUMERICAL_COLUMN_NAME],
   synthetic_data[NUMERICAL_COLUMN_NAME])

Our results validate that the scores do, indeed, correlate with the loss value from the generator: The quality improves as the loss is minimized. Some of the metrics – such as CorrelationSimilarity and CategoricalCoverage – are high to begin with, so there is not much room to improve. But other metrics, like KSComplement, show significant improvement. This is shown in the graph below.

A comparison of loss values and the KSComplement metric. The two are linked: Lower generator loss (blue) correspond to higher quality scores (green).

It's also possible to visualize the synthetic data that corresponds to a specific metric. For example, KSComplement compares the overall shape of a real and a synthetic data column, so we can visualize it using histograms.

from sdmetrics.reports import utils

utils.get_column_plot(
  real_data,
  synthetic_data,
  column_name=NUMERICAL_COLUMN_NAME,
  metadata=table_metadata)
Three histograms were created after training CTGAN for 10, 100 and 500 epochs on the RacketSports dataset. We plotted the dim_2 column. The real data (gray) doesn't change, but the synthetic data (green) improves with more epochs. The KSComplement metric measures the similarity: 0.74, 0.89 and 0.91 (left to right).

Overall, we can conclude that the generator and discriminator losses correspond to the quality metrics that we measured – which means we can trust the loss values, as well as the synthetic data that our CTGAN created!

Conclusion

In this article, we explored the improvements that the CTGAN model makes as it iterates over many epochs. We started by interpreting the loss values that each of the neural networks – the generator and the discriminator – reports over time. This helped us reason about how they were progressing. But to fully trust the progress of our model, we then turned to the SDMetrics library, which provides metrics that are easier to interpret. Using this library, we could verify whether the reported loss values truly resulted in synthetic data quality improvements.

This may lead us to a new, potential feature: What if we integrated these easily interpretable, user-centric metrics into the CTGAN training progress? This feature would allow you to specify the exact metrics you'd like to optimize upfront – for example, KSComplement. In addition to the generator and discriminator loss, CTGAN may be able to report a snapshot of this metric. A hypothetical example is shown below.

model = CTGAN(
  table_metadata,
  verbose=True,
  epochs=800,
  optimization_metric='KSComplement',
  optimization_column='dim_2')
  
model.fit(real_data)
Epoch 1, Loss G: 1.0435, Loss D: -0.1401, KSComplement: 0.7832
Epoch 2, Loss G: 0.4489, Loss D: -0.1455, KSComplement: 0.7671
Epoch 3, Loss G: 0.4756, Loss D: -0.0956, KSComplement: 0.7664
…
Epoch 200: Loss G: -2.542, Loss D: 0.0002911, KSComplement: 0.92391

Such a feature would allow more transparency over CTGAN's learning process, and allow you to stop training your models once the metrics are high.

What do you think? If you're interested in exploring the inner workings of CTGAN and optimizing your synthetic data, drop us a comment below!

Share: