Boosting Fraud-Detection Accuracy with Synthetic Data

There are few types of data more valuable to hackers than credit card transactions. In the U.S. last year there were over 400,000 instances of credit card fraud - more than any other kind of identity theft.

Machine learning algorithms and artificial neural networks have shown major promise in being able to reliably detect instances of fraud. But the reality is that credit card fraud is statistically so rare that it can be difficult to actually have enough data to generate robust protections against it.

Using a model developed by the Synthetic Data Vault (SDV) team, researchers from UCLA have quantitatively determined what could be a transformative tool in the fight against fraud: synthetic data.

In a paper presented this fall at the ACM International Conference on AI in Finance in New York City, a team led by UCLA professor Guang Cheng showed that fraud-detection could be dramatically improved by generating additional anonymized case data consistent with past examples of fraud alongside consumers’ genuine transactions. When dealing with data for which 20 percent of all transactions are fraudulent, the researchers demonstrated that they could significantly reduce their fraud predictor’s false-positive and false-negative rates.

Credit card transactions are a form of “sequential data,” meaning that data points occur in a particular order like a time series. Sequential data is all around us in areas ranging from website clicks to FitBit steps, but it can be hard to model for certain applications. Credit card fraud is particularly difficult both because it’s statistically rare, and because transactions occur at irregular time intervals.

With that in mind, Cheng’s Trustworthy AI Lab chose to use a generative model developed by the SDV team. Able to handle both regular and irregular multi-sequence data, their Conditional Probabilistic Auto-Regression (CPAR) model was augmented by techniques for preprocessing the raw data to be able to significantly improve the synthesized transactions with respect to their usefulness and accuracy/reliability (referred to in the paper as data fidelity and data utility.)

How generative models benefit from data-centric methods

Cheng laments the fact that many data scientists focus much more on developing better predictive models than they do on data cleaning and associated efforts to make sure that the “data in” leads to better “data out.”

“Our work emphasizes the importance of thorough data preparation and pre-processing to elevate the efficacy of generative models, especially when handling intricate datasets like credit card transactions,” said Cheng.

With respect to preprocessing, the researchers created five preprocessing methods, each designed to iteratively enhance the fidelity of the data generated using CPAR. They consolidated data related to time (year, month, day and hour), as well as consolidating merchant information (name and address) into one categorical variable to ensure that data is not synthesized for inconsistent or non-existent merchant locations - for example, a situation where the state is labeled as New York, but the city as Boston.

Transaction amounts can also be notoriously skewed. After all, many of us use credit cards for small transactions and very rarely use them for big transactions. To be able to address the non-gaussianity of the transaction amount columns for individual credit card charges and generate realistic data, the team then used logarithmic and normalization transforms. The team found these techniques significantly improved the synthetic data quality.

Synthetic data generated using CPAR model reduces the false negative rate by a factor of nearly 20x

Their original data had thousands of transactions and only had 0.4206 percent fraudulent transactions. Using CPAR the authors created 15,000 synthetic fraud transactions. To build machine learning models that are able to predict fraud, the team created five sets of 10,000 transactions each where the fraud to no-fraud ratio was 1, 5, 10, 20, and 50 percent.

The false negative rate was initially 50 percent for when the fraud to no fraud ratio was around 1%, but utilizing synthetic fraud data, the team was able to increase the proportion of the fraud cases in their training set. Authors claim that this lead to near-zero False positive and False Negative rates. Specifically, by increasing the fraud to no fraud ratio, the team reduced the false negative rate to about 3 percent, only missing 1 in every 33 fraudulent transactions.

Looking ahead, Cheng’s team plans to do more to test specific credit cards’ “adversarial robustness” - that is, the ability to counteract efforts by malicious actors to get around fraud-protection systems. This includes strategies such as card testing, where fraudsters might first make one extremely small transaction before moving on to larger purchases.

The project is among a larger line of work from Cheng’s group focused on trustworthy AI, particularly with respect to using synthetic data to protect user privacy. Other applications will include their work with healthcare partners on topics such as ICU admission prediction.

“From this study we believe synthetic data can indeed serve as a high-fidelity copy of the original data, enhancing the performance of fraud detection by generating additional fraud case data,” says Cheng.

The team presented their paper at the 2nd Workshop on Synthetic Data held as part of ACM’s conference.

Boosting Fraud-Detection Accuracy with Synthetic Data

How generative models benefit from data-centric methods

Synthetic data generated using CPAR model reduces the false negative rate by a factor of nearly 20x

More blog articles

Differential Privacy for Synthetic Data (Part II): Trust-but-Verify

7 signs a synthetic data software violates privacy

Join the DataCebo Forum

Explore our blog