What role can synthetic data play in high-stakes situations like banking fraud? In a paper in Nature Scientific Data, a team from Spar Nord Bank and Aarhus University describe how they used the synthetic data vault (SDV) to develop the world's first synthetic dataset that can help banks fight money laundering without compromising customer privacy. What's more, they show how they proved performance transferability — a key indicator that synthetic data is up for the task. Read on to find out how.
The problem: Money laundering is rampant — but bank data is private
Money laundering is a widespread and expensive issue: according to FBI estimates, cyber-enabled investment fraud cost the U.S. $3.3 billion in 2022. To counteract it, banks do their best to monitor transactions and report suspicious ones to the authorities, a practice called anti-money laundering, or AML. They usually do this using relatively simple rule-based techniques, often with oversight from human officers.
Meanwhile, fraud detection tools are advancing at light speed. Advanced machine learning, graph analytics, and deep neural networks provide stakeholders across sectors with new ways to identify suspicious patterns. If banks could work with the researchers who specialize in these techniques, AML could become vastly more sophisticated and effective. Even better, it could continually improve — meaning banks could keep ahead of criminals, who are always finding new ways to evade detection.
But there's one problem: Bank data is highly confidential. Banks are not able to share their data with research teams at all, which makes the task of developing better AML tools quite difficult.
The solution: Create a synthetic dataset that can be shared with experts
This challenge intrigued a team in Denmark, made up of experts from Aarhus University and Spar Nord Bank, one of the country's largest banks. Because privacy issues prevent outside researchers from using real bank data to improve AML tools, the group decided to see whether they could use synthetic data for this purpose instead.
The Spar Nord Bank team used the Synthetic Data Vault (SDV) to build a synthetic data set, called SynthAML, that can be used specifically to test new AML tools. They trained their model with 20,000 real AML alerts and associated data from Spar Nord (something they were able to do because some of the team members are employees of the bank, and because they promised never to release it). Then, with the help of some pre- and post-processing, they used the model to sample large amounts of synthetic alerts and transactions.
This was the first attempt to create truly synthetic data for use in AML — that is, data with the same statistical makeup as the real data. Previously, a group from IBM released a data simulator specifically for AML (called AMLSim), and another group put out a financial simulator called PaySim for the same purpose, but both of these were made with rule-based engines rather than more complex modeling software like the SDV.
The test: Can synthetic data effectively test real AML tools?
Now the team had what they wanted: on-demand synthetic transaction data that could be shared with researchers working on AML techniques, without the risk of compromising privacy. But would the synthetic data actually help these tools get better? To find out, the team needed to test performance transferability. That is, they needed to know whether AML tools would stack up comparably when tested with the synthetic data as they did when tested with real data.
To figure this out, the researchers went through a technical validation process. For a baseline, they both trained and tested different models on the real data. Then they trained models on the synthetic data, and then tested them on the real data. After a battery of tests, they had their conclusion: "the better a given model performs on the synthetic test data, the better it also performs on the real test data," they write, showing effective performance transferability. Success!
These results indicate that SynthAML "will be a tool that can actually improve real-world environments,” says author Rasmus Ingemann Tuffveson Jensen. “There’s great potential."
Takeaway: Creating sandboxes to accelerate improvement
We run into this again and again: scenarios where data is powerful, but limited. In the case of AML, the limitation is customer privacy. In others, the data may be inaccessible for other reasons — because it's proprietary, or difficult to collect, or simply represents rare events. In all these cases, synthetic data can help make up for these limitations, creating an accessible space where research is empowered and things can push forward. To create a synthetic dataset for your own dataset, you can follow what SparNord bank did and try out our freely available, battle-tested software: SDV Community.
RELATED LINKS
Paper: “A synthetic data set to benchmark anti-money laundering methods”
https://www.nature.com/articles/s41597-023-02569-2
Rasmus Ingemann Tuffveson Jensen https://dl.acm.org/profile/99660720246
