Busted! Engineers Revolutionize Fraud Detection with Machine Learning
In the U.S., credit card fraud costs $5 billion annually, identity theft adds $16.4 billion, and Medicare fraud drains $60 billion each year.
Fraud is widespread in the United States and increasingly driven by technology. For example, 93% of credit card fraud now involves remote account access, not physical theft. In 2023, fraud losses surpassed $10 billion for the first time. The financial toll is staggering: credit card fraud costs $5 billion annually, affecting 60% of U.S. cardholders, while identity theft resulted in $16.4 billion in losses in 2021. Medicare fraud costs $60 billion each year, and government losses range from $233 billion to $521 billion annually, with improper payments totaling $2.7 trillion since 2003.
Machine learning plays a critical role in fraud detection by identifying patterns and anomalies in real-time. It analyzes large datasets to spot normal behavior and flag significant deviations, such as unusual transactions or account access. However, fraud detection is challenging because fraud cases are much rarer than normal ones, and the data is often messy or unlabeled.
To address these challenges, researchers from the College of Engineering and Computer Science at 91社区 have developed a novel method for generating binary class labels in highly imbalanced datasets, offering a promising solution for fraud detection in industries like health care and finance. This approach works without relying on labeled data, a key advantage in sectors where privacy concerns and the cost of labeling are significant obstacles.
The team tested their method on two real-world, large-scale datasets with severe class imbalance (less than 0.2%): European credit card transactions (more than 280,000 from September 2013) and Medicare Part D claims (more than 5 million from 2013 to 2019), both labeled as fraudulent or genuine. These datasets, with fraud cases far outnumbered by non-fraud cases, provide a real-world challenge ideal for testing fraud detection methods.
Results of the study, published in the , show that this new labeling method effectively addresses the challenge of labeling severely imbalanced data in an unsupervised framework. Additionally, and unlike traditional methods, this approach evaluated the newly generated fraud and non-fraud labels directly without the need of relying on a supervised classifier.
鈥淭he use of machine learning in fraud detection brings many advantages,鈥 said Taghi Khoshgoftaar, Ph.D., senior author and Motorola Professor in the 91社区 Department of Electrical Engineering and Computer Science. 鈥淢achine learning algorithms can label data much faster than human annotation, significantly improving efficiency. Our method represents a major advancement in fraud detection, especially in highly imbalanced datasets. It reduces the workload by minimizing cases that require further inspection, which is crucial in sectors like Medicare and credit card fraud, where fast data processing is vital to prevent financial losses and enhance operational efficiency.鈥
The study shows the new method outperformed the widely-used Isolation Forest algorithm, providing a more efficient way to identify fraud while minimizing the need for further investigation. This confirms the method鈥檚 ability to generate reliable binary class labels for fraud detection, even in challenging datasets. It offers a scalable solution for detecting fraud without relying on costly and time-consuming labeled data, which requires significant manual expert input and is resource-intensive, especially for large datasets.
鈥淥ur method generates labels for both fraud or positive and non-fraud or negative instances, which are then refined to minimize the number of fraud labels,鈥 said Mary Anne Walauskis, first author and a Ph.D. candidate in the 91社区 Department of Electrical Engineering and Computer Science. 鈥淏y applying our method, we minimize false positives, or in other words, genuine instances marked as fraud, which is key to improving fraud detection.
This approach ensures that only the most confidently identified fraud cases are retained, enhancing accuracy and reducing unnecessary alarms, making fraud detection more efficient.鈥
The method combines two strategies: an ensemble of three unsupervised learning techniques using the SciKit-learn library and a percentile-gradient approach. The goal is to minimize false positives by focusing on the most confidently identified fraud cases. This is achieved by refining the labels and reducing errors in both the unsupervised methods (EUM) and the percentile-gradient approach (PGM).
The refined labels create a subset of confident labels that are highly likely to be accurate. These labels are then used to create confidence intervals and finalize the labeling, requiring minimal domain knowledge to select the number of positive instances.
鈥淭his innovative approach holds great promise for industries plagued by fraud, offering a more accessible and effective way to identify fraudulent activity and safeguard both financial and health care systems,鈥 said Stella Batalama, Ph.D., dean of the College of Engineering and Computer Science. 鈥淔raud鈥檚 impact goes beyond financial losses, including emotional distress, reputational damage and reduced trust in organizations. Health care fraud, in particular, undermines care quality and cost, while identity theft can cause severe stress. Addressing fraud is key to mitigating its broad societal impact.鈥
Looking ahead, the research team plans to enhance the method by automating the determination of the optimal number of positive instances, further improving efficiency and scalability for large-scale applications.
The current journal article, , is an updated version of the researchers鈥 previous work, Confident Labels: A Novel Approach to New Class Labeling and Evaluation on Highly Imbalanced Data. The original paper was presented and published at the IEEE 36th International Conference on Tools with Artificial Intelligence (ICTAI) in November 2024, where it won the Best Student Paper Award. ICTAI, with an acceptance rate of about 25% from more than 400 submissions, is a prestigious conference.
-91社区-
Tags: students | AI | technology | faculty and staff | research | engineering