World of Data Series: Fraud Analytics – A Case Study in Analytic Curiosity
Albert Einstein is credited with saying, “I have no special talent. I am only passionately curious.” While most of us are not likely to make many groundbreaking or monumental discoveries that benefit all of humankind, maintaining a degree of the same passion and curiosity that fueled Einstein’s efforts is critical to success. Recently, the Analytics Team at FORVIS worked on two separate but similar projects that highlight how we used analytic curiosity to drive the methods and results of those engagements.
During the COVID-19 pandemic, both federal and state agencies attempted to provide financial relief to constituents in the form of enhanced unemployment insurance programs. While these programs offered welcome relief to those in need, they also provided an opportunity for fraudulent activity. Not all relief agencies were prepared for the influx of relief applications, and their somewhat insufficient security protocols made unemployment insurance fraud an attractive option for bad actors.
Our Analytics Team worked with two state agencies to help identify claimants that likely committed unemployment insurance fraud by identifying patterns present in the most common fraudulent claim submissions. Identifying these fraudulent patterns is still a relevant exercise as unemployment insurance—along with insurance fraud—did not end with the pandemic.
This article’s goal is not to provide an exhaustive listing of questions we asked regarding unemployment insurance fraud, but rather to give a flavor of the approach we took while analyzing the data. Approaching the project with curiosity was key to helping provide in-depth data analysis.
We began our investigation as we do most engagements—by asking questions of increasing complexity.
- Does the provided Social Security number meet the federal government’s requirements for a valid Social Security number?
- Does the age of the claimant make sense for a worker within the workforce, or is the age above or below a given threshold?
Questions Involving Analysis Across Claimants & Expectation Setting
- Were there claims submitted by different claimants who used the same Social Security number?
- How many different claimants submitted claims that used the same email address?
- How many different claimants submitted claims that used the same routing and bank account number?
Questions Involving Data Sets Outside the Primary Data Sets
- Does the data submission method tell us anything about the legitimacy of the claim?
- Can we enrich this data using third-party data sets to tell a more comprehensive story?
- Did the email address contain or adhere to a suspicious pattern?
- Was the provided address a valid address?
- Was the residential address a short distance from the employer address—perhaps even across the street or next door?
A claimant who met any one of these attributes may have been representative of a valid insurance claim, but if a claimant was represented in more than one of these groupings, the potential that the claimant was fraudulent seemed to increase. To help quantify our assumptions, we applied a scoring mechanism to the claimant meeting the different risk category criteria. The score for a given risk category was either a single score, i.e., the claimant either met the criteria or not, or a range of values, based on the number of claimants in the cohort who met the criteria.
We were fortunate to have the benefit of self-reported fraudulent claims data, i.e., claimants reported to the state agency that someone had fraudulently submitted a claim with their information, and that helped to inform our methodology and increased the confidence in our analysis. To increase the probability of identifying potential fraud, we employed the three following detection methods:
We identified specific flags that, by themselves, appeared to indicate a high probability that the claim was fraudulent, and classified those claims as egregious.
We used machine learning to help identify claimants with combinations of potential fraud flags that were consistent with those associated with self-reported fraud claims.
Cumulative Risk Scoring
Using the aggregated scoring of the claims allowed us to group the claims. We then applied a threshold to the scores to help identify potentially fraudulent claims.
Combining these three methods enabled us to more accurately identify fraudulent claims while also helping us keep misidentified fraudulent claims to a minimum. Any one of the above methods would produce a somewhat accurate listing of fraudulent claims, but by combining all three methods, we were able to adapt our assumptions and refine our definitions. Throughout the process, we attempted to maintain our analytic curiosity by continuing to ask questions:
- Does it make sense for the same bank account number to be used by 20 people?
- Does it make sense for the same residential address to be used by 25 people?
- Does the data seem to support our conclusions?
- Do our conclusions make sense?
Continually asking ourselves these questions and others helped keep us on track and provided additional value for our clients. Having a seasoned team that is passionately inquisitive about the message in the data, unwilling to be satisfied with the initial answer the data presents, and committed to delivering results that make sense helps produce better products and more accurate conclusions. That is the value of analytic curiosity.
If you have questions or need assistance, please reach out to a professional at FORVIS or submit the Contact Us form below.