Mini Project: Detecting Fraud
The Goal
In this mini project we are going to explore how we can use credit approval data from a financial institution to gain insights on potential fraud in credit applications. We will use a simulated dataset to explore how deep learning algorithms can be used to find similarities in credit applications so we can focus on groups of customers that might require our attention. Specifically, our goal is to use historical data from previous credit applications to infer entries that might have been fraudulent and use this information to detect potential future fraud.
The Data
This is a fairly small dataset with 690 observations of customers from a financial institution and 14 features that are obfuscated, but may include information on credit score, services used, income, tenancy, and other demographic information. The dataset already has historical information on credit approvals encoded in the Class column [ 1 for Approved and 0 for Denied]. This dataset was taken from the course materials offered by Super Data Science.
CustomerID | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | Class |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15776156 | 1 | 22.08 | 11.46 | 2 | 4 | 4 | 1.585 | 0 | 0 | 0 | 1 | 2 | 100 | 1213 | 0 |
15739548 | 0 | 22.67 | 7.00 | 2 | 8 | 4 | 0.165 | 0 | 0 | 0 | 0 | 2 | 160 | 1 | 0 |
15662854 | 0 | 29.58 | 1.75 | 1 | 4 | 4 | 1.250 | 0 | 0 | 0 | 1 | 2 | 280 | 1 | 0 |
15687688 | 0 | 21.67 | 11.50 | 1 | 5 | 3 | 0.000 | 1 | 1 | 11 | 1 | 2 | 0 | 1 | 1 |
15715750 | 1 | 20.17 | 8.17 | 2 | 6 | 4 | 1.960 | 1 | 1 | 14 | 0 | 2 | 60 | 159 | 1 |
Approach
Now to start exploring our data we will make use of Self Organizing Maps. This deep learning approach will allow us to cluster the data in such a way that we can find similarities in our list of customers. This map is essentially encoding the 14 characteristics of each of our customers, and it is plotting them in a 2D representation. The coloring in the map represents the distance from neurons or nodes (in our case hexagons) with respect from the surrounding neurons. In this case the bigger the distance (closer to 1 or red color) the more dissimilar that neuron is from its neighbors. Each of our customers has been mapped to each of these neurons and we have also overlaid the status approval for each customer with a green square for approved credits and a red circle for rejected applications.
Observations
One first thing to notice is that we have a particular neuron or section of the map that with a distance value close to 1 or in dark red color. This indicates that the collection of customers mapped to this region of the map are quite distinct from their neighbors. We have 18 customers in that section of the map from which 16 were rejected. This give us an indication that 2 of those customers should be further evaluated to better understand why their credits got approved [CustomerIDs 15692408 and 15736510].
CustomerID | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | Class |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15726466 | 1 | 17.42 | 6.5 | 2 | 3 | 4 | 0.125 | 0 | 0 | 0 | 0 | 2 | 60 | 101 | 0 |
15692408 | 1 | 48.08 | 6.040 | 2 | 4 | 4 | 0.040 | 0 | 0 | 0 | 0 | 2 | 0 | 2691 | 1 |
15667451 | 1 | 39.92 | 5 | 2 | 3 | 5 | 0.210 | 0 | 0 | 0 | 0 | 2 | 550 | 1 | 0 |
15763108 | 1 | 28.67 | 14.5 | 2 | 2 | 4 | 0.125 | 0 | 0 | 0 | 0 | 2 | 0 | 287 | 0 |
15723989 | 1 | 22.25 | 9 | 2 | 6 | 4 | 0.085 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 |
15704509 | 1 | 24.58 | 13.5 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 184 | 1 | 0 |
15730287 | 1 | 36.17 | 5.5 | 2 | 3 | 5 | 5.000 | 0 | 0 | 0 | 0 | 2 | 210 | 688 | 0 |
15720353 | 1 | 19.50 | 0.290 | 2 | 4 | 4 | 0.290 | 0 | 0 | 0 | 0 | 2 | 280 | 365 | 0 |
15728906 | 1 | 29.25 | 13.00 | 2 | 2 | 8 | 0.500 | 0 | 0 | 0 | 0 | 2 | 228 | 1 | 0 |
15736510 | 1 | 41.50 | 1.540 | 2 | 3 | 5 | 3.500 | 0 | 0 | 0 | 0 | 2 | 216 | 1 | 1 |
15708236 | 1 | 18.58 | 5.710 | 2 | 2 | 4 | 0.540 | 0 | 0 | 0 | 0 | 2 | 120 | 1 | 0 |
15736420 | 1 | 22.08 | 2.335 | 2 | 4 | 4 | 0.750 | 0 | 0 | 0 | 0 | 2 | 180 | 1 | 0 |
15765093 | 1 | 31.57 | 0.625 | 2 | 4 | 4 | 0.250 | 0 | 0 | 0 | 0 | 2 | 380 | 2011 | 0 |
15632275 | 1 | 22.67 | 11.50 | 2 | 3 | 4 | 0.415 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 |
15737542 | 1 | 33.75 | 2.750 | 2 | 3 | 5 | 0.000 | 0 | 0 | 0 | 0 | 2 | 180 | 1 | 0 |
15748691 | 1 | 36.08 | 2.540 | 2 | 1 | 1 | 0.000 | 0 | 0 | 0 | 0 | 2 | 0 | 1001 | 0 |
15748986 | 1 | 48.08 | 3.750 | 2 | 3 | 5 | 1.000 | 0 | 0 | 0 | 0 | 2 | 100 | 3 | 0 |
15727811 | 1 | 18.58 | 10.290 | 2 | 1 | 1 | 0.415 | 0 | 0 | 0 | 0 | 2 | 80 | 1 | 0 |
Additional Exploration
So how we can take this experiment further? Now we can use our map to create a model that will allow us to predict the approval of future credit applications and plot them on the map to assess the risk of those applications.
Below is the model performance of our classifier that uses the MiniSom library.
SOM Classifier
Map Lattice: 10x10
Sigma: 1.5
Learning Rate: 0.7
Topology: Hexagonal
Iterations: 500
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 0.82 | 0.82 | 0.82 | 122 |
1 | 0.74 | 0.74 | 0.74 | 85 |
accuracy | 0.79 | 207 | ||
macro avg | 0.78 | 0.78 | 0.78 | 207 |
weighted avg | 0.79 | 0.79 | 0.79 | 207 |