Detecting Fraud

Home
Detecting Fraud

Mini Project: Detecting Fraud

The Goal

In this mini project we are going to explore how we can use credit approval data from a financial institution to gain insights on potential fraud in credit applications. We will use a simulated dataset to explore how deep learning algorithms can be used to find similarities in credit applications so we can focus on groups of customers that might require our attention. Specifically, our goal is to use historical data from previous credit applications to infer entries that might have been fraudulent and use this information to detect potential future fraud.

The Data

This is a fairly small dataset with 690 observations of customers from a financial institution and 14 features that are obfuscated, but may include information on credit score, services used, income, tenancy, and other demographic information. The dataset already has historical information on credit approvals encoded in the Class column [ 1 for Approved and 0 for Denied]. This dataset was taken from the course materials offered by Super Data Science.

CustomerID	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	Class
15776156	1	22.08	11.46	2	4	4	1.585	0	0	0	1	2	100	1213	0
15739548	0	22.67	7.00	2	8	4	0.165	0	0	0	0	2	160	1	0
15662854	0	29.58	1.75	1	4	4	1.250	0	0	0	1	2	280	1	0
15687688	0	21.67	11.50	1	5	3	0.000	1	1	11	1	2	0	1	1
15715750	1	20.17	8.17	2	6	4	1.960	1	1	14	0	2	60	159	1

Approach

Now to start exploring our data we will make use of Self Organizing Maps. This deep learning approach will allow us to cluster the data in such a way that we can find similarities in our list of customers. This map is essentially encoding the 14 characteristics of each of our customers, and it is plotting them in a 2D representation. The coloring in the map represents the distance from neurons or nodes (in our case hexagons) with respect from the surrounding neurons. In this case the bigger the distance (closer to 1 or red color) the more dissimilar that neuron is from its neighbors. Each of our customers has been mapped to each of these neurons and we have also overlaid the status approval for each customer with a green square for approved credits and a red circle for rejected applications.

Observations

One first thing to notice is that we have a particular neuron or section of the map that with a distance value close to 1 or in dark red color. This indicates that the collection of customers mapped to this region of the map are quite distinct from their neighbors. We have 18 customers in that section of the map from which 16 were rejected. This give us an indication that 2 of those customers should be further evaluated to better understand why their credits got approved [CustomerIDs 15692408 and 15736510].

CustomerID	A1	A2	A3	A4	A5	A6	A7	A12	A13	A14	Class
15726466	1	17.42	6.5	2	3	4	0.125	2	60	101	0
15692408	1	48.08	6.040	2	4	4	0.040	2	0	2691	1
15667451	1	39.92	5	2	3	5	0.210	2	550	1	0
15763108	1	28.67	14.5	2	2	4	0.125	2	0	287	0
15723989	1	22.25	9	2	6	4	0.085	2	0	1	0
15704509	1	24.58	13.5	1	1	1	0	2	184	1	0
15730287	1	36.17	5.5	2	3	5	5.000	2	210	688	0
15720353	1	19.50	0.290	2	4	4	0.290	2	280	365	0
15728906	1	29.25	13.00	2	2	8	0.500	2	228	1	0
15736510	1	41.50	1.540	2	3	5	3.500	2	216	1	1
15708236	1	18.58	5.710	2	2	4	0.540	2	120	1	0
15736420	1	22.08	2.335	2	4	4	0.750	2	180	1	0
15765093	1	31.57	0.625	2	4	4	0.250	2	380	2011	0
15632275	1	22.67	11.50	2	3	4	0.415	2	0	1	0
15737542	1	33.75	2.750	2	3	5	0.000	2	180	1	0
15748691	1	36.08	2.540	2	1	1	0.000	2	0	1001	0
15748986	1	48.08	3.750	2	3	5	1.000	2	100	3	0
15727811	1	18.58	10.290	2	1	1	0.415	2	80	1	0

Additional Exploration

So how we can take this experiment further? Now we can use our map to create a model that will allow us to predict the approval of future credit applications and plot them on the map to assess the risk of those applications.

Below is the model performance of our classifier that uses the MiniSom library.

SOM Classifier
Map Lattice: 10x10
Sigma: 1.5
Learning Rate: 0.7
Topology: Hexagonal
Iterations: 500

	precision	recall	f1-score	support
0	0.82	0.82	0.82	122
1	0.74	0.74	0.74	85

accuracy			0.79	207
macro avg	0.78	0.78	0.78	207
weighted avg	0.79	0.79	0.79	207