Siwei Jiang: Customer Churn Prediction

I. About the data set

I obtained this data set from Kaggle ¹, and Sakshi Goyal uploaded it in 2020. This data set is about a 🏦bank’s customer churn issue. The manager💼 of the bank is interested in predicting which customer will leave this bank💸. By doing so, the bank can target those customers with special products and services to increase their satisfaction and customer retention🌞.

It contains 10,127 observations and 23 variables. Because I had some issues with Income Category values, so I add another column that used a scale of 1 through 5 to represent each income category in Excel before importing it to the Rstudio. My data analyses are based on 20 variables, which excluded the Client Number and two Naive Bayes Classifiers😶.

The purpose of this project is to predict churned customers, so I use recall as an important model fit measurement📐. I also include accuracy and kappa as measurements of model fitness. I choose kappa because customer churn is only around 16% of total customers. Although some studies state kappa is not a good measure for classification model ², I think it is good enough for this project😎 I include a comparison table📊 in the last section, Result for each models’ performance.

II. Descriptive Stats

This data set has 9,664 missing values, which is 4.77% of the 202,540 total values. Most of those missing values are belong to categorical variables.

After using kNN to replace all the missing data by calculating their nearest 10 neighbors 🏠🏡 values, the new data set contains 0️⃣ missing values. The distribution of the target variable is not balanced⚖️, as attrited customers only represent 16% of the total customers. Because of the potential overfitting issue, I used caret’s built-in function SMOTE oversampling to overcome this issue, and it did improved the model performance. Below are some visualizations that are used for gaining some insights of the underlying data.

Show code

BankChurners1 <- unknownToNA(BankChurners, unknown =  c("", "Unknown"))
# summary(BankChurners1)
NA_cnt <-  table( is.na(BankChurners1))
NA_pct <-  prop.table(NA_cnt)
 cbind(NA_cnt, NA_pct)

      NA_cnt     NA_pct
FALSE 192876 0.95228597
TRUE    9664 0.04771403

Income_Level	Income_Category
1	Less than $40K
2	$40K - $60K
3	$60K - $80K
4	$80K - $120K
5	$120K +

	Accuracy	Kappa	Recall
5:5	0.9599	0.8437	0.8155
6:4	0.9590	0.8406	0.8154
7:3	0.9661	0.8683	0.8381
8:2	0.9580	0.8359	0.8062

	Overall
Total_Trans_Amt	100.00000
Total_Trans_Ct	93.11644
Total_Ct_Chng_Q4_Q1	63.25446
Total_Relationship_Count	41.57094
Total_Amt_Chng_Q4_Q1	36.03208
Total_Revolving_Bal	32.73355

	Overall
Total_Trans_Ct	100.000000
Total_Trans_Amt	30.220501
Total_Ct_Chng_Q4_Q1	22.463371
Total_Relationship_Count	16.367709
Months_Inactive_12_mon	15.729941
Total_Amt_Chng_Q4_Q1	6.665217

	Accuracy	Kappa	Recall
i.RF TOTAL NA	0.9661	0.8683	0.8381
ii.RF TOTAL NA & VAR	0.9585	0.8409	0.8308
iii.RF TRAIN NA	0.9575	0.8341	0.8062
iv.RF TOTAL NA & SMOTE	0.9549	0.8423	0.9344
v.GBT TOTAL NA	0.9681	0.8781	0.8648
vi.GBT TRAIN NA	0.9691	0.8805	0.8545
vii.GBT TOTAL NA & SMOTE	0.9635	0.8668	0.9078
viii.NNET TOTAL NA	0.9351	0.7446	0.7277
ix.NNET TOTAL NA & SMOTE	0.0000	0.0000	0.0000

Customer Churn Prediction

I. About the data set

II. Descriptive Stats

III. The Models

IV. The Process

V. Code

i. 🎄Random Forest Total NA⭐

ii. 🌴Random Forest NA & Var🍃

iii. Random Forest Train NA✨

iv. Random Forest Total NA & SMOTE❄️

v. Gradient Boosting Tree Total NA

vi. Gradient Boosting Tree Train NA✨

vii. Gradient Boosting Tree Total NA & SMOTE

viii. 🌜Neural Network Total NA🕊️

ix. Neural Network Total NA & SMOTE

VI. Result