Customer Segmentation for Arvato Financial services
Arvato Financial Solutions is looking for a data scientist to solve their business problem. In this project, we’ll be looking at what exactly that problem is, and all the steps we could take to get there.
Arvato is a financial services company in Germany. They are hoping to expand their customer base. However, they need insight into which customers they could target. They have provided demographic information about their customers, and would like someone to figure out the characteristics of their customers.
In order to expand their customer base, they would like to know which individuals are most likely to buy their product from the general population. They hope someone is able to build a model to identify these customers. There is a lot of data to deal with, and their hope is that someone can simplify the process and extract as much value from the data as possible.
Arvato has provided 6 data files — 4 csv datasets and 2 Excel files. These files include:
- Customer.csv — Demographic information on the customers of the mail-order products. It contains 191652 rows x 369 columns.
- Azdias.csv — Demographic information on the general population in Germany. It contains 891211 rows x 366 columns.
- Mailout_train.csv — Demographic information on the customers of the mail-order products who made a purchase. It contains 42982 rows x 367 columns. The extra column contained the purchase response of the customers.
- Mailout_test.csv — Demographics information on customers of the mail-order product but the purchase responses have been removed. It contains 42833 rows x 366 columns. The predictions on this dataset will be uploaded to Kaggle, and the results will be returned.
- DIAS Attributes — Values 2017.xlsx — Contains the values that represent different attributes for all the demographic information.
- DIAS Information Levels — Attributes 2017.xlsx — Describes what each keyword representation means, and describes the attributes of the demographic information.
Strategy to Solve the Problem
As with any data modelling, cleaning data is very important- and so this is where we’ll begin. There are over 360 columns of data in both the Customer and General Population datasets. For these datasets, we will be using unsupervised learning techniques to make sense of the data.
Due to the size of the dataset, we will have to be clever about managing the data. Depending on the available hardware, the Random Access Memory (RAM) might be overloaded. This resulted in kernel failure on multiple occasions on my computer. So in order to avoid dealing with the large datasets, we can perform a Principal Component Analysis (PCA).
The PCA will allow us to reduce the dimensionality of the dataset. A PCA will be able to preserve most of the data interpretability without too much information loss. This will allow us to parse the data without overloading computational resources.
Once the PCA is complete, we can perform clustering on the respective datasets. Here the K-means clustering algorithm can be implemented to segment customers into various groups. Each distinct cluster will comprise of individuals with similar characteristics.
The next step is to model the data to identify the customers that are likely to purchase the financial service product. This data is unseen, and can be found on Kaggle. Here, supervised learning models will try to predict that purchase habits of individuals in the Arvato test dataset.
There are various classification algorithms such as Random Forest Classifiers and XGBoost that we could use. For the sake of brevity, we will focus on the algorithms that will yield the best results on the Kaggle dataset.
Metrics to be Used
The Kaggle competition uses an Receiver Operating Characteristic (ROC) curve score, or the area under the curve (AUC), to determine the best performing models. The reason for this is that accuracy scores will not be able to assess the model effectiveness on binary classification.
Here’s why — say you’ve got a hundred apples. You need to build a model that would classify the apples into green and red apples. After training the model, you get an accuracy of 99%! The problem is that there are 99 green apples and 1 red apple. The model has only learnt to classify green apples!
This is caused by an imbalanced dataset. This is also the case with the customer training data. There are only 1.25% customers who actually bought the financial product.
This makes solving the problem much harder. The AUC score will tell us how well our model is doing on the imbalanced binary classification data. When we build our model, we will use this metric for scoring the model performance.
To accompany this scoring method, the classification report will also give us the recall, precision and F1 score for each model. However, the AUC performance on the Kaggle dataset will determine which models will be final, and will ultimately take precedence.
A Brief Project Outline
- Data Processing — Data will be cleaned of duplicates, missing values and unnecessary columns will be dropped.
- Unsupervised Machine Learning — Principal Component Analysis and K-means clustering will be used to allocate different customers to various groups.
- Supervised Machine Learning — Prediction of customer purchase patterns will be made using various classification algorithms such as Random Forest Classifier and XGBoost.
- Project Evaluation — An overview of the results, challenges and suggested improvements of the project.
1. Data Processing
1.1. An observation of the Customer dataset
From the first observation, we see that there are 369 feature columns. There are also quite a few NaN values — or missing values. To the left of the dataframe, we can see an unnecessary column called ‘LNR.’ This is the identification number of each customer.
It can be seen that imputation will be needed, and that it will be best to convert the LNR column into the index column for the dataframe. It would be a good idea to assess the number of missing values per column in the dataset.
In the small snapshot above, we can see that there are columns that contain far too many missing values such as ALTER_KIND1, ALTER_KIND2, etc. This implies that it would be best to drop these columns. They provide very little information, and all imputations would be overrepresented in these columns.
Next, we can observe the different data types in the dataframe.
Ideally, the data types should be numerical. There is a data type of ‘O’ or usually ‘object.’ This may indicate that there are multiple data types in some columns. We can pin point which these columns are, and observe the unique values that are present in some of the columns.
It’s clear there are 6 columns of data that require special attention. As we parse through the columns, it can clearly be seen that ‘CAMEO_DEUG_2015’ contains float values, string values, unknown values and missing values. These columns will require special processing functions.
1.2. Processing the datasets
Now we will drop any duplicates within the dataset. This will help remove data that could overrepresent certain data points.
For the unsupervised learning portion, we will convert the datasets into a cleaned version using a common processing function. We would also have to consider how we will impute the missing values.
Due to the nature of the data, the best method was imputing missing values by using the mode. The mode is used since most of our data is categorical. The mean and median values of the dataframe columns will not be appropriate in this case.
Here, the LNR column will turned into the index of the dataframe. We will also get rid of missing values and unknown values represented as -1's and 0’s respectively. All inappropriate values will be replaced with the mode of the column.
Once all the columns have been cleaned, we can now convert the remaining non-numerical categorical data into ordinal data. Finally, all columns which had an excess number of missing values will be dropped.
Our final dataframe will look something like this.
Once the Customer dataset is processed, the Azdias dataset will go through the same preprocessing steps.
2. Unsupervised Machine Learning
2.1. The Exploration of Data
The Arvato dataset is relatively large. This provides some restrictions as exploring the data will be computationally intensive. During the data cleaning stage, it was difficult to use the best method for imputation — K-Nearest Neighbors imputation.
After processing the data, we’ll have to use a Principal Component Analysis. The Principal Component Analysis (PCA) is a useful technique to reduce the dimensionality of the data. In other words, we can take the data containing 357 columns and reduce it to just a few components.
Firstly, we have to find the number of principal components that can explain most of the data. We’ll fit the PCA with an arbitrary number of components i.e. 20, and see the results.
What we can see there is variance drop-off after two components. These two components can explain 99.6% of the data!
Now that the customer data has been reduced, we can find the ideal number of clusters in the data. The goal here is to find the number of different customer groups we can allocate people to.
Using the Elbow method on the PCA, we find that we can get 3 different clusters and therefore, 3 customer groups. Now all that’s left to do is take the reduced dataset, and find the 3 clusters. Luckily, we were able to verify the results of the PCA by observing the number of ideal clusters on the cleaned customer dataset, and we also got 3 clusters.
After running the algorithm multiple times, it seems the results are quite consistent. The next goal is to add individual customers and allocate them to a group. It’ll be a good idea to see how many customers belong to the various groups as a percentage of the total.
It’s also a good idea to see how the K-means algorithm allocates individuals on the general population as well. Here is what that looks like. This gives us a rough sense as to how the two groups of datasets compare.
The next question — what are the most important features of the customers?
2.2. Understanding Feature Importance
One of the best ways to get the importance of different customer characteristics is by using a Random Forest Classifier. We take the feature columns and the target feature which was created by the K-means clustering algorithm, and use them as input into the Random Forest Classifier.
Now that we have allocated each customer to their respective group, we can now use the algorithm is try and understand the which features were the most influential in allocating customers. And so we get the following graph.
The algorithm seemed to place a large emphasis on the number of cars per postal code area — this is evident in the top two features. This seems a bit strange but let’s inspect further.
It’s important to note that it would have been best to use KNN imputation for the dataset. This gives us a much better representation of individuals in the dataset. Because of this, the choice was made to use mode imputation. It’s a lot lighter on computation! The problem is that we get some data points that are overrepresented but at least it gives us a general sense of the data. Take it with a grain of salt.
2.3. Observing Cluster Characteristics
There are many graphs that we could display but we’ll only show the number of cars, age classes as well as the building density of each cluster. Note that building density is arranged into classes where the higher the class number, the higher the building density.
Overall, the data provided isn’t very clear on what exactly all the features mean. There are also some descriptions that are not available for the data. However, we can do our best to interpret them as best as we can with the associated information we were given. Welcome to real world data!
3. Supervised Machine Learning
3.1. Preparing data for model input
The Mailout_train.csv file was used as training data for the supervised model where customer purchases were recorded. The dataset went through the same processing steps except for a slight alteration to improve model performance.
For the data preparation stage, there was an improved performance on the Kaggle dataset when the use of dummy columns were used instead of ordinal data. For this reason, the original preprocessing function was slightly altered. All other preprocessing steps were kept constant.
The next stage of data preprocessing involved scaling the data. For the supervised machine learning models, the data is to be normalized so that certain features are not overrepresented in the model.
It would’ve been better to do a PCA on the dataset before passing the information into the machine learning models to reduce the dimensionality of the data. However, there seemed to be significant information loss and the models performed worse when a PCA was used. The step was skipped, and the models were trained on the scaled training data.
The data was then split into training and testing data. The validation data would be the Kaggle dataset. However, its standard practice to include the training and testing data as seen below.
3.2. Training the Model and Hyperparameter Tuning
So, this is the hard part. How can we build a model that would try to predict which customers are more likely to buy the product without ever meeting them?
Given the information that Arvato has provided us, we can know something about our customers but not know exactly what that is. That’s where machine learning comes in.
After cleaning and normalizing the data, we can focus on building the model. We are going with two options — XGBoost and Random Forest Classifiers. These models work well with classification problems such as this.
Here we will use a grid search for each model. This allows us to try multiple parameters for each model and see which version of a model will be best.
Since the dataset is imbalanced, I tried to use SMOTE oversampling to improve the results of the model. This is available in the Imbalanced Learn library. The problem is that although the overall performance of the model on my test set was great with a 0.95 AUC score, it yielded only 0.5 on the AUC score on the Kaggle dataset! The model may have been overfitting on the available training data.
The decision to drop the oversampling technique was made, and the performance of the models rose drastically. The choice of models were either Random Forest classifiers and XGBoost classifiers. Without oversampling, the models rose from 0.5 to 0.6 in AUC scores.
Once the models reported an increase in performance, the decision to tune hyperparameters were made. I then applied a grid search onto the models to train and test multiple hyperparameters. This once again, increased model performance to above 0.7. The process of optimizing the hyperparameters required 300 and 3000 fits, on the two classifiers receptively.
For XGBoost, the hyperparameters can be seen below as well as the model training.
The next model is the Random Forest Classifier. The parameters and model training can be seen below.
4. Project Evaluation
Next, the best predictions on the test dataset were made and were uploaded to return the results. I considered the Kaggle dataset to be the validation dataset. In other words, how I test the performance of the model will be how well the models perform on Kaggle — as per the project requirements. Doing so, the Random Forest Classifier returned an overall score of 73%, and the XGBoost returned 74%.
To improve the results even further, an ensemble model was incorporated. We take the outputs of two different models, and combine the results. Often times, the results are more accurate- as was the case here.
To top it all off the results on the Kaggle dataset can be seen below. The model improved to roughly 77%!
4. 2. Conclusion
From the data we can conclude that there are actually three main groups of people in the customer population. The first and third clusters are relatively similar. The main difference between them is that the third cluster has areas with higher building density and lower unemployment.
The second cluster, however, is made up of a generally younger population with a mostly upper middle class background. They tend to originate from area codes where there are more cars owned as compared to the other clusters. They also tend to have a higher rate of unemployment.
The other two clusters tends to be older, and have a more even distribution of income classes. In the general population, the majority of individuals tend to be older and have greater rates of employment.
In the second part of the project, the supervised learning model scored roughly 77% of the dataset on Kaggle. This is a relatively average performing model as the Customer dataset contains a lot of noise. Noise is unnecessary information that makes it difficult for the model to make accurate predictions. The highest score was 88%, and therefore the results are not too far off. The model performed averagely and would likely not be able to accurately identify which clusters are more likely to purchase products — even the highest performing model will likely underperform.
Now that the project is completed, it’s great to think of ways we can improve the results. For one, the data is quite noisy. There were over 360 features that were used to predict responses of customers. It’s possible to clean the data up in a more sophisticated way. Since PCA cannot be done on missing values, KNN imputation would likely reveal greater results. Stronger compute power would also be great to train and test models in a more rapid manner. Some model searches took over hours to obtain!
Given what we know about the competition leaderboard on Kaggle, the highest score achieved was 88%. It just goes to show how hard it is to build a great model for the data. Our model wasn’t too far off with 77%! If we stay positive, we will always find a better way to do things.
In the end, we got to understand who Arvato’s customers are and who are most likely to purchase their products. It was a great project, and great experience to undertake this project. I hope you enjoyed going through my project as much as I enjoyed doing the project!
If you liked this post, make sure to check the full project breakdown on my GitHub page at https://github.com/Danieldacruz7/Customer-Segmentation-Modelling