Explore findings on Udacity Final Project for Data Scientist Nanodegree

Project Introduction

In this Udacity project, demographics data for customers of a mail-order sales company in Germany are analyzed, comparing it against demographics information for the general population. Unsupervised learning techniques to perform customer segmentation are applied, identifying the parts of the population that best describe the core customer base of the company. Then, a supervised model is used to predict which individuals are most likely to convert into becoming customers for the company. The data that used here has been provided by Udacity partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

Motivation of this blog

Have fun! This is a real, meaningful task. Take it, a wonderful jounery, and enjoy exploring the secret within the data. This is such a pleasure, right?

Business and data understanding
Data Preprocessing
Result evaluation
What is next?

1. Questions for better business understanding

At the beginning, a good understanding of data is always appreciable. By asking questions, here is what i find.

For business understanding:

Question 1.1: Before model is applied, how to design an online experiment to show its effectiveness?

Answer: An online experiment mainly is used to give a proof of what model predicts. Thus, the design of online experiment relies on the model prediction or prediction requirements. By following these constraints and design procedures, an online experiment could be achieved. Then, compare the experiment results with model predictions to make finnal decisions on the deploy of model.

Question 1.2: If model is applied and operates normally, how to quantify and evaluate the net income from model?

Answer: Directly, the income by purchase increase in relative ares where model is deployed should be considered as earnings. Meanwhile, the cost of deploy such as computer power, maintenance and new employees should be treated as cost. Combine two types of income, we could obtain the net income.

Question 1.3: After model operates for a long time, how to deal with the differences between the model predictions and reviews by customers?

Answer: After deploying, cause there is no 100% accuracy for any model, there must exist some disagree cases or reviews. These cases could be added directly to balance original dataset. Meanwhile, if computer power is enough, dynamic programming could be applied to treat model bias automatically, which would save workload.

For data understanding:

Question 2.1: How “big” is the datasets?

Fig1: scale and memory usage of two datasets

Answer: As it is shown in Fig1, two datasets contains about 100w cases and around 600 Mb memory use in total, which do not agree with the common definition of “big data”

Question 2.2: How imbalanced is the datasets?

Fig2: a sample feature distribution of two datasets

Answer: As it is shown in Fig2, the comparison graphs of a sample feature show that the distribution ratio of each feature in azdias and customers data are relatively small (commonly less than 6:1) which means imbalance problem could be omitted safely (Mpre details could be referred to the first part of Step 0.3.1: Visiualize Checkings in this notebook here.

Question 2.3: How to deal with small sample data?

Answer: Generally, several methods such as over-sampling, under-sampling and Synthetic Data could be choosed. For this project, small sample data is checked in Step 0.3.2: Assess Missing Data of the same notebook noted above to determine whether to keep or not.

2. Data Proprocessing

Though different algorithms correspond to different data requirements, clean and no-missing data is always helpful not only for the further analysis but also for the the convenience of transfer learning. In this project, several techiques are used and two of them are choosed to be introduced here.

2.1 Evaluation of missing data

Evaluation mainly include three steps:

First, convert unknown values to NaNs. A parsing to make use of map relationship to identify unknown values is applied to unknown values. In details, data that matches a ‘unknown’ value in DIAS Attributes - Values 2017.xlsx is converted into a numpy NaN value.
Then, missing data in each column is assessed. For azdias data, it is found that there are more than 100 feats among 360 feats have no missing data. Most common missing ratio of azdias columns is near 15 percentage while least common missing ratio of azdias columns is near 95% percentage. While for customers data, there are more than 100 feats among 360 feats have no missing data which is similar to azdias dataset. However, most common missing ratio of customers columns is near 25 percentage while least common missing ratio of azdias columns is near 95% percentage. Thus, the feats whose vacancy ratio is greater than 0.4 for the both two datasets are dropped.
Finally, missing data in each row is assessed. For azdias data, the data with lots of missing values are qualitatively different with data that has few or no missing values. The same finding is found in customers data. As a result, no drop would be taken and a special notice on those data would be added.

2.2 Feature engineering

Feature engineering done in this project mainly includes select and re-encode binary features and multi-class features, engineer mixed-type features.

For the first part, binary feature values are mapped into {0, 1}. Multi-class feature “CAMEO_DEUG_2015” is operated individually to extract the stored century time information. Apart from it, other multi-class features are parased by pandas attribute “.get_dummies” with the first dummy dropped to avoid miscellaneous information.
For the second part, a simple algorithm to detect mixed type features is applied by searching key words like “mix”, “+” in the “Description” column in the DIAS Information Levels - Attributes 2017.xlsx file. What should be mentioned here is that not all features could be searched in this excel file, thus the prior knowledge of the unsupervised learning project finished before is used here. Feature “PRAEGENDE_JUGENDJAHRE” and feature “CAMEO_INTL_2015” are regared as mixed feature and operated using the desription in the same excel file shown before.

Besides the two techiques mentioned above, feature scaling is applied to get a full, clean data. More details could refer to “Step 0.3.3: Feature Engineerings” in this notebook here.

3. Result evaluation

This section mainly includes two parts: discussion of findings by unsupervised learning and supervised learning.

unsupervised learning result

Fig3: feature weight of the first principal components for azdias dataset

Fig4: feature weight of the first principal components for customers dataset

For azdias data shown in Fig3, all positive contributed feats of the first three components come from feature “CAMEO_DEU_2015”. Futhermore, referring to the description excel file, the first component corresponds to the Fine Management group. For customers data shown in Fig4, all positive contributed feats of the first three components come from feature “CAMEO_DEU_2015” too. Referring to the description excel file, the first component corresponds to the Fine Management group. Compared with azdias data, both analysis of the first components include the Fine Management and City Nobility group, which means the considerable importance behind the feature “CAMEO_DEU_2015” and its derived features.

Fig5: overrepresented and underrepresented cluster distribution differences in two datasets

Fig6: people type of overrepresented clusters

Fig7: people type of underrepresented clusters

As it is shown in Fig5, which cluster or clusters are overrepresented in the customer dataset compared to the general population and which are underrepresented are obtained here. The threshold used here is 4%, which means if the ratio difference of a cluster by azdias minus customer data is larger than 4%, it belongs to the underrepresented type; in opposite, if the ratio difference of a cluster by customer minus azdias data is larger than 4%, it belongs to the overrepresented type. Following this ruler, underrepresented clusters include cluster 13 and cluster 20, overrepresented clusters include cluster 2 and cluster 5.

To investigate deeply, people types of two kings of cluster are shown in Fig6 and Fig7 seperatively. For overrepresented cluster type shown in Fig6, cluster 2 and 5 show that people with Active Retirement(“CAMEO_DEU_2015_5F”), Long-established(“CAMEO_DEU_2015_6C”), Frugal Aging(“CAMEO_DEU_2015_6F”), and First Shared Apartment(“CAMEO_DEU_2015_9A”) or high transactional activity on product group DIGITAL SERVICES(“D19_DIGIT_SERV_RZ”), BANKS in the last 24 months(“D19_BANKEN_ANZ_24”) and INSURANCE in the last 12 months(“D19_VERSI_ANZ_12”) are more likely to become customer. However, for underrepresented cluster type shown in Fig7, cluster 13 and cluster 20 show that people with high transactional activity on the last transaction for the segment telecommunication ONLINE(“D19_TELKO_ONLINE_DATUM”), BANKS in the last 12 months(“D19_BANKEN_ANZ_12”) and the product group GARDENING PRODUCTS(“D19_GARTEN”) are less likely to become customers.

supervised learning result

Fig8: feature importance

Take a further look at the Fig8, “D19_SOZIALES” contributes most to the model prediction while “W_KEIT_KIND_HH” contributes least. Checking the fetures found in Fig3, 4 and Fig 6, 7, theses features don’t contribute enough as expected, which means there is much room for improvement for unsupervised learning part.

Meanwhile, improvement on model choose is applied using grid search and pipeline tricks, for more details, please refer to the Step 2.3: improvement - creating a machine learning pipeline in this notebook here.

4. What’s next?

Fig9: competition rank up to now

Up to now, we have briefly went through the main parts of data analysis procedure. Fig9 shows the recent rank i have reached. For me, selecting feats by considering better cluster algorithms is more proper than adjusting the hyper parameters of supervised learning to obtain a higher score. Also, it is advisable for you to start a kaggle journal following exsited great kernels. By the way, there is a few questions that i want to share with you which i think maybe desirble to be devoted to:

How many feats are suitable? Generally, the more feats you obtain may not guarantee higer submit score you expect while the more computer power you are expectd to have.
Missing value would not lead to the failure of LightGBM. If keep using LightGBM, how to deal with missing values? What about changing a supervised model?

Hope everyone has a magic data miner time! :)