Hello, world :)

E = m * c^2 = ?


  • 首页

  • 关于

  • 标签

  • 分类

  • 归档

  • 站点地图

阅读笔记 -- 《蒋勋说宋词》

发表于 2020-01-14 | 更新于: 2020-01-14 | 分类于 dairy | | 阅读次数:
字数统计: 506 | 阅读时长 ≈ 2

第一讲 李煜(续)

读过了李煜的经历,之前的思考只是一个局外人的感慨,再读读”说词”的部分,细细品味李煜和宋勋的对话时,那一瞬间或许真真实实地存在一种羁绊,流淌在文化的血液中,恒古而跨今,嘱以记之。

说词,自然说的是李煜的五代词,便是前面感受到的犹如流行歌曲般的五代词。看看那些有点调皮的词,都是曾经乐师口中的流行曲,如今文学还在,乐谱却大部分流失。今天去读诗和去听流行歌曲的,可能是两种不同的人,也许有人曾想将余光中的诗用吉他伴奏唱一唱,希望可以流行,但是毕竟没有真正流行过。再回到那个年代,词风盛行的时候,边说词边听点音乐,自然惬意盎然。

印象最深的便是”人间没个安排处”和”无奈夜长人不寐”的对话,围绕《浪淘沙》从亡国到生命的升华之说更值得回味。”人间没个安排处”来自李煜的《蝶恋花》,与这个艳情调皮的词牌名不同,这首词带有对春天逝去的感怀。文人的心是敏感而纤弱的,凡是到春天过完感伤情怀便更深一层: 从亭边散步的四顾到下阙情感的流露,自然又细腻,词云: “遥夜亭皋闲信步,乍过清明,早觉伤春暮。数点雨声风约住,朦胧淡月云来去。 桃李依依春暗度,谁在秋千,笑里低低语。一片芳心千万绪,人间没个安排处”。便有千万条剪不断理还乱的烦绪,这个人间又如何安排自己呢?这里,宋勋对李煜笑侃: 敢于用诸如”安排”这样的俚语入词,就像我们今天用民间语言入歌一样,流行变古典。

阅读笔记 -- 《量子纠缠》

发表于 2020-01-13 | 更新于: 2020-01-13 | 分类于 dairy | | 阅读次数:
字数统计: 793 | 阅读时长 ≈ 3

01 纠缠的开始(续)

当我们了解了”量子纠缠”的非凡意义后,无论是概念中的”量子”还是”纠缠”,更深入的理解都需要对量子论的发展进行溯源。

首先是”量子”,我们研究的”量子”定义为构成现实事物–不管是光子,原子还是电子–的微小能量和物质。起初,”量子”仅仅是马克斯·普朗克面对一个有关黑体辐射的不合理推断的解决方案: 这个不合理的推断意味着黑体辐射理应产生越来越多高频光线,进而产生无限能量,在普朗克引入”量子”作为能量划分单位后,黑体辐射的理论计算与实验吻合得相当完美,这意味着不合理已然迎刃而解。对普朗克来说,量子是不存在的,它仅仅是一种手段以帮助找到可行的解释,不必与事实相联系。他不会想到,这个看似虚幻的概念将在爱因斯坦和玻尔间擦出怎样的智慧火花。

量子论的第一次突破开始于1905年的一篇论文,文中爱因斯坦认为光就是由量子所构成,这种光的”粒子说”不仅赋予了量子以真实性,更进一步地动摇了牛顿理论中”波动说”–光是一种波–的统治地位。

然而当时,牛顿的”波动说”和爱因斯坦的”粒子说”都有其坚定的簇拥者,即1801年托马斯·杨的双缝干涉实验和光电效应。为了解决这些争论不休,德布罗意,海森堡,薛定谔和狄拉克,这些改造升级了量子理论的科学家开始了量子论的第二次突破。这次突破的最大贡献在于宣告了量子论的正式建立,其次,描述波如何随时间变化的波动方程和描述电子运动的矩阵方程促使了波动力学和矩阵力学的正式建立。

虽然这些成果高深莫测,但无一不是将牛顿”波动说”和爱因斯坦”粒子说”相结合的失败尝试。就连赫赫有名的薛定谔波动方程也是如此: 一旦用于描述量子微粒特性时,不可避免的自由度就会出现以至于方程变得极度复杂以至于无法求解。

就在这时,马克斯·波恩将概率引入量子世界,找到了薛定谔波动方程的解决方案从而让它变得可用,这仿佛是黎明前的一抹曙光,带来了量子论的第三次突破; 与此同时,量子理论最为坚定的支持者,尼尔斯·玻尔–爱因斯坦在量子纠缠上主要斗嘴对手和朋友–开始活跃于量子舞台之上; 更精彩的是,在爱因斯坦与玻尔矛盾最为激烈的那次争论上,”纠缠”渐渐浮出水面,显露出它的奇幻般的魔力与色彩。

What's Next -- 人工智能能做到吗?

发表于 2020-01-10 | 更新于: 2020-01-10 | 分类于 dairy | | 阅读次数:
字数统计: 274 | 阅读时长 ≈ 1

个人级别

  • 将学习兴趣曲线和记忆曲线组合建立对不同人群学习方式个性化推荐系统?
  • 参考信息储存空间大小计算大脑维度,智慧大脑是否缘于简单规则(生命游戏)?
  • 随时记录中NLP技术如何实时学习?

社会层次

  • 粮食危机的解决方案?
  • 通过评估政策策略的可能影响和收益,可否用于社会发展预测和优化?
  • 市场上的买卖双方关系,原来的买家谨慎到互联网时代的卖家谨慎,两种情况下信息流的变化是不一样的,这种变化可否被捕捉和模拟?

人类文明

  • 协助熵理论(信息学)能否预测人类文明的发展?
  • 从出生到成长,基因中是否存在“意识”,还是“秩序”?
  • 考虑不同类型,服从的不同的激励机制对大脑神经元的重新建模,寻找人类进化模式: 比如已知的先经验后模型,先工具再自身?

Explore findings on Udacity Final Project for Data Scientist Nanodegree

发表于 2019-06-17 | 更新于: 2019-06-20 | 分类于 data analysis | | 阅读次数:
字数统计: 1,687 | 阅读时长 ≈ 11

Project Introduction

In this Udacity project, demographics data for customers of a mail-order sales company in Germany are analyzed, comparing it against demographics information for the general population. Unsupervised learning techniques to perform customer segmentation are applied, identifying the parts of the population that best describe the core customer base of the company. Then, a supervised model is used to predict which individuals are most likely to convert into becoming customers for the company. The data that used here has been provided by Udacity partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

Motivation of this blog

Have fun! This is a real, meaningful task. Take it, a wonderful jounery, and enjoy exploring the secret within the data. This is such a pleasure, right?

Contents

  1. Business and data understanding
  2. Data Preprocessing
  3. Result evaluation
  4. What is next?


1. Questions for better business understanding

At the beginning, a good understanding of data is always appreciable. By asking questions, here is what i find.

For business understanding:

Question 1.1: Before model is applied, how to design an online experiment to show its effectiveness?

Answer: An online experiment mainly is used to give a proof of what model predicts. Thus, the design of online experiment relies on the model prediction or prediction requirements. By following these constraints and design procedures, an online experiment could be achieved. Then, compare the experiment results with model predictions to make finnal decisions on the deploy of model.

Question 1.2: If model is applied and operates normally, how to quantify and evaluate the net income from model?

Answer: Directly, the income by purchase increase in relative ares where model is deployed should be considered as earnings. Meanwhile, the cost of deploy such as computer power, maintenance and new employees should be treated as cost. Combine two types of income, we could obtain the net income.

Question 1.3: After model operates for a long time, how to deal with the differences between the model predictions and reviews by customers?

Answer: After deploying, cause there is no 100% accuracy for any model, there must exist some disagree cases or reviews. These cases could be added directly to balance original dataset. Meanwhile, if computer power is enough, dynamic programming could be applied to treat model bias automatically, which would save workload.


For data understanding:

Question 2.1: How “big” is the datasets?

Fig1: scale and memory usage of two datasets

Answer: As it is shown in Fig1, two datasets contains about 100w cases and around 600 Mb memory use in total, which do not agree with the common definition of “big data”

Question 2.2: How imbalanced is the datasets?

Fig2: a sample feature distribution of two datasets

Answer: As it is shown in Fig2, the comparison graphs of a sample feature show that the distribution ratio of each feature in azdias and customers data are relatively small (commonly less than 6:1) which means imbalance problem could be omitted safely (Mpre details could be referred to the first part of Step 0.3.1: Visiualize Checkings in this notebook here.

Question 2.3: How to deal with small sample data?

Answer: Generally, several methods such as over-sampling, under-sampling and Synthetic Data could be choosed. For this project, small sample data is checked in Step 0.3.2: Assess Missing Data of the same notebook noted above to determine whether to keep or not.


2. Data Proprocessing

Though different algorithms correspond to different data requirements, clean and no-missing data is always helpful not only for the further analysis but also for the the convenience of transfer learning. In this project, several techiques are used and two of them are choosed to be introduced here.

2.1 Evaluation of missing data

Evaluation mainly include three steps:

  • First, convert unknown values to NaNs. A parsing to make use of map relationship to identify unknown values is applied to unknown values. In details, data that matches a ‘unknown’ value in DIAS Attributes - Values 2017.xlsx is converted into a numpy NaN value.

  • Then, missing data in each column is assessed. For azdias data, it is found that there are more than 100 feats among 360 feats have no missing data. Most common missing ratio of azdias columns is near 15 percentage while least common missing ratio of azdias columns is near 95% percentage. While for customers data, there are more than 100 feats among 360 feats have no missing data which is similar to azdias dataset. However, most common missing ratio of customers columns is near 25 percentage while least common missing ratio of azdias columns is near 95% percentage. Thus, the feats whose vacancy ratio is greater than 0.4 for the both two datasets are dropped.

  • Finally, missing data in each row is assessed. For azdias data, the data with lots of missing values are qualitatively different with data that has few or no missing values. The same finding is found in customers data. As a result, no drop would be taken and a special notice on those data would be added.

2.2 Feature engineering

Feature engineering done in this project mainly includes select and re-encode binary features and multi-class features, engineer mixed-type features.

  • For the first part, binary feature values are mapped into {0, 1}. Multi-class feature “CAMEO_DEUG_2015” is operated individually to extract the stored century time information. Apart from it, other multi-class features are parased by pandas attribute “.get_dummies” with the first dummy dropped to avoid miscellaneous information.

  • For the second part, a simple algorithm to detect mixed type features is applied by searching key words like “mix”, “+” in the “Description” column in the DIAS Information Levels - Attributes 2017.xlsx file. What should be mentioned here is that not all features could be searched in this excel file, thus the prior knowledge of the unsupervised learning project finished before is used here. Feature “PRAEGENDE_JUGENDJAHRE” and feature “CAMEO_INTL_2015” are regared as mixed feature and operated using the desription in the same excel file shown before.

Besides the two techiques mentioned above, feature scaling is applied to get a full, clean data. More details could refer to “Step 0.3.3: Feature Engineerings” in this notebook here.


3. Result evaluation

This section mainly includes two parts: discussion of findings by unsupervised learning and supervised learning.

unsupervised learning result

Fig3: feature weight of the first principal components for azdias dataset

Fig4: feature weight of the first principal components for customers dataset

For azdias data shown in Fig3, all positive contributed feats of the first three components come from feature “CAMEO_DEU_2015”. Futhermore, referring to the description excel file, the first component corresponds to the Fine Management group. For customers data shown in Fig4, all positive contributed feats of the first three components come from feature “CAMEO_DEU_2015” too. Referring to the description excel file, the first component corresponds to the Fine Management group. Compared with azdias data, both analysis of the first components include the Fine Management and City Nobility group, which means the considerable importance behind the feature “CAMEO_DEU_2015” and its derived features.


Fig5: overrepresented and underrepresented cluster distribution differences in two datasets

Fig6: people type of overrepresented clusters

Fig7: people type of underrepresented clusters

As it is shown in Fig5, which cluster or clusters are overrepresented in the customer dataset compared to the general population and which are underrepresented are obtained here. The threshold used here is 4%, which means if the ratio difference of a cluster by azdias minus customer data is larger than 4%, it belongs to the underrepresented type; in opposite, if the ratio difference of a cluster by customer minus azdias data is larger than 4%, it belongs to the overrepresented type. Following this ruler, underrepresented clusters include cluster 13 and cluster 20, overrepresented clusters include cluster 2 and cluster 5.

To investigate deeply, people types of two kings of cluster are shown in Fig6 and Fig7 seperatively. For overrepresented cluster type shown in Fig6, cluster 2 and 5 show that people with Active Retirement(“CAMEO_DEU_2015_5F”), Long-established(“CAMEO_DEU_2015_6C”), Frugal Aging(“CAMEO_DEU_2015_6F”), and First Shared Apartment(“CAMEO_DEU_2015_9A”) or high transactional activity on product group DIGITAL SERVICES(“D19_DIGIT_SERV_RZ”), BANKS in the last 24 months(“D19_BANKEN_ANZ_24”) and INSURANCE in the last 12 months(“D19_VERSI_ANZ_12”) are more likely to become customer. However, for underrepresented cluster type shown in Fig7, cluster 13 and cluster 20 show that people with high transactional activity on the last transaction for the segment telecommunication ONLINE(“D19_TELKO_ONLINE_DATUM”), BANKS in the last 12 months(“D19_BANKEN_ANZ_12”) and the product group GARDENING PRODUCTS(“D19_GARTEN”) are less likely to become customers.

supervised learning result

Fig8: feature importance

Take a further look at the Fig8, “D19_SOZIALES” contributes most to the model prediction while “W_KEIT_KIND_HH” contributes least. Checking the fetures found in Fig3, 4 and Fig 6, 7, theses features don’t contribute enough as expected, which means there is much room for improvement for unsupervised learning part.

Meanwhile, improvement on model choose is applied using grid search and pipeline tricks, for more details, please refer to the Step 2.3: improvement - creating a machine learning pipeline in this notebook here.


4. What’s next?

Fig9: competition rank up to now

Up to now, we have briefly went through the main parts of data analysis procedure. Fig9 shows the recent rank i have reached. For me, selecting feats by considering better cluster algorithms is more proper than adjusting the hyper parameters of supervised learning to obtain a higher score. Also, it is advisable for you to start a kaggle journal following exsited great kernels. By the way, there is a few questions that i want to share with you which i think maybe desirble to be devoted to:

  • How many feats are suitable? Generally, the more feats you obtain may not guarantee higer submit score you expect while the more computer power you are expectd to have.

  • Missing value would not lead to the failure of LightGBM. If keep using LightGBM, how to deal with missing values? What about changing a supervised model?

Hope everyone has a magic data miner time! :)

Learning notes sample

发表于 2019-05-18 | 更新于: 2019-05-27 | 分类于 dairy | | 阅读次数:
字数统计: 20 | 阅读时长 ≈ 1

Content

  1. Brief introduction to book
    • Author
    • Genres
  2. Note
  3. Memos

、、、
Date: Year/Month/Day

Recorder: HZQ

Main Body
、、、

Deal with inbalance in a NLP classification task by extracting numerical features

发表于 2019-03-14 | 更新于: 2019-06-17 | 分类于 NLP | | 阅读次数:
字数统计: 854 | 阅读时长 ≈ 5

Competition Introduction

In Elo Merchant Category Recommendation competition, kagglers will develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty, which will improve customers’ lives and help Elo reduce unwanted campaigns as well as create the right experience for customers. The competition entrance is here.

Motivation

Using a LightGBM model, this blog tries to dive into initial explore of training and testing datasets. New features are built by extracting numberical feat from first active month in order to recover some information from the time series.

Contents

  1. Questions for better business understanding
  2. Results evaluation: conclusion and discussion
  3. References


1. Questions for better business understanding

At the beginning, a good understanding of data is always appreciable. In this blog we only consider two datasets: training dataset and testing dataset. By asking questions, here is what i find:

Question 1.1: What scale and what kind of data are necessary for customer loyalty prediction model?

Fig1: scale and memory usage of datasets

A: This question is supposed to provide proper evaluation of how many sources company need to invest for the deploy of customer loyalty prediction model. Because MemoryError is raised when checking transaction dataset, i choose to evaluate the memory usage of theses data files by using google colab.

The results of scale and memory usage of each dataset are shown in the Fig1. As it is shown above, all needed data could be divided in three parts: card feature data, transaction data and merchant data. Card feature data and merchant data both are relatively small than transaction data, which the later is nearly 30 million, with memory usage more than 8 GB in total in this competition

Question 1.2: In which way the application of customer loyalty prediction model would impact different business units?

A:
By diving into this question, necessary changes would to be revealed which will help company work better.

If customer loyalty prediction model are applied, finance part may need to expand or reconstruct existed database to record more features as well as missing values; sales part need to use the conclusion of model for decision and collect customer review; technology part is supposed to assure model stability and improve model using or constructing new features from finance part and reviews from sales part. This process would be visualized by a business cycle, more details for model deploy is provided in this blog here.

Question 1.3: How to evaluate the influence of deploying customer loyalty prediction model? Is modeling target really convincible?

Fig2: distribution of “target”

A: This question aims at assuring the application of customer loyalty prediction model more controllable and scalable.

Take a look at the Fig 2, the feat column “target”, the modeling target, is stored in numberical data type which means several loss function like MSE could be used for evaluation. By checking its distribution, there are outliers around -30 found in modeling target which means not all target data is convincible.


2. Results evaluation: conclusion and discussion

Same as what i did in the first section, some questions i found are reviewed and answered in this step. Generally, the more good questions are reviewed, the better understanding of model evaluation you gain.

Fig3: feature importance

Fig4: part of feature importance extracted from “first_active_month”

Take a further look at the Fig3, “first_active_month_elapsed_time_today” contributes most to the model prediction while “first_active_month_monthsession_January” contributes least. Among the feat extracted from date data shown in the Fig4, “first_active_month_elapsed_time_today” gains the most importance while the “first_active_month_monthsession_January” gain the least importance as well, which means valuable information hidden in “first_active_month”. Isn’t it a intersting finding?

When we compare distribution training dataset and testing dataset of the same feat, here comes new findings. “first_active_month_elapsed_time_today”, “first_active_month_elapsed_time_specific” shows measurable different distribution between training data and testing data, among which the first one is the feature picked by the above two questions. On the other hand, the left feats couldn’t provide considerable difference on data distribution between two datasets, which is relatively consistent with the result shown in feat importance figure. More details could refer to my github here.

What’s next?

Up to now, we have briefly went through the whole data processing procedure. For me, next i’d like to add more feats by considering other files provided by this competition. Also, it is advisable for you to start a mining journal following exsited kernel which is listed in the below references. By the way, there is a few questions that i want to share with you which i think maybe desirble to be devoted to:

  • How many feats are suitable? Generally, the more feats you obtain may not guarantee higer submit score you expect

  • For this model, missing value would not lead to the failure of LightGBM. If using LightGBM, how to deal with missing values? Imputer or just keep by?

Hope everyone has a magic data miner time! :)


3. References

  1. My first kernel (3.699) by Chau Ngoc Huynh

  2. A Closer Look at Date Variables by Robin Denz

  3. LGB + FE (LB 3.707) by Konrad Banachewicz

A way to begin your first data analysis competition

发表于 2019-03-11 | 更新于: 2019-05-14 | 分类于 data analysis | | 阅读次数:
字数统计: 1,058 | 阅读时长 ≈ 7

Before starting…

Now it is my turn to give birth to my first blog to introduce the thoungt dived in my time capsules. Each period, i think, is destined to be a dramatic and meaningful recall to me and for you too.

Today’s topic: how to get ready for a kaggle competition as a beginner?

Content

  1. Pre-requirements
    • Windows
    • Linux
  2. Main procedures
    • Question: ask, ask and ask
    • Train-Validate-Test: beyond a cycle
    • Display: what about telling a story?
  3. Find your push!

Pre-requirements

The requirements could be divided into two parts: platform requirements and term or prior knowledge requirements. For platform requirements, commonly, linux systems always make a start more convenient than windows and also bring more potential conflicts and bugs. For terms you need, it highly depends on the field and the problem you dive into. For simplicity, this section mainly focus on the first part: platform requirements.

Linux

Let’s start with linux first. As we know, there are many kinds of linux systems lilke RedHat, Ubuntu, Centos and so on. Here Ubuntu is choosed for its low dependent of hardware support and desktop like windows.

Since the main purpose of preparation is aimed at reducing those unnecessary time costs, checking the version of your system will be a nice start because the pre-installed softwares in most linux system are often different with each other. Like python, different ubuntu system pre-install different version. What you should check include several key points listed below:

  • Pre-installed hardware governing tools
  • Pre-installed compiler
  • The enviroment path

Tutorials to guide your check in Ubuntu could be found in other blog of mine labeled as “computer-toolbox”. Here, a brief, whole procedure is the main focus.

Windows

Compared with linxu, using windows could save time of installing hardware and their governing tools most of time, meanwhile the installation of compiler, IDE and other developer helper maybe much easier. However, a correct environment path weights more for windows because of the potential annoying bugs. Then, find out the weakness brought by the programming language you choose and fix it. For python, you could refer to the discussions in reddit here. At last, a file manger is a good helper to lead you among the messy cache files to find out your demand. Everything is recommmended and could be downloaded freely here.

Main procedures

Question: ask, ask and ask

Beleive your initial impulse and vision could make you much more comfortable for a continuous contribution. The first step to beat down a competition is asking questions. No matter how many time you have, no matter how experienced you are, asking could help you have a clear task and vision all the time. Just stop looking for experiences but looking for questions and it will help you much easier get ready for the competition. The next question is, where should you ask? Platform like stackflow, reddit and quora are pretty good choice for the veteran while the kernel in each kaggle competition, issues in each github are much acceptable for the novice. In a word, find out a suitable platform for yourself and keep asking during the competition, then you are on the way to the champion ~

Train-Validate-Test: why not?

Besides the necessary habit formed from the above, let’s focus on more detail about the common tricks. Typically, we would like to split data into three parts: train, validation and test to avoid over-fitting. Remember never touch test data when training? That is it.

The point is different models, different split schemes. For example, convolutional neutral network generally show more sensitivity to the number of training data than generative adversarial network, thus for the first kind of model, more data is needed for training. In a word, no eternal truth and the truth is always changeable but also worth to explore.

Display: what about telling a story?

After completing the above steps, most difficult programming parts shall be worked out. Just take a breathe, say a congratulations to yourself! Good job! But wait, how to share your great findings with your team members? Well, for this question, i heared so many points to advocate the importance of colorful graphs, charts while it really confuses me: why do we rely on them so much? Only a presentation full of beautiful telling graphs shall be recognized as a sucess? For me, they seem too dogmatic to use, but why so many people like it?

The focus of nowadays presentation put too much attention to tools rather than the speaker itself. Take a display of your work is somehow like telling a story. Imagine the way you persuaded your mother or father to buy an attractive toy or clothes when you were a child, then write it in your words to create a prototype of story. Next, it’s time to dress-up your work. Think from the view of audiences is suggested here, just ask yourself some simple questions like What major, profession do they come from? How many audiences require prior knowledge and how many not? And imagine, if you are one of them, what do you want to hear from this story is exactly what your polishing should carefully focus on. So far, the only thing left to be done is talk. Share your story with your team numbers and ask for advices will bring you more fresh ideas and never let you down. Enjoy talking! Enjoy the gradual grow up of new stories!

Find your push!

Many competition will provide timeline and key points in their website. During the long long procedure, find your push, especially your internal push will give you endless courage. Like a honest, determined, energetic friend, he/she may whisper warmfully: forget the fantastic reward, taste the procedure, witness the growth up of one and another idea. How amazing it is! For me, break your comfort zone like a brave always push me from stopping for a computer game. And you? Have you find your push yet?

Awesome! The main contents of this blog ends here. After finishing the whole reading, why not give it a try? I mean, everyone has their first time to move from their comfort zone, just don’t let the coward in your heart stop your infinite possibility of journey :smile::smile:

An initial explore for Elo Merchant Category Recommendation, a kaggle competition

发表于 2019-02-13 | 更新于: 2019-06-17 | 分类于 data analysis | | 阅读次数:
字数统计: 854 | 阅读时长 ≈ 5

Competition Introduction

In Elo Merchant Category Recommendation competition, kagglers will develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty, which will improve customers’ lives and help Elo reduce unwanted campaigns as well as create the right experience for customers. The competition entrance is here.

Motivation

Using a LightGBM model, this blog tries to dive into initial explore of training and testing datasets. New features are built by extracting numberical feat from first active month in order to recover some information from the time series.

Contents

  1. Questions for better business understanding
  2. Result evaluation: conclusion and discussion
  3. References


1. Questions for better business understanding

At the beginning, a good understanding of data is always appreciable. In this blog we only consider two datasets: training dataset and testing dataset. By asking questions, here is what i find:

Question 1.1: What scale and what kind of data are necessary for customer loyalty prediction model?

Fig1: scale and memory usage of datasets

A: This question is supposed to provide proper evaluation of how many sources company need to invest for the deploy of customer loyalty prediction model. Because MemoryError is raised when checking transaction dataset, i choose to evaluate the memory usage of theses data files by using google colab.

The results of scale and memory usage of each dataset are shown in the Fig1. As it is shown above, all needed data could be divided in three parts: card feature data, transaction data and merchant data. Card feature data and merchant data both are relatively small than transaction data, which the later is nearly 30 million, with memory usage more than 8 GB in total in this competition

Question 1.2: In which way the application of customer loyalty prediction model would impact different business units?

A:
By diving into this question, necessary changes would to be revealed which will help company work better.

If customer loyalty prediction model are applied, finance part may need to expand or reconstruct existed database to record more features as well as missing values; sales part need to use the conclusion of model for decision and collect customer review; technology part is supposed to assure model stability and improve model using or constructing new features from finance part and reviews from sales part. This process would be visualized by a business cycle, more details for model deploy is provided in this blog here.

Question 1.3: How to evaluate the influence of deploying customer loyalty prediction model? Is modeling target really convincible?

Fig2: distribution of “target”

A: This question aims at assuring the application of customer loyalty prediction model more controllable and scalable.

Take a look at the Fig 2, the feat column “target”, the modeling target, is stored in numberical data type which means several loss function like MSE could be used for evaluation. By checking its distribution, there are outliers around -30 found in modeling target which means not all target data is convincible.


2. Result evaluation: conclusion and discussion

Same as what i did in the first section, some questions i found are reviewed and answered in this step. Generally, the more good questions are reviewed, the better understanding of model evaluation you gain.

Fig3: feature importance

Fig4: part of feature importance extracted from “first_active_month”

Take a further look at the Fig3, “first_active_month_elapsed_time_today” contributes most to the model prediction while “first_active_month_monthsession_January” contributes least. Among the feat extracted from date data shown in the Fig4, “first_active_month_elapsed_time_today” gains the most importance while the “first_active_month_monthsession_January” gain the least importance as well, which means valuable information hidden in “first_active_month”. Isn’t it a intersting finding?

When we compare distribution training dataset and testing dataset of the same feat, here comes new findings. “first_active_month_elapsed_time_today”, “first_active_month_elapsed_time_specific” shows measurable different distribution between training data and testing data, among which the first one is the feature picked by the above two questions. On the other hand, the left feats couldn’t provide considerable difference on data distribution between two datasets, which is relatively consistent with the result shown in feat importance figure. More details could refer to my github here.

What’s next?

Up to now, we have briefly went through the whole data processing procedure. For me, next i’d like to add more feats by considering other files provided by this competition. Also, it is advisable for you to start a mining journal following exsited kernel which is listed in the below references. By the way, there is a few questions that i want to share with you which i think maybe desirble to be devoted to:

  • How many feats are suitable? Generally, the more feats you obtain may not guarantee higer submit score you expect

  • For this model, missing value would not lead to the failure of LightGBM. If using LightGBM, how to deal with missing values? Imputer or just keep by?

Hope everyone has a magic data miner time! :)


3. References

  1. My first kernel (3.699) by Chau Ngoc Huynh

  2. A Closer Look at Date Variables by Robin Denz

  3. LGB + FE (LB 3.707) by Konrad Banachewicz

1…67
Ziqiang Huang

Ziqiang Huang

=Explore or Explode=

68 日志
3 分类
7 标签
GitHub E-Mail
© 2017 — 2020 Ziqiang Huang
由 Hexo 强力驱动
|
主题 — NexT.Pisces v5.1.4
Hosted by GitHub Pages