Abracadabra

EDA Example I (Springleaf competition)

This is a notebook, used in the screencast video. Note, that the data files are not present here in Jupyter hub and you will not be able to run it. But you can always download the notebook to your local machine as well as the competition data and make it interactive.

1
2
3
4
5
6
7
8
9
10
11
import os
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import seaborn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def autolabel(arrayA):
''' label each colored square with the corresponding data value.
If value > 20, the text is in black, else in white.
'''
arrayA = np.array(arrayA)
for i in range(arrayA.shape[0]):
for j in range(arrayA.shape[1]):
plt.text(j,i, "%.2f"%arrayA[i,j], ha='center', va='bottom',color='w')
def hist_it(feat):
plt.figure(figsize=(16,4))
feat[Y==0].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.8)
feat[Y==1].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.5)
plt.ylim((0,1))
def gt_matrix(feats,sz=16):
a = []
for i,c1 in enumerate(feats):
b = []
for j,c2 in enumerate(feats):
mask = (~train[c1].isnull()) & (~train[c2].isnull())
if i>=j:
b.append((train.loc[mask,c1].values>=train.loc[mask,c2].values).mean())
else:
b.append((train.loc[mask,c1].values>train.loc[mask,c2].values).mean())
a.append(b)
plt.figure(figsize = (sz,sz))
plt.imshow(a, interpolation = 'None')
_ = plt.xticks(range(len(feats)),feats,rotation = 90)
_ = plt.yticks(range(len(feats)),feats,rotation = 0)
autolabel(a)
1
2
3
4
5
def hist_it1(feat):
plt.figure(figsize=(16,4))
feat[Y==0].hist(bins=100,range=(feat.min(),feat.max()),normed=True,alpha=0.5)
feat[Y==1].hist(bins=100,range=(feat.min(),feat.max()),normed=True,alpha=0.5)
plt.ylim((0,1))

Read the data

1
2
train = pd.read_csv('train.csv.zip')
Y = train.target
1
2
test = pd.read_csv('test.csv.zip')
test_ID = test.ID

Data overview

Probably the first thing you check is the shapes of the train and test matrices and look inside them.

1
2
print 'Train shape', train.shape
print 'Test shape', test.shape
Train shape (145231, 1934)
Test shape (145232, 1933)
1
train.head()























































































































































IDVAR_0001VAR_0002VAR_0003VAR_0004VAR_0005VAR_0006VAR_0007VAR_0008VAR_0009VAR_1926VAR_1927VAR_1928VAR_1929VAR_1930VAR_1931VAR_1932VAR_1933VAR_1934target
02H22404300C0.00.0FalseFalse989899899999999899899899989998IAPS0
14H7534448B1.00.0FalseFalse989899899999999899899899989998IAPS0
25H11633464C0.00.0FalseFalse989899899999999899899899989998IAPS0
37H2403003200C0.00.0FalseFalse989899899999999899899899989998RCC0
48R722612000N0.00.0FalseFalse989899899999999899899899989998BRANCH1

5 rows × 1934 columns


1
test.head()























































































































































IDVAR_0001VAR_0002VAR_0003VAR_0004VAR_0005VAR_0006VAR_0007VAR_0008VAR_0009VAR_1925VAR_1926VAR_1927VAR_1928VAR_1929VAR_1930VAR_1931VAR_1932VAR_1933VAR_1934
01R360252251B2.02.0FalseFalse0989899899999999899899899989998IAPS
13R741923274C2.03.0FalseFalse0989899899999999899899899989998IAPS
26R21363500C1.01.0FalseFalse0989899899999999899899899989998IAPS
39R821500B0.00.0FalseFalse0989899899999999899899899989998IAPS
410H913984500C8.03.0FalseFalse0989899899999999899899899989998IAPS

5 rows × 1933 columns


There are almost 2000 anonymized variables! It’s clear, some of them are categorical, some look like numeric. Some numeric feateures are integer typed, so probably they are event conters or dates. And others are of float type, but from the first few rows they look like integer-typed too, since fractional part is zero, but pandas treats them as float since there are NaN values in that features.

From the first glance we see train has one more column target which we should not forget to drop before fitting a classifier. We also see ID column is shared between train and test, which sometimes can be succesfully used to improve the score.

It is also useful to know if there are any NaNs in the data. You should pay attention to columns with NaNs and the number of NaNs for each row can serve as a nice feature later.

1
2
# Number of NaNs for each object
train.isnull().sum(axis=1).head(15)
0     25
1     19
2     24
3     24
4     24
5     24
6     24
7     24
8     16
9     24
10    22
11    24
12    17
13    24
14    24
dtype: int64
1
2
# Number of NaNs for each column
train.isnull().sum(axis=0).head(15)
ID           0
VAR_0001     0
VAR_0002     0
VAR_0003     0
VAR_0004     0
VAR_0005     0
VAR_0006    56
VAR_0007    56
VAR_0008    56
VAR_0009    56
VAR_0010    56
VAR_0011    56
VAR_0012    56
VAR_0013    56
VAR_0014    56
dtype: int64

Just by reviewing the head of the lists we immediately see the patterns, exactly 56 NaNs for a set of variables, and 24 NaNs for objects.

Dataset cleaning

Remove constant features

All 1932 columns are anonimized which makes us to deduce the meaning of the features ourselves. We will now try to clean the dataset.

It is usually convenient to concatenate train and test into one dataframe and do all feature engineering using it.

1
traintest = pd.concat([train, test], axis = 0)

First we schould look for a constant features, such features do not provide any information and only make our dataset larger.

1
2
# `dropna = False` makes nunique treat NaNs as a distinct value
feats_counts = train.nunique(dropna = False)
1
feats_counts.sort_values()[:10]
VAR_0213    1
VAR_0207    1
VAR_0840    1
VAR_0847    1
VAR_1428    1
VAR_1165    2
VAR_0438    2
VAR_1164    2
VAR_1163    2
VAR_1162    2
dtype: int64

We found 5 constant features. Let’s remove them.

1
2
3
4
5
constant_features = feats_counts.loc[feats_counts==1].index.tolist()
print (constant_features)
traintest.drop(constant_features,axis = 1,inplace=True)
['VAR_0207', 'VAR_0213', 'VAR_0840', 'VAR_0847', 'VAR_1428']

Remove duplicated features

Fill NaNs with something we can find later if needed.

1
traintest.fillna('NaN', inplace=True)

Now let’s encode each feature, as we discussed.

1
2
3
4
train_enc = pd.DataFrame(index = train.index)
for col in tqdm_notebook(traintest.columns):
train_enc[col] = train[col].factorize()[0]

We could also do something like this:

1
# train_enc[col] = train[col].map(train[col].value_counts())

The resulting data frame is very very large, so we cannot just transpose it and use .duplicated. That is why we will use a simple loop.

1
2
3
4
5
6
dup_cols = {}
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
for c2 in train_enc.columns[i + 1:]:
if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
dup_cols[c2] = c1

1
dup_cols
{'VAR_0009': 'VAR_0008',
 'VAR_0010': 'VAR_0008',
 'VAR_0011': 'VAR_0008',
 'VAR_0012': 'VAR_0008',
 'VAR_0013': 'VAR_0006',
 'VAR_0018': 'VAR_0008',
 'VAR_0019': 'VAR_0008',
 'VAR_0020': 'VAR_0008',
 'VAR_0021': 'VAR_0008',
 'VAR_0022': 'VAR_0008',
 'VAR_0023': 'VAR_0008',
 'VAR_0024': 'VAR_0008',
 'VAR_0025': 'VAR_0008',
 'VAR_0026': 'VAR_0008',
 'VAR_0027': 'VAR_0008',
 'VAR_0028': 'VAR_0008',
 'VAR_0029': 'VAR_0008',
 'VAR_0030': 'VAR_0008',
 'VAR_0031': 'VAR_0008',
 'VAR_0032': 'VAR_0008',
 'VAR_0038': 'VAR_0008',
 'VAR_0039': 'VAR_0008',
 'VAR_0040': 'VAR_0008',
 'VAR_0041': 'VAR_0008',
 'VAR_0042': 'VAR_0008',
 'VAR_0043': 'VAR_0008',
 'VAR_0044': 'VAR_0008',
 'VAR_0181': 'VAR_0180',
 'VAR_0182': 'VAR_0180',
 'VAR_0189': 'VAR_0188',
 'VAR_0190': 'VAR_0188',
 'VAR_0196': 'VAR_0008',
 'VAR_0197': 'VAR_0008',
 'VAR_0199': 'VAR_0008',
 'VAR_0201': 'VAR_0051',
 'VAR_0202': 'VAR_0008',
 'VAR_0203': 'VAR_0008',
 'VAR_0210': 'VAR_0208',
 'VAR_0211': 'VAR_0208',
 'VAR_0215': 'VAR_0008',
 'VAR_0216': 'VAR_0008',
 'VAR_0221': 'VAR_0008',
 'VAR_0222': 'VAR_0008',
 'VAR_0223': 'VAR_0008',
 'VAR_0228': 'VAR_0227',
 'VAR_0229': 'VAR_0008',
 'VAR_0238': 'VAR_0089',
 'VAR_0239': 'VAR_0008',
 'VAR_0357': 'VAR_0260',
 'VAR_0394': 'VAR_0246',
 'VAR_0438': 'VAR_0246',
 'VAR_0446': 'VAR_0246',
 'VAR_0512': 'VAR_0506',
 'VAR_0527': 'VAR_0246',
 'VAR_0528': 'VAR_0246',
 'VAR_0529': 'VAR_0526',
 'VAR_0530': 'VAR_0246',
 'VAR_0672': 'VAR_0670',
 'VAR_1036': 'VAR_0916'}

Don’t forget to save them, as it takes long time to find these.

1
2
import cPickle as pickle
pickle.dump(dup_cols, open('dup_cols.p', 'w'), protocol=pickle.HIGHEST_PROTOCOL)

Drop from traintest.

1
traintest.drop(dup_cols.keys(), axis = 1,inplace=True)

Determine types

Let’s examine the number of unique values.

1
2
nunique = train.nunique(dropna=False)
nunique
ID                145231
VAR_0001               3
VAR_0002             820
VAR_0003             588
VAR_0004            7935
VAR_0005               4
VAR_0006              38
VAR_0007              36
VAR_0008               2
VAR_0009               2
VAR_0010               2
VAR_0011               2
VAR_0012               2
VAR_0013              38
VAR_0014              38
VAR_0015              27
VAR_0016              30
VAR_0017              26
VAR_0018               2
VAR_0019               2
VAR_0020               2
VAR_0021               2
VAR_0022               2
VAR_0023               2
VAR_0024               2
VAR_0025               2
VAR_0026               2
VAR_0027               2
VAR_0028               2
VAR_0029               2
                   ...  
VAR_1907              41
VAR_1908              37
VAR_1909              41
VAR_1910              37
VAR_1911             107
VAR_1912           16370
VAR_1913           25426
VAR_1914           14226
VAR_1915            1148
VAR_1916               8
VAR_1917              10
VAR_1918              86
VAR_1919             383
VAR_1920              22
VAR_1921              18
VAR_1922            6798
VAR_1923            2445
VAR_1924             573
VAR_1925              11
VAR_1926               6
VAR_1927              10
VAR_1928              30
VAR_1929             591
VAR_1930               8
VAR_1931              10
VAR_1932              74
VAR_1933             363
VAR_1934               5
target                 2
VAR_0004_mod50        50
Length: 1935, dtype: int64

and build a histogram of those values

1
2
plt.figure(figsize=(14,6))
_ = plt.hist(nunique.astype(float)/train.shape[0], bins=100)

png

Let’s take a looks at the features with a huge number of unique values:

1
2
mask = (nunique.astype(float)/train.shape[0] > 0.8)
train.loc[:, mask]



























































































































































































































































































































































































IDVAR_0212VAR_0227
02NaN311951
149.20713e+102.76949e+06
252.65477e+10654127
377.75753e+103.01509e+06
486.04238e+10118678
5147.73796e+101.76557e+06
6169.70303e+1080151
7203.10981e+10853641
8217.82124e+101.40254e+06
9221.94014e+102.2187e+06
10233.71295e+102.77679e+06
11243.01203e+10434300
12251.80185e+101.48914e+06
13269.83358e+10686666
14289.33087e+101.4847e+06
15302.01715e+10883714
16314.15638e+102.6707e+06
17329.17617e+102.65485e+06
18353.81344e+10487721
1936NaN2.54705e+06
20373.27144e+101.74684e+06
21381.82142e+102.5813e+06
22407.70153e+102.59396e+06
23424.69701e+101.02977e+06
24439.84442e+101.45101e+06
2546NaN2.37136e+06
26509.25094e+10665930
27513.09094e+10497686
28526.06105e+101.95816e+06
29543.78768e+101.62591e+06
1452012904098.80126e+101.83053e+06
1452022904124.6152e+101.02024e+06
1452032904149.33055e+101.88151e+06
1452042904154.63509e+10669351
1452052904172.36028e+10655797
1452062904243.73293e+101.45626e+06
1452072904262.38892e+101.9503e+06
1452082904276.38632e+10596365
1452092904293.00602e+10572119
1452102904314.33429e+1016120
1452112904323.86543e+102.08375e+06
1452122904349.21391e+101.89779e+06
1452132904363.07472e+102.94532e+06
1452142904397.83326e+102.54726e+06
145215290440NaN600318
1452162904412.78561e+10602505
1452172904431.90952e+102.44184e+06
1452182904454.62035e+102.87349e+06
145219290447NaN1.53493e+06
1452202904487.54282e+101.60102e+06
1452212904494.30768e+102.08415e+06
1452222904507.81325e+102.85367e+06
1452232904524.51061e+101.56506e+06
1452242904534.62223e+101.46815e+06
1452252904547.74507e+102.92811e+06
1452262904577.05088e+102.03657e+06
1452272904589.02492e+101.68013e+06
1452282904599.17224e+102.41922e+06
1452292904614.51033e+101.53960e+06
1452302904639.14114e+102.6609e+06

145231 rows × 3 columns


The values are not float, they are integer, so these features are likely to be even counts. Let’s look at another pack of features.

1
2
mask = (nunique.astype(float)/train.shape[0] < 0.8) & (nunique.astype(float)/train.shape[0] > 0.4)
train.loc[:25, mask]











































































































































































































































































































































VAR_0541VAR_0543VAR_0899VAR_1081VAR_1082VAR_1087VAR_1179VAR_1180VAR_1181
0494631167831128717685776857116783768577685776857
1303472346196346375341365341365346196341365341365176604
29499012260112150110726710726712150110726710726758714
3205935949061890457944756859490457944756847568
4100713570834787204752364734708204752364723647
5188772805528455211392113928055211392113920627
6321783333565886886327744327744333565327744327744163944
72961518111084432643265181432643264326
8203593011433434249692712830114249692712827128
981513007677119711971300119711971197
10608815233154837077707715233707770774033
1143214572000621621757621621621
12383539860752115853975211581158
13143594756247562177061770647562177061770617706
1414539121806721483617662717662721630717527317527391019
1510040121191726310399103991211910399103995379
16488096079607916591659607916591659165
17129003559035781260962609635590260962609619646
18104442139605150505136419142218139605136419142218142218
19138982556626685201222012225566201222012220122
20352410033101335838583810033583858385838
2112987320407220694618304918304920407218304918304996736
22359111400176805565556511400556555655565
23999999999999999999-99999999999999999999999999999999999999999999999999999999999
241270495512201249024904955249024902490
25201524582458201520152458201520151008

These look like counts too. First thing to notice is the 23th line: 99999.., -99999 values look like NaNs so we should probably built a related feature. Second: the columns are sometimes placed next to each other, so the columns are probably grouped together and we can disentangle that.

Our conclusion: there are no floating point variables, there are some counts variables, which we will treat as numeric.

And finally, let’s pick one variable (in this case ‘VAR_0015’) from the third group of features.

1
train['VAR_0015'].value_counts()
 0.0      102382
 1.0       28045
 2.0        8981
 3.0        3199
 4.0        1274
 5.0         588
 6.0         275
 7.0         166
 8.0          97
-999.0        56
 9.0          51
 10.0         39
 11.0         18
 12.0         16
 13.0          9
 14.0          8
 15.0          8
 16.0          6
 22.0          3
 21.0          3
 19.0          1
 35.0          1
 17.0          1
 29.0          1
 18.0          1
 32.0          1
 23.0          1
Name: VAR_0015, dtype: int64
1
2
cat_cols = list(train.select_dtypes(include=['object']).columns)
num_cols = list(train.select_dtypes(exclude=['object']).columns)

Go through

Let’s replace NaNs with something first.

1
train.replace('NaN', -999, inplace=True)

Let’s calculate how many times one feature is greater than the other and create cross tabel out of it.

1
2
3
4
5
# select first 42 numeric features
feats = num_cols[:42]
# build 'mean(feat1 > feat2)' plot
gt_matrix(feats,16)

png

Indeed, we see interesting patterns here. There are blocks of geatures where one is strictly greater than the other. So we can hypothesize, that each column correspondes to cumulative counts, e.g. feature number one is counts in first month, second – total count number in first two month and so on. So we immediately understand what features we should generate to make tree-based models more efficient: the differences between consecutive values.

VAR_0002, VAR_0003

1
2
3
4
5
6
7
hist_it(train['VAR_0002'])
plt.ylim((0,0.05))
plt.xlim((-10,1010))
hist_it(train['VAR_0003'])
plt.ylim((0,0.03))
plt.xlim((-10,1010))
(-10, 1010)

png

png

1
train['VAR_0002'].value_counts()
12     5264
24     4763
36     3499
60     2899
6      2657
13     2478
72     2243
48     2222
3      2171
4      1917
2      1835
84     1801
120    1786
1      1724
7      1671
26     1637
5      1624
14     1572
18     1555
8      1513
999    1510
25     1504
96     1445
30     1438
9      1306
144    1283
15     1221
27     1186
38     1146
37     1078
       ... 
877       1
785       1
750       1
653       1
784       1
764       1
751       1
797       1
926       1
691       1
808       1
774       1
902       1
755       1
656       1
814       1
813       1
685       1
739       1
935       1
906       1
807       1
550       1
933       1
804       1
675       1
674       1
745       1
778       1
851       1
Name: VAR_0002, Length: 820, dtype: int64
1
train['VAR_0003'].value_counts()
0      17436
24      3469
12      3271
60      3054
36      2498
72      2081
48      2048
6       1993
1       1797
3       1679
84      1553
2       1459
999     1428
4       1419
120     1411
7       1356
13      1297
18      1296
96      1253
14      1228
8       1216
5       1189
9       1182
30      1100
25      1100
144     1090
15      1047
61      1008
26       929
42       921
       ...  
560        1
552        1
550        1
804        1
543        1
668        1
794        1
537        1
531        1
664        1
632        1
709        1
597        1
965        1
852        1
648        1
596        1
466        1
592        1
521        1
533        1
636        1
975        1
973        1
587        1
523        1
584        1
759        1
583        1
570        1
Name: VAR_0003, Length: 588, dtype: int64

We see there is something special about 12, 24 and so on, sowe can create another feature x mod 12.

VAR_0004

1
2
3
train['VAR_0004_mod50'] = train['VAR_0004'] % 50
hist_it(train['VAR_0004_mod50'])
plt.ylim((0,0.6))
(0, 0.6)

png

Categorical features

Let’s take a look at categorical features we have.

1
train.loc[:,cat_cols].head().T







































































































































































































































































































































































































































01234
VAR_0001HHHHR
VAR_0005CBCCN
VAR_0008FalseFalseFalseFalseFalse
VAR_0009FalseFalseFalseFalseFalse
VAR_0010FalseFalseFalseFalseFalse
VAR_0011FalseFalseFalseFalseFalse
VAR_0012FalseFalseFalseFalseFalse
VAR_0043FalseFalseFalseFalseFalse
VAR_0044[][][][][]
VAR_0073NaT2012-09-04 00:00:00NaTNaTNaT
VAR_00752011-11-08 00:00:002011-11-10 00:00:002011-12-13 00:00:002010-09-23 00:00:002011-10-15 00:00:00
VAR_0156NaTNaTNaTNaTNaT
VAR_0157NaTNaTNaTNaTNaT
VAR_0158NaTNaTNaTNaTNaT
VAR_0159NaTNaTNaTNaTNaT
VAR_0166NaTNaTNaTNaTNaT
VAR_0167NaTNaTNaTNaTNaT
VAR_0168NaTNaTNaTNaTNaT
VAR_0169NaTNaTNaTNaTNaT
VAR_0176NaTNaTNaTNaTNaT
VAR_0177NaTNaTNaTNaTNaT
VAR_0178NaTNaTNaTNaTNaT
VAR_0179NaTNaTNaTNaTNaT
VAR_0196FalseFalseFalseFalseFalse
VAR_0200FT LAUDERDALESANTEEREEDSVILLELIBERTYFRANKFORT
VAR_0202BatchInquiryBatchInquiryBatchInquiryBatchInquiryBatchInquiry
VAR_02042014-01-29 21:16:002014-02-01 00:11:002014-01-30 15:11:002014-02-01 00:07:002014-01-29 19:31:00
VAR_0214NaNNaNNaNNaNNaN
VAR_0216DSDSDSDSDS
VAR_02172011-11-08 02:00:002012-10-02 02:00:002011-12-13 02:00:002012-11-01 02:00:002011-10-15 02:00:00
VAR_0222C6C6C6C6C6
VAR_0226FalseFalseFalseFalseFalse
VAR_0229FalseFalseFalseFalseFalse
VAR_0230FalseFalseFalseFalseFalse
VAR_0232TrueFalseTrueFalseTrue
VAR_0236TrueTrueTrueTrueTrue
VAR_0237FLCAWVTXIL
VAR_0239FalseFalseFalseFalseFalse
VAR_0274FLMIWVTXIL
VAR_0283SSSSS
VAR_0305SSPPP
VAR_0325-1HRHS
VAR_0342CFECUU-1-1
VAR_0352OORRR
VAR_0353URRRU
VAR_0354OR-1-1O
VAR_0404CHIEF EXECUTIVE OFFICER-1-1-1-1
VAR_0466-1I-1-1-1
VAR_0467-1Discharged-1-1-1
VAR_0493COMMUNITY ASSOCIATION MANAGER-1-1-1-1
VAR_1934IAPSIAPSIAPSRCCBRANCH

VAR_0200, VAR_0237, VAR_0274 look like some georgraphical data thus one could generate geography related features, we will talk later in the course.

There are some features, that are hard to identify, but look, there a date columns VAR_0073VAR_0179, VAR_0204, VAR_0217. It is useful to plot one date against another to find relationships.

1
2
3
4
5
6
7
8
9
10
date_cols = [u'VAR_0073','VAR_0075',
u'VAR_0156',u'VAR_0157',u'VAR_0158','VAR_0159',
u'VAR_0166', u'VAR_0167',u'VAR_0168',u'VAR_0169',
u'VAR_0176',u'VAR_0177',u'VAR_0178',u'VAR_0179',
u'VAR_0204',
u'VAR_0217']
for c in date_cols:
train[c] = pd.to_datetime(train[c],format = '%d%b%y:%H:%M:%S')
test[c] = pd.to_datetime(test[c], format = '%d%b%y:%H:%M:%S')
1
2
3
4
5
6
7
8
c1 = 'VAR_0217'
c2 = 'VAR_0073'
# mask = (~test[c1].isnull()) & (~test[c2].isnull())
# sc2(test.ix[mask,c1].values,test.ix[mask,c2].values,alpha=0.7,c = 'black')
mask = (~train[c1].isnull()) & (~train[c2].isnull())
sc2(train.loc[mask,c1].values,train.loc[mask,c2].values,c=train.loc[mask,'target'].values)

png

We see that one date is strictly greater than the other, so the difference between them can be a good feature. Also look at horizontal line there – it also looks like NaN, so I would rather create a new binary feature which will serve as an idicator that our time feature is NaN.