Abracadabra

Processing Anonymized Features

IMPORTANT: You will not be able to run this notebook at coursera platform, as the dataset is not there. The notebook is in read-only mode.

But you can run the notebook locally and download the dataset using this link to explore the data interactively.

1
pd.set_option('max_columns', 100)

Load the data

1
2
train = pd.read_csv('./train.csv')
train.head()



















































































































































































































































































































































































































x0x1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61y
0b4d8a653ea16a14a2d1706330986edca63304de0a62168d6261746600cb011-0.6887067e5c97705ae5df3eff9b91bb549494e33c63cf353694.06e40247e69617a4ad3f9718c61545bc26d08129a634e3cf3acdd9c9e0da217c99905b6513a3e3f369aba4d7f5140.579612-0.112693-0.1721911.1666671.6745380.63088937.0000001.29492255.00.16666710.00.00.0000001.09.00.01.023.03.670.121.9352.20.6250.2500.1250.0000.8130.0740.6340.5480.2353330.2649520.0000000.3333330.3333330.3333330.0000000.0000009.02
1467f9617a316a14a2d1706330986edca63304de0b7584c2d521746600cb0110.8708715624b8f759fa0b797a92669ea3d319f17880307418156.001ede04b4b617a4ad3f9718c61545bd342e2765fbb20e1ca068a6c8cef831b02793146992153ed659aba4d7f5128.7655032.6122852.1590914.0000001.7107141.7135380.1666670.027669109.00.00000031.00.00.0000001.0244.01.01.068.017.250.573.4524.00.4090.6190.5790.2480.3460.5410.5220.0001.7823461.3224090.0116470.3976710.2396010.2495840.0682200.033278601.04
2190436e52816a14a2d1706330986edca63304de0b7584c2d521746600cb0110.4376555624b8f759152af2cb2f91bb549494e33c63cf351178.0cc69cbe29a617a4ad3f9e8a040423ac82c3dbd33ee3501282b199ce7c4845f17dedd5c5c5025bd0a9aba4d7f5124.943933-0.814660-0.7083081.500000-0.512422-0.7339670.33333314.83772811.00.00000024.00.00.0000001.029.00.03.011.04.420.150.1610.21.0001.0001.0001.0001.0000.5200.5330.835-0.5865400.6724360.0000000.6060610.1212120.2121210.0606060.00000033.03
343859085bc16a14a2d1706330986edca63304de0a62168d6261746600cb0110.004439f67f142e40c4dd2197c391bb549494e33c63cf3514559.06e40247e69617a4ad3f9718c61545bc26d08129a9e166b965d466f8951b0fde72a6d5cacfadc5c019aba4d7f5141.576860-0.907833-0.7617360.500000-0.627525-0.8058011.1666670.0043950.00.5000000.00.00.0000007.07.00.03.015.08.920.290.2260.80.0000.0000.0000.0000.0001.0000.0000.000-1.600326-1.8386800.0000001.0000000.0000000.0000000.0000000.0000001.04
4a4c3095b7516a14a2d1706330986edca63304de0b7584c2d521746600cb0110.4809777e5c97705ae071d01df591bb549494e33c63cf355777.06e40247e69617a4ad3f94b9480aa42e84655292c527b6ca8ccdd9c9e0da217c99905b60fc56ea1f09aba4d7f5131.080282-0.371787-0.3676161.6666670.2713070.01311217.3333331713.43912833.00.0000006.01.00.6666678.0108.01.04.086.01.580.052.0322.40.3480.7620.5500.3920.4890.5171.0000.6420.9609910.7909900.0201610.6451610.2580650.0362900.0403230.000000248.03

Build a quick baseline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.ensemble import RandomForestClassifier
# Create a copy to work with
X = train.copy()
# Save and drop labels
y = train.y
X = X.drop('y', axis=1)
# fill NANs
X = X.fillna(-999)
# Label encoder
for c in train.columns[train.dtypes == 'object']:
X[c] = X[c].factorize()[0]
rf = RandomForestClassifier()
rf.fit(X,y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
1
2
plt.plot(rf.feature_importances_)
plt.xticks(np.arange(X.shape[1]), X.columns.tolist(), rotation=90);
/home/dulyanov/miniconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

png

There is something interesting about x8.

1
2
3
# we see it was standard scaled, most likely, if we concat train and test, we will get exact mean=1, and std 1
print 'Mean:', train.x8.mean()
print 'std:', train.x8.std()
Mean: -0.000252352028622
std: 1.02328163601
1
2
# And we see that it has a lot of repeated values
train.x8.value_counts().head(15)
-2.984750    2770
 0.480977    2569
 0.610941    1828
 0.654263    1759
 0.567620    1746
 0.697585    1691
 0.524298    1639
 0.740906    1628
 0.394333    1610
 0.437655    1513
 0.351012    1450
 0.264369    1429
 0.307690    1401
 0.221047    1372
 0.784228    1293
Name: x8, dtype: int64
1
2
3
4
5
6
7
# It's very hard to work with scaled feature, so let's try to scale them back
# Let's first take a look at difference between neighbouring values in x8
x8_unique = train.x8.unique()
x8_unique_sorted = np.sort(x8_unique)
np.diff(x8_unique_sorted)
array([ 43.27826527,  38.98942817,   0.21660793,   0.04332159,
         0.17328635,   0.21660793,   0.08664317,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.12996476,   0.04332159,
         0.04332159,   0.04332159,   0.04332159,   0.04332159,
         0.04332159,   0.04332159,   0.21660793,   1.16968285,
         0.04332159,   0.38989428,          nan])
1
2
3
4
5
6
7
# The most of the diffs are 0.04332159!
# The data is scaled, so we don't know what was the diff value for the original feature
# But let's assume it was 1.0
# Let's devide all the numbers by 0.04332159 to get the right scaling
# note, that feature will still have zero mean
np.diff(x8_unique_sorted/0.04332159)
array([ 998.99992752,  899.9999347 ,    4.99999964,    0.99999993,
          3.99999971,    4.99999964,    1.99999985,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    2.99999978,    0.99999993,
          0.99999993,    0.99999993,    0.99999993,    0.99999993,
          0.99999993,    0.99999993,    4.99999964,   26.99999804,
          0.99999993,    8.99999935,           nan])
1
(train.x8/0.04332159).head(10)
0   -15.897530
1    20.102468
2    10.102468
3     0.102469
4    11.102468
5   -68.897526
6    10.102468
7    15.102468
8     9.102468
9   -68.897526
Name: x8, dtype: float64
1
2
3
4
# Ok, now we see .102468 in every value
# this looks like a part of a mean that was subtracted during standard scaling
# If we subtract it, the values become almost integers
(train.x8/0.04332159 - .102468).head(10)
0   -15.999998
1    20.000000
2    10.000000
3     0.000001
4    11.000000
5   -68.999994
6    10.000000
7    15.000000
8     9.000000
9   -68.999994
Name: x8, dtype: float64
1
2
3
# let's round them
x8_int = (train.x8/0.04332159 - .102468).round()
x8_int.head(10)
0   -16.0
1    20.0
2    10.0
3     0.0
4    11.0
5   -69.0
6    10.0
7    15.0
8     9.0
9   -69.0
Name: x8, dtype: float64
1
2
3
# Ok, what's next? In fact it is not obvious how to find shift parameter,
# and how to understand what the data this feature actually store
# But ...
1
x8_int.value_counts()
-69.0      2770
 11.0      2569
 14.0      1828
 15.0      1759
 13.0      1746
 16.0      1691
 12.0      1639
 17.0      1628
 9.0       1610
 10.0      1513
 8.0       1450
 6.0       1429
 7.0       1401
 5.0       1372
 18.0      1293
 1.0       1290
 4.0       1276
 2.0       1250
 3.0       1213
-1.0       1085
 0.0       1080
-2.0       1006
-4.0        995
-3.0        976
-5.0        954
-8.0        923
-9.0        921
-6.0        906
 19.0       893
-7.0        881
           ... 
 26.0         3
-40.0         3
-41.0         3
 25.0         2
-59.0         2
 31.0         2
 34.0         2
-46.0         2
-49.0         2
 33.0         2
-42.0         2
 32.0         2
 37.0         2
 30.0         2
-45.0         2
-54.0         1
 36.0         1
-51.0         1
 27.0         1
 79.0         1
-47.0         1
 69.0         1
 70.0         1
-50.0         1
-1968.0       1
 42.0         1
-63.0         1
-48.0         1
-64.0         1
 35.0         1
Name: x8, Length: 99, dtype: int64
1
2
3
# do you see this -1968? Doesn't it look like a year? ... So my hypothesis is that this feature is a year of birth!
# Maybe it was a textbox where users enter their year of birth, and someone entered 0000 instead
# The hypothesis looks plausible, isn't it?
1
(x8_int + 1968.0).value_counts().sort_index()
0.0          1
999.0        4
1899.0    2770
1904.0       1
1905.0       1
1909.0       2
1914.0       1
1916.0       3
1917.0       1
1918.0       1
1919.0       2
1920.0       1
1921.0       1
1922.0       2
1923.0       2
1924.0       4
1925.0       4
1926.0       2
1927.0       3
1928.0       3
1929.0       4
1930.0       4
1931.0      12
1932.0      10
1933.0       7
1934.0      13
1935.0      28
1936.0      35
1937.0      35
1938.0      45
          ... 
1978.0    1513
1979.0    2569
1980.0    1639
1981.0    1746
1982.0    1828
1983.0    1759
1984.0    1691
1985.0    1628
1986.0    1293
1987.0     893
1988.0     624
1989.0     434
1990.0     233
1991.0     110
1992.0      31
1993.0       2
1994.0       3
1995.0       1
1998.0       2
1999.0       2
2000.0       2
2001.0       2
2002.0       2
2003.0       1
2004.0       1
2005.0       2
2010.0       1
2037.0       1
2038.0       1
2047.0       1
Name: x8, Length: 99, dtype: int64
1
# After the competition ended the organisers told it was really a year of birth