Abracadabra

Do it yourself


  • Home

  • Categories

  • About

  • Archives

  • Tags

  • Sitemap

  • 公益404

  • Search
close
Abracadabra

EDA Example I (Springleaf competition)

Posted on 2018-10-16 | | Visitors

This is a notebook, used in the screencast video. Note, that the data files are not present here in Jupyter hub and you will not be able to run it. But you can always download the notebook to your local machine as well as the competition data and make it interactive.

1
2
3
4
5
6
7
8
9
10
11
import os
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import seaborn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def autolabel(arrayA):
''' label each colored square with the corresponding data value.
If value > 20, the text is in black, else in white.
'''
arrayA = np.array(arrayA)
for i in range(arrayA.shape[0]):
for j in range(arrayA.shape[1]):
plt.text(j,i, "%.2f"%arrayA[i,j], ha='center', va='bottom',color='w')
def hist_it(feat):
plt.figure(figsize=(16,4))
feat[Y==0].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.8)
feat[Y==1].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.5)
plt.ylim((0,1))
def gt_matrix(feats,sz=16):
a = []
for i,c1 in enumerate(feats):
b = []
for j,c2 in enumerate(feats):
mask = (~train[c1].isnull()) & (~train[c2].isnull())
if i>=j:
b.append((train.loc[mask,c1].values>=train.loc[mask,c2].values).mean())
else:
b.append((train.loc[mask,c1].values>train.loc[mask,c2].values).mean())
a.append(b)
plt.figure(figsize = (sz,sz))
plt.imshow(a, interpolation = 'None')
_ = plt.xticks(range(len(feats)),feats,rotation = 90)
_ = plt.yticks(range(len(feats)),feats,rotation = 0)
autolabel(a)
1
2
3
4
5
def hist_it1(feat):
plt.figure(figsize=(16,4))
feat[Y==0].hist(bins=100,range=(feat.min(),feat.max()),normed=True,alpha=0.5)
feat[Y==1].hist(bins=100,range=(feat.min(),feat.max()),normed=True,alpha=0.5)
plt.ylim((0,1))

Read the data

1
2
train = pd.read_csv('train.csv.zip')
Y = train.target
1
2
test = pd.read_csv('test.csv.zip')
test_ID = test.ID
Read more »
Abracadabra

EDA check list

Posted on 2018-10-16 | | Visitors
  • Get domain knowledge
  • Check if the data is intuitive (abnormal detection)
    • add a feature is_incorrect
  • Understand how the data was generated
    • It is crucial to understand the generation process to set up a proper validation scheme
  • Two things to do with anonymized features
    • Try to decode the features
      • Guess the true meaning of the feature
    • Guess the feature types
      • Each type need its own preprocessing
  • Visualization
    • Tools for individual features exploration
      • Histograms plt.hist(x)
      • Plot (index versus value) plt.plot(x, something)
      • Statistics df.describe() or x.mean() or x.var()
      • Other tools x.value_counts() or x.isnull()
    • Tools for feature relationships
      • Pairs
        • plt.scatter(x1, x2)
        • pd.scatter_matrix(df)
        • df.corr() or plt.matshow()
      • Groups:
        • Clustering
        • Plot (index vs feature statistics) df.mean().sort_values().plot()
  • Data Clean
    • remove duplicated and constant features
      • traintest.nunique(axis=1) == 1
      • traintest.T.drop_duplicates()
      • for f in categorical_feats: traintest[f] = traintest[f].factorize then traintest.T.drop_duplicates()
    • check if same rows have same label
    • check if dataset is shuffled
Abracadabra

Processing Anonymized Features

Posted on 2018-10-16 | | Visitors

IMPORTANT: You will not be able to run this notebook at coursera platform, as the dataset is not there. The notebook is in read-only mode.

But you can run the notebook locally and download the dataset using this link to explore the data interactively.

1
pd.set_option('max_columns', 100)

Load the data

1
2
train = pd.read_csv('./train.csv')
train.head()
Read more »
Abracadabra

Exploratory Data Analysis

Posted on 2018-10-16 | | Visitors

This is a detailed EDA of the data, shown in the second video of “Exploratory data analysis” lecture (week 2).

PLEASE NOTE: the dataset cannot be published, so this notebook is read-only.

Load data

In this competition hosted by solutions.se, the task was to predict the advertisement cost for a particular ad.

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data_path = './data'
train = pd.read_csv('%s/train.csv.gz' % data_path, parse_dates=['Date'])
test = pd.read_csv('%s/test.csv.gz' % data_path, parse_dates=['Date'])

Let’s look at the data (notice that the table is transposed, so we can see all feature names).

1
train.head().T
Read more »
Abracadabra

数据预处理相关技术

Posted on 2018-10-15 | | Visitors

数值型数据 (non-tree based model)

  • 特征预处理
    • MinMaxScalar 不会改变数据分布
    • StandardScalar
    • scipy.stats.rankdata
    • log transform np.log(1+x)
    • raising to the power < 1 np.sqrt(x + 2/3)
    • drop outlier(winsorization,specify upper and lower bound)

融合不同预处理方法得到的特征训练一个模型或者每一种特征训练出一个模型最后做模型融合

  • 特征生成
    • 主要依据先验经验以及对数据的深刻理解
    • 例如,浮点数的小数部分单独提取出来作为特征

类别数据以及有序类别数据

  • 特征预处理
    • Label encoding (tree(or non-tree)-based model)
      • alphabetical sorted sklearn.preprocessing.LabelEncoder
      • order of appearance Pandas.factorize
      • frequency encoding (非常适用于测试数据中包含训练数据未包含的类别)
    • Label encoding (non-tree(or tree)-based model)
      • one-hot encoding (sparse matrix)
  • 特征生成
    • 枚举不同的类别特征的组合形成新的类别特征 (linear models and KNN)

日期数据以及坐标数据

日期数据

  • 特征生成
    • 周期性数据
      • Day number in week, month, season, year
      • second, minute, second
    • 自什么时候以来
      • 问题无关, 比如自1970年1月1日以来
      • 问题相关,比如距离下一个节假日还有多少天等等
    • 两个日期特征之间的差值

坐标数据

  • 特征生成

    • 距离某些关键坐标的距离等等(需要外部数据支持)
    • 对坐标进行网格化或者聚类,然后计算每个网格中的点距离选定点的距离或者每个簇中的点距离聚类中心的距离
    • 点的密度(某一限定范围之内)
    • 区域价值,例如物价房价等(某一限定范围之内)
  • 特征预处理

    • 坐标旋转(例如45°)

缺失值处理

  • 找出隐含的NaN,通过可视化数据分布
  • 填充方法
    • -999, -1等
    • 中值,均值等
    • 尝试恢复缺失数据(线性回归)
  • 特征生成
    • 增加一个特征,是否有缺失值
    • 采用填充的缺失值进行特征生成要特别小心,一般来说若要进行特征生成,则最好不要在之前进行缺失值填充
  • xgboost对于缺失值不敏感

文本数据

  • 特征生成
    • 词袋 skearn.feature_extraction.text.CountVectorizer
    • TF-IDF skearn.feature_extraction.text.TfidfVectorizer
    • N-grams ngram
  • 特征预处理
    • lowercase
    • lemmatization (单词最原始的形式)
    • stemming
    • stopwords nltk
  • Word2Vec, Doc2vec, Glove, FastText, etc
  • Pipeline
    1. 预处理
    2. Ngrams then TF-IDF
    3. or Word2Vec, etc

图像数据

  • 可以结合不同层的特征图
1234…27
Ewan Li

Ewan Li

Ewan's IT Blog

131 posts
64 tags
RSS
Github Twitter
© 2019 Ewan Li
Powered by Hexo
Theme - NexT.Mist
本站访客数人次 本站总访问量次