Abracadabra

Do it yourself


  • Home

  • Categories

  • About

  • Archives

  • Tags

  • Sitemap

  • 公益404

  • Search
close
Abracadabra

Naive Bayes

Posted on 2017-02-25 | | Visitors

利用朴素贝叶斯进行文本分类

准备数据

从文本中构建词向量

下面写一个词表到向量的转换函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def LoadDataSet():
""" Load a vector-like data set tranfered by
a data set list that generated by artifical.
Returns:
return_vec: the vector-like data set.
class_vec: the class label corresponds to the data items.
"""
posting_list = [['my', 'dog', 'has', 'flea',
'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him',
'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute',
'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how',
'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
class_vec = [0, 1, 0, 1, 0, 1] # 1 is abusive, 0 not
return posting_list, class_vec
def CreateVocabList(data_set):
""" Create vocabulary list from vector-like data set.
Arguments:
data_set: the data source.
Returns:
vocab_list: the vocabulary list.
"""
vocab_set = set([])
for document in data_set:
vocab_set = vocab_set | set(document)
return list(vocab_set)
def SetOfWords2Vec(vocab_list, input_set):
""" Transfer the words list to vector for each posting.
Arguments:
vocab_list: The vocabulary list.
input_set: The posting that ready to transfer to vector.
Returns:
return_vec: the result vector.
"""
# Initialize
return_vec = [0] * len(vocab_list)
for word in input_set:
if word in vocab_list:
return_vec[vocab_list.index(word)] = 1
else:
print("the word: %s is not in my Vocabulary" % word)
return return_vec

下面对函数的功能进行测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
In [13]: import bayes
In [14]: list_of_posts, list_classes = bayes.LoadDataSet()
In [15]: my_vocab_list = bayes.CreateVocabList(list_of_posts)
In [16]: my_vocab_list
Out[16]:
['help',
'worthless',
'I',
'take',
'love',
'maybe',
'stupid',
'to',
'not',
'please',
'quit',
'park',
'posting',
'dog',
'dalmation',
'steak',
'my',
'how',
'food',
'so',
'stop',
'is',
'garbage',
'flea',
'problems',
'has',
'buying',
'ate',
'him',
'licks',
'mr',
'cute']
In [17]: bayes.SetOfWords2Vec(my_vocab_list, list_of_posts[0])
Out[17]:
[1,
0,
0,
0,
0,
0,
0,
0,
0,
1,
0,
0,
0,
1,
0,
0,
1,
0,
0,
0,
0,
0,
0,
1,
1,
1,
0,
0,
0,
0,
0,
0]
In [18]: bayes.SetOfWords2Vec(my_vocab_list, list_of_posts[3])
Out[18]:
[0,
1,
0,
0,
0,
0,
1,
0,
0,
0,
0,
0,
1,
0,
0,
0,
0,
0,
0,
0,
1,
0,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0]

看上去一切都work,可以进入下一步了。

训练

从词向量计算概率

我们来看看贝叶斯公式:

$$p(c_i | w) = \frac{p(w | c_i) p(c_i)}{p(w)}$$

这里$w$代表词向量, 可以看出$c_i$的计算十分简单,值得注意的是,根据朴素贝叶斯的假设,有:

$$p(w | c_i) = p(w_0, w_1, w_2, \cdots, w_N | c_i) = p(w_0 | c_i)p(w_1 | c_i)p(w_2 | c_i) \cdots p(w_N | c_i)$$

当需要预测新样本的类别时:

nb

这样一切就很清楚了,下面给出伪代码:

1
2
3
4
5
6
7
8
9
10
计算每个类别的文档数目
for 每一篇文档:
for 每一个类别:
if 词条出现在文档中:
增加该词条的计数值
增加总词条数的计数值
for 每一个类别:
for 每一个词条:
将该词条的数目除以总词条数目得到条件概率
返回每个类别的条件概率

注意,这里的$p(w_j | c_i)$是要根据整个训练集来算,代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def TrainNaiveBayes0(train_matrix, train_category):
""" the training method.
Arguments:
train_matrix: The train data.
train_category: The train label.
Returns:
p0_vect: The conditional probability of w by c0
p1_vect: The conditional probability of w by c1
p_abusive: The conditional probability of c1
"""
num_train_docs = len(train_matrix)
num_words = len(train_matrix[0])
p_abusive = sum(train_category) / num_train_docs
p0_num = zeros(num_words)
p1_num = zeros(num_words)
p0_denom = 0.0
p1_denom = 0.0
for i in range(num_train_docs):
if train_category[i] == 1:
p1_num += train_matrix[i]
p1_denom += sum(train_matrix[i])
else:
p0_num += train_matrix[i]
p0_denom += sum(train_matrix[i])
p1_vect = p1_num / p1_denom
p0_vect = p0_num / p0_denom
return p0_vect, p1_vect, p_abusive

同样进行一下测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
In [25]: train_mat = []
In [26]: for post_in_doc in list_of_posts:
...: train_mat.append(bayes.SetOfWords2Vec(my_vocab_list, post_in_doc))
...:
In [27]: p0_v, p1_v, p_ab = bayes.TrainNaiveBayes0(train_mat, list_classes)
In [28]: p_ab
Out[28]: 0.5
In [29]: p0_v
Out[29]:
array([ 0.04166667, 0. , 0.04166667, 0. , 0.04166667,
0. , 0. , 0.04166667, 0. , 0.04166667,
0. , 0. , 0. , 0.04166667, 0.04166667,
0.04166667, 0.125 , 0.04166667, 0. , 0.04166667,
0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667,
0.04166667, 0. , 0.04166667, 0.08333333, 0.04166667,
0.04166667, 0.04166667])
In [30]: p1_v
Out[30]:
array([ 0. , 0.10526316, 0. , 0.05263158, 0. ,
0.05263158, 0.15789474, 0.05263158, 0.05263158, 0. ,
0.05263158, 0.05263158, 0.05263158, 0.10526316, 0. ,
0. , 0. , 0. , 0.05263158, 0. ,
0.05263158, 0. , 0.05263158, 0. , 0. ,
0. , 0.05263158, 0. , 0.05263158, 0. ,
0. , 0. ])

但是上述代码存在一些缺陷,首先,计算$p(w_j | c_i)$可能会出现结果为0的情况,那么最后的结果就会为0,那么需要进行一些修改 (为什么是2?):

1
2
3
4
p0_num = ones(num_words)
p1_num = ones(num_words)
p0_denom = 2.0
p1_denom = 2.0

另外一个就是下溢的问题,所以要改用log函数

1
2
p1_vect = log(p1_num / p1_denom)
p0_vect = log(p0_num / p0_denom)

测试

最后写一个分类和测试函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def ClassifyNaiveBayes(vec_to_classify, p0_vect, p1_vect, p_abusive):
""" Classify.
Arguments:
vec_to_classify: The vector to classify.
p0_vect: The conditional probability of w by c0.
p1_vect: The conditional probability of w by c1.
p_abusive: The conditional probability of c1.
Returns:
0: The predict class is 0
1: The predict class is 1
"""
p1 = sum(vec_to_classify * p1_vect) + log(p_abusive)
p0 = sum(vec_to_classify * p0_vect) + log(1 - p_abusive)
if p1 > p0:
return 1
else:
return 0
def TestNaiveBayes():
""" A test method.
"""
list_of_posts, list_of_classes = LoadDataSet()
my_vocab_list = CreateVocabList(list_of_posts)
train_mat = []
for post_in_doc in list_of_posts:
train_mat.append(SetOfWords2Vec(my_vocab_list, post_in_doc))
p0_v, p1_v, p_ab = TrainNaiveBayes0(
array(train_mat), array(list_of_classes))
test_entry = ['love', 'my', 'dalmation']
this_doc = array(SetOfWords2Vec(my_vocab_list, test_entry))
print(test_entry, 'classified as: ',
ClassifyNaiveBayes(this_doc, p0_v, p1_v, p_ab))
test_entry = ['stupid', 'garbage']
this_doc = array(SetOfWords2Vec(my_vocab_list, test_entry))
print(test_entry, 'classified as: ',
ClassifyNaiveBayes(this_doc, p0_v, p1_v, p_ab))

Test

1
2
3
In [32]: bayes.TestNaiveBayes()
['love', 'my', 'dalmation'] classified as: 0
['stupid', 'garbage'] classified as: 1

Ok, bravo!

词袋模型

现在有一个问题, 到目前为止,我们将每个词是否出现作为特征,这被称为词集模型。但是如果有一个词在文档中不止出现一次,那么就需要词袋模型进行建模。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def BagOfWords2Vec(vocab_list, input_set):
""" Transfer the words list to vector for each posting.
Arguments:
vocab_list: The vocabulary list.
input_set: The posting that ready to transfer to vector.
Returns:
return_vec: the result vector.
"""
# Initialize
return_vec = [0] * len(vocab_list)
for word in input_set:
if word in vocab_list:
return_vec[vocab_list.index(word)] += 1
else:
print("the word: %s is not in my Vocabulary" % word)
return return_vec

高斯朴素贝叶斯

一般的朴素贝叶斯算法的输入特征为离散值,那么当输入变量为连续值时就不能处理了,一般这时候假设输入变量服从一个正态分布,这样$p(w_j | c_i)$就可以计算了,所以整个的流程如下:

gnb_algo

gnb_mle

采用sk-learn进行下实验

1
2
3
4
5
6
7
8
9
In [33]: from sklearn import datasets
...: iris = datasets.load_iris()
...: from sklearn.naive_bayes import GaussianNB
...: gnb = GaussianNB()
...: y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
...: print("Number of mislabeled points out of a total %d points : %d"
...: % (iris.data.shape[0],(iris.target != y_pred).sum()))
...:
Number of mislabeled points out of a total 150 points : 6
Abracadabra

CS231n Lecture2 note

Posted on 2017-02-24 | | Visitors

图像分类

目标

给一张输入图片赋予一个标签, 这个标签属于事先定义好的类别集合中

地位

计算机视觉的核心问题

例子

classify_example

难点

  1. 拍摄点的视角多样
  2. 拍摄点的远近距离多样
  3. 物体的变形
  4. 物体部分遮挡
  5. 光线
  6. 背景相似
  7. 品种多样

challenges

方法

数据驱动(即包含训练数据)

pipeline

输入 -> 学习 -> 评估

最近邻分类器

nearest_neighbor

距离度量

L1 distance

$$d_1 (I_1, I_2) = \sum_{p} \left| I^p_1 - I^p_2 \right|$$

l1-distance

L2 distance

$$d_2 (I_1, I_2) = \sqrt{\sum_{p} \left( I^p_1 - I^p_2 \right)^2}$$

示例代码

数据读取

1
2
3
4
Xtr, Ytr, Xte, Yte = load_CIFAR10('data/cifar10/') # a magic function we provide
# flatten out all images to be one-dimensional
Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3) # Xtr_rows becomes 50000 x 3072
Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3) # Xte_rows becomes 10000 x 3072

预测及评估

1
2
3
4
5
6
nn = NearestNeighbor() # create a Nearest Neighbor classifier class
nn.train(Xtr_rows, Ytr) # train the classifier on the training images and labels
Yte_predict = nn.predict(Xte_rows) # predict labels on the test images
# and now print the classification accuracy, which is the average number
# of examples that are correctly predicted (i.e. label matches)
print 'accuracy: %f' % ( np.mean(Yte_predict == Yte) )

基本方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
class NearestNeighbor(object):
def __init__(self):
pass
def train(self, X, y):
""" X is N x D where each row is an example. Y is 1-dimension of size N """
# the nearest neighbor classifier simply remembers all the training data
self.Xtr = X
self.ytr = y
def predict(self, X):
""" X is N x D where each row is an example we wish to predict label for """
num_test = X.shape[0]
# lets make sure that the output type matches the input type
Ypred = np.zeros(num_test, dtype = self.ytr.dtype)
# loop over all test rows
for i in xrange(num_test):
# find the nearest training image to the i'th test image
# using the L1 distance (sum of absolute value differences)
distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
min_index = np.argmin(distances) # get the index with smallest distance
Ypred[i] = self.ytr[min_index] # predict the label of the nearest example
return Ypred

实验结果

L1-distance 38.6% on CIFAR-10

L2-distance 35.4% on CIFAR-10

L1 vs. L2 L2比L1的差异容忍度更小

K近邻分类器

kNN

采用验证集进行超参调参

实例代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# assume we have Xtr_rows, Ytr, Xte_rows, Yte as before
# recall Xtr_rows is 50,000 x 3072 matrix
Xval_rows = Xtr_rows[:1000, :] # take first 1000 for validation
Yval = Ytr[:1000]
Xtr_rows = Xtr_rows[1000:, :] # keep last 49,000 for train
Ytr = Ytr[1000:]
# find hyperparameters that work best on the validation set
validation_accuracies = []
for k in [1, 3, 5, 10, 20, 50, 100]:
# use a particular value of k and evaluation on validation data
nn = NearestNeighbor()
nn.train(Xtr_rows, Ytr)
# here we assume a modified NearestNeighbor class that can take a k as input
Yval_predict = nn.predict(Xval_rows, k = k)
acc = np.mean(Yval_predict == Yval)
print 'accuracy: %f' % (acc,)
# keep track of what works on the validation set
validation_accuracies.append((k, acc))

交叉验证

cv

cv_result

最近邻分类器的优缺点

优点

训练速度快

缺点

  1. 测试速度很慢

    ​ 解决方法:1. ANN 2. FANN

  2. 距离度量不合适

    kNN_shortage_1

    ​

kNN_shortage_2

(http://cs231n.github.io/assets/pixels_embed_cifar10_big.jpg)

接下来…

t-SNE http://lvdmaaten.github.io/tsne/

random projection http://scikit-learn.org/stable/modules/random_projection.html

INTUITION FAILS IN HIGH DIMENSIONS http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

Recognizing and Learning Object Categories http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html

线性分类

一个从图像到标签的映射函数

$$f(x_i, W, b) = W x_i + b$$

x shape is [D x 1], W shape is [K x D], b shape is [K x 1]

注意点:

  1. W代表K个分类器的参数放在一起,因此整个模型是K个分类器的一个整合
  2. 向量化能够大大提升计算速度

线性分类器的解释

  1. 权重W表示不同的标签对于图像不同位置不同颜色的重视程度。比如太阳可能对于圆形的区域以及黄颜色比较看重

linear_classification_interpret_1

  1. 将图像看成高维空间中的点

linear_clasification_interpret_2

  1. 将线性分类器看成模板匹配

将W的每一行看成一个模板,通过内积计算,每一张图片张成的列向量都与每一个模板作比较,最后选出最匹配的,这也是一种最近邻算法。

templates

从上图可以看出,这里的模板是各个图像的一种折中。

将bias项放入W中

wb

数据预处理

​ 数据中心化

Abracadabra

Qiniu cloud images batch upload and directory synchronization

Posted on 2017-02-24 | | Visitors

最近写博客的时候会用到图片,因此用了七牛云的图片外链功能,但是其内容管理不能创建目录,所以图片的命名以及上传都很麻烦,然后去网上查了一下,也看了一下官方文档,发现官方有一个批量上传的工具挺好用,所以记录一下,下面把官方文档中的一些东西贴出来,方便日后查阅。

​ docu

Abracadabra

Get Pois uses God-map apis

Posted on 2017-02-21 | | Visitors

伪代码如下:

1
2
3
4
5
从Excel文件中读出数据
对于每一个house:
提取出其location字段(经纬度)
将location字段作为输入参数传给map api
将返回值进行适当筛选最后存入原数据集中

代码实现如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import os
import xlrd
import pickle
import requests
def ReadHousesInfoFromExcel(
file_name='houses_nadrop.xls', sheet_name='小区信息'):
""" Read the houses detail information from the excel-type file.
Arguments:
file_name: the name of the excel-type file.
sheet_name: the name of the sheet of the excel file.
Returns:
houses: A dict that contains the detail information of each house.
"""
HOUSES_FILE_NAME = 'houses.pkl'
HOUSES_DETAIL_TAB = ['name', 'address', 'property_category', 'area',
'avg_price', 'location', 'property_costs',
'volume_rate', 'green_rate']
houses = []
if (os.path.isfile(HOUSES_FILE_NAME)):
with open(HOUSES_FILE_NAME, 'rb') as f:
houses = pickle.load(f)
else:
workBook = xlrd.open_workbook(file_name)
bookSheet = workBook.sheet_by_name(sheet_name)
# read from second row because of the first row has tabs
for row in range(1, bookSheet.nrows):
house = {}
for col in range(bookSheet.ncols):
cel = bookSheet.cell(row, col)
try:
val = cel.value
except:
pass
val = str(val)
house[HOUSES_DETAIL_TAB[col]] = val
houses.append(house)
with open(HOUSES_FILE_NAME, 'wb') as f:
pickle.dump(houses, f)
return houses
def Geocode(location, poi_type):
""" A tool that call the God-Map api.
Arguments:
location: The location of house.
poi_type: The poi type.
Returns:
answer: The JSON-type data that contains pois infomation.
"""
location = str(location).strip()
parameters = {'location': location,
'key': 'e798a5bfb344a09977b79552ae415974',
'types': poi_type,
'offset': 10,
'page': 1,
'extensions': 'base'}
base = 'http://restapi.amap.com/v3/place/around'
try:
response = requests.get(base, parameters)
answer = response.json()
except Exception as e:
print('error!', e)
answer = 'null'
finally:
pass
return answer
def GetPOI(houses):
""" Get the pois information of the houses according to the location.
Arguments:
houses: The house detail information.
Returns:
houses_with_pois: The house detail information
that contains the pois information.
"""
POI_TYPE_LAB = ['subway_station', 'bus_station', 'parking_lot',
'primary_school', 'secondary_school', 'university',
'mall', 'park']
POI_TYPE_CODE = ['150500', '150700', '150904', '141203', '141202',
'141201', '060100', '110101']
KEEP_INFO_LAB = ['name', 'location', 'distance']
NO_INFO_NOW = '-'
SIZE = len(houses)
houses_with_pois = houses.copy()
count = 0
for house in houses_with_pois:
count = count + 1
if count % 100 == 0:
print(count, '', SIZE)
house['pois'] = {}
for poi_type_index in range(len(POI_TYPE_LAB)):
poi_info_json = Geocode(house['location'],
POI_TYPE_CODE[poi_type_index])
if poi_info_json == 'null' or poi_info_json['pois'] is None:
house['pois'][POI_TYPE_LAB[poi_type_index]] = NO_INFO_NOW
else:
house['pois'][POI_TYPE_LAB[poi_type_index]] = []
for poi in poi_info_json['pois']:
pois_without_useless = {}
for key in poi.keys():
if key in KEEP_INFO_LAB:
pois_without_useless[key] = poi[key]
house['pois'][POI_TYPE_LAB[poi_type_index]].append(
pois_without_useless)
# return houses_with_pois
return houses_with_pois
if __name__ == '__main__':
houses = ReadHousesInfoFromExcel()
# answer = Geocode(houses[0]['location'], '150905')
houses_with_pois = GetPOI(houses)

总结一下有几个注意点:

  1. 传给parameters的location参数的格式一定要规范,前后都不能有空格
  2. for循环中不能改变字典的大小,这里的大小不仅指其元素的数目,也包括其总占用空间的大小
  3. 注意pickle的用法
  4. 从Excel中读出的内容要转成str格式

整个过程十分清晰明了,值得注意的是细节问题

Abracadabra

Python data analysis-Learning note-Ch02

Posted on 2017-02-19 | | Visitors

利用Python内置的JSON模块对数据进行解析并转化为字典

数据如下:

1
2
3
4
5
6
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko)
Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr":
"MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov",
"r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u":
"http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918,
"cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

核心代码:

1
2
3
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
records[0]
-----------------------------------
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
'al': 'en-US,en;q=0.8',
'c': 'US',
'cy': 'Danvers',
'g': 'A6qOVH',
'gr': 'MA',
'h': 'wfLQtf',
'hc': 1331822918,
'hh': '1.usa.gov',
'l': 'orofrog',
'll': [42.576698, -70.954903],
'nk': 1,
'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
't': 1331923247,
'tz': 'America/New_York',
'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

对时区字段进行计数(pure python vs. pandas)

首先从记录中提取时区字段并且放入一个列表中

1
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
1
2
3
4
5
6
7
8
9
10
11
12
time_zones[:10]
-----------------------------------
['America/New_York',
'America/Denver',
'America/New_York',
'America/Sao_Paulo',
'America/New_York',
'America/New_York',
'Europe/Warsaw',
'',
'',
'']

使用纯粹的python进行计数

1
2
3
4
5
6
7
8
def get_counts(sequence):
counts = {}
for x in sequence:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts

使用下列方法更加简洁

1
2
3
4
5
6
7
from collections import defaultdict
def get_counts2(sequence):
counts = defaultdict(int) # values will initialize to 0
for x in sequence:
counts[x] += 1
return counts

如果需要返回前十位的时区及其计数值

1
2
3
4
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
1
2
3
4
5
6
7
8
9
10
11
12
top_counts(counts)
--------------------------------------
[(33, 'America/Sao_Paulo'),
(35, 'Europe/Madrid'),
(36, 'Pacific/Honolulu'),
(37, 'Asia/Tokyo'),
(74, 'Europe/London'),
(191, 'America/Denver'),
(382, 'America/Los_Angeles'),
(400, 'America/Chicago'),
(521, ''),
(1251, 'America/New_York')]

可以使用python自带的库

1
from collections import Counter
1
counts = Counter(time_zones)
1
2
3
4
5
6
7
8
9
10
11
12
counts.most_common(10)
--------------------------------
[('America/New_York', 1251),
('', 521),
('America/Chicago', 400),
('America/Los_Angeles', 382),
('America/Denver', 191),
('Europe/London', 74),
('Asia/Tokyo', 37),
('Pacific/Honolulu', 36),
('Europe/Madrid', 35),
('America/Sao_Paulo', 33)]

使用pandas进行相同的任务

pandas中主要的数据结构是DataFrame, 作用是将数据表示成表格

1
2
3
4
5
from pandas import DataFrame, Series
import pandas as pd
frame = DataFrame(records)
frame

dataframe_data_repr

1
2
3
4
5
6
7
8
9
10
11
12
13
frame['tz'][:10]
-------------------------------
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz, dtype: object

计数·

1
2
3
4
5
6
7
8
9
10
11
12
13
14
tz_counts = frame['tz'].value_counts()
tz_counts[:10]
--------------------------------------------------
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33
Name: tz, dtype: int64

填补缺失值以及未知值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10]
----------------------------------------------
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
Name: tz, dtype: int64

画个图展示一下

1
2
plt.figure(figsize=(10, 4))
tz_counts[:10].plot(kind='barh', rot=0)

majority_tz

下面我们对用户使用的浏览器的信息做一些操作

Series应该代表的是DataFrame中的一列

1
2
3
4
5
6
7
8
9
results = Series([x.split()[0] for x in frame.a.dropna()])
results[:5]
---------------------------------------------------
0 Mozilla/5.0
1 GoogleMaps/RochesterNY
2 Mozilla/4.0
3 Mozilla/5.0
4 Mozilla/5.0
dtype: object

同样可以进行计数

1
2
3
4
5
6
7
8
9
10
11
results.value_counts()[:8]
-----------------------------------------
Mozilla/5.0 2594
Mozilla/4.0 601
GoogleMaps/RochesterNY 121
Opera/9.80 34
TEST_INTERNET_AGENT 24
GoogleProducer 21
Mozilla/6.0 5
BlackBerry8520/5.0.0.681 4
dtype: int64

根据Windows和Non-Windows用户进行时区的分组操作

1
cframe = frame[frame.a.notnull()]
1
2
3
4
5
6
operating_system = np.where(cframe['a'].str.contains('Windows'),
'Windows', 'Not Windows')
operating_system[:5]
-----------------------------------------------------------------
array(['Windows', 'Not Windows', 'Windows', 'Not Windows', 'Windows'],
dtype='<U11')
1
by_tz_os = cframe.groupby(['tz', operating_system])

来看看这个by_tz_os长什么样

1
by_tz_os.size()

pandas_group_by_data_pic

再来看看unstack()的炫酷效果

pandas_group_by_data_unstack

排下序, 看看排名多少

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Use to sort in ascending order
indexer = agg_counts.sum(1).argsort()
indexer[:10]
------------------------------------------------
tz
24
Africa/Cairo 20
Africa/Casablanca 21
Africa/Ceuta 92
Africa/Johannesburg 87
Africa/Lusaka 53
America/Anchorage 54
America/Argentina/Buenos_Aires 57
America/Argentina/Cordoba 26
America/Argentina/Mendoza 55
dtype: int64

取出前十的来看看

1
2
count_subset = agg_counts.take(indexer)[-10:]
count_subset

pandas_group_by_data_sort_top10

同样画个图

1
count_subset.plot(kind='barh', stacked=True)

pandas_group_by_data_sort_top10_pic

看看两个类别所占的比例是多少

1
2
normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh', stacked=True)

pandas_group_by_data_sort_top10_percent

电影评分数据表连接操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import os
encoding = 'latin1'
upath = os.path.expanduser('ch02/movielens/users.dat')
rpath = os.path.expanduser('ch02/movielens/ratings.dat')
mpath = os.path.expanduser('ch02/movielens/movies.dat')
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
mnames = ['movie_id', 'title', 'genres']
users = pd.read_csv(upath, sep='::', header=None, names=unames, encoding=encoding)
ratings = pd.read_csv(rpath, sep='::', header=None, names=rnames, encoding=encoding)
movies = pd.read_csv(mpath, sep='::', header=None, names=mnames, encoding=encoding)

看看数据长什么样

1
users[:5]

pandas_ch02_users

1
ratings[:5]

pandas_ch02_ratings

1
movies[:5]

pandas_ch02_movies

多表连接

1
2
data = pd.merge(pd.merge(ratings, users), movies)
data

pandas_ch02_multi_table_joint

1
2
3
4
5
6
7
8
9
10
11
12
13
data.ix[0]
--------------------------------------------
user_id 1
movie_id 1193
rating 5
timestamp 978300760
gender F
age 1
occupation 10
zip 48067
title One Flew Over the Cuckoo's Nest (1975)
genres Drama
Name: 0, dtype: object

根据性别计算每部电影的平均评分

1
2
3
mean_ratings = data.pivot_table('rating', index='title',
columns='gender', aggfunc='mean')
mean_ratings[:5]

pandas_ch02_movie_avg_score_by_gender

过滤掉评分数小于250的电影

1
ratings_by_title = data.groupby('title').size()
1
2
3
4
5
6
7
8
9
ratings_by_title[:5]
-----------------------------------------
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
dtype: int64
1
active_titles = ratings_by_title.index[ratings_by_title >= 250]
1
2
3
4
5
6
7
8
active_titles[:10]
-----------------------------------------
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
'101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
'13th Warrior, The (1999)', '2 Days in the Valley (1996)',
'20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
'2010 (1984)'],
dtype='object', name='title')

ix应该是一个交集操作

1
2
mean_ratings = mean_ratings.ix[active_titles]
mean_ratings

pandas_ch02_movie_avg_score_by_gender_ratings_more_than_250

按照女性最喜欢的电影进行降序排序

1
2
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings[:10]

pandas_ch02_movie_female_favor_top_10

​

US Baby Names 1880-2010

1
2
3
import pandas as pd
names1880 = pd.read_csv('ch02/names/yob1880.txt', names=['name', 'sex', 'births'])
names1880

pandas_ch02_us_baby_name

把所有年份的数据合并一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 2010 is the last available year right now
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = 'ch02/names/yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

进行聚合操作

1
2
total_births = names.pivot_table('births', index='year',
columns='sex', aggfunc=sum)
1
total_births.tail()

pandas_ch02_us_baby_name_addition

计算一下每个名字的出生比例

1
2
3
4
5
6
7
def add_prop(group):
# Integer division floors
births = group.births.astype(float)
group['prop'] = births / births.sum()
return group
names = names.groupby(['year', 'sex']).apply(add_prop)
1
names

pandas_ch02_us_baby_name_prop

进行一下有效性检查

1
2
3
np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
--------------------------------------------
True

筛选出每一对year/sex下总数前1000的名字

1
2
3
4
def get_top1000(group):
return group.sort_values(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)

加个索引,结合了numpy

1
top1000.index = np.arange(len(top1000))

Analyzing naming trends

将数据分为男女

1
2
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']

计算每一年每个名字的出生总数

1
2
3
total_births = top1000.pivot_table('births', index='year', columns='name',
aggfunc=sum)
total_births

pandas_ch02_us_baby_name_counts_per_year

选出几个名字看看总数随年份的变化情况

1
2
3
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.plot(subplots=True, figsize=(12, 10), grid=False,
title="Number of births per year")

pandas_ch02_us_baby_name_trend

Measuring the increase in naming diversity

通过统计前1000项名字所占的比例来判断多样性的变化

1
2
3
4
table = top1000.pivot_table('prop', index='year',
columns='sex', aggfunc=sum)
table.plot(title='Sum of table1000.prop by year and sex',
yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))

pandas_ch02_us_baby_name_diversity

另一种方法,计算占出生人数50%的名字的数量

也即从开始累加,看加到第几个名字时所占比例为50%

先来看看2010年的男孩

1
df = boys[boys.year == 2010]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()
prop_cumsum[:10]
--------------------------------------------------
260877 0.011523
260878 0.020934
260879 0.029959
260880 0.038930
260881 0.047817
260882 0.056579
260883 0.065155
260884 0.073414
260885 0.081528
260886 0.089621
Name: prop, dtype: float64

看来是第116个,不过序号从0开始,应该是117

1
2
3
prop_cumsum.values.searchsorted(0.5)
---------------------------------------------------
116

再来看看1900年的男孩儿

1
2
3
4
5
df = boys[boys.year == 1900]
in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum()
in1900.values.searchsorted(0.5) + 1
---------------------------------------------------
25

所以这样做是可行的

把相同的操作赋予整个数据集

1
2
3
4
5
6
def get_quantile_count(group, q=0.5):
group = group.sort_values(by='prop', ascending=False)
return group.prop.cumsum().values.searchsorted(q) + 1
diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')
diversity.head()

pandas_ch02_us_baby_name_number_in_half_percent

1
diversity.plot(title="Number of popular names in top 50%")

pandas_ch02_us_baby_name_diversity_2

The “Last letter” Revolution

取出每个名字对应的最后一个字母,同时序号对应

1
2
3
4
5
6
7
# extract last letter from name column
get_last_letter = lambda x: x[-1]
last_letters = names.name.map(get_last_letter)
last_letters.name = 'last_letter'
table = names.pivot_table('births', index=last_letters,
columns=['sex', 'year'], aggfunc=sum)

单独取出三年的来看看

1
2
subtable = table.reindex(columns=[1910, 1960, 2010], level='year')
subtable.head()

pandas_ch02_us_baby_name_last_letter

计算一下字母比例

1
2
3
4
5
6
7
8
9
10
subtable.sum()
-------------------------------------
sex year
F 1910 396416.0
1960 2022062.0
2010 1759010.0
M 1910 194198.0
1960 2132588.0
2010 1898382.0
dtype: float64
1
letter_prop = subtable / subtable.sum().astype(float)
1
2
3
4
5
6
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female',
legend=False)

pandas_ch02_us_baby_name_last_letter_prop

最后看一下所有的年份并生成一个趋势图

1
2
letter_prop = table / table.sum().astype(float)
dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T
1
dny_ts.plot()

pandas_ch02_us_baby_name_last_letter_prop_trend

Boy names that became girl names (and vice versa)

以lesl开头的名字为例

1
2
3
4
5
6
all_names = top1000.name.unique()
mask = np.array(['lesl' in x.lower() for x in all_names])
lesley_like = all_names[mask]
lesley_like
----------------------------------------------
array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)

从原数据集中筛选出来

1
2
3
4
5
6
7
8
9
10
filtered = top1000[top1000.name.isin(lesley_like)]
filtered.groupby('name').births.sum()
----------------------------------------------
name
Leslee 1082
Lesley 35022
Lesli 929
Leslie 370429
Lesly 10067
Name: births, dtype: int64

做一下聚合操作并计算比例

1
2
3
4
table = filtered.pivot_table('births', index='year',
columns='sex', aggfunc='sum')
table = table.div(table.sum(1), axis=0)
table.tail()

pandas_ch02_us_baby_name_b2g_prop

看一下趋势

1
table.plot(style={'M': 'k-', 'F': 'k--'})

pandas_ch02_us_baby_name_b2g_prop_trend

1…232425
Ewan Li

Ewan Li

Ewan's IT Blog

125 posts
59 tags
RSS
Github Weibo
© 2018 Ewan Li
Powered by Hexo
Theme - NexT.Mist
本站访客数人次 本站总访问量次