Abracadabra

Python data analysis-Learning note-Ch02

利用Python内置的JSON模块对数据进行解析并转化为字典

数据如下:

1
2
3
4
5
6
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko)
Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr":
"MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov",
"r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u":
"http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918,
"cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

核心代码:

1
2
3
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
records[0]
-----------------------------------
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
'al': 'en-US,en;q=0.8',
'c': 'US',
'cy': 'Danvers',
'g': 'A6qOVH',
'gr': 'MA',
'h': 'wfLQtf',
'hc': 1331822918,
'hh': '1.usa.gov',
'l': 'orofrog',
'll': [42.576698, -70.954903],
'nk': 1,
'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
't': 1331923247,
'tz': 'America/New_York',
'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

对时区字段进行计数(pure python vs. pandas)

首先从记录中提取时区字段并且放入一个列表中

1
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
1
2
3
4
5
6
7
8
9
10
11
12
time_zones[:10]
-----------------------------------
['America/New_York',
'America/Denver',
'America/New_York',
'America/Sao_Paulo',
'America/New_York',
'America/New_York',
'Europe/Warsaw',
'',
'',
'']

使用纯粹的python进行计数

1
2
3
4
5
6
7
8
def get_counts(sequence):
counts = {}
for x in sequence:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts

使用下列方法更加简洁

1
2
3
4
5
6
7
from collections import defaultdict
def get_counts2(sequence):
counts = defaultdict(int) # values will initialize to 0
for x in sequence:
counts[x] += 1
return counts

如果需要返回前十位的时区及其计数值

1
2
3
4
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
1
2
3
4
5
6
7
8
9
10
11
12
top_counts(counts)
--------------------------------------
[(33, 'America/Sao_Paulo'),
(35, 'Europe/Madrid'),
(36, 'Pacific/Honolulu'),
(37, 'Asia/Tokyo'),
(74, 'Europe/London'),
(191, 'America/Denver'),
(382, 'America/Los_Angeles'),
(400, 'America/Chicago'),
(521, ''),
(1251, 'America/New_York')]

可以使用python自带的库

1
from collections import Counter
1
counts = Counter(time_zones)
1
2
3
4
5
6
7
8
9
10
11
12
counts.most_common(10)
--------------------------------
[('America/New_York', 1251),
('', 521),
('America/Chicago', 400),
('America/Los_Angeles', 382),
('America/Denver', 191),
('Europe/London', 74),
('Asia/Tokyo', 37),
('Pacific/Honolulu', 36),
('Europe/Madrid', 35),
('America/Sao_Paulo', 33)]

使用pandas进行相同的任务

pandas中主要的数据结构是DataFrame, 作用是将数据表示成表格

1
2
3
4
5
from pandas import DataFrame, Series
import pandas as pd
frame = DataFrame(records)
frame

dataframe_data_repr

1
2
3
4
5
6
7
8
9
10
11
12
13
frame['tz'][:10]
-------------------------------
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz, dtype: object

计数·

1
2
3
4
5
6
7
8
9
10
11
12
13
14
tz_counts = frame['tz'].value_counts()
tz_counts[:10]
--------------------------------------------------
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33
Name: tz, dtype: int64

填补缺失值以及未知值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10]
----------------------------------------------
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
Name: tz, dtype: int64

画个图展示一下

1
2
plt.figure(figsize=(10, 4))
tz_counts[:10].plot(kind='barh', rot=0)

majority_tz

下面我们对用户使用的浏览器的信息做一些操作

Series应该代表的是DataFrame中的一列

1
2
3
4
5
6
7
8
9
results = Series([x.split()[0] for x in frame.a.dropna()])
results[:5]
---------------------------------------------------
0 Mozilla/5.0
1 GoogleMaps/RochesterNY
2 Mozilla/4.0
3 Mozilla/5.0
4 Mozilla/5.0
dtype: object

同样可以进行计数

1
2
3
4
5
6
7
8
9
10
11
results.value_counts()[:8]
-----------------------------------------
Mozilla/5.0 2594
Mozilla/4.0 601
GoogleMaps/RochesterNY 121
Opera/9.80 34
TEST_INTERNET_AGENT 24
GoogleProducer 21
Mozilla/6.0 5
BlackBerry8520/5.0.0.681 4
dtype: int64

根据Windows和Non-Windows用户进行时区的分组操作

1
cframe = frame[frame.a.notnull()]
1
2
3
4
5
6
operating_system = np.where(cframe['a'].str.contains('Windows'),
'Windows', 'Not Windows')
operating_system[:5]
-----------------------------------------------------------------
array(['Windows', 'Not Windows', 'Windows', 'Not Windows', 'Windows'],
dtype='<U11')
1
by_tz_os = cframe.groupby(['tz', operating_system])

来看看这个by_tz_os长什么样

1
by_tz_os.size()

pandas_group_by_data_pic

再来看看unstack()的炫酷效果

pandas_group_by_data_unstack

排下序, 看看排名多少

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Use to sort in ascending order
indexer = agg_counts.sum(1).argsort()
indexer[:10]
------------------------------------------------
tz
24
Africa/Cairo 20
Africa/Casablanca 21
Africa/Ceuta 92
Africa/Johannesburg 87
Africa/Lusaka 53
America/Anchorage 54
America/Argentina/Buenos_Aires 57
America/Argentina/Cordoba 26
America/Argentina/Mendoza 55
dtype: int64

取出前十的来看看

1
2
count_subset = agg_counts.take(indexer)[-10:]
count_subset

pandas_group_by_data_sort_top10

同样画个图

1
count_subset.plot(kind='barh', stacked=True)

pandas_group_by_data_sort_top10_pic

看看两个类别所占的比例是多少

1
2
normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh', stacked=True)

pandas_group_by_data_sort_top10_percent

电影评分数据表连接操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import os
encoding = 'latin1'
upath = os.path.expanduser('ch02/movielens/users.dat')
rpath = os.path.expanduser('ch02/movielens/ratings.dat')
mpath = os.path.expanduser('ch02/movielens/movies.dat')
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
mnames = ['movie_id', 'title', 'genres']
users = pd.read_csv(upath, sep='::', header=None, names=unames, encoding=encoding)
ratings = pd.read_csv(rpath, sep='::', header=None, names=rnames, encoding=encoding)
movies = pd.read_csv(mpath, sep='::', header=None, names=mnames, encoding=encoding)

看看数据长什么样

1
users[:5]

pandas_ch02_users

1
ratings[:5]

pandas_ch02_ratings

1
movies[:5]

pandas_ch02_movies

多表连接

1
2
data = pd.merge(pd.merge(ratings, users), movies)
data

pandas_ch02_multi_table_joint

1
2
3
4
5
6
7
8
9
10
11
12
13
data.ix[0]
--------------------------------------------
user_id 1
movie_id 1193
rating 5
timestamp 978300760
gender F
age 1
occupation 10
zip 48067
title One Flew Over the Cuckoo's Nest (1975)
genres Drama
Name: 0, dtype: object

根据性别计算每部电影的平均评分

1
2
3
mean_ratings = data.pivot_table('rating', index='title',
columns='gender', aggfunc='mean')
mean_ratings[:5]

pandas_ch02_movie_avg_score_by_gender

过滤掉评分数小于250的电影

1
ratings_by_title = data.groupby('title').size()
1
2
3
4
5
6
7
8
9
ratings_by_title[:5]
-----------------------------------------
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
dtype: int64
1
active_titles = ratings_by_title.index[ratings_by_title >= 250]
1
2
3
4
5
6
7
8
active_titles[:10]
-----------------------------------------
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
'101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
'13th Warrior, The (1999)', '2 Days in the Valley (1996)',
'20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
'2010 (1984)'],
dtype='object', name='title')

ix应该是一个交集操作

1
2
mean_ratings = mean_ratings.ix[active_titles]
mean_ratings

pandas_ch02_movie_avg_score_by_gender_ratings_more_than_250

按照女性最喜欢的电影进行降序排序

1
2
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings[:10]

pandas_ch02_movie_female_favor_top_10

US Baby Names 1880-2010

1
2
3
import pandas as pd
names1880 = pd.read_csv('ch02/names/yob1880.txt', names=['name', 'sex', 'births'])
names1880

pandas_ch02_us_baby_name

把所有年份的数据合并一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 2010 is the last available year right now
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = 'ch02/names/yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

进行聚合操作

1
2
total_births = names.pivot_table('births', index='year',
columns='sex', aggfunc=sum)
1
total_births.tail()

pandas_ch02_us_baby_name_addition

计算一下每个名字的出生比例

1
2
3
4
5
6
7
def add_prop(group):
# Integer division floors
births = group.births.astype(float)
group['prop'] = births / births.sum()
return group
names = names.groupby(['year', 'sex']).apply(add_prop)
1
names

pandas_ch02_us_baby_name_prop

进行一下有效性检查

1
2
3
np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
--------------------------------------------
True

筛选出每一对year/sex下总数前1000的名字

1
2
3
4
def get_top1000(group):
return group.sort_values(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)

加个索引,结合了numpy

1
top1000.index = np.arange(len(top1000))

将数据分为男女

1
2
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']

计算每一年每个名字的出生总数

1
2
3
total_births = top1000.pivot_table('births', index='year', columns='name',
aggfunc=sum)
total_births

pandas_ch02_us_baby_name_counts_per_year

选出几个名字看看总数随年份的变化情况

1
2
3
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.plot(subplots=True, figsize=(12, 10), grid=False,
title="Number of births per year")

pandas_ch02_us_baby_name_trend

Measuring the increase in naming diversity

通过统计前1000项名字所占的比例来判断多样性的变化

1
2
3
4
table = top1000.pivot_table('prop', index='year',
columns='sex', aggfunc=sum)
table.plot(title='Sum of table1000.prop by year and sex',
yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))

pandas_ch02_us_baby_name_diversity

另一种方法,计算占出生人数50%的名字的数量

也即从开始累加,看加到第几个名字时所占比例为50%

先来看看2010年的男孩

1
df = boys[boys.year == 2010]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()
prop_cumsum[:10]
--------------------------------------------------
260877 0.011523
260878 0.020934
260879 0.029959
260880 0.038930
260881 0.047817
260882 0.056579
260883 0.065155
260884 0.073414
260885 0.081528
260886 0.089621
Name: prop, dtype: float64

看来是第116个,不过序号从0开始,应该是117

1
2
3
prop_cumsum.values.searchsorted(0.5)
---------------------------------------------------
116

再来看看1900年的男孩儿

1
2
3
4
5
df = boys[boys.year == 1900]
in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum()
in1900.values.searchsorted(0.5) + 1
---------------------------------------------------
25

所以这样做是可行的

把相同的操作赋予整个数据集

1
2
3
4
5
6
def get_quantile_count(group, q=0.5):
group = group.sort_values(by='prop', ascending=False)
return group.prop.cumsum().values.searchsorted(q) + 1
diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')
diversity.head()

pandas_ch02_us_baby_name_number_in_half_percent

1
diversity.plot(title="Number of popular names in top 50%")

pandas_ch02_us_baby_name_diversity_2

The “Last letter” Revolution

取出每个名字对应的最后一个字母,同时序号对应

1
2
3
4
5
6
7
# extract last letter from name column
get_last_letter = lambda x: x[-1]
last_letters = names.name.map(get_last_letter)
last_letters.name = 'last_letter'
table = names.pivot_table('births', index=last_letters,
columns=['sex', 'year'], aggfunc=sum)

单独取出三年的来看看

1
2
subtable = table.reindex(columns=[1910, 1960, 2010], level='year')
subtable.head()

pandas_ch02_us_baby_name_last_letter

计算一下字母比例

1
2
3
4
5
6
7
8
9
10
subtable.sum()
-------------------------------------
sex year
F 1910 396416.0
1960 2022062.0
2010 1759010.0
M 1910 194198.0
1960 2132588.0
2010 1898382.0
dtype: float64
1
letter_prop = subtable / subtable.sum().astype(float)
1
2
3
4
5
6
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female',
legend=False)

pandas_ch02_us_baby_name_last_letter_prop

最后看一下所有的年份并生成一个趋势图

1
2
letter_prop = table / table.sum().astype(float)
dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T
1
dny_ts.plot()

pandas_ch02_us_baby_name_last_letter_prop_trend

Boy names that became girl names (and vice versa)

以lesl开头的名字为例

1
2
3
4
5
6
all_names = top1000.name.unique()
mask = np.array(['lesl' in x.lower() for x in all_names])
lesley_like = all_names[mask]
lesley_like
----------------------------------------------
array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)

从原数据集中筛选出来

1
2
3
4
5
6
7
8
9
10
filtered = top1000[top1000.name.isin(lesley_like)]
filtered.groupby('name').births.sum()
----------------------------------------------
name
Leslee 1082
Lesley 35022
Lesli 929
Leslie 370429
Lesly 10067
Name: births, dtype: int64

做一下聚合操作并计算比例

1
2
3
4
table = filtered.pivot_table('births', index='year',
columns='sex', aggfunc='sum')
table = table.div(table.sum(1), axis=0)
table.tail()

pandas_ch02_us_baby_name_b2g_prop

看一下趋势

1
table.plot(style={'M': 'k-', 'F': 'k--'})

pandas_ch02_us_baby_name_b2g_prop_trend