14.1 来自Bitly的USA.gov数据

    以每小时快照为例,文件中各行的格式为JSON(即JavaScript Object Notation,这是一种常用的Web数据格式)。例如,如果我们只读取某个文件中的第一行,那么所看到的结果应该是下面这样:

    Python有内置或第三方模块可以将JSON字符串转换成Python字典对象。这里,我将使用json模块及其loads函数逐行加载已经下载好的数据文件:

    1. path = 'datasets/bitly_usagov/example.txt'
    2. records = [json.loads(line) for line in open(path)]

    现在,records对象就成为一组Python字典了:

    1. In [18]: records[0]
    2. Out[18]:
    3. {'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko)
    4. Chrome/17.0.963.78 Safari/535.11',
    5. 'al': 'en-US,en;q=0.8',
    6. 'c': 'US',
    7. 'cy': 'Danvers',
    8. 'g': 'A6qOVH',
    9. 'gr': 'MA',
    10. 'h': 'wfLQtf',
    11. 'hc': 1331822918,
    12. 'hh': '1.usa.gov',
    13. 'l': 'orofrog',
    14. 'll': [42.576698, -70.954903],
    15. 'nk': 1,
    16. 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
    17. 't': 1331923247,
    18. 'tz': 'America/New_York',
    19. 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

    假设我们想要知道该数据集中最常出现的是哪个时区(即tz字段),得到答案的办法有很多。首先,我们用列表推导式取出一组时区:

    1. In [12]: time_zones = [rec['tz'] for rec in records]
    2. ---------------------------------------------------------------------------
    3. KeyError Traceback (most recent call last)
    4. <ipython-input-12-db4fbd348da9> in <module>()
    5. ----> 1 time_zones = [rec['tz'] for rec in records]
    6. <ipython-input-12-db4fbd348da9> in <listcomp>(.0)
    7. ----> 1 time_zones = [rec['tz'] for rec in records]
    8. KeyError: 'tz'

    晕!原来并不是所有记录都有时区字段。这个好办,只需在列表推导式末尾加上一个if ‘tz’in rec判断即可:

    1. In [13]: time_zones = [rec['tz'] for rec in records if 'tz' in rec]
    2. In [14]: time_zones[:10]
    3. Out[14]:
    4. ['America/New_York',
    5. 'America/Denver',
    6. 'America/New_York',
    7. 'America/Sao_Paulo',
    8. 'America/New_York',
    9. 'America/New_York',
    10. 'Europe/Warsaw',
    11. '',
    12. '',
    13. '']

    只看前10个时区,我们发现有些是未知的(即空的)。虽然可以将它们过滤掉,但现在暂时先留着。接下来,为了对时区进行计数,这里介绍两个办法:一个较难(只使用标准Python库),另一个较简单(使用pandas)。计数的办法之一是在遍历时区的过程中将计数值保存在字典中:

    1. def get_counts(sequence):
    2. counts = {}
    3. for x in sequence:
    4. if x in counts:
    5. counts[x] += 1
    6. else:
    7. counts[x] = 1
    8. return counts

    如果使用Python标准库的更高级工具,那么你可能会将代码写得更简洁一些:

    1. from collections import defaultdict
    2. def get_counts2(sequence):
    3. counts = defaultdict(int) # values will initialize to 0
    4. for x in sequence:
    5. counts[x] += 1
    6. return counts

    我将逻辑写到函数中是为了获得更高的复用性。要用它对时区进行处理,只需将time_zones传入即可:

    1. In [17]: counts = get_counts(time_zones)
    2. In [18]: counts['America/New_York']
    3. Out[18]: 1251
    4. In [19]: len(time_zones)
    5. Out[19]: 3440

    如果想要得到前10位的时区及其计数值,我们需要用到一些有关字典的处理技巧:

    1. def top_counts(count_dict, n=10):
    2. value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    3. value_key_pairs.sort()
    4. return value_key_pairs[-n:]

    如果你搜索Python的标准库,你能找到collections.Counter类,它可以使这项工作更简单:

    1. In [22]: from collections import Counter
    2. In [23]: counts = Counter(time_zones)
    3. In [24]: counts.most_common(10)
    4. Out[24]:
    5. [('America/New_York', 1251),
    6. ('', 521),
    7. ('America/Chicago', 400),
    8. ('America/Los_Angeles', 382),
    9. ('America/Denver', 191),
    10. ('Europe/London', 74),
    11. ('Asia/Tokyo', 37),
    12. ('Pacific/Honolulu', 36),
    13. ('Europe/Madrid', 35),
    14. ('America/Sao_Paulo', 33)]

    用pandas对时区进行计数

    从原始记录的集合创建DateFrame,与将记录列表传递到pandas.DataFrame一样简单:

    1. In [25]: import pandas as pd
    2. In [26]: frame = pd.DataFrame(records)
    3. In [27]: frame.info()
    4. <class 'pandas.core.frame.DataFrame'>
    5. RangeIndex: 3560 entries, 0 to 3559
    6. Data columns (total 18 columns):
    7. _heartbeat_ 120 non-null float64
    8. al 3094 non-null object
    9. cy 2919 non-null object
    10. g 3440 non-null object
    11. gr 2919 non-null object
    12. h 3440 non-null object
    13. hc 3440 non-null float64
    14. hh 3440 non-null object
    15. kw 93 non-null object
    16. l 3440 non-null object
    17. ll 2919 non-null object
    18. nk 3440 non-null float64
    19. r 3440 non-null object
    20. t 3440 non-null float64
    21. tz 3440 non-null object
    22. u 3440 non-null object
    23. dtypes: float64(4), object(14)
    24. memory usage: 500.7+ KB
    25. In [28]: frame['tz'][:10]
    26. Out[28]:
    27. 0 America/New_York
    28. 1 America/Denver
    29. 2 America/New_York
    30. 3 America/Sao_Paulo
    31. 4 America/New_York
    32. 5 America/New_York
    33. 6 Europe/Warsaw
    34. 7
    35. 8
    36. 9
    37. Name: tz, dtype: object

    这里frame的输出形式是摘要视图(summary view),主要用于较大的DataFrame对象。我们然后可以对Series使用value_counts方法:

    1. In [29]: tz_counts = frame['tz'].value_counts()
    2. In [30]: tz_counts[:10]
    3. Out[30]:
    4. America/New_York 1251
    5. 521
    6. America/Chicago 400
    7. America/Los_Angeles 382
    8. America/Denver 191
    9. Europe/London 74
    10. Asia/Tokyo 37
    11. Pacific/Honolulu 36
    12. Europe/Madrid 35
    13. America/Sao_Paulo 33
    14. Name: tz, dtype: int64

    我们可以用matplotlib可视化这个数据。为此,我们先给记录中未知或缺失的时区填上一个替代值。fillna函数可以替换缺失值(NA),而未知值(空字符串)则可以通过布尔型数组索引加以替换:

    1. In [31]: clean_tz = frame['tz'].fillna('Missing')
    2. In [32]: clean_tz[clean_tz == ''] = 'Unknown'
    3. In [33]: tz_counts = clean_tz.value_counts()
    4. In [34]: tz_counts[:10]
    5. Out[34]:
    6. America/New_York 1251
    7. Unknown 521
    8. America/Chicago 400
    9. America/Los_Angeles 382
    10. America/Denver 191
    11. Missing 120
    12. Europe/London 74
    13. Asia/Tokyo 37
    14. Pacific/Honolulu 36
    15. Europe/Madrid 35
    16. Name: tz, dtype: int64

    此时,我们可以用seaborn包创建水平柱状图(结果见图14-1):

    1. In [36]: import seaborn as sns
    2. In [37]: subset = tz_counts[:10]
    3. In [38]: sns.barplot(y=subset.index, x=subset.values)

    a字段含有执行URL短缩操作的浏览器、设备、应用程序的相关信息:

    1. In [39]: frame['a'][1]
    2. Out[39]: 'GoogleMaps/RochesterNY'
    3. In [40]: frame['a'][50]
    4. Out[40]: 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2)
    5. Gecko/20100101 Firefox/10.0.2'
    6. In [41]: frame['a'][51][:50] # long line
    7. Out[41]: 'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P9'

    将这些”agent”字符串中的所有信息都解析出来是一件挺郁闷的工作。一种策略是将这种字符串的第一节(与浏览器大致对应)分离出来并得到另外一份用户行为摘要:

    1. In [42]: results = pd.Series([x.split()[0] for x in frame.a.dropna()])
    2. In [43]: results[:5]
    3. Out[43]:
    4. 0 Mozilla/5.0
    5. 1 GoogleMaps/RochesterNY
    6. 2 Mozilla/4.0
    7. 3 Mozilla/5.0
    8. 4 Mozilla/5.0
    9. dtype: object
    10. In [44]: results.value_counts()[:8]
    11. Out[44]:
    12. Mozilla/5.0 2594
    13. Mozilla/4.0 601
    14. GoogleMaps/RochesterNY 121
    15. Opera/9.80 34
    16. Mozilla/6.0 5
    17. BlackBerry8520/5.0.0.681 4
    18. dtype: int64

    现在,假设你想按Windows和非Windows用户对时区统计信息进行分解。为了简单起见,我们假定只要agent字符串中含有”Windows”就认为该用户为Windows用户。由于有的agent缺失,所以首先将它们从数据中移除:

    1. In [45]: cframe = frame[frame.a.notnull()]

    接下来就可以根据时区和新得到的操作系统列表对数据进行分组了:

    1. In [49]: by_tz_os = cframe.groupby(['tz', 'os'])

    分组计数,类似于value_counts函数,可以用size来计算。并利用unstack对计数结果进行重塑:

    1. In [50]: agg_counts = by_tz_os.size().unstack().fillna(0)
    2. In [51]: agg_counts[:10]
    3. Out[51]:
    4. os Not Windows Windows
    5. tz
    6. 245.0 276.0
    7. Africa/Cairo 0.0 3.0
    8. Africa/Casablanca 0.0 1.0
    9. Africa/Ceuta 0.0 2.0
    10. Africa/Johannesburg 0.0 1.0
    11. Africa/Lusaka 0.0 1.0
    12. America/Anchorage 4.0 1.0
    13. America/Argentina/Buenos_Aires 1.0 0.0
    14. America/Argentina/Cordoba 0.0 1.0
    15. America/Argentina/Mendoza 0.0 1.0

    最后,我们来选取最常出现的时区。为了达到这个目的,我根据agg_counts中的行数构造了一个间接索引数组:

    1. # Use to sort in ascending order
    2. In [52]: indexer = agg_counts.sum(1).argsort()
    3. In [53]: indexer[:10]
    4. Out[53]:
    5. tz
    6. 24
    7. Africa/Cairo 20
    8. Africa/Casablanca 21
    9. Africa/Ceuta 92
    10. Africa/Johannesburg 87
    11. Africa/Lusaka 53
    12. America/Anchorage 54
    13. America/Argentina/Buenos_Aires 57
    14. America/Argentina/Cordoba 26
    15. America/Argentina/Mendoza 55
    16. dtype: int64

    然后我通过take按照这个顺序截取了最后10行最大值:

    1. In [54]: count_subset = agg_counts.take(indexer[-10:])
    2. In [55]: count_subset
    3. Out[55]:
    4. os Not Windows Windows
    5. tz
    6. America/Sao_Paulo 13.0 20.0
    7. Europe/Madrid 16.0 19.0
    8. Pacific/Honolulu 0.0 36.0
    9. Asia/Tokyo 2.0 35.0
    10. Europe/London 43.0 31.0
    11. America/Denver 132.0 59.0
    12. America/Los_Angeles 130.0 252.0
    13. America/Chicago 115.0 285.0
    14. 245.0 276.0
    15. America/New_York 339.0 912.0

    pandas有一个简便方法nlargest,可以做同样的工作:

    1. In [56]: agg_counts.sum(1).nlargest(10)
    2. Out[56]:
    3. tz
    4. America/New_York 1251.0
    5. 521.0
    6. America/Chicago 400.0
    7. America/Los_Angeles 382.0
    8. America/Denver 191.0
    9. Europe/London 74.0
    10. Asia/Tokyo 37.0
    11. Pacific/Honolulu 36.0
    12. Europe/Madrid 35.0
    13. America/Sao_Paulo 33.0
    14. dtype: float64

    然后,如这段代码所示,可以用柱状图表示。我传递一个额外参数到seaborn的barpolt函数,来画一个堆积条形图(见图14-2):

    1. # Rearrange the data for plotting
    2. In [58]: count_subset = count_subset.stack()
    3. In [59]: count_subset.name = 'total'
    4. In [60]: count_subset = count_subset.reset_index()
    5. In [61]: count_subset[:10]
    6. Out[61]:
    7. tz os total
    8. 0 America/Sao_Paulo Not Windows 13.0
    9. 1 America/Sao_Paulo Windows 20.0
    10. 2 Europe/Madrid Not Windows 16.0
    11. 3 Europe/Madrid Windows 19.0
    12. 4 Pacific/Honolulu Not Windows 0.0
    13. 5 Pacific/Honolulu Windows 36.0
    14. 6 Asia/Tokyo Not Windows 2.0
    15. 7 Asia/Tokyo Windows 35.0
    16. 8 Europe/London Not Windows 43.0
    17. 9 Europe/London Windows 31.0
    18. In [62]: sns.barplot(x='total', y='tz', hue='os', data=count_subset)

    图14-2 最常出现时区的Windows和非Windows用户

    这张图不容易看出Windows用户在小分组中的相对比例,因此标准化分组百分比之和为1:

    1. def norm_total(group):
    2. group['normed_total'] = group.total / group.total.sum()
    3. return group
    4. results = count_subset.groupby('tz').apply(norm_total)

    再次画图,见图14-3:

    1. In [65]: sns.barplot(x='normed_total', y='tz', hue='os', data=results)

    我们还可以用groupby的transform方法,更高效的计算标准化的和: