十八、自然语言处理

自然语言处理(NLP)是使用计算机分析文本数据的方法。

这是维基百科上的自然语言处理。

NLTK 是用于文本分析的主要 Python 模块。

NLTK 组织网站在,他们有一整本教程在这里

NLTK

NLTK 为超过 50 种语料库和词汇资源(如 WordNet)提供了易于使用的界面,以及一套用于分类,分词,词干提取,标注,解析和语义推理的文本处理库,用于工业级 NLP 库的包装器,和活跃的讨论论坛。

在这个笔记本中,我们将使用 NLTK 包中的一些有用功能来完成一些基本的文本分析。

要处理文本数据,通常需要使用语料库 - 文本数据集进行比较。 NLTK 有许多这样的数据集可用,但默认情况下不会安装它们(因为它们的完整集合会非常大)。下面我们将下载其中一些数据集。

  1. # 请返回此单元格,取消注释,然后运行此代码。
  2. # 这段代码赋予 python 写入磁盘的权限(如果它还没有这样做的权限)。
  3. import ssl
  4. try:
  5. _create_unverified_https_context = ssl._create_unverified_context
  6. except AttributeError:
  7. pass
  8. else:
  9. ssl._create_default_https_context = _create_unverified_https_context
  10. # 从 NLTK 下载一些有用的数据文件
  11. nltk.download('punkt')
  12. nltk.download('stopwords')
  13. nltk.download('averaged_perceptron_tagger')
  14. nltk.download('maxent_ne_chunker')
  15. nltk.download('words')
  16. nltk.download('treebank')
  17. '''
  18. [nltk_data] Downloading package punkt to /Users/tom/nltk_data...
  19. [nltk_data] Package punkt is already up-to-date!
  20. [nltk_data] Downloading package stopwords to /Users/tom/nltk_data...
  21. [nltk_data] Package stopwords is already up-to-date!
  22. [nltk_data] Downloading package averaged_perceptron_tagger to
  23. [nltk_data] /Users/tom/nltk_data...
  24. [nltk_data] Package averaged_perceptron_tagger is already up-to-
  25. [nltk_data] date!
  26. [nltk_data] Downloading package maxent_ne_chunker to
  27. [nltk_data] /Users/tom/nltk_data...
  28. [nltk_data] Package maxent_ne_chunker is already up-to-date!
  29. [nltk_data] Downloading package words to /Users/tom/nltk_data...
  30. [nltk_data] Package words is already up-to-date!
  31. [nltk_data] Downloading package treebank to /Users/tom/nltk_data...
  32. [nltk_data] Package treebank is already up-to-date!
  33. True
  34. '''
  35. # 设置一些要测试的数据的测试句子
  36. sentence = "UC San Diego is a great place to study cognitive science."

分词是将文本数据拆分为“标记”的过程,这些标记是有意义的数据片段。

分词的更多信息在。

词可以在不同的级别完成 - 例如,你可以将文本分词为句子,和/或分词为单词。

  1. # 在单词级别对我们的句子进行分词
  2. tokens = nltk.word_tokenize(sentence)
  3. # 查看单词分词后的数据
  4. print(tokens)
  5. # ['UC', 'San', 'Diego', 'is', 'a', 'great', 'place', 'to', 'study', 'cognitive', 'science', '.']

这里是维基百科上的词性标注。

  1. # 对我们的句子进行词性标注
  2. tags = nltk.pos_tag(tokens)
  3. # 检查我们的数据的 POS 标签
  4. print(tags)
  5. # [('UC', 'NNP'), ('San', 'NNP'), ('Diego', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('place', 'NN'), ('to', 'TO'), ('study', 'VB'), ('cognitive', 'JJ'), ('science', 'NN'), ('.', '.')]
  6. # 查看描述所有缩写含义的文档
  7. nltk.help.upenn_tagset()
  8. '''
  9. $: dollar
  10. $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
  11. '': closing quotation mark
  12. ' ''
  13. (: opening parenthesis
  14. ( [ {
  15. ): closing parenthesis
  16. ) ] }
  17. ,: comma
  18. ,
  19. --: dash
  20. --
  21. .: sentence terminator
  22. . ! ?
  23. :: colon or ellipsis
  24. : ; ...
  25. CC: conjunction, coordinating
  26. & 'n and both but either et for less minus neither nor or plus so
  27. therefore times v. versus vs. whether yet
  28. CD: numeral, cardinal
  29. mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
  30. seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
  31. fifteen 271,124 dozen quintillion DM2,000 ...
  32. DT: determiner
  33. all an another any both del each either every half la many much nary
  34. neither no some such that the them these this those
  35. EX: existential there
  36. there
  37. FW: foreign word
  38. lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
  39. terram fiche oui corporis ...
  40. IN: preposition or conjunction, subordinating
  41. astride among uppon whether out inside pro despite on by throughout
  42. below within for towards near behind atop around if like until below
  43. JJ: adjective or numeral, ordinal
  44. third ill-mannered pre-war regrettable oiled calamitous first separable
  45. ectoplasmic battery-powered participatory fourth still-to-be-named
  46. multilingual multi-disciplinary ...
  47. JJR: adjective, comparative
  48. bleaker braver breezier briefer brighter brisker broader bumper busier
  49. calmer cheaper choosier cleaner clearer closer colder commoner costlier
  50. cozier creamier crunchier cuter ...
  51. JJS: adjective, superlative
  52. calmest cheapest choicest classiest cleanest clearest closest commonest
  53. corniest costliest crassest creepiest crudest cutest darkest deadliest
  54. dearest deepest densest dinkiest ...
  55. LS: list item marker
  56. A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
  57. SP-44007 Second Third Three Two * a b c d first five four one six three
  58. two
  59. MD: modal auxiliary
  60. can cannot could couldn't dare may might must need ought shall should
  61. shouldn't will would
  62. NN: noun, common, singular or mass
  63. common-carrier cabbage knuckle-duster Casino afghan shed thermostat
  64. investment slide humour falloff slick wind hyena override subhumanity
  65. machinist ...
  66. NNP: noun, proper, singular
  67. Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
  68. Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
  69. Shannon A.K.C. Meltex Liverpool ...
  70. NNPS: noun, proper, plural
  71. Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
  72. Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
  73. Apache Apaches Apocrypha ...
  74. NNS: noun, common, plural
  75. undergraduates scotches bric-a-brac products bodyguards facets coasts
  76. divestitures storehouses designs clubs fragrances averages
  77. subjectivists apprehensions muses factory-jobs ...
  78. PDT: pre-determiner
  79. all both half many quite such sure this
  80. POS: genitive marker
  81. ' 's
  82. PRP: pronoun, personal
  83. hers herself him himself hisself it itself me myself one oneself ours
  84. ourselves ownself self she thee theirs them themselves they thou thy us
  85. PRP$: pronoun, possessive
  86. her his mine my our ours their thy your
  87. RB: adverb
  88. occasionally unabatingly maddeningly adventurously professedly
  89. stirringly prominently technologically magisterially predominately
  90. swiftly fiscally pitilessly ...
  91. RBR: adverb, comparative
  92. further gloomier grander graver greater grimmer harder harsher
  93. healthier heavier higher however larger later leaner lengthier less-
  94. perfectly lesser lonelier longer louder lower more ...
  95. RBS: adverb, superlative
  96. best biggest bluntest earliest farthest first furthest hardest
  97. heartiest highest largest least less most nearest second tightest worst
  98. RP: particle
  99. aboard about across along apart around aside at away back before behind
  100. by crop down ever fast for forth from go high i.e. in into just later
  101. low more off on open out over per pie raising start teeth that through
  102. under unto up up-pp upon whole with you
  103. SYM: symbol
  104. % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
  105. TO: "to" as preposition or infinitive marker
  106. to
  107. UH: interjection
  108. Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
  109. huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
  110. man baby diddle hush sonuvabitch ...
  111. VB: verb, base form
  112. ask assemble assess assign assume atone attention avoid bake balkanize
  113. bank begin behold believe bend benefit bevel beware bless boil bomb
  114. boost brace break bring broil brush build ...
  115. VBD: verb, past tense
  116. dipped pleaded swiped regummed soaked tidied convened halted registered
  117. cushioned exacted snubbed strode aimed adopted belied figgered
  118. speculated wore appreciated contemplated ...
  119. VBG: verb, present participle or gerund
  120. telegraphing stirring focusing angering judging stalling lactating
  121. hankerin' alleging veering capping approaching traveling besieging
  122. encrypting interrupting erasing wincing ...
  123. VBN: verb, past participle
  124. multihulled dilapidated aerosolized chaired languished panelized used
  125. experimented flourished imitated reunifed factored condensed sheared
  126. unsettled primed dubbed desired ...
  127. VBP: verb, present tense, not 3rd person singular
  128. predominate wrap resort sue twist spill cure lengthen brush terminate
  129. appear tend stray glisten obtain comprise detest tease attract
  130. emphasize mold postpone sever return wag ...
  131. VBZ: verb, present tense, 3rd person singular
  132. bases reconstructs marks mixes displeases seals carps weaves snatches
  133. slumps stretches authorizes smolders pictures emerges stockpiles
  134. seduces fizzes uses bolsters slaps speaks pleads ...
  135. WDT: WH-determiner
  136. that what whatever which whichever
  137. WP: WH-pronoun
  138. that what whatever whatsoever which who whom whosoever
  139. whose
  140. WRB: Wh-adverb
  141. how however whence whenever where whereby whereever wherein whereof why
  142. '''

命名实体识别旨在用与相关的实体类型标记单词。

这里是上的命名实体识别。

“停止词”是一种语言中最常见的词语,我们经常希望在文本分析之前将其过滤掉。

这里是维基百科上的停止词。

  1. # 查看英语中的停止词语料库
  2. print(nltk.corpus.stopwords.words('english'))
  3. # ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

文本编码

NLP的关键组件之一是决定如何编码文本数据。

常见的编码是:

  • 词袋(BoW)
    • 文本被编码为单词和频率的集合
  • 词频-逆向文件频率(TF-IDF)
    • TF-IDF是一种加权,用于存储单词和语料库中的共性关系。

我们将浏览 BoW 和 TF-IDF 文本编码的示例。

  1. # 导入
  2. %matplotlib inline
  3. # 标准 Python 有一些有用的字符串工具
  4. import string
  5. # 集合是标准 Python 的一部分,包含一些有用的数据对象
  6. from collections import Counter
  7. import numpy as np
  8. import matplotlib.pyplot as plt
  9. # Scikit-learn 有一些有用的 NLP 工具,例如 TFIDF 向量化器
  10. from sklearn.feature_extraction.text import TfidfVectorizer

我们将要查看的数据是 BookCorpus 数据集的一小部分。原始数据集可在此处找到:。

原始数据集是从超过 11,000 本书中收集的,并且已经在句子和单词级别上进行了分词。这里提供和使用的小子集包含前 10,000 个句子。

  1. # 加载数据
  2. with open('files/book10k.txt', 'r') as f:
  3. sents = f.readlines()
  4. # 查看数据 - 打印出第一个和最后一个句子,作为示例
  5. print(sents[0])
  6. print(sents[-1])
  7. '''
  8. the half-ling book one in the fall of igneeria series kaylee soderburg copyright 2013 kaylee soderburg all rights reserved .
  9. alejo was sure the fact that he was nervously repeating mass along with five wrinkly , age-encrusted spanish women meant that stalin was rethinking whether he was going to pay the price .
  10. '''
  11. # 预处理:从句子中删除所有额外的空格
  12. sents = [sent.strip() for sent in sents]

我们首先看一下文档中的单词频率,然后打印出频率最高的前 10 个单词。

如果你滚动上面的单词列表,你可能会注意到的一点是,它仍然包含标点符号。我们删除那些。

  1. # 'string' 模块(标准库)有一个有用的标点符号列表
  2. print(string.punctuation)
  3. # !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
  4. # 从计数对象中删除所有标点符号标记
  5. for punc in string.punctuation:
  6. if punc in counts:
  7. counts.pop(punc)
  8. # 获得前 10 个最常用的单词
  9. top10 = counts.most_common(10)
  10. # 提取顶部单词,并计数
  11. top10_words = [it[0] for it in top10]
  12. top10_counts = [it[1] for it in top10]
  13. # 绘制文本中最常用单词的条形图
  14. plt.barh(top10_words, top10_counts)
  15. plt.title('Term Frequency');
  16. plt.xlabel('Frequency');

正如我们所看到的,文档中出现了thewasa等等。

对于弄清楚这些文档的内容,或者作为使用和理解这些文本数据的方式,这些经常出现的单词并不是很有用。

  1. # 丢弃所有停止词
  2. for stop in nltk.corpus.stopwords.words('english'):
  3. if stop in counts:
  4. counts.pop(stop)
  5. # 获取删除停止词的数据中,前 20 个最常用单词
  6. top20 = counts.most_common(20)
  7. # 绘制文本中最常用单词的条形图
  8. plt.barh([it[0] for it in top20], [it[1] for it in top20])
  9. plt.title('Term Frequency');
  10. plt.xlabel('Frequency');

这看起来可能更相关/有用。我们可以继续探索这个 BoW 模型,但现在让我们转向,并使用 TFIDF 进行探索。

  1. # 初始化 TFIDF 对象
  2. tfidf = TfidfVectorizer(analyzer='word',
  3. sublinear_tf=True,
  4. max_features=5000,
  5. tokenizer=nltk.word_tokenize)
  6. # 将 TFIDF 转换应用于我们的数据
  7. # 请注意,这会接受句子并对其进行分词,然后应用 TFIDF
  8. tfidf_books = tfidf.fit_transform(sents).toarray()

TfidfVectorizer 将计算每个单词的逆文档频率(IDF)。

然后 TFIDF 计算为TF * IDF,用于降低频繁出现的单词的权重。该 TFIDF 存储在tfidf_books变量中,该变量是一个n_documents x n_words矩阵,用于以 TFIDF 表示来编码文档。

让我们首先为前 10 个最常出现的单词(来自第一次分析)中的每一个绘制 IDF。

png

我们将该绘图与以下绘图进行比较,该绘图显示 IDF 最高的前 10 个单词。

  1. # 获得 IDF 得分最高的单词
  2. inds = np.argsort(tfidf.idf_)[::-1][:10]
  3. top_IDF_tokens = [list(tfidf.vocabulary_)[ind] for ind in inds]
  4. top_IDF_scores = tfidf.idf_[inds]
  5. # 绘制 IDF 得分最高的单词
  6. plt.barh(top_IDF_tokens, top_IDF_scores)
  7. plt.xlabel('IDF Score');

正如我们所看到的,与更罕见的单词相比,文档中经常出现的单词获得的 IDF 分数非常低。

在 TF-IDF 之后,我们成功地减少了文档中频繁出现的单词的权重。这允许我们通过最独特的单词来表示文档,这可以是表示文本数据的更有用的方式。