6.3. 语言模型数据集（周杰伦专辑歌词）

首先读取这个数据集，看看前40个字符是什么样的。

from mxnet import nd
import random
import zipfile
 
with zipfile.ZipFile('../data/jaychou_lyrics.txt.zip') as zin:
    with zin.open('jaychou_lyrics.txt') as f:
        corpus_chars = f.read().decode('utf-8')
corpus_chars[:40]

Out[1]:

'想要有直升机\n想要和你飞到宇宙去\n想要和你融化在一起\n融化在宇宙里\n我每天每天每'

这个数据集有6万多个字符。为了打印方便，我们把换行符替换成空格，然后仅使用前1万个字符来训练模型。

In [2]:

corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[0:10000]

In [3]:

Out[3]:

之后，将训练数据集中每个字符转化为索引，并打印前20个字符及其对应的索引。

In [4]:

corpus_indices = [char_to_idx[char] for char in corpus_chars]
sample = corpus_indices[:20]
print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print('indices:', sample)

chars: 想要有直升机 想要和你飞到宇宙去 想要和

我们将以上代码封装在d2lzh包里的load_data_jay_lyrics函数中，以方便后面章节调用。调用该函数后会依次得到corpus_indices、、idx_to_char和vocab_size这4个变量。

下面的代码每次从数据里随机采样一个小批量。其中批量大小batch_size指每个小批量的样本数，为每个样本所包含的时间步数。在随机采样中，每个样本是原始序列上任意截取的一段序列。相邻的两个随机小批量在原始序列上的位置不一定相毗邻。因此，我们无法用一个小批量最终时间步的隐藏状态来初始化下一个小批量的隐藏状态。在训练模型时，每次随机采样前都需要重新初始化隐藏状态。

In [5]:

让我们输入一个从0到29的连续整数的人工序列。设批量大小和时间步数分别为2和6。打印随机采样每次读取的小批量样本的输入X和标签Y。可见，相邻的两个随机小批量在原始序列上的位置不一定相毗邻。

In [6]:

my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

X:
[[ 6.  7.  8.  9. 10. 11.]
 [18. 19. 20. 21. 22. 23.]]
<NDArray 2x6 @cpu(0)>
Y:
[[ 7.  8.  9. 10. 11. 12.]
 [19. 20. 21. 22. 23. 24.]]
<NDArray 2x6 @cpu(0)>
 
X:
[[ 0.  1.  2.  3.  4.  5.]
 [12. 13. 14. 15. 16. 17.]]
<NDArray 2x6 @cpu(0)>
Y:
[[ 1.  2.  3.  4.  5.  6.]
 [13. 14. 15. 16. 17. 18.]]
<NDArray 2x6 @cpu(0)>

6.3.3.2. 相邻采样

In [7]:

# 本函数已保存在d2lzh包中方便以后使用
def data_iter_consecutive(corpus_indices, batch_size, num_steps, ctx=None):
    corpus_indices = nd.array(corpus_indices, ctx=ctx)
    data_len = len(corpus_indices)
    batch_len = data_len // batch_size
        batch_size, batch_len))
    epoch_size = (batch_len - 1) // num_steps
    for i in range(epoch_size):
        i = i * num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

同样的设置下，打印相邻采样每次读取的小批量样本的输入X和标签。相邻的两个随机小批量在原始序列上的位置相毗邻。

In [8]:

X:
[[ 0.  1.  2.  3.  4.  5.]
 [15. 16. 17. 18. 19. 20.]]
<NDArray 2x6 @cpu(0)>
Y:
[[ 1.  2.  3.  4.  5.  6.]
 [16. 17. 18. 19. 20. 21.]]
<NDArray 2x6 @cpu(0)>
 
X:
[[ 6.  7.  8.  9. 10. 11.]
 [21. 22. 23. 24. 25. 26.]]
<NDArray 2x6 @cpu(0)>
Y:
[[ 7.  8.  9. 10. 11. 12.]
 [22. 23. 24. 25. 26. 27.]]
<NDArray 2x6 @cpu(0)>

时序数据采样方式包括随机采样和相邻采样。使用这两种方式的循环神经网络训练在实现上略有不同。

你还能想到哪些采样小批量时序数据的方法？
如果我们希望一个序列样本是一个完整的句子，这会给小批量采样带来什么样的问题？