
    日期: 2021.01

    摘要: 本示例简要介绍如何通过飞桨开源框架,实现图片搜索的功能。




    本教程基于Paddle 2.0 编写,如果您的环境不是本版本,请先参考官网 Paddle 2.0 。

    本示例采用CIFAR-10数据集。这是一个经典的数据集,由50000张图片的训练数据,和10000张图片的测试数据组成,其中每张图片是一个RGB的长和宽都为32的图片。使用paddle.vision.datasets.Cifar10可以方便的完成数据的下载工作,把数据归一化到(0, 1.0)区间内,并提供迭代器供按顺序访问数据。我们会把训练数据和测试数据分别存放在两个numpy数组中,供后面的训练和评估来使用。

    1. import paddle.vision.transforms as T
    2. transform = T.Compose([T.Transpose((2, 0, 1))])
    3. cifar10_train = paddle.vision.datasets.Cifar10(mode='train', transform=transform)
    4. x_train = np.zeros((50000, 3, 32, 32))
    5. y_train = np.zeros((50000, 1), dtype='int32')
    6. for i in range(len(cifar10_train)):
    7. train_image, train_label = cifar10_train[i]
    8. # normalize the data
    9. x_train[i,:, :, :] = train_image / 255.
    10. y_train[i, 0] = train_label
    11. y_train = np.squeeze(y_train)
    12. print(x_train.shape)
    13. print(y_train.shape)
    1. (50000, 3, 32, 32)
    2. (50000,)
    1. cifar10_test = paddle.vision.datasets.cifar.Cifar10(mode='test', transform=transform)
    2. x_test = np.zeros((10000, 3, 32, 32), dtype='float32')
    3. y_test = np.zeros((10000, 1), dtype='int64')
    4. for i in range(len(cifar10_test)):
    5. test_image, test_label = cifar10_test[i]
    6. # normalize the data
    7. x_test[i,:, :, :] = test_image / 255.
    8. y_test[i, 0] = test_label
    9. y_test = np.squeeze(y_test)
    10. print(x_test.shape)
    11. print(y_test.shape)

    3.2 数据探索


    1. height_width = 32
    2. def show_collage(examples):
    3. box_size = height_width + 2
    4. num_rows, num_cols = examples.shape[:2]
    5. collage = Image.new(
    6. mode="RGB",
    7. size=(num_cols * box_size, num_rows * box_size),
    8. color=(255, 255, 255),
    9. )
    10. for row_idx in range(num_rows):
    11. for col_idx in range(num_cols):
    12. array = (np.array(examples[row_idx, col_idx]) * 255).astype(np.uint8)
    13. array = array.transpose(1,2,0)
    14. collage.paste(
    15. Image.fromarray(array), (col_idx * box_size, row_idx * box_size)
    16. )
    17. collage = collage.resize((2 * num_cols * box_size, 2 * num_rows * box_size))
    18. return collage
    19. sample_idxs = np.random.randint(0, 50000, size=(5, 5))
    20. examples = x_train[sample_idxs]
    21. show_collage(examples)

    3.3 构建训练数据

    图片检索的模型的训练样本跟我们常见的分类任务的训练样本不太一样的地方在于,每个训练样本并不是一个(image, class)这样的形式。而是(image0, image1, similary_or_not)的形式,即,每一个训练样本由两张图片组成,而其label是这两张图片是否相似的标志位(0或者1)。



    1. class_idx_to_train_idxs = defaultdict(list)
    2. for y_train_idx, y in enumerate(y_train):
    3. class_idx_to_train_idxs[y].append(y_train_idx)
    4. class_idx_to_test_idxs = defaultdict(list)
    5. for y_test_idx, y in enumerate(y_test):
    6. class_idx_to_test_idxs[y].append(y_test_idx)

    有了上面的索引,我们就可以为飞桨准备一个读取数据的迭代器。该迭代器每次生成2 * number of classes张图片,在CIFAR10数据集中,这会是20张图片。前10张图片,和后10张图片,分别是10个类别中每个类别随机抽出的一张图片。这样,在实际的训练过程中,我们就会有10张相似的图片和90张不相似的图片(前10张图片中的任意一张图片,都与后10张的对应位置的1张图片相似,而与其他9张图片不相似)。

    1. num_classes = 10
    2. def reader_creator(num_batchs):
    3. def reader():
    4. iter_step = 0
    5. while True:
    6. break
    7. iter_step += 1
    8. x = np.empty((2, num_classes, 3, height_width, height_width), dtype=np.float32)
    9. for class_idx in range(num_classes):
    10. examples_for_class = class_idx_to_train_idxs[class_idx]
    11. anchor_idx = random.choice(examples_for_class)
    12. positive_idx = random.choice(examples_for_class)
    13. while positive_idx == anchor_idx:
    14. positive_idx = random.choice(examples_for_class)
    15. x[0, class_idx] = x_train[anchor_idx]
    16. x[1, class_idx] = x_train[positive_idx]
    17. yield x
    18. return reader
    19. # num_batchs: how many batchs to generate
    20. def anchor_positive_pairs(num_batchs=100):
    21. return reader_creator(num_batchs)
    1. pairs_train_reader = anchor_positive_pairs(num_batchs=1000)


    1. (2, 10, 3, 32, 32)

    我们的目标是首先把图片转换为高维空间的表示,然后计算图片在高维空间表示时的相似度。 下面的网络结构用来把一个形状为(3, 32, 32)的图片转换成形状为(8,)的向量。在有些资料中也会把这个转换成的向量称为Embedding,请注意,这与自然语言处理领域的词向量的区别。 下面的模型由三个连续的卷积加一个全局均值池化,然后用一个线性全链接层映射到维数为8的向量空间。为了后续计算余弦相似度时的便利,我们还在最后做了归一化。(即,余弦相似度的分母部分)

    1. class MyNet(paddle.nn.Layer):
    2. def __init__(self):
    3. super(MyNet, self).__init__()
    4. self.conv1 = paddle.nn.Conv2D(in_channels=3,
    5. out_channels=32,
    6. kernel_size=(3, 3),
    7. stride=2)
    8. self.conv2 = paddle.nn.Conv2D(in_channels=32,
    9. out_channels=64,
    10. kernel_size=(3,3),
    11. stride=2)
    12. self.conv3 = paddle.nn.Conv2D(in_channels=64,
    13. out_channels=128,
    14. kernel_size=(3,3),
    15. stride=2)
    16. self.gloabl_pool = paddle.nn.AdaptiveAvgPool2D((1,1))
    17. self.fc1 = paddle.nn.Linear(in_features=128, out_features=8)
    18. def forward(self, x):
    19. x = self.conv1(x)
    20. x = F.relu(x)
    21. x = self.conv2(x)
    22. x = F.relu(x)
    23. x = self.conv3(x)
    24. x = F.relu(x)
    25. x = self.gloabl_pool(x)
    26. x = paddle.squeeze(x, axis=[2, 3])
    27. x = self.fc1(x)
    28. x = x / paddle.norm(x, axis=1, keepdim=True)
    29. return x


    • inverse_temperature参数起到的作用是让softmax在计算梯度时,能够处于梯度更显著的区域。(可以参考中,在点积之后的scale操作)。

    • 整个计算过程,会先用上面的网络分别计算前10张图片(anchors)的高维表示,和后10张图片的高维表示。然后再用matmul计算前10张图片分别与后10张图片的相似度。(所以similarities会是一个(10, 10)的Tensor)。

    • 在构造类别标签时,则相应的,可以构造出来0 ~ num_classes的标签值,用来让学习的目标成为相似的图片的相似度尽可能的趋向于1.0,而不相似的图片的相似度尽可能的趋向于-1.0。

    1. def train(model):
    2. print('start training ... ')
    3. model.train()
    4. inverse_temperature = paddle.to_tensor(np.array([1.0/0.2], dtype='float32'))
    5. epoch_num = 20
    6. opt = paddle.optimizer.Adam(learning_rate=0.0001,
    7. parameters=model.parameters())
    8. for epoch in range(epoch_num):
    9. for batch_id, data in enumerate(pairs_train_reader()):
    10. anchors_data, positives_data = data[0], data[1]
    11. anchors = paddle.to_tensor(anchors_data)
    12. anchor_embeddings = model(anchors)
    13. positive_embeddings = model(positives)
    14. similarities = paddle.matmul(anchor_embeddings, positive_embeddings, transpose_y=True)
    15. similarities = paddle.multiply(similarities, inverse_temperature)
    16. sparse_labels = paddle.arange(0, num_classes, dtype='int64')
    17. loss = F.cross_entropy(similarities, sparse_labels)
    18. if batch_id % 500 == 0:
    19. print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, loss.numpy()))
    20. loss.backward()
    21. opt.step()
    22. opt.clear_grad()
    23. model = MyNet()
    24. train(model)
    1. start training ...
    2. epoch: 0, batch_id: 0, loss is: [2.2846317]
    3. epoch: 0, batch_id: 500, loss is: [2.0111878]
    4. epoch: 1, batch_id: 0, loss is: [2.1171227]
    5. epoch: 1, batch_id: 500, loss is: [2.1604505]
    6. epoch: 2, batch_id: 0, loss is: [2.2643456]
    7. epoch: 2, batch_id: 500, loss is: [1.9459085]
    8. epoch: 3, batch_id: 0, loss is: [2.044874]
    9. epoch: 3, batch_id: 500, loss is: [2.6040092]
    10. epoch: 4, batch_id: 0, loss is: [2.2173238]
    11. epoch: 4, batch_id: 500, loss is: [1.9844944]
    12. epoch: 5, batch_id: 0, loss is: [1.8081882]
    13. epoch: 5, batch_id: 500, loss is: [1.7608368]
    14. epoch: 6, batch_id: 0, loss is: [2.3919208]
    15. epoch: 6, batch_id: 500, loss is: [2.057749]
    16. epoch: 7, batch_id: 0, loss is: [1.7965529]
    17. epoch: 7, batch_id: 500, loss is: [1.8363149]
    18. epoch: 8, batch_id: 0, loss is: [1.6242621]
    19. epoch: 8, batch_id: 500, loss is: [2.052803]
    20. epoch: 9, batch_id: 0, loss is: [1.7524099]
    21. epoch: 9, batch_id: 500, loss is: [1.820884]
    22. epoch: 10, batch_id: 0, loss is: [1.7788585]
    23. epoch: 10, batch_id: 500, loss is: [1.9079857]
    24. epoch: 11, batch_id: 0, loss is: [1.7813282]
    25. epoch: 11, batch_id: 500, loss is: [1.7013695]
    26. epoch: 12, batch_id: 0, loss is: [2.0464826]
    27. epoch: 12, batch_id: 500, loss is: [1.6375948]
    28. epoch: 13, batch_id: 0, loss is: [2.0308146]
    29. epoch: 13, batch_id: 500, loss is: [1.7633543]
    30. epoch: 14, batch_id: 0, loss is: [1.7758572]
    31. epoch: 14, batch_id: 500, loss is: [1.6636188]
    32. epoch: 15, batch_id: 0, loss is: [1.7562834]
    33. epoch: 15, batch_id: 500, loss is: [1.9864613]
    34. epoch: 16, batch_id: 0, loss is: [1.5613587]
    35. epoch: 16, batch_id: 500, loss is: [1.7808621]
    36. epoch: 17, batch_id: 0, loss is: [2.0996895]
    37. epoch: 17, batch_id: 500, loss is: [1.7851509]
    38. epoch: 18, batch_id: 0, loss is: [1.5448205]
    39. epoch: 18, batch_id: 500, loss is: [1.7916664]
    40. epoch: 19, batch_id: 0, loss is: [1.7407477]
    41. epoch: 19, batch_id: 500, loss is: [1.47673]



    1. examples = np.empty(
    2. (
    3. num_classes,
    4. near_neighbours_per_example + 1,
    5. 3,
    6. height_width,
    7. height_width,
    8. ),
    9. dtype=np.float32,
    10. )
    11. for row_idx in range(num_classes):
    12. examples_for_class = class_idx_to_test_idxs[row_idx]
    13. anchor_idx = random.choice(examples_for_class)
    14. examples[row_idx, 0] = x_test[anchor_idx]
    15. anchor_near_neighbours = indicies[anchor_idx][1:near_neighbours_per_example+1]
    16. for col_idx, nn_idx in enumerate(anchor_near_neighbours):
    17. examples[row_idx, col_idx + 1] = x_test[nn_idx]


