应用差分隐私机制保护用户隐私

差分隐私是一种保护用户数据隐私的机制。什么是隐私，隐私指的是单个用户的某些属性，一群用户的某一些属性可以不看做隐私。例如：“抽烟的人有更高的几率会得肺癌”，这个不泄露隐私，但是“张三抽烟，得了肺癌”，这个就泄露了张三的隐私。如果我们知道A医院，今天就诊的100个病人，其中有10个肺癌，并且我们知道了其中99个人的患病信息，就可以推测剩下一个人是否患有肺癌。这种窃取隐私的行为叫做差分攻击。差分隐私是防止差分攻击的方法，通过添加噪声，使得差别只有一条记录的两个数据集，通过模型推理获得相同结果的概率非常接近。也就是说，用了差分隐私后，攻击者知道的100个人的患病信息和99个人的患病信息几乎是一样的，从而无法推测出剩下1个人的患病情况。

机器学习中的差分隐私

机器学习算法一般是用大量数据并更新模型参数，学习数据特征。在理想情况下，这些算法学习到一些泛化性较好的模型，例如“吸烟患者更容易得肺癌”，而不是特定的个体特征，例如“张三是个吸烟者，患有肺癌”。然而，机器学习算法并不会区分通用特征还是个体特征。当我们用机器学习来完成某个重要的任务，例如肺癌诊断，发布的机器学习模型，可能在无意中透露训练集中的个体特征，恶意攻击者可能从发布的模型获得关于张三的隐私信息，因此使用差分隐私技术来保护机器学习模型是十分必要的。

差分隐私定义[1]为：

$Pr[\mathcal{K}(D)\in S] \le e^{\epsilon} Pr[\mathcal{K}(D’) \in S]+\delta$

对于两个差别只有一条记录的数据集$D, D’$，通过随机算法$\mathcal{K}$，输出为结果集合$S$子集的概率满足上面公式，$\epsilon$为差分隐私预算，$\delta$ 为扰动，$\epsilon, \delta$越小，$\mathcal{K}$在$D, D’$上输出的数据分布越接近。

差分隐私的度量

差分隐私可以用$\epsilon, \delta$ 度量。

$\epsilon$：数据集中增加或者减少一条记录，引起的输出概率可以改变的上限。我们通常希望$\epsilon$是一个较小的常数，值越小表示差分隐私条件越严格。
$\delta$：用于限制模型行为任意改变的概率，通常设置为一个小的常数，推荐设置小于训练数据集大小的倒数。

MindArmour实现的差分隐私

这里以LeNet模型，MNIST 数据集为例，说明如何在MindSpore上使用差分隐私优化器训练神经网络模型。

实现阶段

下列是我们需要的公共模块、MindSpore相关模块和差分隐私特性模块。

参数配置

设置运行环境、数据集路径、模型训练参数、checkpoint存储参数、差分隐私参数，data_path数据路径替换成你的数据集所在路径。更多配置可以参考。

cfg = edict({
     'num_classes': 10,  # the number of classes of model's output
     'lr': 0.01,  # the learning rate of model's optimizer
     'momentum': 0.9,  # the momentum value of model's optimizer
     'epoch_size': 10,  # training epochs
     'batch_size': 256,  # batch size for training
     'image_height': 32,  # the height of training samples
     'image_width': 32,  # the width of training samples
     'save_checkpoint_steps': 234,  # the interval steps for saving checkpoint file of the model
     'keep_checkpoint_max': 10,  # the maximum number of checkpoint files would be saved
     'device_target': 'Ascend',  # device used
     'data_path': '../../common/dataset/MNIST',  # the path of training and testing data set
     'dataset_sink_mode': False,  # whether deliver all training data to device one time
     'micro_batches': 32,  # the number of small batches split from an original batch
     'norm_bound': 1.0,  # the clip bound of the gradients of model's training parameters
     'initial_noise_multiplier': 0.05,  # the initial multiplication coefficient of the noise added to training
     # parameters' gradients
     'noise_mechanisms': 'Gaussian',  # the method of adding noise in gradients while training
     'clip_mechanisms': 'Gaussian',  # the method of adaptive clipping gradients while training
     'clip_decay_policy': 'Linear', # Decay policy of adaptive clipping, decay_policy must be in ['Linear', 'Geometric'].
     'clip_learning_rate': 0.001, # Learning rate of update norm clip.
     'target_unclipped_quantile': 0.9, # Target quantile of norm clip.
     'fraction_stddev': 0.01, # The stddev of Gaussian normal which used in empirical_fraction.
     'optimizer': 'Momentum'  # the base optimizer used for Differential privacy training
})

配置必要的信息，包括环境信息、执行的模式。
```
context.set_context(mode=context.GRAPH_MODE, device_target=cfg.device_target)
```
详细的接口配置信息，请参见context.set_context接口说明。

加载数据集并处理成MindSpore数据格式。

建立模型

这里以LeNet模型为例，您也可以建立训练自己的模型。

from mindspore import nn
from mindspore.common.initializer import TruncatedNormal
def conv(in_channels, out_channels, kernel_size, stride=1, padding=0):
    weight = weight_variable()
                     kernel_size=kernel_size, stride=stride, padding=padding,
                     weight_init=weight, has_bias=False, pad_mode="valid")
def fc_with_initialize(input_channels, out_channels):
    weight = weight_variable()
    bias = weight_variable()
    return nn.Dense(input_channels, out_channels, weight, bias)
def weight_variable():
    return TruncatedNormal(0.05)
class LeNet5(nn.Cell):
    """
    LeNet network
    """
    def __init__(self):
        super(LeNet5, self).__init__()
        self.conv1 = conv(1, 6, 5)
        self.conv2 = conv(6, 16, 5)
        self.fc1 = fc_with_initialize(16*5*5, 120)
        self.fc2 = fc_with_initialize(120, 84)
        self.fc3 = fc_with_initialize(84, 10)
        self.relu = nn.ReLU()
        self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
        self.flatten = nn.Flatten()
    def construct(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        return x

加载LeNet网络，定义损失函数、配置checkpoint、用上述定义的数据加载函数generate_mnist_dataset载入数据。

network = LeNet5()
config_ck = CheckpointConfig(save_checkpoint_steps=cfg.save_checkpoint_steps,
ckpoint_cb = ModelCheckpoint(prefix="checkpoint_lenet",
                             directory='./trained_ckpt_file/',
                             config=config_ck)
# get training dataset
ds_train = generate_mnist_dataset(os.path.join(cfg.data_path, "train"),
                                  cfg.batch_size)

配置差分隐私优化器的参数。
- 判断micro_batches和batch_size参数是否符合要求，batch_size必须要整除micro_batches。
- 实例化差分隐私工厂类。
- 设置优化器类型，目前支持SGD、Momentum和Adam。
- 设置差分隐私预算监测器RDP，用于观测每个step中的差分隐私预算$\epsilon$的变化。

将LeNet模型包装成差分隐私模型，只需要将网络传入DPModel即可。

# Create the DP model for training.
model = DPModel(micro_batches=cfg.micro_batches,
                norm_bound=cfg.norm_bound,
                noise_mech=noise_mech,
                clip_mech=clip_mech,
                network=network,
                loss_fn=net_loss,
                optimizer=net_opt,
                metrics={"Accuracy": Accuracy()})

模型训练与测试。

 LOGGER.info(TAG, "============== Starting Training ==============")
 model.train(cfg['epoch_size'], ds_train,
             callbacks=[ckpoint_cb, LossMonitor(), rdp_monitor],
             dataset_sink_mode=cfg.dataset_sink_mode)
 LOGGER.info(TAG, "============== Starting Testing ==============")
 ckpt_file_name = 'trained_ckpt_file/checkpoint_lenet-10_234.ckpt'
 param_dict = load_checkpoint(ckpt_file_name)
 load_param_into_net(network, param_dict)
 ds_eval = generate_mnist_dataset(os.path.join(cfg.data_path, 'test'),
                                  batch_size=cfg.batch_size)
 acc = model.eval(ds_eval, dataset_sink_mode=False)
 LOGGER.info(TAG, "============== Accuracy: %s  ==============", acc)

运行命令。

运行脚本，可在命令行输入命令：

其中lenet5_dp.py替换成你的脚本的名字。
结果展示。

不加差分隐私的LeNet模型精度稳定在99%，加了Gaussian噪声，自适应Clip的差分隐私LeNet模型收敛，精度稳定在95%左右。
```
============== Starting Training ==============
...
============== Starting Testing ==============
============== Accuracy: 0.9698  ==============
```

引用

[1] C. Dwork and J. Lei. Differential privacy and robust statistics. In STOC, pages 371–380. ACM, 2009.

[2] Ilya Mironov. Rényi diﬀerential privacy. In IEEE Computer Security Foundations Symposium, 2017.

[3] Abadi, M. e. a., 2016. Deep learning with differential privacy. s.l.:Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.