自定义Op

    在MNN中,添加Op包含如下步骤:

    1. 添加Op的模型转换
    2. 添加对应的Backend的Op实现,即Execution(CPU||Vulkan||OpenGL)

    首先在添加相应的Op名字以及参数描述,具体如下:

    1. 若此Op来自于Caffe,则在CaffeOp.fbs下添加参数描述;若Op来自Tensorflow,则在下添加,例如:
    1. MNN.fbs的OpType列表里添加Op的名字,而后,在OpParameter列表里添加参数描述名字(即table后的Pool)
    1. 在下添加xxxTf.cpp 实现可参考PoolingTf.cpp
      1. // 两种方法
      2. class PoolingTf : public tfOpConverter {
      3. public:
      4. virtual void run(MNN::OpT *dstOp, TmpNode *srcNode, TmpGraph *tempGraph);
      5. PoolingTf() {}
      6. virtual ~PoolingTf() {}
      7. virtual MNN::OpType opType();
      8. virtual MNN::OpParameter type();
      9. }
      10. // 或宏定义
      11. // DECLARE_OP_CONVERTER(PoolingTf);

    Caffe模型转换

    1. 在添加xxx.cpp 具体参考Pool.cpp 在run函数中解析caffe参数得到具体参数

    ONNX模型转换

    1. 添加具体Op转换xxxOnnx.cpp,实现如下三个函数 ```c++ MNN::OpType PoolingOnnx::opType() { return MNN::OpType_Pooling; } MNN::OpParameter PoolingOnnx::type() { return MNN::OpParameter_Pool; }

    void PoolingOnnx::run(MNN::OpT dstOp, const onnx::NodeProto onnxNode, std::vector initializers)

    1. > PoolingOnnx::run(MNN::OpT* dstOp, const onnx::NodeProto* onnxNode, std::vector<const onnx::TensorProto*> initializers),其中onnxNodeonnx原始节点信息,权重等数据信息需从initializers取。
    2. ### Tensorflow Lite模型转换
    3. 1. 添加XXXTflite.cpp
    4. ```c++
    5. DECLARE_OP_COVERTER(XXXTflite);
    6. // 需要实现如下函数
    7. XXXTflite::opType(bool quantizedModel);
    8. XXXTflite::type(bool quantizedModel);
    9. XXXTflite::run(MNN::OpT *dstOp, const std::unique_ptr<tflite::OperatorT> &tfliteOp,
    10. const std::vector<std::unique_ptr<tflite::TensorT> > &tfliteTensors,
    11. const std::vector<std::unique_ptr<tflite::BufferT> > &tfliteModelBuffer,
    12. const std::vector<std::unique_ptr<tflite::OperatorCodeT> > &tfliteOpSet,
    13. bool quantizedModel)
    14. // 接口函数相比tensorflow多一个quantizedModel参数,若quantizedModel为true,
    15. // 则模型为量化模型,需转为相应的量化Op,若为false,转为float Op
    16. // 在run()函数中需要设置输入/输出tensor的index
    17. // set input output index
    18. dstOp->inputIndexes.resize(1);
    19. dstOp->outputIndexes.resize(1);
    20. dstOp->inputIndexes[0] = tfliteOp->inputs[0];
    21. dstOp->outputIndexes[0] = tfliteOp->outputs[0];
    22. // 注册Op转换
    23. using namespace tflite;
    24. REGISTER_CONVERTER(SqueezeTflite, BuiltinOperator_SQUEEZE);
    1. 根据输入tensor的维度信息,计算输出tensor的维度信息,并设置输出tensor的数据类型。 继承基类SizeComputer,实现onComputeSize函数,若输入维度信息未知返回false,计算完成后返回true。例如Pooling:
    1. class PoolSizeComputer : public SizeComputer {
    2. public:
    3. virtual bool onComputeSize(const MNN::Op* op, const std::vector<Tensor*>& inputs,
    4. const std::vector<Tensor*>& outputs) const override {
    5. MNN_ASSERT(1 == inputs.size());
    6. MNN_ASSERT(1 == outputs.size());
    7. auto input = inputs[0];
    8. auto output = outputs[0];
    9. ::memcpy(output->buffer().dim, input->buffer().dim, input->buffer().dimensions * sizeof(halide_dimension_t));
    10. // Pool 参数信息
    11. auto layer = op->main_as_Pool();
    12. int outw = 1;
    13. int outh = 1;
    14. int w = input->width();
    15. int h = input->height();
    16. if (layer->pad_x() > 0)
    17. w += layer->pad_x() * 2;
    18. if (layer->pad_y() > 0)
    19. h += layer->pad_y() * 2;
    20. // Tensorflow padding mode SAME
    21. if (layer->pad_type() == PoolPadType_SAME) {
    22. outw = ceil((float)w / (float)layer->stride_x());
    23. outh = ceil((float)h / (float)layer->stride_y());
    24. }
    25. // Tensorflow padding mode VALID
    26. else if (layer->pad_type() == PoolPadType_VALID) {
    27. outw = ceil((float)(w - layer->kernel_x() + 1) / (float)layer->stride_x());
    28. outh = ceil((float)(h - layer->kernel_y() + 1) / (float)layer->stride_y());
    29. }
    30. else {
    31. outw = UP_DIV(w - layer->kernel_x(), layer->stride_x()) + 1;
    32. outh = UP_DIV(h - layer->kernel_y(), layer->stride_y()) + 1;
    33. }
    34. }
    35. // 输入信息未知返回false
    36. if (outw <= 0 || outh <= 0) {
    37. return false;
    38. // Pooling只改变Height,Width
    39. output->buffer().dim[3].extent = outw;
    40. output->buffer().dim[2].extent = outh;
    41. return true;
    42. }
    43. };
    44. // 注册Shape计算功能
    45. REGISTER_SHAPE(XXXComputer, OpType_XXX);

    CPU

    在添加CPUXXX.hpp、CPUXXX.cpp。

    1. 类声明 继承基类Execution,主要实现onResize()onExecute()
    1. class CPUPool : public Execution {
    2. public:
    3. CPUPool(Backend *b, const Pool *parameter);
    4. virtual ~CPUPool() = default;
    5. // 若执行onExecute需要使用缓存,在此函数中申请,若无可不声明
    6. virtual ErrorCode onResize(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
    7. // 具体的Op执行函数
    8. virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
    9. private:
    10. const Pool *mParameter;
    11. Tensor mCacheLine;
    12. };
    1. 实现 在onResize()中调用backend()->onAcquireBuffer(&mCacheLine, Backend::DYNAMIC)进行缓存的申请以及调用backend()->onReleaseBuffer(&mCacheLine, Backend::DYNAMIC)回收缓存,便于内存的复用。 在onExecute()需做必要的输入的检查,便于提前发现问题,执行完毕正确返回NO_ERROR。

    Metal

    Metal目录下添加MetalXXX.hpp和MetalXXX.cpp。

    1. 声明

    继承基类Execution,声明构造、析构、onResizeonExecute函数:

    1. class MetalPooling : public Execution {
    2. public:
    3. MetalPooling(Backend *backend, const Pool *pooling);
    4. virtual ~MetalPooling();
    5. virtual ErrorCode onResize(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
    6. virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
    7. private:
    8. bool mGlobal;
    9. PoolType mPoolType;
    10. int mKernelX;
    11. int mKernelY;
    12. int mStrideX;
    13. int mStrideY;
    14. int mPadX;
    15. int mPadY;
    16. id<MTLBuffer> mConstBuffer;
    17. };

    2. 实现

    • 不同于CPU Tensor将数据存储在host指针中,Metal数据指针存放在deviceId中,deviceId上存储的是id:

      1. auto buffer = (__bridge id<MTLBuffer>)(void *)tensor->deviceId();
    • Metal Op的特定参数等可以通过id存储。buffer数据类型可以与tensor不同,buffer甚至可以混合多种数据类型,只需保证创建时指定了正确的长度即可。例如:

      1. auto buffer = [context newDeviceBuffer:2 * sizeof(int) + 2 * sizeof(__fp16) access:CPUWriteOnly];
      2. ((__fp16 *)buffer.contents)[0] = mAlpha / mLocalSize; // alpha
      3. ((__fp16 *)buffer.contents)[1] = mBeta; // beta
      4. ((int *)buffer.contents)[1] = mLocalSize; // local size
      5. ((int *)buffer.contents)[2] = inputs[0]->channel(); // channel
    • 在创建buffer时,需要指定访问控制权限。目前共有三种权限: CPUReadWrite,数据在CPU/GPU间共享存储,一般用于device buffer; CPUWriteOnly,数据通过CPU写入后不再读取,一般用于参数buffer; CPUTransparent,数据只在GPU中,一般用于heap buffer。

    • MNNMetalContext在创建buffer上,有两套相近的接口,区别只在数据的生命周期上:device占用的内存在单次推理过程中都不会被复用;而heap占用的内存,在调用-[MNNMetalContext releaseHeapBuffer:]之后,可以被其他Op复用。一般而言,heap只会与CPUTransparent一起使用。

      heap实际只在iOS 10+上有效,iOS 9-上会回退到device上。

    • 使用Metal时,如非特殊情况,禁止自行创建device和library。加载library、编译function都是耗时行为,MNNMetalContext上做了必要的缓存优化。通过context执行Metal的示例如下:

      1. auto context = (__bridge MNNMetalContext *)backend->context();
      2. auto kernel = /* metal kernel name NSString */;
      3. auto encoder = [context encoder];
      4. auto bandwidth = [context load:kernel encoder:encoder];
      5. /* encoder set buffer(s)/sampler(s) */
      6. [context dispatchEncoder:encoder
      7. threads:{x, y, z}
      8. maxThreadsPerGroup:maxThreadsPerThreadgroup]; // recommended way to dispatch
      9. [encoder endEncoding];

    3. 实现Metal Op Creator,完成注册:

    1. class MetalPoolingCreator : public MetalBackend::Creator {
    2. public:
    3. virtual Execution *onCreate(const std::vector<Tensor *> &inputs, const MNN::Op *op, Backend *backend) const {
    4. return new MetalPooling(backend, op->main_as_Pool());
    5. }
    6. REGISTER_METAL_OP_CREATOR(MetalPoolingCreator, OpType_Pooling);

    1. Shader

    在添加具体的shader(*.comp),Pooling输入内存布局默认为NC4HW4,故按image实现,否则采用buffer实现。

    然后执行makeshader.py脚本编译Shader。

    2. 类声明

    1. class VulkanPool : public VulkanBasicExecution {
    2. public:
    3. VulkanPool(const Op* op, Backend* bn);
    4. virtual ~VulkanPool();
    5. ErrorCode onEncode(const std::vector<Tensor*>& inputs, const std::vector<Tensor*>& outputs,
    6. const VulkanCommandPool::Buffer* cmdBuffer) override;
    7. private:
    8. // GPU Shader所需的参数
    9. std::shared_ptr<VulkanBuffer> mConstBuffer;
    10. // Pipeline
    11. const VulkanPipeline* mPoolPipeline;
    12. const Pool* mCommon;
    13. std::shared_ptr<VulkanPipeline::DescriptorSet> mDescriptorSet;
    14. };

    3. 实现

    实现函数onEncode(),首先需要做内存布局检查(若为NC4HW4,则Shader用image实现,否则用buffer),执行完毕返回NO_ERROR。

    4. 实现Vulkan Execution Creator,完成注册

    1. class VulkanPoolCreator : public VulkanBackend::Creator {
    2. public:
    3. virtual Execution* onCreate(const std::vector<Tensor*>& inputs, const MNN::Op* op,
    4. Backend* backend) const override {
    5. return new VulkanPool(op, backend);
    6. }
    7. };
    8. static bool gResistor = []() {
    9. VulkanBackend::addCreator(OpType_Pooling, new VulkanPoolCreator);
    10. return true;
    11. }();

    OpenCL

    1. Kernel

    cl目录添加具体的kernel(*.cl),Pooling按image实现,内存排序为( H : batch * height, W : channel/4 * width * channel4)

    1. __kernel void pooling(GLOBAL_SIZE_3_DIMS __read_only image2d_t input, __private const int in_height,
    2. __private const int in_width, __private const int out_height, __private const int pad_top,
    3. __private const int pad_left, __private const int stride_h, __private const int stride_w,
    4. __private const int pooling_size_h, __private const int pooling_size_w,
    5. __write_only image2d_t output) {
    6. const int out_channel_idx = get_global_id(0);
    7. const int out_width_idx = get_global_id(1);
    8. const int out_hb_idx = get_global_id(2);
    9. if (out_channel_idx >= global_size_dim0 || out_width_idx >= global_size_dim1 || out_hb_idx >= global_size_dim2) {
    10. return;
    11. }
    12. const int out_width = global_size_dim1;
    13. const int n_b = out_hb_idx / out_height;
    14. const int mod_b = out_hb_idx - mul24(n_b, out_height);
    15. const int batch_idx = mul24(n_b, in_height);
    16. const int in_height_start = mad24(mod_b, stride_h, -pad_top);
    17. const int in_width_start = mad24(out_width_idx, stride_w, -pad_left);
    18. const int in_channel_offset = mul24(out_channel_idx, in_width);
    19. DATA_TYPE4 res = (DATA_TYPE4)(MIN_VALUE);
    20. for (int height = 0; height < pooling_size_h; ++height) {
    21. int in_height_idx = in_height_start + height;
    22. in_height_idx = select(batch_idx + in_height_idx, -1, (in_height_idx < 0 || in_height_idx >= in_height));
    23. if (in_height_idx != -1) {
    24. for (int width = 0; width < pooling_size_w; ++width) {
    25. int in_width_idx = in_width_start + width;
    26. in_width_idx =
    27. select(in_channel_offset + in_width_idx, -1, (in_width_idx < 0 || in_width_idx >= in_width));
    28. if (in_width_idx != -1) {
    29. DATA_TYPE4 in = READ_IMAGE(input, SAMPLER, (int2)(in_width_idx, in_height_idx));
    30. res = MNN_MAX(res, in);
    31. }
    32. }
    33. }
    34. }
    35. const int pos = mad24(out_channel_idx, out_width, out_width_idx);
    36. WRITE_IMAGE(output, (int2)(pos, out_hb_idx), res);
    37. }

    Then execute opencl_codegen.py to generate the string map corresponding to the kernel.

    Note: Macro description in kernel

    a. GLOBAL_SIZE_3_DIMS :Corresponds to the specified global work group size. b. READ_IMAGE / WRITE_IMAGE :Read and write pictures. c. DATA_TYPE:The specified data type (float/half/int32).

    2. Class declaration

    在目录下添加xxx.h和XXX.cpp。类声明继承类Execution,如下

    1. template <typename T>
    2. class PoolOp : public Execution {
    3. public:
    4. PoolOp(const std::vector<Tensor *> &inputs, const MNN::Op *op, Backend *backend);
    5. virtual ~PoolOp() = default;
    6. virtual ErrorCode onResize(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
    7. virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
    8. bool buildPoolingKernel();
    9. std::vector<uint32_t> poolLocalWS(const std::vector<uint32_t> &gws, const uint32_t maxWorkGroupSize);
    10. private:
    11. const Pool *mPoolParams;
    12. PoolType mPoolType;
    13. PoolPadType mPadType;
    14. std::vector<int> mStrides{1, 1};
    15. std::vector<int> mKernels{1, 1};
    16. std::vector<int> mPaddings{0, 0};
    17. std::vector<int> mDilations{1, 1};
    18. cl::Kernel mKernel;
    19. uint32_t mMaxWorkGroupSize;
    20. std::vector<int> mInputShape;
    21. OpenCLBackend *mOpenCLBackend;
    22. };

    3. 实现

    实现函数onResize( )(可选)onExecute( ),执行完毕返回NO_ERROR。

    4. 实现OpenCL Op Creator以及注册

    如下

    OpenGL

    1. Shader

    • OpenGL/glsl添加具体的shader(*.glsl),不用加文件头。
    • 在 下执行 makeShader.py

    2. Executor

    OpenGL/execution/ 添加执行器,可参考 GLPool.h 和 GLPool.cpp

    3. 注册

    OpenGL 不是用的抽象工厂方案,需要修改 下的 GLBackend.cpp