Integrating Caffe2 on iOS/Android
If you would like to see a working Caffe2 implementation on mobile, currently Android only, check out this demo project.
- Distribute (Asset Pipeline, Mobile Config, etc) the models to devices.
- Instantiate a caffe2::Predictor instance (iOS) or Caffe2 instance (Android) to expose the model to your code.
- Pass inputs to the model, get outputs back.
- caffe2::NetDef - (typically binary-serialized) Google Protobuf instance that encapsulates the computation graph and the pretrained weights.
Caffe2 is composed of:
- A core library, composed of the Workspace, Blob, Net, and Operator classes.
- An operator library, a range of Operator implementations (such as convolution, etc)
It’s pure C++, with the only non-optional dependencies being:
- Google Protobuf (the lite version, ~300kb)
- Eigen, a BLAS (on Android) is required for certain primitives, and a vectorized vector/matrix manipulation library, and Eigen is the fastest benchmarked on ARM.
For some use cases you can also bundle NNPACK, which specifically optimizes convolutions on ARM. It’s optional (but recommended).
Intuitive overview
A model consists of two parts - a set of weights (typically floating-point numbers) that represent the learned parameters (updated during training), and a set of ‘operations’ that form a computation graph that represent how to combine the input data (that varies with each graph pass) with the learned parameters (constant with each graph pass). The parameters (and intermediate states in the computation graph live in a Workspace, which is essentially a , where a Blob represents an arbitrary typed pointer, typically a TensorCPU, which is an n-dimensional array (a la Python’s numpy ndarray, Torch’s Tensor, etc).
The core class is caffe2::Predictor, which exposes the constructor:
where the two NetDef
inputs are Google Protocol Buffer objects that represent the two computation graphs described above - the init_net typically runs a set of operations that deserialize weights into the Workspace, and the predict_net
specifies how to execute the computation graph for each input.
Usage considerations
The Predictor is a stateful class - typically the flow would be to instantiate the class once and reuse it for multiple requests. The setup overhead is either trivial or non-trivial, depending on the use case. The constructor does the following:
- Constructs the workspace object
- Executes the , allocating memory and setting the values of the parameters.
One key point is that all the initialization is in a sense “statically” verifiable - if the constructor fails (by throwing an exception) on one machine, then it will always fail on every machine. Before exporting the NetDef
instances, verify that the Net construction can execute correctly.
Performance considerations
For a convolutional implementation, it is recommended to use NNPACK since that’s substantially faster (~2x-3x) than the standard implementation used in most frameworks. Setting OperatorDef::engine
to NNPACK is recommended here. Example:
For non-convolutional (e.g. ranking) workloads, the key computational primitive are often fully-connected layers (e.g. FullyConnectedOp in Caffe2, InnerProductLayer in Caffe, nn.Linear in Torch). For these use cases, you can fall back to a BLAS library, specifically Accelerate on iOS and Eigen on Android.
Memory considerations
The model for memory usage of an instantiated and run Predictor is that it’s the sum of the size of the weights and the total size of the activations. There is no ‘static’ memory allocated, all allocations are tied to the Workspace instance owned by the Predictor, so there should be no memory impact after all Predictor instances are deleted.
It’s recommended before exporting to run something like:
This will automatically share activations where valid in the topological ordering of the graph (see for a more detailed discussion).
Startup considerations on iOS
and used by, for example, :