Data Input
One way to do this is to pass a Numpy ndarray object as a parameter to the job function directly.
Another approach is to use DataLoader of OneFlow and its related operators. It can load and pre-process datasets of a particular format from the file system.
Working directly with Numpy data is easy and convenient but only for small amounts of data. Because when the amount of data is too large, there may be barrier in preparing the Numpy data. Therefore, this approach is more suitable for the initial stages of the project to quickly validate and improve the algorithm.
The DataLoader of OneFlow use techniques such as multi-threading and data pipelining which make data loading, data pre-processing more efficient.However, you need to which already supported by Oneflow or develop you own DataLoader for the datatype which not supported by Oneflow. Thus we recommend use that in mature projects.
We can directly use Numpy ndarray as data input during training or predicting with OneFlow:
You can download code from and run it by:
Following output are expected:
(32, 1, 28, 28) (32,)
Thus, the example generates Numpy data randomly (images_in
and labels_in
) according to the shape and data type requirements of the job function.
Then directly pass the Numpy data images_in
and labels_in
as parameters when the job function is called.
images, labels = test_job(images_in, labels_in)
The oneflow.typing.Numpy.Placeholder
is the placeholder of Numpy ndarray
. There are also various placeholders in OneFlow that can represent more complex forms of Numpy data. More details please refer to The Definition and Call of Job Function.
Under the module, there are DataLoader operators for loading datasets and associated data preprocessing operators.DataLoader is usually named as data.xxx_reader
, such as the existing data.ofrecord_reader
and data.coco_reader
which support OneFlow’s native OFRecord
format and COCO dataset.
In addition, there are other data preprocessing operators that are used to process the data after DataLoader has been loaded. The following code uses data.OFRecordImageDecoderRandomCrop
for random image cropping and data.OFRecordRawDecoder
for image decoding. You can refer to the API documentation for more details.
The following example reads the data format file and dealing with images from the ImageNet dataset. The complete code can be downloaded here: .
This script requires an OFRecord dataset and you can make your own one according to [this article] (. /extended_topics/how_to_make_of_dataset.md).
The following example is running a script with our pre-prepared dataset:
wget https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/docs/basics_topics/part-00000
python3 of_data_pipeline.py
The following output are expected:
Code Explanation
There are generally two stages in using OneFlow DataLoader: Load Data and Preprocessing Data.
flow.data.ofrecord_reader
in the script is responsible for loading data from the file system into memory.
ofrecord = flow.data.ofrecord_reader(
"path/to/ImageNet/ofrecord",
batch_size=batch_size,
data_part_num=1,
part_name_suffix_length=5,
random_shuffle=True,
shuffle_after_epoch=True,
)
To specify the directory where the OFRecord file is located and some other parameters please refer to .
If the return value of the DataLoader is a basic data type. Then it can be used directly as an input to the downstream operator. Otherwise the data preprocessing operator needs to be called further for preprocessing.
For example, in the script:
image = flow.data.OFRecordImageDecoderRandomCrop(
ofrecord, "encoded", color_space=color_space
)
)
rsz = flow.image.Resize(
image, resize_x=224, resize_y=224, color_space=color_space
)
rng = flow.random.CoinFlip(batch_size=batch_size)
normal = flow.image.CropMirrorNormalize(
rsz,
mirror_blob=rng,
color_space=color_space,
mean=[123.68, 116.779, 103.939],
std=[58.393, 57.12, 57.375],
output_dtype=flow.float,
)
OneFlow provides a number of DataLoaders and preprocessing operators, refer to oneflow.data for details. These operators will be enriched and optimized in the future, but users can also refer to to customize the DataLoader to meet specific needs.