Bulk export

    This page documents bulk export for YugabyteDB’s . To export data from a YugabyteDB (or even an Apache Cassandra) table, you can use the tool.

    We will first create a source YugabyteDB table and populate it with data. Then we will export the data out using the cassandra-unloader tool. We will use a generic gaming user profile use case as a running example to illustrate the export process.

    1. # sample usage:
    2. # To generate a 10GB (10240 MB) file.
    3. # % python gen_csv.py <outfile_name> <outfile_size_MB>
    4. # % python gen_csv.py file01.csv 10240
    5. #
    6. import numpy as np
    7. import uuid
    8. import csv
    9. import sys
    10. outfile = sys.argv[1] # output file name
    11. outsize_mb = int(sys.argv[2])
    12. print("Outfile = " + outfile)
    13. print("Outfile Size (MB) = " + str(outsize_mb))
    14. chunksize = 10000
    15. while (os.path.getsize(outfile)//1024**2) < outsize_mb:
    16. data = [[uuid.uuid4() for i in range(chunksize)],
    17. np.random.random(chunksize)*1000,
    18. np.random.random(chunksize)*50,
    19. np.random.randint(1000000, size=(chunksize,)),
    20. [uuid.uuid4() for i in range(chunksize)]]
    21. csvfile.writelines(['%s,%.6f,%.6f,%i,%s\n' % row for row in zip(*data)])

    Sample rows generated by script would like the following.

    1. $ head file00.csv

    To generate 5 CSV files of about 5 GB each, run the following commands.

    1. python ./gen_csv.py file00.csv 5120 &
    2. python ./gen_csv.py file01.csv 5120 &
    3. python ./gen_csv.py file03.csv 5120 &
    4. python ./gen_csv.py file04.csv 5120 &

    You can do this as shown below.

    1. $ wget https://github.com/yugabyte/cassandra-loader/releases/download/v0.0.27-yb-2/cassandra-loader

    The files can be queued up for upload one at a time. Sample invocation:

    1. -schema "load.users(user_id, score1, score2, points, object_id)" \
    2. -boolStyle 1_0 \
    3. -numFutures 1000 \
    4. -rate 10000 \
    5. -queryTimeout 65 \
    6. -numRetries 10 \
    7. -progressRate 200000 \
    8. -host <clusterNodeIP> \
    9. -f file01.csv
    1. $ wget https://github.com/brianmhess/cassandra-loader/releases/download/v0.0.27/cassandra-unloader
    1. ./cassandra-unloader \
    2. -schema "load.users(user_id, score1, score2, points, object_id)" \
    3. -boolStyle 1_0 \
    4. -f outfile.csv

    For additional options to cassandra-unloader, see .