STREAM LOAD

load data to table in streaming

SYNOPSIS

Curl —location-trusted -u user:passwd [-H “”…] -T data.file -XPUT http://fe\_host:http\_port/api/{db}/{table}/\_stream\_load

DESCRIPTION

This statement is used to load data to the specified table. The difference from normal load is that this load method is synchronous load.

This type of load still guarantees the atomicity of a batch of load tasks, either all data is loaded successfully or all fails.

This operation also updates the data for the rollup table associated with this base table.

This is a synchronous operation that returns the results to the user after the entire data load is completed.

Currently, HTTP chunked and non-chunked uploads are supported. For non-chunked mode, Content-Length must be used to indicate the length of the uploaded content, which ensures data integrity.

In addition, the user preferably sets the Content of the Expect Header field to 100-continue, which avoids unnecessary data transmission in certain error scenarios.

OPTIONS

Users can pass in the load parameters through the Header part of HTTP.

A label that is loaded at one time. The data of the same label cannot be loaded multiple times. Users can avoid the problem of repeated data load by specifying the label.

Currently Palo internally retains the most recent successful label within 30 minutes.

column_separator

Used to specify the column separator in the load file. The default is \t. If it is an invisible character, you need to add \x as a prefix and hexadecimal to indicate the separator.

For example, the separator \x01 of the hive file needs to be specified as -H "column_separator:\x01"

columns

used to specify the correspondence between the columns in the load file and the columns in the table. If the column in the source file corresponds exactly to the contents of the table, then it is not necessary to specify the contents of this field. If the source file does not correspond to the table schema, then this field is required for some data conversion. There are two forms of column, one is directly corresponding to the field in the load file, directly using the field name to indicate.

One is a derived column with the syntax column_name = expression. Give a few examples to help understand.

Example 1: There are three columns “c1, c2, c3” in the table. The three columns in the source file correspond to “c3, c2, c1” at a time; then you need to specify -H "columns: c3, c2, c1"

Example 2: There are three columns in the table, “c1, c2, c3”. The first three columns in the source file correspond in turn, but there are more than one column; then you need to specify-H "columns: c1, c2, c3, xxx"

The last column can optionally specify a name for the placeholder.

where

Used to extract some data. If the user needs to filter out the unwanted data, it can be achieved by setting this option.

Example 1: load only data larger than k1 column equal to 20180601, then you can specify -H “where: k1 = 20180601” when loading

max_filter_ratio

The maximum proportion of data that can be filtered (for reasons such as data irregularity). The default is zero tolerance. Data non-standard does not include rows that are filtered out by the where condition.

Partitions

Used to specify the partition designed for this load. If the user is able to determine the partition corresponding to the data, it is recommended to specify the item. Data that does not satisfy these partitions will be filtered out.

For example, specify load to p1, p2 partition, -H "partitions: p1, p2"

Timeout

Specifies the timeout for the load. Unit seconds. The default is 600 seconds. The range is from 1 second to 259200 seconds.

strict_mode

The user specifies whether strict load mode is enabled for this load. The default is enabled. The shutdown mode is -H "strict_mode: false".

timezone

Specifies the time zone used for this load. The default is East Eight District. This parameter affects all function results related to the time zone involved in the load.

exec_mem_limit

Memory limit. Default is 2GB. Unit is Bytes.

RETURN VALUES

After the load is completed, the related content of this load will be returned in Json format. Current field included

: load status.
- Success: indicates that the load is successful and the data is visible.
- Publish Timeout: Indicates that the load job has been successfully Commit, but for some reason it is not immediately visible. Users can be considered successful and do not have to retry load
- Label Already Exists: Indicates that the Label is already occupied by another job, either the load was successful or it is being loaded. The user needs to use the get label state command to determine the subsequent operations.
- Other: The load failed, the user can specify Label to retry the job.
NumberLoadedRows: The number of data rows loaded this time, only valid when Success
NumberFilteredRows: The number of rows filtered by this load, that is, the number of rows with unqualified data quality.
NumberUnselectedRows: Number of rows that were filtered by the where condition for this load
LoadBytes: The amount of source file data loaded this time
LoadTimeMs: Time spent on this load
ErrorURL: The specific content of the filtered data, only the first 1000 items are retained

ERRORS

You can view the load error details by the following statement:

Where url is the url given by ErrorURL.

load the data from the local file ‘testData’ into the table ‘testTbl’ in the database ‘testDb’ and use Label for deduplication. Specify a timeout of 100 seconds

Curl --location-trusted -u root -H "label:123" -H "timeout:100" -T testData http://host:port/api/testDb/testTbl/_stream_load
load the data from the local file ‘testData’ into the ‘testTbl’ table in the database ‘testDb’, allowing a 20% error rate (user is in defalut_cluster)

Curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -T testData http://host:port/api/testDb/testTbl/_stream_load
load the data from the local file ‘testData’ into the ‘testTbl’ table in the database ‘testDb’, allow a 20% error rate, and specify the column name of the file (user is in defalut_cluster)

Curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -H "columns: k2, k1, v1" -T testData http://host:port/api/testDb/testTbl/_stream_load
load the data from the local file ‘testData’ into the p1, p2 partition in the ‘testTbl’ table in the database ‘testDb’, allowing a 20% error rate.

Curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -H "partitions: p1, p2" -T testData http://host:port/api/testDb/testTbl/stream_load
load using streaming mode (user is in defalut_cluster)

Seq 1 10 | awk '{OFS="\t"}{print $1, $1 * 10}' | curl --location-trusted -u root -T - http://host:port/api/testDb/testTbl/_stream_load
load a table with HLL columns, which can be columns in the table or columns in the data used to generate HLL columns,you can also use hll_empty to supplement columns that are not in the data

Curl --location-trusted -u root -H "columns: k1, k2, v1=hll_hash(k1), v2=hll_empty()" -T testData http://host:port/api/testDb/testTbl/_stream_load
load data for strict mode filtering and set the time zone to Africa/Abidjan

Curl --location-trusted -u root -H "strict_mode: true" -H "timezone: Africa/Abidjan" -T testData http://host:port/api/testDb/testTbl/_stream_load
Curl --location-trusted -u root -H "columns: k1, k2, v1=to_bitmap(k1), v2=bitmap_empty()" -T testData http://host:port/api/testDb/testTbl/_stream_load