- Introduction
- Quick Start
- Table Design
- Data Loading
- Overview of data loading
- Load data from a local file system or a streaming data source using HTTP push
- Load data from HDFS or cloud storage
- Routine Load
- Spark Load
- Insert Into
- Change data through loading
- Transform data at loading
- Json Loading
- Synchronize data from MySQL
- Load data by using flink-connector-starrocks
- DataX Writer
- Data Export
- Using StarRocks
- Reference
- SQL Reference
- User Account Management
- Cluster Management
- ADMIN CANCEL REPAIR
- ADMIN CHECK TABLET
- ADMIN REPAIR
- ADMIN SET CONFIG
- ADMIN SET REPLICA STATUS
- ADMIN SHOW CONFIG
- ADMIN SHOW REPLICA DISTRIBUTION
- ADMIN SHOW REPLICA STATUS
- ALTER SYSTEM
- CANCEL DECOMMISSION
- CREATE FILE
- DROP FILE
- INSTALL PLUGIN
- SHOW BACKENDS
- SHOW BROKER
- SHOW FILE
- SHOW FRONTENDS
- SHOW FULL COLUMNS
- SHOW INDEX
- SHOW PLUGINS
- SHOW TABLE STATUS
- UNINSTALL PLUGIN
- DDL
- ALTER DATABASE
- ALTER TABLE
- ALTER VIEW
- BACKUP
- CANCEL BACKUP
- CANCEL RESTORE
- CREATE DATABASE
- CREATE INDEX
- CREATE MATERIALIZED VIEW
- CREATE REPOSITORY
- CREATE RESOURCE
- CREATE TABLE AS SELECT
- CREATE TABLE LIKE
- CREATE TABLE
- CREATE VIEW
- CREATE FUNCTION
- DROP DATABASE
- DROP INDEX
- DROP MATERIALIZED VIEW
- DROP REPOSITORY
- DROP RESOURCE
- DROP TABLE
- DROP VIEW
- DROP FUNCTION
- HLL
- RECOVER
- RESTORE
- SHOW RESOURCES
- SHOW FUNCTION
- TRUNCATE TABLE
- DML
- ALTER ROUTINE LOAD
- BROKER LOAD
- CANCEL LOAD
- DELETE
- EXPORT
- GROUP BY
- INSERT
- PAUSE ROUTINE LOAD
- RESUME ROUTINE LOAD
- ROUTINE LOAD
- SELECT
- SHOW ALTER
- SHOW BACKUP
- SHOW DATA
- SHOW DATABASES
- SHOW DELETE
- SHOW DYNAMIC PARTITION TABLES
- SHOW EXPORT
- SHOW LOAD
- SHOW PARTITIONS
- SHOW PROPERTY
- SHOW REPOSITORIES
- SHOW RESTORE
- SHOW ROUTINE LOAD
- SHOW ROUTINE LOAD TASK
- SHOW SNAPSHOT
- SHOW TABLES
- SHOW TABLET
- SHOW TRANSACTION
- SPARK LOAD
- STOP ROUTINE LOAD
- STREAM LOAD
- Data Types
- Auxiliary Commands
- Function Reference
- Java UDFs
- Window Function
- Date Functions
- convert_tz
- curdate
- current_timestamp
- curtime
- datediff
- date_add
- date_format
- date_sub
- date_trunc
- day
- dayname
- dayofmonth
- dayofweek
- dayofyear
- from_days
- from_unixtime
- hour
- minute
- month
- monthname
- now
- quarter
- second
- str_to_date
- timediff
- timestampadd
- timestampdiff
- to_date
- to_days
- unix_timestamp
- utc_timestamp
- weekofyear
- year
- hours_diff
- minutes_diff
- months_diff
- seconds_diff
- weeks_diff
- years_diff
- Aggregate Functions
- Geographic Functions
- String Functions
- JSON Functions
- Overview of JSON functions and operators
- JSON constructor functions
- JSON query and processing functions
- JSON operators
- Aggregate Functions
- Bitmap Functions
- Array Functions
- cast function
- hash function
- Cryptographic Functions
- Math Functions
- Utility Functions
- System variables
- Error code
- System limits
- SQL Reference
- Administration
- FAQ
- Deploy
- Data Migration
- SQL
- Other FAQs
- Benchmark
- Developers
- Contribute to StarRocks
- Code Style Guides
- Use the debuginfo file for debugging
- Development Environment
- Trace Tools
STREAM LOAD
description
NAME:
stream-load: load data to table in streaming
SYNOPSIS
curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT \
http://fe_host:http_port/api/{db}/{table}/_stream_load
DESCRIPTION
This statement is used to import data into the specified table. The difference from normal Load is that this import method is synchronous import.
This import method can still ensure the atomicity of a batch of import tasks. Either all data is imported successfully or all data fails.
This operation will update the data of rollup table related to this base table at the same time.
This is a synchronous operation. After the entire data import is completed, the import results will be returned to the user.
Currently, HTTP chunked and non chunked uploads are supported. For non chunked uploads, Content_Length must be used to indicate the length of the uploaded content, so as to ensure the integrity of the data.
In addition, you'd better set the Expected Header field content 100_ continue to avoid unnecessary data transmission in some error scenarios.
OPTIONS
Users can pass in import parameters through the Header part of HTTP
label: label imported at one time. Data of the same label cannot be imported multiple times. Users can avoid the problem of repeated import of a copy of data by specifying a Label.
Currently, StarRocks keeps the most recently successful label within 30 minutes.
column_separator:used to specify the column separator in the import file. The default is \t. If it is an invisible character, you need to prefix it with \x and use hexadecimal to represent the separator.
For example, the separator of hive file \x01 should be specified as - H "column_separator: \x01"
columns:used to specify the correspondence between the columns in the import file and the columns in the table. If the column in the source file exactly corresponds to the content in the table, you do not need to specify the content of this field.
If the source file does not correspond to the table schema, this field needs some data conversion. There are two forms of column. One is to directly correspond to the fields in the import file, which are directly represented by the field name;
One is derived column, and the syntax is column - name = expression. Give a few examples to help understand.
Example 1: there are three columns "c1, c2, c3" in the table, and the three columns in the source file correspond to "c3, c2, c1" at one time; Then you need to specify - H "columns: c3, c2, c1"
Example 2: there are three columns "c1, c2, c3" in the table. The first three columns in the source file correspond in turn, but there is more than one column. Then you need to specify - H "columns: c1, c2, c3, XXX";
The last column can be filled with any name
Example 3: there are three columns "year, month and day" in the table, and there is only one time column in the source file in the format of "2018-06-01 01:02:03";
Then you can specify - H "columns: column, year = year (column), month = month (column), day = day (Col)" to complete the import
where: used to extract some data. If users need to filter out unwanted data, they can set this option.
Example 1: if you only import data with columns greater than K1 and equal to 20180601, you can specify - H "where: k1 = 20180601" during import
max_filter_ratio:the maximum allowable data proportion that can be filtered (due to data irregularity, etc.). The default is zero tolerance. Data irregularity does not include rows filtered through the where condition.
partitions: used to specify the partition designed for this import. If the user can determine the partition corresponding to the data, it is recommended to specify this item. Data that does not meet these partitions will be filtered out.
For example, specify to import to p1, p2 partitions, - H "partitions: p1, p2"
timeout: Specifies the timeout of import. The unit is seconds. The default is 600 seconds. The settable range is 1 second ~ 259200 seconds.
strict_mode: the user specifies whether strict mode is enabled for this import. It is off by default. The enabling method is - H "strict_mode: true".
timezone: Specifies the time zone used for this import. The default is Dongba zone. This parameter will affect the results of all functions related to time zone involved in import.
exec_mem_limit: import memory limit. The default is 2GB. The unit is bytes.
format: Specifies the import data format. The default is csv. json format is supported.
jsonpaths: there are two ways to import json: simple mode and precise mode.
Simple mode: it is a simple mode without setting the jsonpaths parameter. In this mode, json data is required to be an object type, for example:
{"k1": 1, "k2": 2, "k3": "hello"}, where k1, k2 and k3 are column names.
Matching pattern: the json data is relatively complex, and the corresponding value needs to be matched through the jsonpaths parameter.
strip_ outer_ array: Boolean type. true means that json data starts with an array object and flattens the array object. The default value is false. For example:
[
{"k1" : 1, "v1" : 2},
{"k1" : 3, "v1" : 4}
]
When strip_ outer_ array is true, and two rows of data will be generated when it is finally imported into starrocks.
json_ root: json_ root is a legal jsonpath string, which is used to specify the root node of json document. The default value is' '.
RETURN VALUES
After the import is completed, the relevant contents of the import will be returned in Json format. The following fields are currently included:
Status: imports the last status.
Success: indicates that the import is successful and the data is visible;
Publish Timeout: indicates that the import job has been successfully committed, but it cannot be seen immediately for some reason. The user can be regarded as having succeeded without having to retry the import
Label Already Exists: indicates that the Label has been occupied by other jobs. It may be imported successfully or being imported.
Users need to use the get label state command to determine subsequent operations
Other: the import failed. The user can specify Label to retry the job
Message: detailed description of import status. The specific failure reason will be returned in case of failure.
NumberTotalRows: the total number of rows read from the data stream
NumberLoadedRows: the number of data rows imported this time, which is valid only when Success
NumberFilteredRows: the number of rows filtered in this import, that is, the number of rows with unqualified data quality
Numberunselectedrows: the number of rows filtered through the where condition in this import
LoadBytes: the amount of source file data imported this time
LoadTimeMs: time taken for this import
ErrorURL: the specific content of the filtered data. Only the first 1000 items are reserved
ERRORS Errors you can view the import error details through the following statement:
SHOW LOAD WARNINGS ON 'url'
Where url is the url given by ErrorURL.
example
Import the data in the local file 'testData' into the table 'testTbl' in the database 'testDb', and use label for deduplication. Specify a timeout of 100 seconds
curl --location-trusted -u root -H "label:123" -H "timeout:100" -T testData \ http://host:port/api/testDb/testTbl/_stream_load
Import the data in the local file 'testData' into the table testTbl' in the database 'testDb', use label for de duplication, and only import the data with k1 equal to 20180601
curl --location-trusted -u root -H "label:123" -H "where: k1=20180601" -T testData \ http://host:port/api/testDb/testTbl/_stream_load
Import the data in the local file 'testData' into the table 'testTbl' in the database 'testDb', allowing an error rate of 20% (the user is in the defalut_cluster)
curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -T testData \ http://host:port/api/testDb/testTbl/_stream_load
Import the data in the local file 'testData' into the table 'testTbl' in the database 'testDb', allow an error rate of 20%, and specify the column name of the file (the user is in the defalut_cluster)
curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" \ -H "columns: k2, k1, v1" -T testData \ http://host:port/api/testDb/testTbl/_stream_load
Import the data in the local file 'testData' into the p1 and p2 partitions in the table 'testTbl' in the database 'testDb', allowing an error rate of 20%.
curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" \ -H "partitions: p1, p2" -T testData \ http://host:port/api/testDb/testTbl/_stream_load
Import using streaming (the user is from the default_cluster)
seq 1 10 | awk '{OFS="\t"}{print $1, $1 * 10}' | curl --location-trusted -u root -T - \ http://host:port/api/testDb/testTbl/_stream_load
Import a table containing HLL columns, which can be columns in the table or columns in the data. It can be used to generate HLL columns, or use hll_ empty to supplement columns that are not in the data
curl --location-trusted -u root \ -H "columns: k1, k2, v1=hll_hash(k1), v2=hll_empty()" -T testData \ http://host:port/api/testDb/testTbl/_stream_load
Import data for strict mode filtering, and set the time zone to Africa / Abidjan
curl --location-trusted -u root -H "strict_mode: true" \ -H "timezone: Africa/Abidjan" -T testData \ http://host:port/api/testDb/testTbl/_stream_load
Import a table containing BITMAP columns, which can be columns in the table or columns in the data. It can be used to generate BITMAP columns or use bitmap_ empty to fill
curl --location-trusted -u root \ -H "columns: k1, k2, v1=to_bitmap(k1), v2=bitmap_empty()" -T testData \ http://host:port/api/testDb/testTbl/_stream_load
Simple mode, importing json data
Table structure: `category` varchar(512) NULL COMMENT "", `author` varchar(512) NULL COMMENT "", `title` varchar(512) NULL COMMENT "", `price` double NULL COMMENT "" json data format: {"category":"C++","author":"avc","title":"C++ primer","price":895} Import command: curl --location-trusted -u root -H "label:123" -H "format: json" -T testData \ http://host:port/api/testDb/testTbl/_stream_load In order to improve throughput, it supports one-time import of data. json data format: [ {"category":"C++","author":"avc","title":"C++ primer","price":89.5}, {"category":"Java","author":"avc","title":"Effective Java","price":95}, {"category":"Linux","author":"avc","title":"Linux kernel","price":195} ]
Matching patterns, importing json data
json data format: [ {"category":"xuxb111","author":"1avc","title":"SayingsoftheCentury","price":895}, {"category":"xuxb222","author":"2avc","title":"SayingsoftheCentury","price":895}, {"category":"xuxb333","author":"3avc","title":"SayingsoftheCentury","price":895} ] Import precisely by specifying jsonpath. For example, only category, author and price attributes are imported curl --location-trusted -u root \ -H "columns: category, price, author" -H "label:123" -H "format: json" -H "jsonpaths: [\"$.category\",\"$.price\",\"$.author\"]" -H "strip_outer_array: true" -T testData \ http://host:port/api/testDb/testTbl/_stream_load
Note: 1. If the json data starts with an array and each object in the array is a record, you need to set strip_ outer_ array to true to flatten the array. 2. If the json data starts with an array and each object in the array is a record, when setting the jsonpath, our ROOT node is actually an object in the array.
User specified json root node
json data format: { "RECORDS":[ {"category":"11","title":"SayingsoftheCentury","price":895,"timestamp":1589191587}, {"category":"22","author":"2avc","price":895,"timestamp":1589191487}, {"category":"33","author":"3avc","title":"SayingsoftheCentury","timestamp":1589191387} ] } Import precisely by specifying jsonpath. For example, only category, author and price attributes are imported curl --location-trusted -u root \ -H "columns: category, price, author" -H "label:123" -H "format: json" -H "jsonpaths: [\"$.category\",\"$.price\",\"$.author\"]" -H "strip_outer_array: true" -H "json_root: $.RECORDS" -T testData \ http://host:port/api/testDb/testTbl/_stream_load
keyword
STREAM,LOAD