Feature Support: Data Loading and Unloading
This document outlines the features of various data loading and unloading methods supported by StarRocks.
File format
Loading file formats
Data Source | File Format | |||||||
---|---|---|---|---|---|---|---|---|
CSV | JSON [3] | Parquet | ORC | Avro | ProtoBuf | Thrift | ||
Stream Load | Local file systems, applications, connectors | Yes | Yes | To be supported | To be supported | To be supported | ||
INSERT from FILES | HDFS, S3, OSS, Azure, GCS, NFS(NAS) [5] | Yes (v3.3+) | To be supported | Yes (v3.1+) | Yes (v3.1+) | To be supported | ||
Broker Load | Yes | Yes (v3.2.3+) | Yes | Yes | To be supported | |||
Routine Load | Kafka | Yes | Yes | To be supported | To be supported | Yes (v3.0+) [1] | To be supported | To be supported |
Spark Load | Yes | To be supported | Yes | Yes | To be supported | |||
Connectors | Flink, Spark | Yes | Yes | To be supported | To be supported | To be supported | ||
Kafka Connector [2] | Kafka | Yes (v3.0+) | To be supported | To be supported | Yes (v3.0+) | To be supported | ||
PIPE [4] | Consistent with INSERT from FILES |
[1], [2]: Schema Registry is required.
[3]: JSON supports a variety of CDC formats. For details about the JSON CDC formats supported by StarRocks, see JSON CDC format.
[4]: Currently, only INSERT from FILES is supported for loading with PIPE.
[5]: You need to mount a NAS device as NFS under the same directory of each BE or CN node to access the files in NFS via the file://
protocol.
JSON CDC formats
Stream Load | Routine Load | Broker Load | INSERT from FILES | Kafka Connector [1] | |
---|---|---|---|---|---|
Debezium | To be supported | To be supported | To be supported | To be supported | Yes (v3.0+) |
Canal | To be supported | ||||
Maxwell |
[1]: You must configure the transforms
parameter while loading Debezium CDC format data into Primary Key tables in StarRocks.
Unloading file formats
Target | File format | |||||
---|---|---|---|---|---|---|
Table format | Remote storage | CSV | JSON | Parquet | ORC | |
INSERT INTO FILES | N/A | HDFS, S3, OSS, Azure, GCS, NFS(NAS) [3] | Yes (v3.3+) | To be supported | Yes (v3.2+) | Yes (v3.3+) |
INSERT INTO Catalog | Hive | HDFS, S3, OSS, Azure, GCS | Yes (v3.3+) | To be supported | Yes (v3.2+) | Yes (v3.3+) |
Iceberg | HDFS, S3, OSS, Azure, GCS | To be supported | To be supported | Yes (v3.2+) | To be supported | |
Hudi/Delta | To be supported | |||||
EXPORT | N/A | HDFS, S3, OSS, Azure, GCS | Yes [1] | To be supported | To be supported | To be supported |
PIPE | To be supported [2] |
[1]: Configuring Broker process is supported.
[2]: Currently, unloading data using PIPE is not supported.
[3]: You need to mount a NAS device as NFS under the same directory of each BE or CN node to access the files in NFS via the file://
protocol.
File format-related parameters
Loading file format-related parameters
File format | Parameter | Loading method | ||||
---|---|---|---|---|---|---|
Stream Load | INSERT from FILES | Broker Load | Routine Load | Spark Load | ||
CSV | column_separator | Yes | Yes (v3.3+) | Yes [1] | ||
row_delimiter | Yes | Yes [2] (v3.1+) | Yes [3] (v2.2+) | To be supported | ||
enclose | Yes (v3.0+) | Yes (v3.0+) | Yes (v3.0+) | To be supported | ||
escape | ||||||
skip_header | To be supported | |||||
trim_space | Yes (v3.0+) | |||||
JSON | jsonpaths | Yes | To be supported | Yes (v3.2.3+) | Yes | To be supported |
strip_outer_array | ||||||
json_root | ||||||
ignore_json_size | To be supported |
[1]: The corresponding parameter is COLUMNS TERMINATED BY
.
[2]: The corresponding parameter is ROWS TERMINATED BY
.
[3]: The corresponding parameter is ROWS TERMINATED BY
.
Unloading file format-related parameters
File format | Parameter | Unloading method | |
---|---|---|---|
INSERT INTO FILES | EXPORT | ||
CSV | column_separator | Yes (v3.3+) | Yes |
line_delimiter [1] |
[1]: The corresponding parameter in data loading is row_delimiter
.
Compression formats
Loading compression formats
File format | Compression format | Loading method | ||||
---|---|---|---|---|---|---|
Stream Load | Broker Load | INSERT from FILES | Routine Load | Spark Load | ||
CSV |
| Yes [1] | Yes [2] | To be supported | To be supported | To be supported |
JSON | Yes (v3.2.7+) [3] | To be supported | N/A | To be supported | N/A | |
Parquet |
| N/A | Yes [4] | To be supported | Yes [4] | |
ORC |
[1]: Currently, only when loading CSV files with Stream Load can you specify the compression format by using format=gzip
, indicating gzip-compressed CSV files. deflate
and bzip2
formats are also supported.
[2]: Broker Load does not support specifying the compression format of CSV files by using the parameter format
. Broker Load identifies the compression format by using the suffix of the file. The suffix of gzip-compressed files is .gz
, and that of the zstd-compressed files is .zst
. Besides, other format
-related parameters, such as trim_space
and enclose
, are also not supported.
[3]: Supports specifying the compression format by using compression = gzip
.
[4]: Supported by Arrow Library. You do not need to configure the compression
parameter.
Unloading compression formats
File format | Compression format | Unloading method | ||||
---|---|---|---|---|---|---|
INSERT INTO FILES | INSERT INTO Catalog | EXPORT | ||||
Hive | Iceberg | Hudi/Delta | ||||
CSV |
| To be supported | To be supported | To be supported | To be supported | To be supported |
JSON | N/A | N/A | N/A | N/A | N/A | N/A |
Parquet |
| Yes (v3.2+) | Yes (v3.2+) | Yes (v3.2+) | To be supported | N/A |
ORC |
Credentials
Loading - Authentication
Authentication | Loading method | ||||
---|---|---|---|---|---|
Stream Load | INSERT from FILES | Broker Load | Routine Load | External Catalog | |
Single Kerberos | N/A | Yes (v3.1+) | Yes [1] (versions earlier than v2.5) | Yes [2] (v3.1.4+) | Yes |
Kerberos Ticket Granting Ticket (TGT) | N/A | To be supported | Yes (v3.1.10+/v3.2.1+) | ||
Single KDC Multiple Kerberos | N/A | ||||
Basic access authentications (Access Key pair, IAM Role) | N/A | Yes (HDFS and S3-compatible object storage) | Yes [3] | Yes |
[1]: For HDFS, StarRocks supports both simple authentication and Kerberos authentication.
[2]: When the security protocol is set to sasl_plaintext
or sasl_ssl
, both SASL and GSSAPI (Kerberos) authentications are supported.
[3]: When the security protocol is set to sasl_plaintext
or sasl_ssl
, both SASL and PLAIN authentications are supported.
Unloading - Authentication
INSERT INTO FILES | EXPORT | |
---|---|---|
Single Kerberos | To be supported | To be supported |
Loading - Other parameters and features
Parameter and feature | Loading method | |||||||
---|---|---|---|---|---|---|---|---|
Stream Load | INSERT from FILES | INSERT from SELECT/VALUES | Broker Load | PIPE | Routine Load | Spark Load | ||
partial_update | Yes (v3.0+) | Yes [1] (v3.3+) | Yes (v3.0+) | N/A | Yes (v3.0+) | To be supported | ||
partial_update_mode | Yes (v3.1+) | To be supported | Yes (v3.1+) | N/A | To be supported | To be supported | ||
COLUMNS FROM PATH | N/A | Yes (v3.2+) | N/A | Yes | N/A | N/A | Yes | |
timezone or session variable time_zone [2] | Yes [3] | Yes [4] | Yes [4] | Yes [4] | To be supported | Yes [4] | To be supported | |
Time accuracy - Microsecond | Yes | Yes | Yes | Yes (v3.1.11+/v3.2.6+) | To be supported | Yes | Yes |
[1]: From v3.3 onwards, StarRocks supports Partial Updates in Row mode for INSERT INTO by specifying the column list.
[2]: Setting the time zone by the parameter or the session variable will affect the results returned by functions such as strftime(), alignment_timestamp(), and from_unixtime().
[3]: Only the parameter timezone
is supported.
[4]: Only the session variable time_zone
is supported.
Unloading - Other parameters and features
Parameter and feature | INSERT INTO FILES | EXPORT |
---|---|---|
target_max_file_size | Yes (v3.2+) | To be supported |
single | ||
Partitioned_by | ||
Session variable time_zone | To be supported | |
Time accuracy - Microsecond | To be supported | To be supported |