- StarRocks
- Introduction to StarRocks
- Quick Start
- Table Design
- Data Loading
- Concepts
- Overview of data loading
- Load data from a local file system or a streaming data source using HTTP PUT
- Load data from HDFS or cloud storage
- Continuously load data from Apache Kafka®
- Bulk load using Apache Sparkâ„¢
- Load data using INSERT
- Load data using Stream Load transaction interface
- Synchronize data from MySQL in real time
- Continuously load data from Apache Flink®
- Change data through loading
- Transform data at loading
- Data Unloading
- Query Data Sources
- Query Acceleration
- Gather CBO statistics
- Materialized view
- Colocate Join
- Lateral Join
- Index
- Computing the Number of Distinct Values
- Administration
- Deployment
- Management
- Data Recovery
- User Privilege and Authentication
- Performance Tuning
- Reference
- SQL Reference
- User Account Management
- Cluster Management
- ADD SQLBLACKLIST
- ADMIN CANCEL REPAIR TABLE
- ADMIN CHECK TABLET
- ADMIN REPAIR TABLE
- ADMIN SET CONFIG
- ADMIN SET REPLICA STATUS
- ADMIN SHOW CONFIG
- ADMIN SHOW REPLICA DISTRIBUTION
- ADMIN SHOW REPLICA STATUS
- ALTER RESOURCE GROUP
- ALTER SYSTEM
- CANCEL DECOMMISSION
- CREATE FILE
- CREATE RESOURCE GROUP
- DELETE SQLBLACKLIST
- DROP FILE
- DROP RESOURCE GROUP
- EXPLAIN
- INSTALL PLUGIN
- KILL
- SET
- SHOW BACKENDS
- SHOW BROKER
- SHOW COMPUTE NODES
- SHOW FILE
- SHOW FRONTENDS
- SHOW FULL COLUMNS
- SHOW INDEX
- SHOW PLUGINS
- SHOW PROC
- SHOW PROCESSLIST
- SHOW RESOURCE GROUP
- SHOW SQLBLACKLIST
- SHOW TABLE STATUS
- SHOW VARIABLES
- UNINSTALL PLUGIN
- DDL
- ALTER DATABASE
- ALTER MATERIALIZED VIEW
- ALTER TABLE
- ALTER VIEW
- ALTER RESOURCE
- ANALYZE TABLE
- BACKUP
- CANCEL ALTER TABLE
- CANCEL BACKUP
- CANCEL RESTORE
- CREATE ANALYZE
- CREATE EXTERNAL CATALOG
- CREATE DATABASE
- CREATE INDEX
- CREATE MATERIALIZED VIEW
- CREATE REPOSITORY
- CREATE RESOURCE
- CREATE TABLE AS SELECT
- CREATE TABLE LIKE
- CREATE TABLE
- CREATE VIEW
- CREATE FUNCTION
- DROP ANALYZE
- DROP STATS
- DROP CATALOG
- DROP DATABASE
- DROP INDEX
- DROP MATERIALIZED VIEW
- DROP REPOSITORY
- DROP RESOURCE
- DROP TABLE
- DROP VIEW
- DROP FUNCTION
- HLL
- KILL ANALYZE
- RECOVER
- REFRESH EXTERNAL TABLE
- RESTORE
- SHOW ANALYZE JOB
- SHOW ANALYZE STATUS
- SHOW META
- SHOW RESOURCES
- SHOW FUNCTION
- TRUNCATE TABLE
- USE
- DML
- ALTER ROUTINE LOAD
- BROKER LOAD
- CANCEL LOAD
- CANCEL EXPORT
- CANCEL REFRESH MATERIALIZED VIEW
- CREATE ROUTINE LOAD
- DELETE
- EXPORT
- GROUP BY
- INSERT
- PAUSE ROUTINE LOAD
- REFRESH MATERIALIZED VIEW
- RESUME ROUTINE LOAD
- SELECT
- SHOW ALTER TABLE
- SHOW ALTER MATERIALIZED VIEW
- SHOW BACKUP
- SHOW CATALOGS
- SHOW CREATE MATERIALIZED VIEW
- SHOW CREATE TABLE
- SHOW CREATE VIEW
- SHOW DATA
- SHOW DATABASES
- SHOW DELETE
- SHOW DYNAMIC PARTITION TABLES
- SHOW EXPORT
- SHOW LOAD
- SHOW MATERIALIZED VIEW
- SHOW PARTITIONS
- SHOW PROPERTY
- SHOW REPOSITORIES
- SHOW RESTORE
- SHOW ROUTINE LOAD
- SHOW ROUTINE LOAD TASK
- SHOW SNAPSHOT
- SHOW TABLES
- SHOW TABLET
- SHOW TRANSACTION
- SPARK LOAD
- STOP ROUTINE LOAD
- STREAM LOAD
- Auxiliary Commands
- Data Types
- Keywords
- Function Reference
- Java UDFs
- Window functions
- Aggregate Functions
- Array Functions
- Bit Functions
- Bitmap Functions
- base64_to_bitmap
- bitmap_agg
- bitmap_and
- bitmap_andnot
- bitmap_contains
- bitmap_count
- bitmap_from_string
- bitmap_empty
- bitmap_has_any
- bitmap_hash
- bitmap_intersect
- bitmap_max
- bitmap_min
- bitmap_or
- bitmap_remove
- bitmap_to_array
- bitmap_to_string
- bitmap_union
- bitmap_union_count
- bitmap_union_int
- bitmap_xor
- intersect_count
- to_bitmap
- Conditional Functions
- Cryptographic Functions
- Date Functions
- add_months
- adddate
- convert_tz
- current_date
- current_time
- current_timestamp
- date
- date_add
- date_format
- date_sub, subdate
- date_trunc
- datediff
- day
- dayname
- dayofmonth
- dayofweek
- dayofyear
- days_add
- days_diff
- days_sub
- from_days
- from_unixtime
- hour
- hours_add
- hours_diff
- hours_sub
- microseconds_add
- microseconds_sub
- minute
- minutes_add
- minutes_diff
- minutes_sub
- month
- monthname
- months_add
- months_diff
- now
- quarter
- second
- seconds_add
- seconds_diff
- seconds_sub
- str_to_date
- str2date
- time_slice
- time_to_sec
- timediff
- timestamp
- timestampadd
- timestampdiff
- to_date
- to_days
- unix_timestamp
- utc_timestamp
- week
- weekofyear
- weeks_add
- weeks_diff
- weeks_sub
- year
- years_add
- years_diff
- years_sub
- Geographic Functions
- JSON Functions
- Overview of JSON functions and operators
- JSON operators
- JSON constructor functions
- JSON query and processing functions
- Math Functions
- String Functions
- Pattern Matching Functions
- Percentile Functions
- Scalar Functions
- Utility Functions
- cast function
- hash function
- System variables
- Error code
- System limits
- SQL Reference
- FAQ
- Benchmark
- Developers
- Contribute to StarRocks
- Code Style Guides
- Use the debuginfo file for debugging
- Development Environment
- Trace Tools
Data loading
1. What do I do if the "close index channel failed" or "too many tablet versions" error occurs?
You were running load jobs too frequently, and the data was not compacted in a timely manner. As a result, the number of data versions generated during loading exceeds the maximum number (which defaults to 1000) of data versions that are allowed. Use one of the following methods to resolve this issue:
Increase the amount of data loaded in each individual job, thereby reducing loading frequency.
Modify the configuration items in the BE configuration file be.conf of each BE as follows, thereby accelerating data compactions:
cumulative_compaction_num_threads_per_disk = 4 base_compaction_num_threads_per_disk = 2 cumulative_compaction_check_interval_seconds = 2
After you modify the settings of the preceding configuration items, you must observe the memory and I/O to ensure that they are normal.
2. What do I do if the "Label Already Exists" error occurs?
This error occurs because the load job has the same label as another load job, which has been successfully run or is being run, within the same StarRocks database.
Stream Load jobs are submitted according to HTTP. In general, request retry logic is embedded in HTTP clients of all programmatic languages. When the StarRocks cluster receives a load job request from an HTTP client, it immediately starts to process the request, but it does not return the job result to the HTTP client in a timely manner. As a result, the HTTP client sends the same load job request again. However, the StarRocks cluster is already processing the first request and therefore returns the Label Already Exists
error for the second request.
Do as follows to check that load jobs submitted by using different loading methods do not have the same label and are not repeatedly submitted:
View the FE log and check whether the label of the failed load job is recorded twice. If the label is recorded twice, the client has submitted the load job request twice.
NOTE
The StarRocks cluster does not distinguish between the labels of load jobs based on loading methods. Therefore, load jobs submitted by using different loading methods may have the same label.
Run SHOW LOAD WHERE LABEL = "xxx" to check for load jobs that have the same label and are in the FINISHED state.
NOTE
xxx
is the label that you want to check.
Before you submit a load job, we recommend that you calculate the approximate amount of time required to load the data and then adjust the client-side request timeout period accordingly. This way, you can prevent the client from submitting the load job request multiple times.
3. What do I do if the "ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel" error occurs?
Execute SHOW LOAD, and use the error URL in the returned execution result to view the error details.
Common data quality errors are as follows:
"convert csv string to INT failed."
Strings from a source column failed to be transformed into the data type of the matching destination column. For example,
abc
failed to be transformed into a numeric value."the length of input is too long than schema."
Values from a source column are in lengths that are not supported by the matching destination column. For example, the source column values of CHAR data type exceed the destination column's maximum length specified at table creation, or the source column values of INT data type exceed 4 bytes.
"actual column number is less than schema column number."
After a source row is parsed based on the specified column separator, the number of columns obtained is smaller than the number of columns in the destination table. A possible reason is that the column separator specified in the load command or statement differs from the column separator that is actually used in that row.
"actual column number is more than schema column number."
After a source row is parsed based on the specified column separator, the number of columns obtained is greater than the number of columns in the destination table. A possible reason is that the column separator specified in the load command or statement differs from the column separator that is actually used in that row.
"the frac part length longer than schema scale."
The decimal parts of values from a DECIMAL-type source column exceed the specified length.
"the int part length longer than schema precision."
The integer parts of values from a DECIMAL-type source column exceed the specified length.
"there is no corresponding partition for this key."
The value in the partition column for a source row is not within the partition range.
4. What do I do if RPC times out?
Check the setting of the write_buffer_size
configuration item in the BE configuration file be.conf of each BE. This configuration item is used to control the maximum size per memory block on the BE. The default maximum size is 100 MB. If the maximum size is exceedingly large, Remote Procedure Call (RPC) may time out. To resolve this issue, adjust the settings of the write_buffer_size
and tablet_writer_rpc_timeout_sec
configuration items in the BE configuration file. For more information, see BE configurations.
5. What do I do if the "Value count does not match column count" error occurs?
After my load job failed, I used the error URL returned in the job result to retrieve the error details and found the "Value count does not match column count" error, which indicates a mismatch between the number of columns in the source data file and the number of columns in the destination StarRocks table:
Error: Value count does not match column count. Expect 3, but got 1. Row: 2023-01-01T18:29:00Z,cpu0,80.99
Error: Value count does not match column count. Expect 3, but got 1. Row: 2023-01-01T18:29:10Z,cpu1,75.23
Error: Value count does not match column count. Expect 3, but got 1. Row: 2023-01-01T18:29:20Z,cpu2,59.44
The reason for this issue is as follows:
The column separator specified in the load command or statement differs from the column separator that is actually used in the source data file. In the preceding example, the CSV-formatted data file consists of three columns, which are separated with commas (,
). However, \t
is specified as the column separator in the load command or statement. As a result, the three columns from the source data file are incorrectly parsed into one column.
Specify commas (,
) as the column separator in the load command or statement. Then, submit the load job again.
- Data loading
- 1. What do I do if the "close index channel failed" or "too many tablet versions" error occurs?
- 2. What do I do if the "Label Already Exists" error occurs?
- 3. What do I do if the "ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel" error occurs?
- 4. What do I do if RPC times out?
- 5. What do I do if the "Value count does not match column count" error occurs?