- StarRocks
- Introduction to StarRocks
- Quick Start
- Deployment
- Deployment overview
- Prepare
- Deploy
- Deploy shared-nothing StarRocks
- Deploy and use shared-data StarRocks
- Manage
- Table Design
- Understand StarRocks table design
- Table types
- Data distribution
- Data compression
- Sort keys and prefix indexes
- Data Loading
- Concepts
- Overview of data loading
- Load data from a local file system or a streaming data source using HTTP PUT
- Load data from HDFS
- Load data from cloud storage
- Load data from Apache Kafka®
- Continuously load data from Apache Kafka®
- Load data from Apache Sparkâ„¢
- Load data using INSERT
- Load data using Stream Load transaction interface
- Realtime synchronization from MySQL
- Continuously load data from Apache Flink®
- Change data through loading
- Transform data at loading
- Data Unloading
- Query Data Lakes
- Query Acceleration
- Gather CBO statistics
- Synchronous materialized views
- Asynchronous materialized views
- Colocate Join
- Lateral Join
- Query Cache
- Index
- Computing the Number of Distinct Values
- Sorted streaming aggregate
- Integrations
- Administration
- Management
- Data recovery
- User Privilege and Authentication
- Performance Tuning
- Reference
- SQL Reference
- User Account Management
- Cluster Management
- ADD SQLBLACKLIST
- ADMIN CANCEL REPAIR TABLE
- ADMIN CHECK TABLET
- ADMIN REPAIR TABLE
- ADMIN SET CONFIG
- ADMIN SET REPLICA STATUS
- ADMIN SHOW CONFIG
- ADMIN SHOW REPLICA DISTRIBUTION
- ADMIN SHOW REPLICA STATUS
- ALTER RESOURCE GROUP
- ALTER STORAGE VOLUME
- ALTER SYSTEM
- CANCEL DECOMMISSION
- CREATE FILE
- CREATE RESOURCE GROUP
- CREATE STORAGE VOLUME
- DELETE SQLBLACKLIST
- DESC STORAGE VOLUME
- DROP FILE
- DROP RESOURCE GROUP
- DROP STORAGE VOLUME
- EXPLAIN
- INSTALL PLUGIN
- KILL
- SET
- SET DEFAULT STORAGE VOLUME
- SHOW BACKENDS
- SHOW BROKER
- SHOW COMPUTE NODES
- SHOW FILE
- SHOW FRONTENDS
- SHOW FULL COLUMNS
- SHOW INDEX
- SHOW PLUGINS
- SHOW PROC
- SHOW PROCESSLIST
- SHOW RESOURCE GROUP
- SHOW SQLBLACKLIST
- SHOW STORAGE VOLUMES
- SHOW TABLE STATUS
- SHOW VARIABLES
- UNINSTALL PLUGIN
- DDL
- ALTER DATABASE
- ALTER MATERIALIZED VIEW
- ALTER TABLE
- ALTER VIEW
- ALTER RESOURCE
- ANALYZE TABLE
- BACKUP
- CANCEL ALTER TABLE
- CANCEL BACKUP
- CANCEL RESTORE
- CREATE ANALYZE
- CREATE DATABASE
- CREATE EXTERNAL CATALOG
- CREATE FUNCTION
- CREATE INDEX
- CREATE MATERIALIZED VIEW
- CREATE REPOSITORY
- CREATE RESOURCE
- CREATE TABLE
- CREATE TABLE AS SELECT
- CREATE TABLE LIKE
- CREATE VIEW
- DROP ANALYZE
- DROP CATALOG
- DROP DATABASE
- DROP FUNCTION
- DROP INDEX
- DROP MATERIALIZED VIEW
- DROP REPOSITORY
- DROP RESOURCE
- DROP STATS
- DROP TABLE
- DROP VIEW
- HLL
- KILL ANALYZE
- RECOVER
- REFRESH EXTERNAL TABLE
- RESTORE
- SET CATALOG
- SHOW ANALYZE JOB
- SHOW ANALYZE STATUS
- SHOW FUNCTION
- SHOW META
- SHOW RESOURCES
- TRUNCATE TABLE
- USE
- DML
- ALTER LOAD
- ALTER ROUTINE LOAD
- BROKER LOAD
- CANCEL LOAD
- CANCEL EXPORT
- CANCEL REFRESH MATERIALIZED VIEW
- CREATE ROUTINE LOAD
- DELETE
- DROP TASK
- EXPORT
- GROUP BY
- INSERT
- PAUSE ROUTINE LOAD
- REFRESH MATERIALIZED VIEW
- RESUME ROUTINE LOAD
- SELECT
- SHOW ALTER TABLE
- SHOW ALTER MATERIALIZED VIEW
- SHOW BACKUP
- SHOW CATALOGS
- SHOW CREATE CATALOG
- SHOW CREATE DATABASE
- SHOW CREATE MATERIALIZED VIEW
- SHOW CREATE TABLE
- SHOW CREATE VIEW
- SHOW DATA
- SHOW DATABASES
- SHOW DELETE
- SHOW DYNAMIC PARTITION TABLES
- SHOW EXPORT
- SHOW LOAD
- SHOW MATERIALIZED VIEWS
- SHOW PARTITIONS
- SHOW PROPERTY
- SHOW REPOSITORIES
- SHOW RESTORE
- SHOW ROUTINE LOAD
- SHOW ROUTINE LOAD TASK
- SHOW SNAPSHOT
- SHOW TABLES
- SHOW TABLET
- SHOW TRANSACTION
- SPARK LOAD
- STOP ROUTINE LOAD
- STREAM LOAD
- SUBMIT TASK
- UPDATE
- Auxiliary Commands
- Data Types
- Keywords
- Function Reference
- Function list
- Java UDFs
- Window functions
- Lambda expression
- Aggregate Functions
- any_value
- approx_count_distinct
- array_agg
- avg
- bitmap
- bitmap_agg
- count
- corr
- covar_pop
- covar_samp
- group_concat
- grouping
- grouping_id
- hll_empty
- hll_hash
- hll_raw_agg
- hll_union
- hll_union_agg
- max
- max_by
- min
- min_by
- multi_distinct_sum
- multi_distinct_count
- percentile_approx
- percentile_cont
- percentile_disc
- retention
- stddev
- stddev_samp
- sum
- variance, variance_pop, var_pop
- var_samp
- window_funnel
- Array Functions
- all_match
- any_match
- array_agg
- array_append
- array_avg
- array_concat
- array_contains
- array_contains_all
- array_cum_sum
- array_difference
- array_distinct
- array_filter
- array_generate
- array_intersect
- array_join
- array_length
- array_map
- array_max
- array_min
- array_position
- array_remove
- array_slice
- array_sort
- array_sortby
- array_sum
- arrays_overlap
- array_to_bitmap
- cardinality
- element_at
- reverse
- unnest
- Bit Functions
- Bitmap Functions
- base64_to_bitmap
- bitmap_agg
- bitmap_and
- bitmap_andnot
- bitmap_contains
- bitmap_count
- bitmap_from_string
- bitmap_empty
- bitmap_has_any
- bitmap_hash
- bitmap_intersect
- bitmap_max
- bitmap_min
- bitmap_or
- bitmap_remove
- bitmap_subset_in_range
- bitmap_subset_limit
- bitmap_to_array
- bitmap_to_base64
- bitmap_to_string
- bitmap_union
- bitmap_union_count
- bitmap_union_int
- bitmap_xor
- intersect_count
- sub_bitmap
- to_bitmap
- JSON Functions
- Overview of JSON functions and operators
- JSON operators
- JSON constructor functions
- JSON query and processing functions
- Map Functions
- Binary Functions
- Conditional Functions
- Cryptographic Functions
- Date Functions
- add_months
- adddate
- convert_tz
- current_date
- current_time
- current_timestamp
- date
- date_add
- date_diff
- date_format
- date_slice
- date_sub, subdate
- date_trunc
- datediff
- day
- dayname
- dayofmonth
- dayofweek
- dayofyear
- days_add
- days_diff
- days_sub
- from_days
- from_unixtime
- hour
- hours_add
- hours_diff
- hours_sub
- last_day
- makedate
- microseconds_add
- microseconds_sub
- minute
- minutes_add
- minutes_diff
- minutes_sub
- month
- monthname
- months_add
- months_diff
- months_sub
- next_day
- now
- previous_day
- quarter
- second
- seconds_add
- seconds_diff
- seconds_sub
- str_to_date
- str2date
- time_slice
- time_to_sec
- timediff
- timestamp
- timestampadd
- timestampdiff
- to_date
- to_days
- unix_timestamp
- utc_timestamp
- week
- week_iso
- weekofyear
- weeks_add
- day_of_week_iso
- weeks_diff
- weeks_sub
- year
- years_add
- years_diff
- years_sub
- Geographic Functions
- Math Functions
- String Functions
- append_trailing_char_if_absent
- ascii
- char
- char_length
- character_length
- concat
- concat_ws
- ends_with
- find_in_set
- group_concat
- hex
- hex_decode_binary
- hex_decode_string
- instr
- lcase
- left
- length
- locate
- lower
- lpad
- ltrim
- money_format
- null_or_empty
- parse_url
- repeat
- replace
- reverse
- right
- rpad
- rtrim
- space
- split
- split_part
- starts_with
- strleft
- strright
- str_to_map
- substring
- trim
- ucase
- unhex
- upper
- url_decode
- url_encode
- Pattern Matching Functions
- Percentile Functions
- Scalar Functions
- Struct Functions
- Table Functions
- Utility Functions
- cast function
- hash function
- AUTO_INCREMENT
- Generated columns
- System variables
- User-defined variables
- Error code
- System limits
- AWS IAM policies
- SQL Reference
- FAQ
- Benchmark
- Ecosystem Release Notes
- Developers
- Contribute to StarRocks
- Code Style Guides
- Use the debuginfo file for debugging
- Development Environment
- Trace Tools
Generated columns
Since v3.1, StarRocks supports generated columns. Generated columns can be used to accelerate queries with complex expressions. This feature supports precomputing and storing the results of expressions and query rewrites, which significantly accelerates queries with the same complex expressions.
You can define one or more generated columns to store the results of expressions at table creation. As such, when executing queries that contain the expression whose results are stored in the generated column you have defined, the CBO rewrites the query to read data directly from the generated column. Alternatively, you can directly query the data in the generated column.
It is also recommended to evaluate the impact of generated columns on loading performance because computing expressions takes some time. Additionally, it is advised to create generated columns at table creation rather than adding or modifying them after table creation. Because it is time-consuming and costly to add or modify generated columns after table creation.
Basic operations
Create generated columns
Syntax
<col_name> <data_type> [NULL] AS <expr> [COMMENT 'string']
Create generated columns at table creation (Recommended)
Create a table named test_tbl1
with five columns of which columns newcol1
and newcol2
are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columns data_array
and data_json
respectively.
CREATE TABLE test_tbl1
(
id INT NOT NULL,
data_array ARRAY<int> NOT NULL,
data_json JSON NOT NULL,
newcol1 DOUBLE AS array_avg(data_array),
newcol2 String AS json_string(json_query(data_json, "$.a"))
)
PRIMARY KEY (id)
DISTRIBUTED BY HASH(id);
NOTICE:
- Generated columns must be defined after regular columns.
- Aggregate functions cannot be used in the expressions for generated columns.
- The expressions for generated column cannot reference other generated columns or auto-increment columns, but the expressions can reference multiple regular columns.
- The data type of a generated column must match the data type of the result generated by the expression for the generated column.
- Generated columns cannot be created on Aggregate tables.
- Currently, StarRocks's shared-data mode does not support generated columns.
Add generated columns after table creation
NOTICE
This operation is time-consuming and resource-intensive. Therefore, it is recommended to add generated columns at table creation. If it is unavoidable to use ALTER TABLE to add generated columns, it is recommended to evaluate the cost and time involved in advance.
Create a table named
test_tbl2
with three regular columnsid
,data_array
, anddata_json
. Insert a data row into the table.-- Create a table. CREATE TABLE test_tbl2 ( id INT NOT NULL, data_array ARRAY<int> NOT NULL, data_json JSON NOT NULL ) PRIMARY KEY (id) DISTRIBUTED BY HASH(id); -- Insert a data row. INSERT INTO test_tbl2 VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}')); -- Query the table. MySQL [example_db]> select * from test_tbl2; +------+------------+------------------+ | id | data_array | data_json | +------+------------+------------------+ | 1 | [1,2] | {"a": 1, "b": 2} | +------+------------+------------------+ 1 row in set (0.04 sec)
Execute ALTER TABLE ... ADD COLUMN ... to add generated columns
newcol1
andnewcol2
, which are created by evaluating the expressions based on the values of regular columnsdata_array
anddata_json
.ALTER TABLE test_tbl2 ADD COLUMN newcol1 DOUBLE AS array_avg(data_array); ALTER TABLE test_tbl2 ADD COLUMN newcol2 String AS json_string(json_query(data_json, "$.a"));
NOTICE:
- Adding generated columns to Aggregate tables is not supported.
- Regular columns need to be defined before generated columns. When you use the ALTER TABLE ... ADD COLUMN ... statement to add a regular column without specifying the position of the new regular column, the system automatically places it before the generated columns. Moreover, you cannot use AFTER to explicitly place the regular column after a generated column.
Query the table data.
MySQL [example_db]> SELECT * FROM test_tbl2; +------+------------+------------------+---------+---------+ | id | data_array | data_json | newcol1 | newcol2 | +------+------------+------------------+---------+---------+ | 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 | +------+------------+------------------+---------+---------+ 1 row in set (0.04 sec)
The result shows that the generated columns
newcol1
andnewcol2
are added to the table, and StarRocks automatically computes their values based on the expression.
Load data into generated columns
During data loading, StarRocks automatically calculates the values for generated columns based on the expressions. You cannot specify the values of generated columns. The following example uses the INSERT INTO statement to load data:
Use INSERT INTO to insert a record into the
test_tbl1
table. Note that you cannot specify the values for the generated columns within theVALUES ()
clause.INSERT INTO test_tbl1 (id, data_array, data_json) VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));
Query the table data.
MySQL [example_db]> SELECT * FROM test_tbl1; +------+------------+------------------+---------+---------+ | id | data_array | data_json | newcol1 | newcol2 | +------+------------+------------------+---------+---------+ | 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 | +------+------------+------------------+---------+---------+ 1 row in set (0.01 sec)
The results show that StarRocks automatically computes the values for the generated columns
newcol1
andnewcol2
based on the expressions.NOTICE:
The following error is returned if you specify values for the generated columns during data loading:
MySQL [example_db]> INSERT INTO test_tbl1 (id, data_array, data_json, newcol1, newcol2) VALUES (2, [3,4], parse_json('{"a" : 3, "b" : 4}'), 3.5, "3"); ERROR 1064 (HY000): Getting analyzing error. Detail message: materialized column 'newcol1' can not be specified. MySQL [example_db]> INSERT INTO test_tbl1 VALUES (2, [3,4], parse_json('{"a" : 3, "b" : 4}'), 3.5, "3"); ERROR 1064 (HY000): Getting analyzing error. Detail message: Column count doesn't match value count.
Modify generated columns
NOTICE
This operation is time-consuming and resource-intensive. If it is unavoidable to use ALTER TABLE to modify generated columns, it is recommended to evaluate the cost and time involved in advance.
You can modify the data type and expression of a generated column.
Create a table
test_tbl3
with five columns of which columnsnewcol1
andnewcol2
are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columnsdata_array
anddata_json
respectively. Insert a data row into the table.-- Create a table. MySQL [example_db]> CREATE TABLE test_tbl3 ( id INT NOT NULL, data_array ARRAY<int> NOT NULL, data_json JSON NOT NULL, -- The data types and expressions of generated columns are specified as follows: newcol1 DOUBLE AS array_avg(data_array), newcol2 String AS json_string(json_query(data_json, "$.a")) ) PRIMARY KEY (id) DISTRIBUTED BY HASH(id); -- Insert a data row. INSERT INTO test_tbl3 (id, data_array, data_json) VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}')); -- Query the table. MySQL [example_db]> select * from test_tbl3; +------+------------+------------------+---------+---------+ | id | data_array | data_json | newcol1 | newcol2 | +------+------------+------------------+---------+---------+ | 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 | +------+------------+------------------+---------+---------+ 1 row in set (0.01 sec)
Modified the generated columns
newcol1
andnewcol2
:Change the data type of the generated column
newcol1
toARRAY<INT>
and change its expression todata_array
.ALTER TABLE test_tbl3 MODIFY COLUMN newcol1 ARRAY<INT> AS data_array;
Modify the expression of the generated column
newcol2
to extract the values of fieldb
from the regular columndata_json
.ALTER TABLE test_tbl3 MODIFY COLUMN newcol2 String AS json_string(json_query(data_json, "$.b"));
View the modified schema and the data in the table.
View the modified schema.
MySQL [example_db]> show create table test_tbl3\G **** 1. row **** Table: test_tbl3 Create Table: CREATE TABLE test_tbl3 ( id int(11) NOT NULL COMMENT "", data_array array<int(11)> NOT NULL COMMENT "", data_json json NOT NULL COMMENT "", -- After modification, the data types and expressions of generated columns are as follows: newcol1 array<int(11)> NULL AS example_db.test_tbl3.data_array COMMENT "", newcol2 varchar(65533) NULL AS json_string(json_query(example_db.test_tbl3.data_json, '$.b')) COMMENT "" ) ENGINE=OLAP PRIMARY KEY(id) DISTRIBUTED BY HASH(id) PROPERTIES (...); 1 row in set (0.00 sec)
Query the table data after modification. The result shows that StarRocks recalculates the values of the generated columns
newcol1
andnewcol2
based on the modified expressions.MySQL [example_db]> select * from test_tbl3; +------+------------+------------------+---------+---------+ | id | data_array | data_json | newcol1 | newcol2 | +------+------------+------------------+---------+---------+ | 1 | [1,2] | {"a": 1, "b": 2} | [1,2] | 2 | +------+------------+------------------+---------+---------+ 1 row in set (0.01 sec)
Drop a generated column
Drop column newcol1
from the table test_tbl3
ALTER TABLE test_tbl3 DROP COLUMN newcol1;
NOTICE:
If a generated colum references a regular column in the expression, you cannot directly drop or modify that regular column. Instead, you need to first drop the generated column and then drop or modify the regular column.
Query rewrites
If the expression in a query matches the expression of a generated column, the optimizer automatically rewrites the query to directly read the values of the generated column.
Suppose that you create a table
test_tbl4
with the following schema:CREATE TABLE test_tbl4 ( id INT NOT NULL, data_array ARRAY<int> NOT NULL, data_json JSON NOT NULL, newcol1 DOUBLE AS array_avg(data_array), newcol2 String AS json_string(json_query(data_json, "$.a")) ) PRIMARY KEY (id) DISTRIBUTED BY HASH(id);
If you query the data in the table
test_tbl4
by using theSELECT array_avg(data_array), json_string(json_query(data_json, "$.a")) FROM test_tbl4;
statement, the query involves only the regular columnsdata_array
anddata_json
. However, the expressions in the query match the expressions of the generated columnsnewcol1
andnewcol2
. In this case, the execution plan shows that the CBO automatically rewrites the query to read the values of the generated columnsnewcol1
andnewcol2
.MySQL [example_db]> EXPLAIN SELECT array_avg(data_array), json_string(json_query(data_json, "$.a")) FROM test_tbl4; +---------------------------------------+ | Explain String | +---------------------------------------+ | PLAN FRAGMENT 0 | | OUTPUT EXPRS:4: newcol1 | 5: newcol2 | -- The query is rewritten to read data from the generated columns newcol1 and newcol2 are accessed. | PARTITION: RANDOM | | | | RESULT SINK | | | | 0:OlapScanNode | | TABLE: test_tbl4 | | PREAGGREGATION: ON | | partitions=0/1 | | rollup: test_tbl4 | | tabletRatio=0/0 | | tabletList= | | cardinality=1 | | avgRowSize=2.0 | +---------------------------------------+ 15 rows in set (0.00 sec)
Partial updates and generated columns
To perform partial updates on a Primary Key table, you must specify all the regular columns referenced by the generated columns in the columns
parameter. The following example uses Stream Load to perform partial updates.
Create a table
test_tbl5
with five columns of which columnsnewcol1
andnewcol2
are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columnsdata_array
anddata_json
respectively. Insert a data row into the table.-- Create a table. CREATE TABLE test_tbl5 ( id INT NOT NULL, data_array ARRAY<int> NOT NULL, data_json JSON NULL, newcol1 DOUBLE AS array_avg(data_array), newcol2 String AS json_string(json_query(data_json, "$.a")) ) PRIMARY KEY (id) DISTRIBUTED BY HASH(id); -- Insert into a data row. INSERT INTO test_tbl5 (id, data_array, data_json) VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}')); -- Query the table. MySQL [example_db]> select * from test_tbl5; +------+------------+------------------+---------+---------+ | id | data_array | data_json | newcol1 | newcol2 | +------+------------+------------------+---------+---------+ | 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 | +------+------------+------------------+---------+---------+ 1 row in set (0.01 sec)
Prepare a CSV file
my_data1.csv
to update some columns in thetest_tbl5
table.1|[3,4]|{"a": 3, "b": 4} 2|[3,4]|{"a": 3, "b": 4}
Use Stream Load with the
my_data1.csv
file to update some columns of thetest_tbl5
table. You need to setpartial_update:true
and specify all the regular columns referenced by the generated columns in thecolumns
parameter.curl --location-trusted -u <username>:<password> -H "label:1" \ -H "column_separator:|" \ -H "partial_update:true" \ -H "columns:id,data_array,data_json" \ -T my_data1.csv -XPUT \ http://<fe_host>:<fe_http_port>/api/example_db/test_tbl5/_stream_load
Query the table data.
[example_db]> select * from test_tbl5; +------+------------+------------------+---------+---------+ | id | data_array | data_json | newcol1 | newcol2 | +------+------------+------------------+---------+---------+ | 1 | [3,4] | {"a": 3, "b": 4} | 3.5 | 3 | | 2 | [3,4] | {"a": 3, "b": 4} | 3.5 | 3 | +------+------------+------------------+---------+---------+ 2 rows in set (0.01 sec)
An error is returned by Stream Load if you perform partial updates without specifying all the regular columns referenced by the generated columns.
Prepare a CSV file
my_data2.csv
.1|[3,4] 2|[3,4]
When partial column updates are performed by using Stream Load with the
my_data2.csv
file, if the values for thedata_json
column are not provided inmy_data2.csv
and thecolumns
parameter in the Stream Load job does not include thedata_json
column, even if thedata_json
column allows null values, an error is returned by Stream Load because the columndata_json
is referenced by the generated columnnewcol2
.