- StarRocks
- Introduction to StarRocks
- Quick Start
- Deployment
- Deployment overview
- Prepare
- Deploy
- Deploy shared-nothing StarRocks
- Deploy and use shared-data StarRocks
- Manage
- Table Design
- Understand StarRocks table design
- Table types
- Data distribution
- Data compression
- Sort keys and prefix indexes
- Data Loading
- Concepts
- Overview of data loading
- Load data from a local file system or a streaming data source using HTTP PUT
- Load data from HDFS
- Load data from cloud storage
- Load data from Apache Kafka®
- Continuously load data from Apache Kafka®
- Load data from Apache Spark™
- Load data using INSERT
- Load data using Stream Load transaction interface
- Realtime synchronization from MySQL
- Continuously load data from Apache Flink®
- Change data through loading
- Transform data at loading
- Data Unloading
- Query Data Lakes
- Query Acceleration
- Gather CBO statistics
- Synchronous materialized views
- Asynchronous materialized views
- Colocate Join
- Lateral Join
- Query Cache
- Index
- Computing the Number of Distinct Values
- Sorted streaming aggregate
- Integrations
- Administration
- Management
- Data recovery
- User Privilege and Authentication
- Performance Tuning
- Reference
- SQL Reference
- User Account Management
- Cluster Management
- ADD SQLBLACKLIST
- ADMIN CANCEL REPAIR TABLE
- ADMIN CHECK TABLET
- ADMIN REPAIR TABLE
- ADMIN SET CONFIG
- ADMIN SET REPLICA STATUS
- ADMIN SHOW CONFIG
- ADMIN SHOW REPLICA DISTRIBUTION
- ADMIN SHOW REPLICA STATUS
- ALTER RESOURCE GROUP
- ALTER STORAGE VOLUME
- ALTER SYSTEM
- CANCEL DECOMMISSION
- CREATE FILE
- CREATE RESOURCE GROUP
- CREATE STORAGE VOLUME
- DELETE SQLBLACKLIST
- DESC STORAGE VOLUME
- DROP FILE
- DROP RESOURCE GROUP
- DROP STORAGE VOLUME
- EXPLAIN
- INSTALL PLUGIN
- KILL
- SET
- SET DEFAULT STORAGE VOLUME
- SHOW BACKENDS
- SHOW BROKER
- SHOW COMPUTE NODES
- SHOW FILE
- SHOW FRONTENDS
- SHOW FULL COLUMNS
- SHOW INDEX
- SHOW PLUGINS
- SHOW PROC
- SHOW PROCESSLIST
- SHOW RESOURCE GROUP
- SHOW SQLBLACKLIST
- SHOW STORAGE VOLUMES
- SHOW TABLE STATUS
- SHOW VARIABLES
- UNINSTALL PLUGIN
- DDL
- ALTER DATABASE
- ALTER MATERIALIZED VIEW
- ALTER TABLE
- ALTER VIEW
- ALTER RESOURCE
- ANALYZE TABLE
- BACKUP
- CANCEL ALTER TABLE
- CANCEL BACKUP
- CANCEL RESTORE
- CREATE ANALYZE
- CREATE DATABASE
- CREATE EXTERNAL CATALOG
- CREATE FUNCTION
- CREATE INDEX
- CREATE MATERIALIZED VIEW
- CREATE REPOSITORY
- CREATE RESOURCE
- CREATE TABLE
- CREATE TABLE AS SELECT
- CREATE TABLE LIKE
- CREATE VIEW
- DROP ANALYZE
- DROP CATALOG
- DROP DATABASE
- DROP FUNCTION
- DROP INDEX
- DROP MATERIALIZED VIEW
- DROP REPOSITORY
- DROP RESOURCE
- DROP STATS
- DROP TABLE
- DROP VIEW
- HLL
- KILL ANALYZE
- RECOVER
- REFRESH EXTERNAL TABLE
- RESTORE
- SET CATALOG
- SHOW ANALYZE JOB
- SHOW ANALYZE STATUS
- SHOW FUNCTION
- SHOW META
- SHOW RESOURCES
- TRUNCATE TABLE
- USE
- DML
- ALTER LOAD
- ALTER ROUTINE LOAD
- BROKER LOAD
- CANCEL LOAD
- CANCEL EXPORT
- CANCEL REFRESH MATERIALIZED VIEW
- CREATE ROUTINE LOAD
- DELETE
- DROP TASK
- EXPORT
- GROUP BY
- INSERT
- PAUSE ROUTINE LOAD
- REFRESH MATERIALIZED VIEW
- RESUME ROUTINE LOAD
- SELECT
- SHOW ALTER TABLE
- SHOW ALTER MATERIALIZED VIEW
- SHOW BACKUP
- SHOW CATALOGS
- SHOW CREATE CATALOG
- SHOW CREATE DATABASE
- SHOW CREATE MATERIALIZED VIEW
- SHOW CREATE TABLE
- SHOW CREATE VIEW
- SHOW DATA
- SHOW DATABASES
- SHOW DELETE
- SHOW DYNAMIC PARTITION TABLES
- SHOW EXPORT
- SHOW LOAD
- SHOW MATERIALIZED VIEWS
- SHOW PARTITIONS
- SHOW PROPERTY
- SHOW REPOSITORIES
- SHOW RESTORE
- SHOW ROUTINE LOAD
- SHOW ROUTINE LOAD TASK
- SHOW SNAPSHOT
- SHOW TABLES
- SHOW TABLET
- SHOW TRANSACTION
- SPARK LOAD
- STOP ROUTINE LOAD
- STREAM LOAD
- SUBMIT TASK
- UPDATE
- Auxiliary Commands
- Data Types
- Keywords
- Function Reference
- Function list
- Java UDFs
- Window functions
- Lambda expression
- Aggregate Functions
- any_value
- approx_count_distinct
- array_agg
- avg
- bitmap
- bitmap_agg
- count
- corr
- covar_pop
- covar_samp
- group_concat
- grouping
- grouping_id
- hll_empty
- hll_hash
- hll_raw_agg
- hll_union
- hll_union_agg
- max
- max_by
- min
- min_by
- multi_distinct_sum
- multi_distinct_count
- percentile_approx
- percentile_cont
- percentile_disc
- retention
- stddev
- stddev_samp
- sum
- variance, variance_pop, var_pop
- var_samp
- window_funnel
- Array Functions
- all_match
- any_match
- array_agg
- array_append
- array_avg
- array_concat
- array_contains
- array_contains_all
- array_cum_sum
- array_difference
- array_distinct
- array_filter
- array_generate
- array_intersect
- array_join
- array_length
- array_map
- array_max
- array_min
- array_position
- array_remove
- array_slice
- array_sort
- array_sortby
- array_sum
- arrays_overlap
- array_to_bitmap
- cardinality
- element_at
- reverse
- unnest
- Bit Functions
- Bitmap Functions
- base64_to_bitmap
- bitmap_agg
- bitmap_and
- bitmap_andnot
- bitmap_contains
- bitmap_count
- bitmap_from_string
- bitmap_empty
- bitmap_has_any
- bitmap_hash
- bitmap_intersect
- bitmap_max
- bitmap_min
- bitmap_or
- bitmap_remove
- bitmap_subset_in_range
- bitmap_subset_limit
- bitmap_to_array
- bitmap_to_base64
- bitmap_to_string
- bitmap_union
- bitmap_union_count
- bitmap_union_int
- bitmap_xor
- intersect_count
- sub_bitmap
- to_bitmap
- JSON Functions
- Overview of JSON functions and operators
- JSON operators
- JSON constructor functions
- JSON query and processing functions
- Map Functions
- Binary Functions
- Conditional Functions
- Cryptographic Functions
- Date Functions
- add_months
- adddate
- convert_tz
- current_date
- current_time
- current_timestamp
- date
- date_add
- date_diff
- date_format
- date_slice
- date_sub, subdate
- date_trunc
- datediff
- day
- dayname
- dayofmonth
- dayofweek
- dayofyear
- days_add
- days_diff
- days_sub
- from_days
- from_unixtime
- hour
- hours_add
- hours_diff
- hours_sub
- last_day
- makedate
- microseconds_add
- microseconds_sub
- minute
- minutes_add
- minutes_diff
- minutes_sub
- month
- monthname
- months_add
- months_diff
- months_sub
- next_day
- now
- previous_day
- quarter
- second
- seconds_add
- seconds_diff
- seconds_sub
- str_to_date
- str2date
- time_slice
- time_to_sec
- timediff
- timestamp
- timestampadd
- timestampdiff
- to_date
- to_days
- unix_timestamp
- utc_timestamp
- week
- week_iso
- weekofyear
- weeks_add
- day_of_week_iso
- weeks_diff
- weeks_sub
- year
- years_add
- years_diff
- years_sub
- Geographic Functions
- Math Functions
- String Functions
- append_trailing_char_if_absent
- ascii
- char
- char_length
- character_length
- concat
- concat_ws
- ends_with
- find_in_set
- group_concat
- hex
- hex_decode_binary
- hex_decode_string
- instr
- lcase
- left
- length
- locate
- lower
- lpad
- ltrim
- money_format
- null_or_empty
- parse_url
- repeat
- replace
- reverse
- right
- rpad
- rtrim
- space
- split
- split_part
- starts_with
- strleft
- strright
- str_to_map
- substring
- trim
- ucase
- unhex
- upper
- url_decode
- url_encode
- Pattern Matching Functions
- Percentile Functions
- Scalar Functions
- Struct Functions
- Table Functions
- Utility Functions
- cast function
- hash function
- AUTO_INCREMENT
- Generated columns
- System variables
- User-defined variables
- Error code
- System limits
- AWS IAM policies
- SQL Reference
- FAQ
- Benchmark
- Ecosystem Release Notes
- Developers
- Contribute to StarRocks
- Code Style Guides
- Use the debuginfo file for debugging
- Development Environment
- Trace Tools
List partitioning
Since v3.1, StarRocks supports list partitioning. Data is partitioned based on a predefined value list for each partition, which can accelerate queries and facilitate management according to enumerated values.
Introduction
You need to explicitly specify the column values list in each partition. These values do not need to be continuous, unlike the continuous time or numeric range required in Range Partitioning. During data loading, StarRocks will store the data in the corresponding partition based on the mapping between the data's partitioning column values and the predefined column values for each partition.
List partitioning is suitable for storing data whose columns contain a small number of enum values, and you often query and manage data based on these enum values. For example, columns represent geographical locations, states, and categories. Each value in a column represents an independent category. By partitioning data based on enum values, you can improve query performance and facilitate data management.
List partitioning is particularly useful for scenarios where a partition needs to include multiple values for each partitioning column. For example, if a table includes the City
column representing the city where an individual is from, and you often query and manage data by states and cities. At table creation, you can use the City
column as the partitioning column for List Partitioning and specify that data of various cities within the same state are placed in one partition, for example PARTITION pCalifornia VALUES IN ("Los Angeles", "San Francisco", "San Diego")
. This method can accelerate queries based on states and cities while facilitating data management.
If a partition only needs to contain data with the same value of each partitioning column, it is recommended to use expression partitioning.
Comparison between list partitioning and expression partitioning
The main difference between list partitioning and expression partitioning (recommended) is that list partitioning requires you to manually create partitions one by one. On the other hand, expression partitioning can automatically create partitions during loading to simplify partitioning. And in most cases, expression partitioning can replace list partitioning. The specific comparison between the two is shown in the following table:
Partitioning method | List partitioning | Expression partitioning |
---|---|---|
Syntax | PARTITION BY LIST (partition_columns)( PARTITION <partition_name> VALUES IN (value_list) [, ...] ) | PARTITION BY <partition_columns> |
Multiple values for each partitioning column in a partition | Supported. A partition can store data with different values in each partitioning column. In the following example, when the loaded data contains values Los Angeles , San Francisco , and San Diego in the city column, all the data is stored in one partition.pCalifornia .PARTITION BY LIST (city) ( PARTITION pCalifornia VALUES IN ("Los Angeles","San Francisco","San Diego") [, ...] ) | Not supported. A partition stores data with the same value in the partitioning column For example, the expression PARTITION BY (city) is used in expression partitioning. When the loaded data contains values Los Angeles , San Francisco , and San Diego in the city column, StarRocks will automatically create three partitions pLosAngeles , pSanFrancisco , and pSanDiego . The three partitions respectively store data with values Los Angeles, San Francisco , and San Diego in the city column. |
Create partitions before data loading | Supported. Partitions need to be created at table creation. | No need to do so. Partitions can be automatically created during data loading. |
Automatically create List partitions during data loading | Not supported. If a partition corresponding to the data does not exist during data loading, an error is returned. | Supported. If a partition corresponding to the data does not exist during data loading, StarRocks automatically creates the partition to store the data. Each partition can only contain data with the same value for the partitioning column. |
SHOW CREATE TABLE | Returned the partitions definition in the CREATE TABLE statement. | After data is loaded, the statement returns the result with the partition clause used in the CREATE TABLE statement, that is, PARTITION BY (partition_columns) . However, the returned result does not show any automatically created partitions. If you need to view the automatically created partitions, execute SHOW PARTITIONS FROM <table_name> . |
Usage
Syntax
PARTITION BY LIST (partition_columns)(
PARTITION <partition_name> VALUES IN (value_list)
[, ...]
)
partition_columns::=
<column> [,<column> [, ...] ]
value_list ::=
value_item [, value_item [, ...] ]
value_item ::=
{ <value> | ( <value> [, <value>, [, ...] ] ) }
Parameters
Parameters | Parameters | Description |
---|---|---|
partition_columns | YES | the ames of the partitioning columns. The partitioning column values can be string (BINARY not supported), date or datetime, integer, and boolean values. The partitioning column allows NULL values. |
partition_name | YES | Partition name. It is recommended to set appropriate partition names based on the business scenario to differentiate the data in different partitions. |
value_list | YES | A list of partitioning column values in a partition. |
Examples
Example 1: Suppose you frequently query details of data center billing based on states or cities. At table creation, you can specify the partitioning column as city
and specify that each partition stores data of cities within the same state. This method can accelerate queries for specific states or cities and facilitate data management.
CREATE TABLE t_recharge_detail1 (
id bigint,
user_id bigint,
recharge_money decimal(32,2),
city varchar(20) not null,
dt varchar(20) not null
)
DUPLICATE KEY(id)
PARTITION BY LIST (city) (
PARTITION pLos_Angeles VALUES IN ("Los Angeles"),
PARTITION pSan_Francisco VALUES IN ("San Francisco")
)
DISTRIBUTED BY HASH(`id`);
Example 2: Suppose you frequently query details of data center billing based on time ranges and specific states or cities. During table creation, you can specify the partitioning columns as dt
and city
. This way, data of specific dates and specific states or cities is stored into the same partition, improving query speed and facilitating data management.
CREATE TABLE t_recharge_detail4 (
id bigint,
user_id bigint,
recharge_money decimal(32,2),
city varchar(20) not null,
dt varchar(20) not null
) ENGINE=OLAP
DUPLICATE KEY(id)
PARTITION BY LIST (dt,city) (
PARTITION p202204_California VALUES IN (
("2022-04-01", "Los Angeles"),
("2022-04-01", "San Francisco"),
("2022-04-02", "Los Angeles"),
("2022-04-02", "San Francisco")
),
PARTITION p202204_Texas VALUES IN (
("2022-04-01", "Houston"),
("2022-04-01", "Dallas"),
("2022-04-02", "Houston"),
("2022-04-02", "Dallas")
)
)
DISTRIBUTED BY HASH(`id`);
Limits
- List partitioning does support dynamic partitoning and creating multiple partitions at a time.
- Currently, StarRocks's shared-data mode does not support this feature.
- When the
ALTER TABLE <table_name> DROP PARTITION <partition_name>;
statement is used to delete a partition created by using list partitioning, data in the partition is directly removed and cannot be recovered. - Currently you cannot backup and restore partitions created by the list partitioning.
- Currently, StarRocks does not support creating asynchronous materialized views with base tables created with the list partitioning strategy.