- StarRocks
- Introduction to StarRocks
- Quick Start
- Table Design
- Data Loading
- Concepts
- Overview of data loading
- Load data from a local file system or a streaming data source using HTTP PUT
- Load data from HDFS or cloud storage
- Continuously load data from Apache Kafka®
- Bulk load using Apache Spark™
- Load data using INSERT
- Synchronize data from MySQL in real time
- Continuously load data from Apache Flink®
- Change data through loading
- Transform data at loading
- Data Unloading
- Query Data Sources
- Query Acceleration
- Administration
- Deployment
- Management
- Data Recovery
- User Privilege and Authentication
- Performance Tuning
- Reference
- SQL Reference
- User Account Management
- Cluster Management
- ADD SQLBLACKLIST
- ADMIN CANCEL REPAIR TABLE
- ADMIN CHECK TABLET
- ADMIN REPAIR TABLE
- ADMIN SET CONFIG
- ADMIN SET REPLICA STATUS
- ADMIN SHOW CONFIG
- ADMIN SHOW REPLICA DISTRIBUTION
- ADMIN SHOW REPLICA STATUS
- ALTER RESOURCE GROUP
- ALTER SYSTEM
- CANCEL DECOMMISSION
- CREATE FILE
- CREATE RESOURCE GROUP
- DELETE SQLBLACKLIST
- DROP FILE
- DROP RESOURCE GROUP
- EXPLAIN
- INSTALL PLUGIN
- KILL
- SET
- SHOW BACKENDS
- SHOW BROKER
- SHOW FILE
- SHOW FRONTENDS
- SHOW FULL COLUMNS
- SHOW INDEX
- SHOW PLUGINS
- SHOW PROC
- SHOW PROCESSLIST
- SHOW RESOURCE GROUP
- SHOW SQLBLACKLIST
- SHOW TABLE STATUS
- SHOW VARIABLES
- UNINSTALL PLUGIN
- DDL
- ALTER DATABASE
- ALTER TABLE
- ALTER VIEW
- ALTER RESOURCE
- BACKUP
- CANCEL BACKUP
- CANCEL RESTORE
- CREATE DATABASE
- CREATE INDEX
- CREATE MATERIALIZED VIEW
- CREATE REPOSITORY
- CREATE RESOURCE
- CREATE TABLE AS SELECT
- CREATE TABLE LIKE
- CREATE TABLE
- CREATE VIEW
- CREATE FUNCTION
- DROP DATABASE
- DROP INDEX
- DROP MATERIALIZED VIEW
- DROP REPOSITORY
- DROP RESOURCE
- DROP TABLE
- DROP VIEW
- DROP FUNCTION
- HLL
- RECOVER
- RESTORE
- SHOW RESOURCES
- SHOW FUNCTION
- TRUNCATE TABLE
- USE
- DML
- ALTER ROUTINE LOAD
- BROKER LOAD
- CANCEL LOAD
- CANCEL EXPORT
- CANCEL REFRESH MATERIALIZED VIEW
- DELETE
- EXPORT
- GROUP BY
- INSERT
- PAUSE ROUTINE LOAD
- RESUME ROUTINE LOAD
- ROUTINE LOAD
- SELECT
- SHOW ALTER TABLE
- SHOW BACKUP
- SHOW CREATE TABLE
- SHOW CREATE VIEW
- SHOW DATA
- SHOW DATABASES
- SHOW DELETE
- SHOW DYNAMIC PARTITION TABLES
- SHOW EXPORT
- SHOW LOAD
- SHOW PARTITIONS
- SHOW PROPERTY
- SHOW REPOSITORIES
- SHOW RESTORE
- SHOW ROUTINE LOAD
- SHOW ROUTINE LOAD TASK
- SHOW SNAPSHOT
- SHOW TABLES
- SHOW TABLET
- SHOW TRANSACTION
- SPARK LOAD
- STOP ROUTINE LOAD
- STREAM LOAD
- Auxiliary Commands
- Data Types
- Function Reference
- Java UDFs
- Window functions
- Aggregate Functions
- Array Functions
- Bit Functions
- Bitmap Functions
- base64_to_bitmap
- bitmap_agg
- bitmap_and
- bitmap_andnot
- bitmap_contains
- bitmap_count
- bitmap_from_string
- bitmap_empty
- bitmap_has_any
- bitmap_hash
- bitmap_intersect
- bitmap_max
- bitmap_min
- bitmap_or
- bitmap_remove
- bitmap_to_array
- bitmap_to_string
- bitmap_union
- bitmap_union_count
- bitmap_union_int
- bitmap_xor
- intersect_count
- to_bitmap
- Conditional Functions
- Cryptographic Functions
- Date Functions
- add_months
- adddate
- convert_tz
- current_date
- current_time
- current_timestamp
- date
- date_add
- date_format
- date_sub, subdate
- date_trunc
- datediff
- day
- dayname
- dayofmonth
- dayofweek
- dayofyear
- days_add
- days_diff
- days_sub
- from_days
- from_unixtime
- hour
- hours_add
- hours_diff
- hours_sub
- microseconds_add
- microseconds_sub
- minute
- minutes_add
- minutes_diff
- minutes_sub
- month
- monthname
- months_add
- months_diff
- months_sub
- now
- quarter
- second
- seconds_add
- seconds_diff
- seconds_sub
- str_to_date
- str2date
- time_slice
- time_to_sec
- timediff
- timestamp
- timestampadd
- timestampdiff
- to_date
- to_days
- unix_timestamp
- utc_timestamp
- week
- weekofyear
- weeks_add
- weeks_diff
- weeks_sub
- year
- years_add
- years_diff
- years_sub
- Geographic Functions
- JSON Functions
- Overview of JSON functions and operators
- JSON operators
- JSON constructor functions
- JSON query and processing functions
- Math Functions
- String Functions
- Pattern Matching Functions
- Percentile Functions
- Scalar Functions
- Utility Functions
- cast function
- hash function
- System variables
- Error code
- System limits
- SQL Reference
- FAQ
- Benchmark
- Developers
- Contribute to StarRocks
- Code Style Guides
- Use the debuginfo file for debugging
- Development Environment
- Trace Tools
Performance Optimization
Table Creation
Data Model Selection
StarRocks supports three kinds of data model: AGGREGATE KEY, UNIQUE KEY, and DUPLICATE KEY. All three are sorted by KEY.
- AGGREGATE KEY: When the AGGREGATE KEY is the same, the old and new records are aggregated. The currently supported aggregate functions are SUM, MIN, MAX, REPLACE. Aggregate model can aggregate data in advance, which is suitable for reporting and multi-dimensional analyses.
CREATE TABLE site_visit
(
siteid INT,
city SMALLINT,
username VARCHAR(32),
pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY(siteid, city, username)
DISTRIBUTED BY HASH(siteid) BUCKETS 10;
UNIQUE KEY: When the UNIQUE KEY is the same, the new record overwrites the old one.. Currently, UNIQUE KEY
functions similarly as REPLACE
of AGGREGATE KEY
.Both are suitable for analyses involving constant updates.
CREATE TABLE sales_order
(
orderid BIGINT,
status TINYINT,
username VARCHAR(32),
amount BIGINT DEFAULT '0'
)
UNIQUE KEY(orderid)
DISTRIBUTED BY HASH(orderid) BUCKETS 10;
DUPLICATE KEY: Only need to specify the sort key. Records with the same DUPLICATE KEY exist at the same time. It is suitable for analyses that don’t involve aggregating data in advance.
CREATE TABLE session_data
(
visitorid SMALLINT,
sessionid BIGINT,
visittime DATETIME,
city CHAR(20),
province CHAR(20),
ip varchar(32),
brower CHAR(20),
url VARCHAR(1024)
)
DUPLICATE KEY(visitorid, sessionid)
DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10;
In-memory tables
StarRocks supports caching data in memory to speed up queries. In-memory tables are suitable for dimension tables with a small number of rows.
CREATE TABLE memory_table
(
visitorid SMALLINT,
sessionid BIGINT,
visittime DATETIME,
city CHAR(20),
province CHAR(20),
ip varchar(32),
brower CHAR(20),
url VARCHAR(1024)
)
DUPLICATE KEY(visitorid, sessionid)
DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10
PROPERTIES (
"in_memory"="true"
);
Colocate Table
To speed up queries, tables with the same distribution can use a common bucketing column. In that case, data can be joined locally without being transferred across the cluster during the join
operation.
CREATE TABLE colocate_table
(
visitorid SMALLINT,
sessionid BIGINT,
visittime DATETIME,
city CHAR(20),
province CHAR(20),
ip varchar(32),
brower CHAR(20),
url VARCHAR(1024)
)
DUPLICATE KEY(visitorid, sessionid)
DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10
PROPERTIES(
"colocate_with" = "group1"
);
For more information about colocate join and replica management, refer to Colocate join
Flat table and star schema
To adapt to the front-end business, flat tables don’t differentiate between dimension information and index information. Such flat tables often do not perform as well as expected because:
- The schema has many fields When there are a large number of key columns in the aggregation model, it may lead to an increase in the number of columns that need to be sorted during the import.
- Updates on dimension information will be reflected to the table. The frequency of updates directly affects query efficiency.
It is recommended to use star schema to distinguish dimension tables and index tables. This schema can place dimension tables with frequent updates in MySQL, and dimension tables with fewer updates in StarRocks. Creating replicas of dimension tables in StarRocks can improve join performance.
Partition and bucket
StarRocks supports two levels of partitioning storage, the first level is RANGE partition, the second level is HASH bucket.
RANGE partition: RANGE partition is used to divide the data into different intervals (can be understood as dividing the original table into multiple sub-tables). Most users choose to set partitions by time, which has the following advantages:
Data can be distinguished between hot and cold
Be able to leverage StarRocks tiered storage (SSD + SATA)
Faster to delete data by partition
HASH bucket: Divides data into different buckets according to the hash value.
It is recommended to use a column with a large degree of differentiation for bucketing to avoid data skewing
To facilitate data recovery, it is recommended to keep the size of each bucket around 10GB. Please consider the number of buckets when creating a table or adding a partition .
Sparse index and bloomfilter
StarRocks stores data in an ordered manner and builds a sparse index with block (1024 rows) granularity.
The sparse index selects a fixed-length prefix (currently 36 bytes) in the schema as the index content.
When creating a table, it is recommended to put the common filter fields ahead of the schema. The most differentiated and frequent the query fields should be placed first.
A varchar field should be placed to the end of a sparse index. If the varchar field appears first and gets truncated, the index may be less than 36 bytes long.
For the above
site_visit
tablesite_visit(siteid, city, username, pv)
The siteid, city, and username rows occupy 4, 2, and 32 bytes respectively, so the content of the prefix index is the first 30 bytes of siteid + city + username.
In addition to sparse indexes, StarRocks also provides bloomfilter indexes, which are effective for filtering columns with large differentiations. If you don't want to place a varchar field in the sparse index, you can create a bloomfilter index.
Inverted Index
StarRocks adopts Bitmap Indexing technology to support inverted indexes that can be applied to all columns of the duplicate data model and the key column of the aggregate and unique data models. Bitmap Index is suitable for columns with a small value range, such as gender, city, and province. As the range expands, the bitmap index expands in parallel.
Materialized view (rollup)
A rollup is essentially a materialized index of the original table (base table). When creating a rollup, only some columns of the base table can be selected as the schema, and the order of the fields in the schema can be different from that of the base table. Below are some use cases of using a rollup:
- The data aggregation in the base table is not high, because the base table has differentiated fields. In this case, you may consider selecting some columns to create rollups. For the
site_visit
table above:
site_visit(siteid, city, username, pv)
siteid
may lead to poor data aggregation. You may create a rollup with onlycity
andpv
.
ALTER TABLE site_visit ADD ROLLUP rollup_city(city, pv);
- The prefix index in the base table cannot be hit, because the way the base table is built cannot cover all the query patterns. In this case, you may consider creating a rollup to adjust the column order. For the
session_data
table above:
session_data(visitorid, sessionid, visittime, city, province, ip, brower, url)
- If there are cases where you can analyze visits by browser and province, you can create a separate rollup.
ALTER TABLE session_data ADD ROLLUP rollup_brower(brower,province,ip,url) DUPLICATE KEY(brower,province);
Import
StarRocks currently supports two types of imports – broker load and stream load. StarRocks guarantees atomicity for single batch imports, even if multiple tables are imported at once.
- stream load: Import by http in micro batch. Latency of importingf 1 MB data is maintained at the second level. Suitable for high frequency import.
- broker load: Import by pull. Suitable for batch data import on a daily basis.
Schema change
There are three ways to change schemas in StarRocks – sorted schema change, direct schema change, and linked schema change.
- Sorted schema change: Change the sorting of a column and reorder the data. For example, deleting a column in a sorted schema leads to data reorder.
ALTER TABLE site_visit DROP COLUMN city;
- Direct schema change: Transform the data instead of reordering it.For example, changing the column type, adding a column to a sparse index, etc.
ALTER TABLE site_visit MODIFY COLUMN username varchar(64);
- Linked schema change: Complete changes without transforming the data. For example, adding columns.
ALTER TABLE site_visit ADD COLUMN click bigint SUM default '0';
It is recommended to consider schemas during table creation to avoid problems in schema changes.