- StarRocks
- Introduction to StarRocks
- Quick Start
- Deployment
- Deployment overview
- Prepare
- Deploy
- Deploy classic StarRocks
- Deploy and use shared-data StarRocks
- Manage
- Table Design
- Data Loading
- Concepts
- Overview of data loading
- Load data from a local file system or a streaming data source using HTTP PUT
- Load data from HDFS or cloud storage
- Continuously load data from Apache Kafka®
- Bulk load using Apache Sparkâ„¢
- Load data using INSERT
- Load data using Stream Load transaction interface
- Realtime synchronization from MySQL
- Continuously load data from Apache Flink®
- Change data through loading
- Transform data at loading
- Data Unloading
- Query Data Sources
- Query Acceleration
- Gather CBO statistics
- Synchronous materialized view
- Asynchronous materialized view
- Colocate Join
- Lateral Join
- Query Cache
- Index
- Computing the Number of Distinct Values
- Sorted streaming aggregate
- Administration
- Management
- Data recovery
- User Privilege and Authentication
- Performance Tuning
- Reference
- SQL Reference
- User Account Management
- Cluster Management
- ADD SQLBLACKLIST
- ADMIN CANCEL REPAIR TABLE
- ADMIN CHECK TABLET
- ADMIN REPAIR TABLE
- ADMIN SET CONFIG
- ADMIN SET REPLICA STATUS
- ADMIN SHOW CONFIG
- ADMIN SHOW REPLICA DISTRIBUTION
- ADMIN SHOW REPLICA STATUS
- ALTER RESOURCE GROUP
- ALTER SYSTEM
- CANCEL DECOMMISSION
- CREATE FILE
- CREATE RESOURCE GROUP
- DELETE SQLBLACKLIST
- DROP FILE
- DROP RESOURCE GROUP
- EXPLAIN
- INSTALL PLUGIN
- KILL
- SET
- SHOW BACKENDS
- SHOW BROKER
- SHOW COMPUTE NODES
- SHOW FILE
- SHOW FRONTENDS
- SHOW FULL COLUMNS
- SHOW INDEX
- SHOW PLUGINS
- SHOW PROC
- SHOW PROCESSLIST
- SHOW RESOURCE GROUP
- SHOW SQLBLACKLIST
- SHOW TABLE STATUS
- SHOW VARIABLES
- UNINSTALL PLUGIN
- DDL
- ALTER DATABASE
- ALTER MATERIALIZED VIEW
- ALTER TABLE
- ALTER VIEW
- ALTER RESOURCE
- ANALYZE TABLE
- BACKUP
- CANCEL ALTER TABLE
- CANCEL BACKUP
- CANCEL RESTORE
- CREATE ANALYZE
- CREATE DATABASE
- CREATE EXTERNAL CATALOG
- CREATE INDEX
- CREATE MATERIALIZED VIEW
- CREATE REPOSITORY
- CREATE RESOURCE
- CREATE TABLE AS SELECT
- CREATE TABLE LIKE
- CREATE TABLE
- CREATE VIEW
- CREATE FUNCTION
- DROP ANALYZE
- DROP STATS
- DROP CATALOG
- DROP DATABASE
- DROP INDEX
- DROP MATERIALIZED VIEW
- DROP REPOSITORY
- DROP RESOURCE
- DROP TABLE
- DROP VIEW
- DROP FUNCTION
- HLL
- KILL ANALYZE
- RECOVER
- REFRESH EXTERNAL TABLE
- RESTORE
- SET CATALOG
- SHOW ANALYZE JOB
- SHOW ANALYZE STATUS
- SHOW META
- SHOW RESOURCES
- SHOW FUNCTION
- TRUNCATE TABLE
- USE
- DML
- ALTER LOAD
- ALTER ROUTINE LOAD
- BROKER LOAD
- CANCEL LOAD
- CANCEL EXPORT
- CANCEL REFRESH MATERIALIZED VIEW
- CREATE ROUTINE LOAD
- DELETE
- EXPORT
- GROUP BY
- INSERT
- PAUSE ROUTINE LOAD
- REFRESH MATERIALIZED VIEW
- RESUME ROUTINE LOAD
- SELECT
- SHOW ALTER TABLE
- SHOW ALTER MATERIALIZED VIEW
- SHOW BACKUP
- SHOW CATALOGS
- SHOW CREATE CATALOG
- SHOW CREATE MATERIALIZED VIEW
- SHOW CREATE TABLE
- SHOW CREATE VIEW
- SHOW DATA
- SHOW DATABASES
- SHOW DELETE
- SHOW DYNAMIC PARTITION TABLES
- SHOW EXPORT
- SHOW LOAD
- SHOW MATERIALIZED VIEWS
- SHOW PARTITIONS
- SHOW PROPERTY
- SHOW REPOSITORIES
- SHOW RESTORE
- SHOW ROUTINE LOAD
- SHOW ROUTINE LOAD TASK
- SHOW SNAPSHOT
- SHOW TABLES
- SHOW TABLET
- SHOW TRANSACTION
- SPARK LOAD
- STOP ROUTINE LOAD
- STREAM LOAD
- SUBMIT TASK
- UPDATE
- Auxiliary Commands
- Data Types
- Keywords
- AUTO_INCREMENT
- Function Reference
- Java UDFs
- Window functions
- Lambda expression
- Aggregate Functions
- array_agg
- avg
- any_value
- approx_count_distinct
- bitmap
- bitmap_agg
- count
- grouping
- grouping_id
- hll_empty
- hll_hash
- hll_raw_agg
- hll_union
- hll_union_agg
- max
- max_by
- min
- multi_distinct_sum
- multi_distinct_count
- percentile_approx
- percentile_cont
- percentile_disc
- retention
- stddev
- stddev_samp
- sum
- variance, variance_pop, var_pop
- var_samp
- window_funnel
- Array Functions
- array_agg
- array_append
- array_avg
- array_concat
- array_contains
- array_contains_all
- array_cum_sum
- array_difference
- array_distinct
- array_filter
- array_intersect
- array_join
- array_length
- array_map
- array_max
- array_min
- array_position
- array_remove
- array_slice
- array_sort
- array_sortby
- array_sum
- arrays_overlap
- array_to_bitmap
- cardinality
- element_at
- reverse
- unnest
- Bit Functions
- Bitmap Functions
- base64_to_bitmap
- bitmap_agg
- bitmap_and
- bitmap_andnot
- bitmap_contains
- bitmap_count
- bitmap_from_string
- bitmap_empty
- bitmap_has_any
- bitmap_hash
- bitmap_intersect
- bitmap_max
- bitmap_min
- bitmap_or
- bitmap_remove
- bitmap_to_array
- bitmap_to_base64
- bitmap_to_string
- bitmap_union
- bitmap_union_count
- bitmap_union_int
- bitmap_xor
- intersect_count
- sub_bitmap
- to_bitmap
- JSON Functions
- Overview of JSON functions and operators
- JSON operators
- JSON constructor functions
- JSON query and processing functions
- Map Functions
- Binary Functions
- Conditional Functions
- Cryptographic Functions
- Date Functions
- add_months
- adddate
- convert_tz
- current_date
- current_time
- current_timestamp
- date
- date_add
- date_format
- date_slice
- date_sub, subdate
- date_trunc
- datediff
- day
- dayname
- dayofmonth
- dayofweek
- dayofyear
- days_add
- days_diff
- days_sub
- from_days
- from_unixtime
- hour
- hours_add
- hours_diff
- hours_sub
- microseconds_add
- microseconds_sub
- minute
- minutes_add
- minutes_diff
- minutes_sub
- month
- monthname
- months_add
- months_diff
- months_sub
- now
- quarter
- second
- seconds_add
- seconds_diff
- seconds_sub
- str_to_date
- str2date
- time_slice
- time_to_sec
- timediff
- timestamp
- timestampadd
- timestampdiff
- to_date
- to_days
- unix_timestamp
- utc_timestamp
- week
- week_iso
- weekofyear
- weeks_add
- weeks_diff
- weeks_sub
- year
- years_add
- years_diff
- years_sub
- Geographic Functions
- Math Functions
- String Functions
- append_trailing_char_if_absent
- ascii
- char
- char_length
- character_length
- concat
- concat_ws
- ends_with
- find_in_set
- group_concat
- hex
- hex_decode_binary
- hex_decode_string
- instr
- lcase
- left
- length
- locate
- lower
- lpad
- ltrim
- money_format
- null_or_empty
- parse_url
- repeat
- replace
- reverse
- right
- rpad
- rtrim
- space
- split
- split_part
- starts_with
- strleft
- strright
- substring
- trim
- ucase
- unhex
- upper
- Pattern Matching Functions
- Percentile Functions
- Scalar Functions
- Utility Functions
- cast function
- hash function
- System variables
- User-defined variables
- Error code
- System limits
- SQL Reference
- FAQ
- Benchmark
- Developers
- Contribute to StarRocks
- Code Style Guides
- Use the debuginfo file for debugging
- Development Environment
- Trace Tools
- Integration
Deploy and use shared-data StarRocks
This topic describes how to deploy and use a shared-data StarRocks cluster.
The shared-data StarRocks cluster is specifically engineered for the cloud on the premise of separation of storage and compute. It allows data to be stored in object storage that is compatible with the S3 protocol (for example, AWS S3 and MinIO). You can achieve not only cheaper storage and better resource isolation, but elastic scalability for your cluster. The query performance of the shared-data StarRocks cluster aligns with that of a classic cluster in case of local cache hits.
Compared to the classic StarRocks architecture, separation of storage and compute offers a wide range of benefits. By decoupling these components, StarRocks provides:
- Inexpensive and seamlessly scalable storage.
- Elastic scalable compute. Because data is no longer stored in BE nodes, scaling can be done without data migration or shuffling across nodes.
- Local disk cache for hot data to boost query performance.
- Adjustable time-to-live (TTL) for cached hot data. StarRocks automatically removes the expired cached data to save the local disk space.
- Asynchronous data ingestion into object storage, allowing a significant improvement in loading performance.
The architecture of the shared-data StarRocks cluster is as follows:
This feature is supported from v3.0.
Deploy a shared-data StarRocks cluster
The deployment of a shared-data StarRocks cluster is similar to that of a classic StarRocks cluster. The only difference is the parameters in the configuration files of FE and BE fe.conf and be.conf. This section only lists the FE and BE configuration items you need to add to the configuration files when you deploy a shared-data StarRocks cluster. For detailed instructions on deploying a StarRocks cluster, see Deploy StarRocks.
Configure FE nodes for shared-data StarRocks
Before starting FEs, add the following configuration items in the FE configuration file fe.conf:
Configuration item | Description |
---|---|
run_mode | The running mode of the StarRocks cluster. Valid values: shared_data and shared_nothing (Default). shared_data indicates running StarRocks in shared-data mode. shared_nothing indicates running StarRocks in classic mode.CAUTION You cannot adopt the shared_data and shared_nothing modes simultaneously for a StarRocks cluster. Mixed deployment is not supported.DO NOT change run_mode after the cluster is deployed. Otherwise, the cluster fails to restart. The transformation from a classic cluster to a shared-data cluster or vice versa is not supported. |
cloud_native_meta_port | The cloud-native meta service RPC port. Default: 6090 . |
cloud_native_storage_type | The type of object storage you use. Valid value: S3 (Default). |
aws_s3_path | The S3 path used to store data. It consists of the name of your S3 bucket and the sub-path (if any) under it. |
aws_s3_endpoint | The endpoint used to access your S3 bucket, for example, https://s3.us-west-2.amazonaws.com . |
aws_s3_region | The region in which your S3 bucket resides, for example, us-west-2 . |
aws_s3_use_aws_sdk_default_behavior | Whether to use the default authentication credential of AWS SDK. Valid values: true and false (Default). |
aws_s3_use_instance_profile | Whether to use Instance Profile and Assumed Role as credential methods for accessing S3. Valid values: true and false (Default).
|
aws_s3_access_key | The Access Key ID used to access your S3 bucket. |
aws_s3_secret_key | The Secret Access Key used to access your S3 bucket. |
aws_s3_iam_role_arn | The ARN of the IAM role that has privileges on your S3 bucket in which your data files are stored. |
aws_s3_external_id | The external ID of the AWS account that is used for cross-account access to your S3 bucket. |
If you use AWS S3
If you use the default authentication credential of AWS SDK to access S3, add the following configuration items:
run_mode = shared_data cloud_native_meta_port = <meta_port> cloud_native_storage_type = S3 aws_s3_path = <s3_path> aws_s3_region = <region> aws_s3_endpoint = <endpoint_url> aws_s3_use_aws_sdk_default_behavior = true
If you use IAM user-based credential (Access Key and Secret Key) to access S3, add the following configuration items:
run_mode = shared_data cloud_native_meta_port = <meta_port> cloud_native_storage_type = S3 aws_s3_path = <s3_path> aws_s3_region = <region> aws_s3_endpoint = <endpoint_url> aws_s3_access_key = <access_key> aws_s3_secret_key = <secret_key>
If you use Instance Profile to access S3, add the following configuration items:
run_mode = shared_data cloud_native_meta_port = <meta_port> cloud_native_storage_type = S3 aws_s3_path = <s3_path> aws_s3_region = <region> aws_s3_endpoint = <endpoint_url> aws_s3_use_instance_profile = true
If you use Assumed Role to access S3, add the following configuration items:
run_mode = shared_data cloud_native_meta_port = <meta_port> cloud_native_storage_type = S3 aws_s3_path = <s3_path> aws_s3_region = <region> aws_s3_endpoint = <endpoint_url> aws_s3_use_instance_profile = true aws_s3_iam_role_arn = <role_arn>
If you use Assumed Role to access S3 from an external AWS account, add the following configuration items:
run_mode = shared_data cloud_native_meta_port = <meta_port> cloud_native_storage_type = S3 aws_s3_path = <s3_path> aws_s3_region = <region> aws_s3_endpoint = <endpoint_url> aws_s3_use_instance_profile = true aws_s3_iam_role_arn = <role_arn> aws_s3_external_id = <external_id>
If you use GCP Cloud Storage:
run_mode = shared_data cloud_native_meta_port = <meta_port> cloud_native_storage_type = S3 aws_s3_path = <s3_path> # For example: us-east-1 aws_s3_region = <region> # For example: https://storage.googleapis.com aws_s3_endpoint = <endpoint_url> aws_s3_access_key = <access_key> aws_s3_secret_key = <secret_key>
If you use MinIO:
run_mode = shared_data cloud_native_meta_port = <meta_port> cloud_native_storage_type = S3 aws_s3_path = <s3_path> # For example: us-east-1 aws_s3_region = <region> # For example: http://172.26.xx.xxx:39000 aws_s3_endpoint = <endpoint_url> aws_s3_access_key = <access_key> aws_s3_secret_key = <secret_key>
Configure BE nodes for shared-data StarRocks
Before starting BEs, add the following configuration items in the BE configuration file be.conf:
starlet_port = <starlet_port>
storage_root_path = <storage_root_path>
Configuration item | Description |
---|---|
starlet_port | The BE heartbeat service port. Default value: 9070 . |
storage_root_path | The storage volume directory that the local cached data depends on and the medium type of the storage. Multiple volumes are separated by semicolon (;). If the storage medium is SSD, add ,medium:ssd at the end of the directory. If the storage medium is HDD, add ,medium:hdd at the end of the directory. Example: /data1,medium:hdd;/data2,medium:ssd . Default value: ${STARROCKS_HOME}/storage . |
NOTE
The data is cached under the directory <storage_root_path>/starlet_cache.
Use your shared-data StarRocks cluster
The usage of shared-data StarRocks clusters is also similar to that of a classic StarRocks cluster. The only difference is that you must create a special table to use object storage for your data.
Create a table
After connecting to your shared-data StarRocks cluster, create a database and then table in the database. Currently, shared-data StarRocks clusters support the following table types:
- Duplicate Key table
- Aggregate table
- Unique Key table
NOTE
Currently, Primary Key table is not supported on StarRocks shared-data clusters.
The following example creates a database cloud_db
and a table detail_demo
based on Duplicate Key table type, enables the local disk cache, sets the cache expiration time to 30 days, and disables asynchronous data ingestion into object storage:
CREATE DATABASE cloud_db;
USE cloud_db;
CREATE TABLE IF NOT EXISTS detail_demo (
recruit_date DATE NOT NULL COMMENT "YYYY-MM-DD",
region_num TINYINT COMMENT "range [-128, 127]",
num_plate SMALLINT COMMENT "range [-32768, 32767] ",
tel INT COMMENT "range [-2147483648, 2147483647]",
id BIGINT COMMENT "range [-2^63 + 1 ~ 2^63 - 1]",
password LARGEINT COMMENT "range [-2^127 + 1 ~ 2^127 - 1]",
name CHAR(20) NOT NULL COMMENT "range char(m),m in (1-255) ",
profile VARCHAR(500) NOT NULL COMMENT "upper limit value 65533 bytes",
ispass BOOLEAN COMMENT "true/false"
)
DUPLICATE KEY(recruit_date, region_num)
DISTRIBUTED BY HASH(recruit_date, region_num) BUCKETS 96
PROPERTIES (
"enable_storage_cache" = "true",
"storage_cache_ttl" = "2592000",
"enable_async_write_back" = "false"
);
In addition to the regular table PROPERTIES, you need to specify the following PROPERTIES when creating a table for shared-data StarRocks cluster:
Property | Description |
---|---|
enable_storage_cache | Whether to enable the local disk cache. Default: true .
To enable the local disk cache, you must specify the directory of the disk in the BE configuration item starlet_cache_dir . |
storage_cache_ttl | The duration for which StarRocks caches the loaded data in the local disk if the local disk cache is enabled. The expired data is deleted from the local disk. If the value is set to -1 , the cached data does not expire. Default: 2592000 (30 days).CAUTION If you disabled the local disk cache, you do not need to set this configuration item. Setting this item to a value other than 0 when the local disk cache is disabled will cause unexpected behaviors of StarRocks. |
enable_async_write_back | Whether to allow data to be written into object storage asynchronously. Default: false .
|
View table information
You can view the information of tables in a specific database using SHOW PROC "/dbs/<db_id>"
. See SHOW PROC for more information.
Example:
mysql> SHOW PROC "/dbs/xxxxx";
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------+
| TableId | TableName | IndexNum | PartitionColumnName | PartitionNum | State | Type | LastConsistencyCheckTime | ReplicaCount | PartitionType | StoragePath |
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------+
| 12003 | detail_demo | 1 | NULL | 1 | NORMAL | CLOUD_NATIVE | NULL | 8 | UNPARTITIONED | s3://xxxxxxxxxxxxxx/1/12003/ |
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------+
The Type
of a table in shared-data StarRocks cluster is CLOUD_NATIVE
. In the field StoragePath
, StarRocks returns the object storage directory where the table is stored.
Load data into a shared-data StarRocks cluster
Shared-data StarRocks clusters support all loading methods provided by StarRocks. See Overview of data loading for more information.
Query in a shared-data StarRocks cluster
Tables in a shared-data StarRocks cluster support all types of queries provided by StarRocks. See StarRocks SELECT for more information.