Skip to main content
Version: Candidate-3.4

Feature Support: Data Lake Analytics

From v2.3 onwards, StarRocks supports managing external data sources and analyzing data in data lakes via external catalogs.

This document outlines the feature support for external catalogs and the supported version of the features involved.

Universal features

This section lists the universal features of the External Catalog feature, including storage systems, file readers, credentials, privileges, and Data Cache.

External storage systems

Storage SystemSupported Version
HDFSv2.3+
AWS S3v2.3+
Microsoft Azure Storagev3.0+
Google GCSv3.0+
Alibaba Cloud OSSv3.1+
Huawei Cloud OBSv3.1+
Tencent Cloud COSv3.1+
Volcengine TOSv3.1+
Kingsoft Cloud KS3v3.1+
MinIOv3.1+
Ceph S3v3.1+

In addition to the native support for the storage systems listed above, StarRocks also supports the following types of object storage services:

  • HDFS-compatible object storage services such as COS Cloud HDFS, OSS-HDFS, and OBS PFS
    • Description: You need to specify the object storage URI prefix in the BE configuration item fallback_to_hadoop_fs_list, and upload the .jar package provided by the cloud vendor to the directory /lib/hadoop/hdfs/. Note that you must create the external catalog using the prefix you specified in fallback_to_hadoop_fs_list.
    • Supported Version(s): v3.1.9+, v3.2.4+
  • S3-compatible object storage services other than those listed above
    • Description: You need to specify the object storage URI prefix in the BE configuration item s3_compatible_fs_list. Note that you must create the external catalog using the prefix you specified in s3_compatible_fs_list.
    • Supported Version(s): v3.1.9+, v3.2.4+

Compression formats

This section only lists the compression formats supported by each file format. For the file formats supported by each external catalog, please refer to the section on the corresponding external catalog.

File FormatCompression Formats
ParquetNO_COMPRESSION, SNAPPY, LZ4, ZSTD, GZIP, LZO (v3.1.5+)
ORCNO_COMPRESSION, ZLIB, SNAPPY, LZO, LZ4, ZSTD
TextNO_COMPRESSION, LZO (v3.1.5+)
AvroNO_COMPRESSION (v3.2.1+), DEFLATE (v3.2.1+), SNAPPY (v3.2.1+), BZIP2 (v3.2.1+)
RCFileNO_COMPRESSION (v3.2.1+), DEFLATE (v3.2.1+), SNAPPY (v3.2.1+), GZIP (v3.2.1+)
SequenceFileNO_COMPRESSION (v3.2.1+), DEFLATE (v3.2.1+), SNAPPY (v3.2.1+), BZIP2 (v3.2.1+), GZIP (v3.2.1+)
note

The Avro, RCFile, and SequenceFile file formats are read by Java Native Interface (JNI) instead of the native readers within StarRocks. Therefore, the read performance for these file formats may not be as good as that of Parquet and ORC.

Management, credential, and access control

FeatureDescriptionSupported Version(s)
Information SchemaSupports Information Schema for external catalogs.v3.2+
Data lake access controlSupports StarRocks' native RBAC model for external catalogs. You can manage the privileges of databases, tables, and views (currently, Hive views and Iceberge views only) in external catalogs just like those in the default catalog of StarRocks.v3.0+
Reuse external services on Apache RangerSupports reusing the external service (such as the Hive Service) on Apache Ranger for access control.v3.1.9+
Kerberos authenticationSupports Kerberos authentication for HDFS or Hive Metastore.v2.3+

Data Cache

FeatureDescriptionSupported Version(s)
Data Cache (Block Cache)From v2.5 onwards, StarRocks supported the Data Cache feature (then called Block Cache) implemented using CacheLib, which led to limited optimization potential for its extensibility. Starting from v3.0, StarRocks refactored the cache implementation and added new features to Data Cache, resulting in better performance with each subsequent version.v2.5+
Data rebalancing among local disksSupports data rebalancing strategy to ensure that data skew is controlled under 10%.v3.2+
Replace Block Cache with Data CacheParameter changes
BE Configurations:
  • Replace block_cache_enable with datacache_enable.
  • Replace block_cache_mem_size with datacache_mem_size.
  • Replace block_cache_disk_size with datacache_disk_size.
  • Replace block_cache_disk_path with datacache_disk_path.
  • Replace block_cache_meta_path with datacache_meta_path.
  • Replace block_cache_block_size with datacache_block_size.
Session Variables:
  • Replace enable_scan_block_cache with enable_scan_datacache.
  • Replace enable_populate_block_cache with enable_populate_datacache.
After the cluster is upgraded to a version where Data Cache is available, the Block Cache parameters still take effect. The new parameters will override the old ones once Data Cache is enabled. The mixed usage of both groups of parameters is not allowed. Otherwise, some parameters will not take effect.
v3.2+
New metrics for API that monitors Data CacheSupports an individual API that monitors Data Cache including the cache capacity and hits. You can view Data Cache metrics via the interface http://${BE_HOST}:${BE_HTTP_PORT}/api/datacache/stat.v3.2.3+
Memory Tracker for Data CacheSupports Memory Tracker for Data Cache. You can view the memory-related metrics via the interface http://${BE_HOST}:${BE_HTTP_PORT}/mem_tracker.v3.1.8+
Data Cache WarmupBy executing CACHE SELECT, you can proactively populate the cache with the desired data from remote storage in advance to prevent the first query from taking too much time fetching the data. CACHE SELECT will not print data or incur calculations. It only fetches data.v3.3+

Hive Catalog

Metadata

Hive Catalog's support for Hive Metastore (HMS) and AWS Glue mostly overlaps except that the automatic incremental update feature for HMS is not recommended. The default configuration is recommended in most cases.

The performance of metadata retrieval largely depends on the performance of the user's HMS or HDFS NameNode. Please consider all factors and base your judgment on test results.

  • [Default and Recommended] Best performance with a tolerance of minute-level data inconsistency
    • Configuration: You can use the default setting. Data updated within 10 minutes (by default) is not visible. Old data will be returned to queries within this duration.
    • Advantage: Best query performance.
    • Disadvantage: Data inconsistency caused by latency.
    • Supported Version(s): v2.5.5+ (Disabled by default in v2.5 and enabled by default in v3.0+)
  • Instant visibility of newly loaded data (files) without manual refresh
    • Configuration: Disable the cache for the metadata of the underlying data files by setting the catalog property enable_remote_file_cache to false.
    • Advantage: Visibility of file changes with no delay.
    • Disadvantage: Lower performance when the file metadata cache is disabled. Each query must access the file list.
    • Supported Version(s): v2.5.5+
  • Instant visibility of partition changes without manual refresh
    • Configuration: Disable the cache for the Hive partition names by setting the catalog property enable_cache_list_names to false.
    • Advantage: Visibility of partition changes with no delay
    • Disadvantage: Lower performance when the partition name cache is disabled. Each query must access the partition list.
    • Supported Version(s): v2.5.5+
tip

If you demand real-time updates on the data changes whilst the performance of your HMS is not optimized, you can enable the cache, disable the automatic incremental update, and manually refresh the metadata (using REFRESH EXTERNAL TABLE) via a scheduling system whenever there is a data change upstream.

Storage system

FeatureDescriptionSupported Version(s)
Recursive sub-directory listingEnable recursive sub-directory listing by setting the Catalog property enable_recursive_listing to true. When recursive listing is enabled, StarRocks will read data from a table and its partitions and from the subdirectories within the physical locations of the table and its partitions. This feature is designed to address the issue of multi-layer nested directories.v2.5.9+
v3.0.4+ (Disabled by default in v2.5 and v3.0, and enabled by default in v3.1+)

File formats and data types

File formats

FeatureSupported File Formats
ReadParquet, ORC, TEXT, Avro, RCFile, SequenceFile
SinkParquet (v3.2+), ORC (v3.3+), TEXT (v3.3+)

Data types

INTERVAL, BINARY, and UNION types are not supported.

TEXT-formatted Hive table does not support MAP and STRUCT types.

Hive views

StarRocks supports querying Hive views from v3.1.0 onwards.

While StarRocks executes queries against a Hive view, it will try to parse the definition of the view using the syntax of StarRocks and Trino. An error will be returned if StarRocks cannot parse the definition of the view. There is a possibility that StarRocks failed to parse the Hive views created with functions exclusive to Hive or Spark.

Query statistics interfaces

FeatureSupported Version(s)
Supports SHOW CREATE TABLE to view Hive table schemav3.0+
Supports ANALYZE to collect statisticsv3.2+
Supports collecting histograms and STRUCT subfield statisticsv3.3+

Data sinking

FeatureSupported Version(s)Note
CREATE DATABASEv3.2+You can choose to specify the location for a database created in Hive or not. If you do not specify the location for the database, you will need to specify the location for the tables created under the database. Otherwise, an error will be returned. If you have specified the location for the database, tables without the location specified will inherit the location of the database. And if you have specified locations for both the database and the table, the table's location will take effect eventually.
CREATE TABLEv3.2+For both partitioned and non-partitioned tables.
CREATE TABLE AS SELECTv3.2+
INSERT INTO/OVERWRITEv3.2+For both partitioned and non-partitioned tables.
CREATE TABLE LIKEv3.2.4+
Sink file sizev3.3+You can define the maximum size of each data file to be sunk using the session variable connector_sink_target_max_file_size.

Iceberg Catalog

Metadata

Iceberg Catalog supports HMS, Glue, and Tabular as its metastore. The default configuration is recommended in most cases.

Please note that the default value of the session variable enable_iceberg_metadata_cache has been changed to accommodate different scenarios:

  • From v3.2.1 to v3.2.3, this parameter is set to true by default, regardless of what metastore service is used.
  • In v3.2.4 and later, if the Iceberg cluster uses AWS Glue as metastore, this parameter still defaults to true. However, if the Iceberg cluster uses other metastore services such as Hive metastore, this parameter defaults to false.
  • From v3.3.0 onwards, the default value of this parameter is set to true again because StarRocks supports the new Iceberg metadata framework. Iceberg Catalog and Hive Catalog now use the same metadata polling mechanism and FE configuration item background_refresh_metadata_interval_millis.
FeatureSupported Version(s)
Distributed metadata plan (Recommended for scenarios with a large volume of metadata)v3.3+
Manifest Cache (Recommended for scenarios with a small volume of metadata but high demand on latency)v3.3+

File formats

FeatureSupported File Formats
ReadParquet, ORC
SinkParquet
  • Both Parquet-formatted and ORC-formatted Iceberg V1 tables support position deletes and equality deletes.
  • ORC-formatted Iceberg V2 tables support position deletes from v3.0.0, and Parquet-formatted ones support position deletes from v3.1.0.
  • ORC-formatted Iceberg V2 tables support equality deletes from v3.1.8 and v3.2.3, and Parquet-formatted ones support equality deletes from v3.2.5.

Iceberg views

StarRocks supports querying Iceberg views from v3.3.2 onwards. Currently, only Iceberg views created through StarRocks are supported.

note

While StarRocks executes queries against an Iceberg view, it will try to parse the definition of the view using the syntax of StarRocks and Trino. An error will be returned if StarRocks cannot parse the definition of the view. There is a possibility that StarRocks failed to parse the Iceberg views created with functions exclusive to Iceberg or Spark.

Query statistics interfaces

FeatureSupported Version(s)
Supports SHOW CREATE TABLE to view Iceberg table schemav3.0+
Supports ANALYZE to collect statisticsv3.2+
Supports collecting histograms and STRUCT subfield statisticsv3.3+

Data sinking

FeatureSupported Version(s)Note
CREATE DATABASEv3.1+You can choose to specify the location for a database created in Iceberg or not. If you do not specify the location for the database, you will need to specify the location for the tables created under the database. Otherwise, an error will be returned. If you have specified the location for the database, tables without the location specified will inherit the location of the database. And if you have specified locations for both the database and the table, the table's location will take effect eventually.
CREATE TABLEv3.1+For both partitioned and non-partitioned tables.
CREATE TABLE AS SELECTv3.1+
INSERT INTO/OVERWRITEv3.1+For both partitioned and non-partitioned tables.

Miscellaneous supports

FeatureSupported Version(s)
Supports reading TIMESTAMP-type partition formats yyyy-MM-ddTHH:mm and yyyy-MM-dd HH:mm.v2.5.19+
v3.1.9+
v3.2.3+

Hudi Catalog

  • StarRocks supports querying the Parquet-formatted data in Hudi, and supports SNAPPY, LZ4, ZSTD, GZIP, and NO_COMPRESSION compression formats for Parquet files.
  • StarRocks fully supports Hudi's Copy On Write (COW) tables and Merge On Read (MOR) tables.
  • StarRocks supports SHOW CREATE TABLE to view Hudi table schema from v3.0.0 onwards.

Delta Lake Catalog

  • StarRocks supports querying the Parquet-formatted data in Delta Lake, and supports SNAPPY, LZ4, ZSTD, GZIP, and NO_COMPRESSION compression formats for Parquet files.
  • StarRocks does not support querying the MAP-type and STRUCT-type data in Delta Lake.
  • StarRocks supports SHOW CREATE TABLE to view Delta Lake table schema from v3.0.0 onwards.

JDBC Catalog

Catalog typeSupported Version(s)
MySQLv3.0+
PostgreSQLv3.0+
ClickHousev3.3+
Oraclev3.2.9+
SQL Serverv3.2.9+

MySQL

FeatureSupported Version(s)
Metadata cachev3.3+

Data type correspondance

MySQLStarRocksSupported Version(s)
BOOLEANBOOLEANv2.3+
BITBOOLEANv2.3+
SIGNED TINYINTTINYINTv2.3+
UNSIGNED TINYINTSMALLINTv3.0.6+
v3.1.2+
SIGNED SMALLINTSMALLINTv2.3+
UNSIGNED SMALLINTINTv3.0.6+
v3.1.2+
SIGNED INTEGERINTv2.3+
UNSIGNED INTEGERBIGINTv3.0.6+
v3.1.2+
SIGNED BIGINTBIGINTv2.3+
UNSIGNED BIGINTLARGEINTv3.0.6+
v3.1.2+
FLOATFLOATv2.3+
REALFLOATv3.0.1+
DOUBLEDOUBLEv2.3+
DECIMALDECIMAL32v2.3+
CHARVARCHAR(columnsize)v2.3+
VARCHARVARCHARv2.3+
TEXTVARCHAR(columnsize)v3.0.1+
DATEDATEv2.3+
TIMETIMEv3.1.9+
v3.2.4+
TIMESTAMPDATETIMEv2.3+

PostgreSQL

Data type correspondance

MySQLStarRocksSupported Version(s)
BITBOOLEANv2.3+
SMALLINTSMALLINTv2.3+
INTEGERINTv2.3+
BIGINTBIGINTv2.3+
REALFLOATv2.3+
DOUBLEDOUBLEv2.3+
NUMERICDECIMAL32v2.3+
CHARVARCHAR(columnsize)v2.3+
VARCHARVARCHARv2.3+
TEXTVARCHAR(columnsize)v2.3+
DATEDATEv2.3+
TIMESTAMPDATETIMEv2.3+

ClickHouse

Supported from v3.3.0 onwards.

Oracle

Supported from v3.2.9 onwards.

SQL Server

Supported from v3.2.9 onwards.

Elasticsearch Catalog

Elasticsearch Catalog is supported from v3.1.0 onwards.

Paimon Catalog

Paimon Catalog is supported from v3.1.0 onwards.

MaxCompute Catalog

MaxCompute Catalog is supported from v3.3.0 onwards.

Kudu Catalog

Kudu Catalog is supported from v3.3.0 onwards.