- Introduction
- Quick Start
- Table Design
- Data Loading
- Overview of data loading
- Load data from a local file system or a streaming data source using HTTP push
- Load data from HDFS or cloud storage
- Routine Load
- Spark Load
- Insert Into
- Change data through loading
- Transform data at loading
- Json Loading
- Synchronize data from MySQL
- Load data by using flink-connector-starrocks
- DataX Writer
- Data Export
- Using StarRocks
- Reference
- SQL Reference
- User Account Management
- Cluster Management
- ADMIN CANCEL REPAIR
- ADMIN CHECK TABLET
- ADMIN REPAIR
- ADMIN SET CONFIG
- ADMIN SET REPLICA STATUS
- ADMIN SHOW CONFIG
- ADMIN SHOW REPLICA DISTRIBUTION
- ADMIN SHOW REPLICA STATUS
- ALTER SYSTEM
- CANCEL DECOMMISSION
- CREATE FILE
- DROP FILE
- INSTALL PLUGIN
- SHOW BACKENDS
- SHOW BROKER
- SHOW FILE
- SHOW FRONTENDS
- SHOW FULL COLUMNS
- SHOW INDEX
- SHOW PLUGINS
- SHOW TABLE STATUS
- UNINSTALL PLUGIN
- DDL
- ALTER DATABASE
- ALTER TABLE
- ALTER VIEW
- BACKUP
- CANCEL BACKUP
- CANCEL RESTORE
- CREATE DATABASE
- CREATE INDEX
- CREATE MATERIALIZED VIEW
- CREATE REPOSITORY
- CREATE RESOURCE
- CREATE TABLE AS SELECT
- CREATE TABLE LIKE
- CREATE TABLE
- CREATE VIEW
- CREATE FUNCTION
- DROP DATABASE
- DROP INDEX
- DROP MATERIALIZED VIEW
- DROP REPOSITORY
- DROP RESOURCE
- DROP TABLE
- DROP VIEW
- DROP FUNCTION
- HLL
- RECOVER
- RESTORE
- SHOW RESOURCES
- SHOW FUNCTION
- TRUNCATE TABLE
- DML
- ALTER ROUTINE LOAD
- BROKER LOAD
- CANCEL LOAD
- DELETE
- EXPORT
- GROUP BY
- INSERT
- PAUSE ROUTINE LOAD
- RESUME ROUTINE LOAD
- ROUTINE LOAD
- SELECT
- SHOW ALTER
- SHOW BACKUP
- SHOW DATA
- SHOW DATABASES
- SHOW DELETE
- SHOW DYNAMIC PARTITION TABLES
- SHOW EXPORT
- SHOW LOAD
- SHOW PARTITIONS
- SHOW PROPERTY
- SHOW REPOSITORIES
- SHOW RESTORE
- SHOW ROUTINE LOAD
- SHOW ROUTINE LOAD TASK
- SHOW SNAPSHOT
- SHOW TABLES
- SHOW TABLET
- SHOW TRANSACTION
- SPARK LOAD
- STOP ROUTINE LOAD
- STREAM LOAD
- Data Types
- Auxiliary Commands
- Function Reference
- Java UDFs
- Window Function
- Date Functions
- convert_tz
- curdate
- current_timestamp
- curtime
- datediff
- date_add
- date_format
- date_sub
- date_trunc
- day
- dayname
- dayofmonth
- dayofweek
- dayofyear
- from_days
- from_unixtime
- hour
- minute
- month
- monthname
- now
- quarter
- second
- str_to_date
- timediff
- timestampadd
- timestampdiff
- to_date
- to_days
- unix_timestamp
- utc_timestamp
- weekofyear
- year
- hours_diff
- minutes_diff
- months_diff
- seconds_diff
- weeks_diff
- years_diff
- Aggregate Functions
- Geographic Functions
- String Functions
- JSON Functions
- Overview of JSON functions and operators
- JSON constructor functions
- JSON query and processing functions
- JSON operators
- Aggregate Functions
- Bitmap Functions
- Array Functions
- cast function
- hash function
- Cryptographic Functions
- Math Functions
- Utility Functions
- System variables
- Error code
- System limits
- SQL Reference
- Administration
- FAQ
- Deploy
- Data Migration
- SQL
- Other FAQs
- Benchmark
- Developers
- Contribute to StarRocks
- Code Style Guides
- Use the debuginfo file for debugging
- Development Environment
- Trace Tools
Bloom Filter Indexing
Principle
What is Bloom Filter
Bloom Filter is a data structure used to determine whether an element is in a collection. The advantage is that it is more space and time efficient, and the disadvantage is that it has a certain rate of misclassification.
Bloom Filter is composed of a bit array and a number of hash functions. The bit array is initially set to 0. When an element is inserted, the hash functions (number of n) compute on the element and obtain the slots (number of n)The corresponding number of slots in the bit array is set to 1.
To confirm whether an element is in the set, the system will calculate the Hash value based on the hash functions. If the hash values in the bloom filter have at least one 0, the element does not exist. When all the corresponding slots in the bit are 1, the existence of the element cannot be confirmed. This is because the number of slots in the bloom filter is limited, and it is possible that all the slots calculated from this element are the same as the slots calculated from another existing element. Therefore, in the all-1 case, we need to go back to the source to confirm the existence of the element.
What is Bloom Filter Indexing
When creating a table in StarRocks, you can specify the columns to be indexed by BloomFilter through PROPERTIES{"bloom_filter_columns"="c1,c2,c3"}
. BloomFilter can quickly confirm whether a certain value exists in a column when querying. If the bloom filter determines that the specified value does not exist in the column, there is no need to read the data file. If it is an all-1 situation, it needs to read the data block to confirm whether the target value exists. In addition, bloom filter indexes cannot determine which specific row of data has the specified value.
Suitable scenarios
A bloom filter index can be built when the following conditions are met.
- Bloom Filter is suitable for non-prefix filtering.
- The query will be frequently filtered according to the column, and most of the query conditions are
in
and=
. - Unlike Bitmap, BloomFilter is suitable for columns with a high base number.
How to use
Create an index
When creating a table, use bloom_filter_columns
.
PROPERTIES ( "bloom_filter_columns"="k1,k2,k3" )
View an index
View the bloom filter indexes under the table_name
.
SHOW CREATE TABLE table_name;
Delete an Index
Deleting an index means removing the index column from the bloom_filter_columns
property:
ALTER TABLE example_db.my_table SET ("bloom_filter_columns" = "");
Modify an index
Modifying an index means modifying the bloom_filter_columns
property.
ALTER TABLE example_db.my_table SET ("bloom_filter_columns" = "k1,k2,k3");
Notes
- Bloom Filter indexing is not supported for Tinyint, Float, Double type columns.
- Bloom Filter indexing only has an accelerating effect on
in
and=
filter queries. - To see whether a query hits the bloom filter index, check its profile information.