Vector Index
This topic introduces the vector index feature of StarRocks and how to perform an approximate nearest neighbor search (ANNS) with it.
The vector index feature is only supported in shared-nothing clusters of v3.4 or later.
Overview
Currently, StarRocks supports vector indexing at the Segment file level. The index maps each search item to its row ID within Segment files, allowing fast data retrieval by directly locating the corresponding data rows without brute-force vector distance calculations. The system now provides two types of vector indexing: Inverted File with Product Quantization (IVFPQ) and Hierarchical Navigable Small World (HNSW), each with its own organizational structure.
Inverted File with Product Quantization (IVFPQ)
Inverted File with Product Quantization (IVFPQ) is a method for approximate nearest neighbor search in large-scale high-dimensional vectors, commonly used in vector retrieval tasks within deep learning and machine learning. IVFPQ consists of two main components: the inverted file and product quantization.
- Inverted File: This component is an indexing method. It divides the dataset into multiple clusters (or Voronoi cells), each with a centroid (seed point), and assigns each data point (vector) to its nearest cluster center. For a vector search, IVFPQ only needs to search the nearest cluster, significantly reducing the search scope and complexity.
- Product Quantization: This component is a data compression technique. It splits high-dimensional vectors into subvectors and quantizes each subvector to map it to the nearest point in a predefined set, reducing storage and computation costs while retaining high precision.
By combining inverted files and product quantization, IVFPQ enables efficient approximate nearest neighbor search in large, high-dimensional datasets.
Hierarchical Navigable Small World (HNSW)
Hierarchical Navigable Small World (HNSW) is a graph-based algorithm for high-dimensional nearest neighbor search, also widely used in vector retrieval tasks.
HNSW builds a hierarchical graph structure in which each layer is a navigable small world (NSW) graph. In the graph, each vertex represents a data point, and edges indicate similarity between vertices. The higher layers of the graph contain fewer vertices with sparser connections for fast global search, while the lower layers include all vertices with denser connections for precise local search.
For a vector search, HNSW searches at the top layer first, quickly identifying an approximate nearest neighbor region, then moves down layer by layer to find the exact nearest neighbor at the bottom layer.
HNSW offers both efficiency and precision, making it adaptable to various data and query distributions.
Comparison between IVFPQ and HNSW
- Data Compression Ratio: IVFPQ has a higher compression ratio (around 1:0.15). The indexing calculation only provides preliminary sorting results after rough ranking because PQ will compress the vectors. It requires additional fine ranking for final sorting results, leading to higher computation and latency. HNSW has a lower compression ratio (around 1:0.8) and provides precise ranking without extra processing, resulting in lower computation costs and latency and higher storage costs.
- Recall Adjustment: Both indexes support recall rate tuning through parameter adjustments, but IVFPQ incurs higher computation costs at similar recall rates.
- Caching Strategy: IVFPQ allows balancing memory cost and computation latency by adjusting the cache proportion of index blocks, while HNSW currently only supports full-file caching.
Usage
Each table supports only one vector index.
Prerequisites
Before creating vector indexes, you must enable it by setting the FE configuration item enable_experimental_vector
to true
.
Execute the following statement to enable it dynamically:
ADMIN SET FRONTEND CONFIG ("enable_experimental_vector" = "true");
To enable it permanently, you must add enable_experimental_vector = true
to the FE configuration file fe.conf
and restart FE.
Create vector index
This tutorial creates vector indexes while creating tables. You can also append vector indexes to an existing table. See Append vector index for detailed instructions.
-
The following example creates an HNSW vector index
hnsw_vector
on columnvector
in tablehnsw
.CREATE TABLE hnsw (
id BIGINT(20) NOT NULL COMMENT "",
vector ARRAY<FLOAT> NOT NULL COMMENT "",
INDEX hnsw_vector (vector) USING VECTOR (
"index_type" = "hnsw",
"dim"="5",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"M" = "16",
"efconstruction" = "40"
)
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1; -
The following example creates an IVFPQ vector index
ivfpq_vector
on columnvector
in tableivfpq
.CREATE TABLE ivfpq (
id BIGINT(20) NOT NULL COMMENT "",
vector ARRAY<FLOAT> NOT NULL COMMENT "",
INDEX ivfpq_vector (vector) USING VECTOR (
"index_type" = "ivfpq",
"dim"="5",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"nbits" = "16",
"nlist" = "40"
)
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;
Index construction parameters
USING VECTOR
- Default: N/A
- Required: Yes
- Description: Creates a vector index.
index_type
- Default: N/A
- Required: Yes
- Description: Vector index type. Valid Values:
hnsw
andivfpq
.
dim
- Default: N/A
- Required: Yes
- Description: Dimension of the index. After the index is built, vectors that do not meet the dimension requirements will be rejected for loading into the base column. It must be an integer greater than or equal to
1
.
metric_type
- Default: N/A
- Required: Yes
- Description: Metric type (measurement function) of the vector index. Valid values:
l2_distance
: Euclidean distance. The smaller the value, the higher the similarity.cosine_similarity
: Cosine similarity. The larger the value, the higher the similarity.
is_vector_normed
- Default: false
- Required: No
- Description: Whether the vectors are normalized. Valid values are
true
andfalse
. It takes effect only whenmetric_type
iscosine_similarity
. If the vectors are normalized, the value of the calculated distances will be within [-1, 1]. The vectors must satisfy that the sum of the squares is1
, otherwise an error is returned.
M
- Default: 16
- Required: No
- Description: HNSW-specific parameter. The number of bi-directional connections created for every new element during graph construction. It must be an integer greater than or equal to
2
. The value ofM
directly affects the efficiency and accuracy of graph construction and search. During graph construction, each vertex will attempt to establish connections with its nearestM
vertices. If a vertex already hasM
connections but finds a closer vertex, the farthest connection will be deleted and a new connection will be established with the closer vertex. A vector search will start from an entry point and find the nearest neighbor along the vertices connected to it. Therefore, the larger the value ofM
, the larger the search scope for each vertex, the higher the search efficiency, but the higher the cost of graph construction and storage.
efconstruction
- Default: 40
- Required: No
- Description: HNSW-specific parameter. The size of the candidate list containing the nearest neighbors. It must be an integer greater than or equal to
1
. It is used to control the search depth during the graph construction process. Specifically,efconstruction
defines the size of the search list (also known as the candidate list) for each vertex during the graph construction process. This candidate list is used to store the neighbor candidates of the current vertex, and the size of the list isefconstruction
. The larger the value ofefconstruction
, the more candidates is considered as the neighbors of the vertex during the graph construction process, and, as a result, the better quality (such as better connectivity) of the graph, but also the higher time consumption and computation complexity of graph construction.
nbits
- Default: 16
- Required: No
- Description: IVFPQ-specific parameter. Precision of Product Quantization (PQ). It must be the multiplier of
8
. With IVFPQ, each vector is divided into multiple subvectors, and then each subvector is quantized.Nbits
defines the precision of quantization, that is, how many binary digits each subvector is quantized into. The larger the value ofnbits
, the higher the quantization precision, but the higher storage and computation costs.
nlist
- Default: 16
- Required: No
- Description: IVFPQ-specific parameter. Number of clusters, or inverted lists. It must be an integer greater than or equal to
1
. With IVFPQ, the dataset is divided into clusters, and the centroid of each cluster corresponds to an inverted list. A vector search will first find the cluster centroid closest to the data point, and then retrieve the nearest neibors in the corresponding inverted list. Therefore, the value ofnlist
will affect the accuracy and efficiency of the search. The larger the value ofnlist
, the finer the granularity of clustering, thus the higher the search accuracy, but also the higher the search complexity.
M_IVFPQ
- Required: Yes
- Description: IVFPQ-specific parameter. Number of the subvectors the original vector will be split into. The IVFPQ index will divide a
dim
-dimensional vector intoM_IVFPQ
equal-length subvectors. Therefore, it must be a factor of the value ofdim
.
Append vector index
You can also add vector indexes to an existing table using CREATE INDEX or ALTER TABLE ADD INDEX.
Example:
CREATE INDEX ivfpq_vector
ON ivfpq (vector)
USING VECTOR (
"index_type" = "ivfpq",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"dim"="5",
"nlist" = "256",
"nbits"="10"
);
ALTER TABLE ivfpq
ADD INDEX ivfpq_vector (vector)
USING VECTOR (
"index_type" = "ivfpq",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"dim"="5",
"nlist" = "256",
"nbits"="10"
);
Manage vector indexes
View vector index
You can view the definition of a vector index using the SHOW CREATE TABLE statement:
Example:
mysql> SHOW CREATE TABLE hnsw \G
*************************** 1. row ***************************
Table: hnsw
Create Table: CREATE TABLE hnsw (
id bigint(20) NOT NULL COMMENT "",
vector array<float> NOT NULL COMMENT "",
INDEX index_vector (vector) USING VECTOR("dim" = "5", "efconstruction" = "40", "index_type" = "hnsw", "is_vector_normed" = "false", "M" = "512", "metric_type" = "l2_distance") COMMENT ''
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES (
"compression" = "LZ4",
"fast_schema_evolution" = "true",
"replicated_storage" = "false",
"replication_num" = "3"
);
1 row in set (0.00 sec)
Drop vector index
You can drop vector indexes using DROP INDEX or ALTER TABLE DROP INDEX.
DROP INDEX ivfpq_vector ON ivfpq;
ALTER TABLE ivfpq DROP INDEX ivfpq_vector;
Perform ANNS with vector indexes
Before running a vector search, make sure the FE configuration item enable_experimental_vector
is set to true
.
Requirements for vector index-based queries
SELECT *, <vector_index_distance_func>(v1, [1,2,3]) as dis
FROM table_name
WHERE <vector_index_distance_func>(v1, [1,2,3]) <= 10
ORDER BY <vector_index_distance_func>(v1, [1,2,3])
LIMIT 10
To use vector index in queries, all the following requirements must be met:
- ORDER BY requirements:
- ORDER BY clause format: The
ORDER BY
clause must follow the formatORDER BY <vector_index_distance_func>(vector_column, constant_array)
without including additional ORDER BY columns.- Function name requirements for
<vector_index_distance_func>
:- If
metric_type
isl2_distance
, the function name must beapprox_l2_distance
. - If
metric_type
iscosine_similarity
, the function name must beapprox_cosine_similarity
.
- If
- Parameter requirements for
<vector_index_distance_func>
:- One of the column
constant_array
must be a constantARRAY<FLOAT>
with dimensions matching the vector indexdim
. - The other column
vector_column
must be the column corresponding to the vector index.
- One of the column
- Function name requirements for
- ORDER direction requirements:
- If
metric_type
isl2_distance
, the order must beASC
. - If
metric_type
iscosine_similarity
, the order must beDESC
.
- If
- A
LIMIT N
clause is required.
- ORDER BY clause format: The
- Predicate Requirements:
- All predicates must be
<vector_index_distance_func>
expressions, combined usingAND
and comparison operators (>
or<
). The comparison operator direction must align with theASC
/DESC
order. Specifically: - Requirement 1:
- If
metric_type
isl2_distance
:col_ref <= constant
. - If
metric_type
iscosine_similarity
:col_ref >= constant
. - Here,
col_ref
refers to the result of<vector_index_distance_func>(vector_column, constant_array)
and can be cast toFLOAT
orDOUBLE
types, for example:approx_l2_distance(v1, [1,2,3])
CAST(approx_l2_distance(v1, [1,2,3]) AS FLOAT)
CAST(approx_l2_distance(v1, [1,2,3]) AS DOUBLE)
- If
- Requirement 2:
- The predicate must use
AND
, with each child predicate meeting Requirement 1.
- The predicate must use
- All predicates must be
Preparation
Create tables with vector indexes and insert vector data into them:
CREATE TABLE test_hnsw (
id BIGINT(20) NOT NULL COMMENT "",
vector ARRAY<FLOAT> NOT NULL COMMENT "",
INDEX index_vector (vector) USING VECTOR (
"index_type" = "hnsw",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"M" = "512",
"dim"="5")
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;
INSERT INTO t_test_vector_table VALUES
(1, [1,2,3,4,5]),
(2, [4,5,6,7,8]);
CREATE TABLE test_ivfpq (
id BIGINT(20) NOT NULL COMMENT "",
vector ARRAY<FLOAT> NOT NULL COMMENT "",
INDEX index_vector (vector) USING VECTOR (
"index_type" = "hnsw",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"nlist" = "256",
"nbits"="10")
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;
INSERT INTO test_ivfpq VALUES
(1, [1,2,3,4,5]),
(2, [4,5,6,7,8]);
Perform vector search
Approximate search
Approximate searches will hit the vector index and thereby accelerate the search process.
The following example searches for the top 1 approximate nearest neighbor of the vector [1,1,1,1,1]
.
SELECT id, approx_l2_distance([1,1,1,1,1], vector)
FROM test_hnsw
ORDER BY approx_l2_distance([1,1,1,1,1], vector)
LIMIT 1;
Scalar-vector joint search
You can combine scalar search with vector search.
SELECT id, approx_l2_distance([1,1,1,1,1], vector)
FROM test_hnsw
WHERE id = 1
ORDER BY approx_l2_distance([1,1,1,1,1], vector)
LIMIT 1;
Range search
You can perform range search on vector data.
The following example pushes the score < 40
condition down to the index and filters vectors by score
range.
SELECT * FROM (
SELECT id, approx_l2_distance([1,1,1,1,1], vector) score
FROM test_hnsw
) a
WHERE score < 40
ORDER BY score
LIMIT 1;
Precise calculation
Precise calculations will ignore the vector index, and directly calculate distances between the vectors for precise results.
SELECT id, l2_distance([1,1,1,1,1], vector)
FROM test_hnsw WHERE id = 1
ORDER BY l2_distance([1,1,1,1,1], vector)
LIMIT 1;
NOTE
Different distance metric functions define "similarity" differently. For
l2_distance
, smaller values indicate higher similarity; forcosine_similarity
, larger values indicate higher similarity. Therefore, when calculatingtopN
, the sorting (ORDER BY) direction should match the similarity direction of the metric. UseORDER BY ASC LIMIT x
forl2_distance
, andORDER BY DESC LIMIT x
forcosine_similarity
.
Fine tune index parameters for search
Parameter tuning is essential in vector search, as it affects both performance and accuracy. It is recommended to tune the search paramters on a small dataset and move to large datasets only when expected recall and latency are achieved.
Search parameters are passed through hints in SQL statements.
For HNSW index
Before proceeding, make sure the vector column is built with the HNSW index.
SELECT
/*+ SET_VAR (ann_params='{efsearch=256}') */
id, approx_l2_distance([1,1,1,1,1], vector)
FROM test_hnsw
WHERE id = 1
ORDER BY approx_l2_distance([1,1,1,1,1], vector)
LIMIT 1;
Parameter:
efsearch
- Default: 16
- Required: No
- Description: Parameter that controls precision-speed tradeoff. During a hierarchical graph structure search, this parameter controls the size of the candidate list during the search. The larger the value of
efsearch
, the higher the accuracy, but the lower the speed.
For IVFPQ index
Before proceeding, make sure the vector column is built with the IVFPQ index.
SELECT
/*+ SET_VAR (ann_params='{
nprobe=256,
max_codes=0,
scan_table_threshold=0,
polysemous_ht=0,
range_search_confidence=0.1
}') */
id, approx_l2_distance([1,1,1,1,1], vector)
FROM test_ivfpq
WHERE id = 1
ORDER BY approx_l2_distance([1,1,1,1,1], vector)
LIMIT 1;
Parameter:
nprobe
- Default: 1
- Required: No
- Description: Number of inverted lists checked during the search. The larger the value of
nprobe
, the higher the accuracy, but the lower the speed.
max_codes
- Default: 0
- Required: No
- Description: Maximum codes checked per inverted list. This parameter will also affect accuracy and speed.
scan_table_threshold
- Default: 0
- Required: No
- Description: Parameter that controls polysemous hashing. When the Hamming distance between an element's hash and the hash of the vector to be searched is below this threshold, the element is added to the candidate list.
polysemous_ht
- Default: 0
- Required: No
- Description: Parameter that controls polysemous hashing. When the Hamming distance between an element's hash and the hash of the vector to be searched is below this threshold, the element is directly added to the results.
range_search_confidence
- Default: 0.1
- Required: No
- Description: Confidence of approximate range searches. Value range: [0, 1]. Setting it to
1
produces the most accurate results.
Calculate approximate recall
You can calculate approximate recall by intersecting the topK
elements from brute-force retrieval with those from approximate retrieval: Recall = TP / (TP + FN)
.
-- Approximate retrieval
SELECT id
FROM test_hnsw
ORDER BY approx_l2_distance([1,1,1,1,1], vector)
LIMIT 5;
8
9
7
5
1
-- Brute-force retrieval
SELECT id
FROM test_hnsw
ORDER BY l2_distance([1,1,1,1,1], vector)
LIMIT 5;
8
9
5
7
10
In the preceding example, approximate retrieval returns 8, 9, 7, and 5. However, the correct result is 8, 9, 5, 7, and 10. In this case, recall is 4/5=80%.
Check if vector indexes take effect
Execute EXPLAIN against the query statement. If OlapScanNode
properties show VECTORINDEX: ON
, it indicates that the vector index has been applied to approximate vector searches.
Example:
> EXPLAIN SELECT id FROM t_test_vector_table ORDER BY approx_l2_distance([1,1,1,1,1], vector) LIMIT 5;
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS:1: id |
| PARTITION: UNPARTITIONED |
| |
| RESULT SINK |
| |
| 4:Project |
| | <slot 1> : 1: id |
| | limit: 5 |
| | |
| 3:MERGING-EXCHANGE |
| limit: 5 |
| |
| PLAN FRAGMENT 1 |
| OUTPUT EXPRS: |
| PARTITION: RANDOM |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 03 |
| UNPARTITIONED |
| |
| 2:TOP-N |
| | order by: <slot 3> 3: approx_l2_distance ASC |
| | offset: 0 |
| | limit: 5 |
| | |
| 1:Project |
| | <slot 1> : 1: id |
| | <slot 3> : 4: __vector_approx_l2_distance |
| | |
| 0:OlapScanNode |
| TABLE: t_test_vector_table |
| VECTORINDEX: ON |
| IVFPQ: OFF, Distance Column: <4:__vector_approx_l2_distance>, LimitK: 5, Order: ASC, Query Vector: [1, 1, 1, 1, 1], Predicate Range: -1.0 |
| PREAGGREGATION: ON |
| partitions=1/1 |
| rollup: t_test_vector_table |
| tabletRatio=1/1 |
| tabletList=11302 |
| cardinality=2 |
| avgRowSize=4.0 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
Limitations
- Each table supports only one vector index.
- Searches on multiple vector indexes in a single query statement are unsupported.
- Vector search cannot be used as subqueries or in joins.