ds_hll_estimate
Estimates the approximate distinct count from a serialized HyperLogLog sketch. This function is part of the DataSketches HLL family of functions for approximate distinct counting.
ds_hll_estimate
takes a VARBINARY serialized sketch created by ds_hll_accumulate
or ds_hll_combine
and returns the estimated number of distinct values.
It is based on Apache DataSketches and provides high precision for approximate distinct counting. For more information, see HyperLogLog Sketches.
Syntaxβ
bigint ds_hll_estimate(sketch)
Parametersβ
sketch
: A VARBINARY column containing serialized HyperLogLog sketches created byds_hll_accumulate
ords_hll_combine
.
Return Typeβ
Returns a BIGINT representing the estimated distinct count.
Examplesβ
-- Create test table
CREATE TABLE t1 (
id BIGINT,
province VARCHAR(64),
age SMALLINT,
dt VARCHAR(10)
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 3;
-- Insert test data
INSERT INTO t1 SELECT generate_series, generate_series, generate_series % 100, "2024-07-24"
FROM table(generate_series(1, 1000));
-- Create table to store sketches
CREATE TABLE t2 (
`id` bigint,
`dt` varchar(10),
`ds_id` binary,
`ds_province` binary,
`ds_age` binary,
`ds_dt` binary
) ENGINE=OLAP
DISTRIBUTED BY HASH(id) BUCKETS 3;
-- Store sketches created by ds_hll_accumulate
INSERT INTO t2 SELECT id, dt,
ds_hll_accumulate(id),
ds_hll_accumulate(province, 10),
ds_hll_accumulate(age, 20, "HLL_6"),
ds_hll_accumulate(dt, 10, "HLL_8")
FROM t1;
-- Estimate distinct counts grouped by date
SELECT dt,
ds_hll_estimate(ds_id),
ds_hll_estimate(ds_province),
ds_hll_estimate(ds_age),
ds_hll_estimate(ds_dt)
FROM t2
GROUP BY dt
ORDER BY 1
LIMIT 3;
Related Functionsβ
ds_hll_accumulate
: Creates serialized HyperLogLog sketches from datads_hll_combine
: Combines multiple serialized sketches into a single sketchds_hll_count_distinct
: Direct approximate distinct counting function
Keywordsβ
ds_HLL_ESTIMATE, HLL, HYPERLOGLOG, APPROXIMATE, DISTINCT, COUNT, ESTIMATE