SELECT
Description
Queries data from one or more tables, views, or materialized views. The SELECT statement generally consists of the following clauses:
- WITH
- WHERE and operators
- GROUP BY
- HAVING
- UNION
- INTERSECT
- EXCEPT/MINUS
- ORDER BY
- LIMIT
- OFFSET
- Joins
- Subqueries
- DISTINCT
- Alias
SELECT can work as an independent statement or a clause nested in other statements. The output of the SELECT clause can be used as the input of other statements.
StarRocks' query statement basically conforms to the SQL92 standard. Here is a brief description of the supported SELECT usage.
NOTE
To query data from tables, views, or materialized views in a StarRocks internal table, you must have the SELECT privilege on these objects. To query data from tables, views, or materialized views in an external data source, you must have the USAGE privilege on the corresponding external catalog.
WITH
A clause that can be added before a SELECT statement to define an alias for a complex expression that is referenced multiple times inside SELECT.
Similar to CRATE VIEW, but the table and column names defined in the clause do not persist after the query ends and do not conflict with names in the actual table or VIEW.
The benefits of using a WITH clause are:
Convenient and easy to maintain, reducing duplication within queries.
It is easier to read and understand SQL code by abstracting the most complex parts of a query into separate blocks.
Examples:
-- Define one subquery at the outer level, and another at the inner level as part of the
-- initial stage of the UNION ALL query.
with t1 as (select 1),t2 as (select 2)
select * from t1 union all select * from t2;
Join
Join operations combine data from two or more tables and then return a result set of some columns from some of them.
StarRocks supports self joins, cross joins, inner joins, outer joins, semi joins, and anti joins. Outer joins include left joins, right joins, and full joins.
Syntax:
SELECT select_list FROM
table_or_subquery1 [INNER] JOIN table_or_subquery2 |
table_or_subquery1 {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN table_or_subquery2 |
table_or_subquery1 {LEFT | RIGHT} SEMI JOIN table_or_subquery2 |
table_or_subquery1 {LEFT | RIGHT} ANTI JOIN table_or_subquery2 |
[ ON col1 = col2 [AND col3 = col4 ...] |
USING (col1 [, col2 ...]) ]
[other_join_clause ...]
[ WHERE where_clauses ]
SELECT select_list FROM
table_or_subquery1, table_or_subquery2 [, table_or_subquery3 ...]
[other_join_clause ...]
WHERE
col1 = col2 [AND col3 = col4 ...]
SELECT select_list FROM
table_or_subquery1 CROSS JOIN table_or_subquery2
[other_join_clause ...]
[ WHERE where_clauses ]
Self Join
StarRocks supports self-joins, which are self-joins and self-joins. For example, different columns of the same table are joined.
There is actually no special syntax to identify self-join. The conditions on both sides of a join in a self-join come from the same table.
We need to assign them different aliases.
Examples:
SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;
Cross Join
Cross join can produce a lot of results, so cross join should be used with caution.
Even if you need to use cross join, you need to use filter conditions and ensure that fewer results are returned. Example:
SELECT * FROM t1, t2;
SELECT * FROM t1 CROSS JOIN t2;
Inner Join
Inner join is the most well-known and commonly used join. Returns results from columns requested by two similar tables, joined if the columns of both tables contain the same value.
If a column name of both tables is the same, we need to use the full name (in the form of table_name.column_name) or alias the column name.
Examples:
The following three queries are equivalent.
SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id;
SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id;
SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;
Outer Join
Outer join returns the left or right table or all rows of both. If there is no matching data in another table, set it to NULL. Example:
SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id;
SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id;
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;
Equivalent and unequal join
Usually, users use the most equal join, which requires the operator of the join condition to be an equal sign.
Unequal join can be used on join conditions!=, Equal sign. Unequal joins produce a large number of results and may exceed the memory limit during calculation.
Use with caution. Unequal join only supports inner join. Example:
SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;
SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id > t2.id;
Semi Join
Left semi join returns only the rows in the left table that match the data in the right table, regardless of how many rows match the data in the right table.
This row of the left table is returned at most once. Right semi join works similarly, except that the data returned is a right table.
Examples:
SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;
Anti Join
Left anti join only returns rows from the left table that do not match the right table.
Right anti join reverses this comparison, returning only rows from the right table that do not match the left table. Example:
SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT ANTI JOIN t2 ON t1.id = t2.id;
Equi-join and Non-equi-join
The various joins supported by StarRocks can be classified as equi-joins and non-equi-joins depending on the join conditions specified in the joins.
Equi-joins | Self joins, cross joins, inner joins, outer joins, semi joins, and anti joins |
---|---|
Non-equi-joins | cross joins, inner joins, left semi joins, left anti joins, and outer joins |
-
Equi-joins
An equi-join uses a join condition in which two join items are combined by the
=
operator. Example:a JOIN b ON a.id = b.id
. -
Non-equi-joins
A non-equi-join uses a join condition in which two join items are combined by a comparison operator such as
<
,<=
,>
,>=
, or<>
. Example:a JOIN b ON a.id < b.id
. Non-equi-joins run slower than equi-joins. We recommend that you exercise caution when you use non-equi-joins.The following two examples show how to run non-equi-joins:
SELECT t1.id, c1, c2
FROM t1
INNER JOIN t2 ON t1.id < t2.id;
SELECT t1.id, c1, c2
FROM t1
LEFT JOIN t2 ON t1.id > t2.id;
ORDER BY
The ORDER BY clause of a SELECT statement sorts the result set by comparing the values from one or more columns.
ORDER BY is a time- and resource-consuming operation because all the results must be sent to one node for merging before the results can be sorted. Sorting consumes more memory resources than a query without ORDER BY.
Therefore, if you only need the first N
results from the sorted result set, you can use the LIMIT clause, which reduces memory usage and network overhead. If the LIMIT clause is not specified, the first 65535 results are returned by default.
Syntax:
ORDER BY <column_name>
[ASC | DESC]
[NULLS FIRST | NULLS LAST]
ASC
specifies that the results should be returned in ascending order. DESC
specifies that the results should be returned in descending order. If the order is not specified, ASC (ascending) is the default. Example:
select * from big_table order by tiny_column, short_column desc;
Sort order for NULL values: NULLS FIRST
indicates that NULL values should be returned before non-NULL values. NULLS LAST
indicates that NULL values should be returned after non-NULL values.
Examples:
select * from sales_record order by employee_id nulls first;
GROUP BY
The GROUP BY clause is often used with aggregate functions such as COUNT(), SUM(), AVG(), MIN(), and MAX().
The column specified by GROUP BY will not participate in the aggregation operation. The GROUP BY clause can be added with the Having clause to filter the results produced by the aggregate function.
Examples:
select tiny_column, sum(short_column)
from small_table
group by tiny_column;
+-------------+---------------------+
| tiny_column | sum('short_column')|
+-------------+---------------------+
| 1 | 2 |
| 2 | 1 |
+-------------+---------------------+
Syntax
SELECT ...
FROM ...
[ ... ]
GROUP BY [
, ... |
GROUPING SETS [, ...] ( groupSet [ , groupSet [ , ... ] ] ) |
ROLLUP(expr [ , expr [ , ... ] ]) |
CUBE(expr [ , expr [ , ... ] ])
]
[ ... ]
Parameters
groupSet
represents a set composed of columns, aliases or expressions in the select list. groupSet ::= { ( expr [ , expr [ , ... ] ] )}
expr
indicates the column, alias or expression in the select list.
Note
StarRocks supports syntax like PostgreSQL. The syntax examples are as follows:
SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), (a), (b), ( ) );
SELECT a, b,c, SUM( d ) FROM tab1 GROUP BY ROLLUP(a,b,c)
SELECT a, b,c, SUM( d ) FROM tab1 GROUP BY CUBE(a,b,c)
ROLLUP(a,b,c)
is equivalent to the followingGROUPING SETS
statement:
GROUPING SETS (
(a,b,c),
( a, b ),
( a),
( )
)
CUBE ( a, b, c )
is equivalent to the followingGROUPING SETS
statement:
GROUPING SETS (
( a, b, c ),
( a, b ),
( a, c ),
( a ),
( b, c ),
( b ),
( c ),
( )
)
Examples
The following is an example of actual data:
SELECT * FROM t;
+------+------+------+
| k1 | k2 | k3 |
+------+------+------+
| a | A | 1 |
| a | A | 2 |
| a | B | 1 |
| a | B | 3 |
| b | A | 1 |
| b | A | 4 |
| b | B | 1 |
| b | B | 5 |
+------+------+------+
8 rows in set (0.01 sec)
SELECT k1, k2, SUM(k3) FROM t GROUP BY GROUPING SETS ( (k1, k2), (k2), (k1), ( ) );
+------+------+-----------+
| k1 | k2 | sum(`k3`) |
+------+------+-----------+
| b | B | 6 |
| a | B | 4 |
| a | A | 3 |
| b | A | 5 |
| NULL | B | 10 |
| NULL | A | 8 |
| a | NULL | 7 |
| b | NULL | 11 |
| NULL | NULL | 18 |
+------+------+-----------+
9 rows in set (0.06 sec)
> SELECT k1, k2, GROUPING_ID(k1,k2), SUM(k3) FROM t GROUP BY GROUPING SETS ((k1, k2), (k1), (k2), ());
+------+------+---------------+----------------+
| k1 | k2 | grouping_id(k1,k2) | sum(`k3`) |
+------+------+---------------+----------------+
| a | A | 0 | 3 |
| a | B | 0 | 4 |
| a | NULL | 1 | 7 |
| b | A | 0 | 5 |
| b | B | 0 | 6 |
| b | NULL | 1 | 11 |
| NULL | A | 2 | 8 |
| NULL | B | 2 | 10 |
| NULL | NULL | 3 | 18 |
+------+------+---------------+----------------+
9 rows in set (0.02 sec)
GROUP BY GROUPING SETS
| CUBE
| ROLLUP
is an extension of the GROUP BY clause. It can realize the aggregation of groups of multiple sets in a GROUP BY clause. The result is equivalent to the UNION operation of multiple corresponding GROUP BY clauses.
GROUP BY clause is a special case of GROUP BY GROUPING SETS containing only one element. For example, the GROUPING SETS statement:
SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), (a), (b), ( ) );
The query result is equivalent to:
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
UNION
SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
UNION
SELECT null, null, SUM( c ) FROM tab1
GROUPING(expr)
indicates whether a column is an aggregate column. If it is an aggregate column, it is 0, otherwise it is 1.
GROUPING_ID(expr [ , expr [ , ... ] ])
is similar to GROUPING. GROUPING_ ID calculates the bitmap value of a column list according to the specified column order, and each bit is the value of GROUPING.
GROUPING_ID() function returns the decimal value of the bit vector.
HAVING
The HAVING clause does not filter row data in a table, but filters the results of aggregate functions.
Generally speaking, HAVING is used with aggregate functions (such as COUNT(), SUM(), AVG(), MIN(), MAX()) and GROUP BY clauses.
Examples:
select tiny_column, sum(short_column)
from small_table
group by tiny_column
having sum(short_column) = 1;
+-------------+---------------------+
|tiny_column | sum('short_column') |
+-------------+---------------------+
| 2 | 1 |
+-------------+---------------------+
1 row in set (0.07 sec)
select tiny_column, sum(short_column)
from small_table
group by tiny_column
having tiny_column > 1;
+-------------+---------------------+
|tiny_column | sum('short_column') |
+-------------+---------------------+
| 2 | 1 |
+-------------+---------------------+
1 row in set (0.07 sec)
LIMIT
LIMIT clauses are used to limit the maximum number of rows returned. Setting the maximum number of rows returned can help StarRocks optimize memory usage.
This clause is mainly used in the following scenarios:
Returns the result of the top-N query.
Think about what's included in the table below.
The size of the query result set needs to be limited because of the large amount of data in the table or because the where clause does not filter too much data.
Instructions for use: The value of the LIMIT clause must be a numeric literal constant.
Examples:
mysql> select tiny_column from small_table limit 1;
+-------------+
|tiny_column |
+-------------+
| 1 |
+-------------+
1 row in set (0.02 sec)
mysql> select tiny_column from small_table limit 10000;
+-------------+
|tiny_column |
+-------------+
| 1 |
| 2 |
+-------------+
2 rows in set (0.01 sec)
OFFSET
The OFFSET clause causes the result set to skip the first few rows and return the following results directly.
The result set defaults to start at line 0, so offset 0 and no offset return the same results.
Generally speaking, OFFSET clauses need to be used with ORDER BY and LIMIT clauses to be valid.
Examples:
mysql> select varchar_column from big_table order by varchar_column limit 3;
+----------------+
| varchar_column |
+----------------+
| beijing |
| chongqing |
| tianjin |
+----------------+
3 rows in set (0.02 sec)
mysql> select varchar_column from big_table order by varchar_column limit 1 offset 0;
+----------------+
|varchar_column |
+----------------+
| beijing |
+----------------+
1 row in set (0.01 sec)
mysql> select varchar_column from big_table order by varchar_column limit 1 offset 1;
+----------------+
|varchar_column |
+----------------+
| chongqing |
+----------------+
1 row in set (0.01 sec)
mysql> select varchar_column from big_table order by varchar_column limit 1 offset 2;
+----------------+
|varchar_column |
+----------------+
| tianjin |
+----------------+
1 row in set (0.02 sec)
Note: It is allowed to use offset syntax without order by, but offset does not make sense at this time.
In this case, only the limit value is taken, and the offset value is ignored. So without order by.
Offset exceeds the maximum number of rows in the result set and is still a result. It is recommended that users use offset with order by.
UNION
Combines the result of multiple queries.
Syntax:
query_1 UNION [DISTINCT | ALL] query_2
- DISTINCT (default) returns only unique rows. UNION is equivalent to UNION DISTINCT.
- ALL combines all rows, including duplicates. Since de-duplication is memory intensive, queries using UNION ALL are faster and less memory-consuming. For better performance, use UNION ALL.
NOTE
Each query statement must return the same number of columns and the columns must have compatible data types.
Examples:
Create tables select1
and select2
.
CREATE TABLE select1(
id INT,
price INT
)
DISTRIBUTED BY HASH(id);
INSERT INTO select1 VALUES
(1,2),
(1,2),
(2,3),
(5,6),
(5,6);
CREATE TABLE select2(
id INT,
price INT
)
DISTRIBUTED BY HASH(id);
INSERT INTO select2 VALUES
(2,3),
(3,4),
(5,6),
(7,8);
Example 1: Return all IDs in the two tables, including duplicates.
mysql> (select id from select1) union all (select id from select2) order by id;
+------+
| id |
+------+
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 5 |
| 5 |
| 5 |
| 7 |
+------+
11 rows in set (0.02 sec)
Example 2: Return all the unique IDs in the two tables. The following two statements are equivalent.
mysql> (select id from select1) union (select id from select2) order by id;
+------+
| id |
+------+
| 1 |
| 2 |
| 3 |
| 5 |
| 7 |
+------+
6 rows in set (0.01 sec)
mysql> (select id from select1) union distinct (select id from select2) order by id;
+------+
| id |
+------+
| 1 |
| 2 |
| 3 |
| 5 |
| 7 |
+------+
5 rows in set (0.02 sec)
Example 3: Return the first three IDs among all the unique IDs in the two tables. The following two statements are equivalent.
mysql> (select id from select1) union distinct (select id from select2)
order by id
limit 3;
++------+
| id |
+------+
| 1 |
| 2 |
| 3 |
+------+
4 rows in set (0.11 sec)
mysql> select * from (select id from select1 union distinct select id from select2) as t1
order by id
limit 3;
+------+
| id |
+------+
| 1 |
| 2 |
| 3 |
+------+
3 rows in set (0.01 sec)