Skip to main content
Version: Latest-4.1

Arrow Flight SQL Best Practices

Arrow Flight SQL is the fastest way to pull large result sets out of StarRocks. Against the MySQL protocol, on the same hardware and against the same cluster, Arrow Flight is consistently faster: 3×–9Γ— faster at the raw protocol fetch, and 19×–97Γ— faster end-to-end to a pandas DataFrame. The exact factor depends on row count, column shape, and which MySQL client you compare against. But the speedup is not automatic: how the client code reads the result has a large effect on the end-to-end time, and a few simple mistakes can give back most of it.

This page shows the overall numbers you can expect, summarises the aspects that affect them, and then describes each aspect with the code change and the measured impact.

Overall Performance​

Two comparisons follow. The first measures only the protocol fetch β€” how long it takes for the bytes to arrive and be parsed, with no language-level object conversion. The second measures a real-world Python application where data is read into a pandas DataFrame. See Test Environment for the hardware.

Protocol-level fetch (Arrow Flight ADBC vs mysql --quick)​

fetch_arrow_table() drains the network into Arrow buffers without converting cells into Python objects. mysql --quick drains the MySQL wire protocol with a streaming C client that parses rows. Both are protocol-only β€” neither pays for language-native object materialization.

WorkloadRowsMySQL protocol
(mysql --quick)
Arrow Flight
(fetch_arrow_table)
Speedup
Single numeric column (SELECT id)1 M831 ms215 ms3.9Γ—
Single numeric column (SELECT id)5 M2,216 ms456 ms4.9Γ—
Single numeric column (SELECT id)10 M4,166 ms1,163 ms3.6Γ—
Single numeric column (SELECT id)100 M35,629 ms6,737 ms5.3Γ—
20 numeric columns (SELECT *)1 M1,994 ms370 ms5.4Γ—
20 numeric columns (SELECT *)5 M9,665 ms1,251 ms7.7Γ—
20 numeric columns (SELECT *)10 M18,461 ms2,577 ms7.2Γ—
20 numeric columns (SELECT *)100 M178,416 ms19,047 ms9.4Γ—
20 VARCHAR columns (SELECT *)1 M4,549 ms1,294 ms3.5Γ—
20 VARCHAR columns (SELECT *)5 M19,077 ms5,959 ms3.2Γ—
20 VARCHAR columns (SELECT *)10 M36,079 ms11,499 ms3.1Γ—
20 VARCHAR columns (SELECT *)100 M370,858 ms164,508 ms (chunked)2.3Γ—

Real-world Python application β€” pd.read_sql with ADBC vs PyMySQL​

The canonical Python pipeline is pd.read_sql(sql, conn) β†’ pandas.DataFrame. The connection object you hand it is the entire migration: pass a PyMySQL Connection and pandas calls cursor.fetchall() + pd.DataFrame(rows), walking every row to build the DataFrame. Pass an ADBC Flight SQL connection and pandas uses ADBC's native Arrow fetch + near-zero-copy DataFrame conversion.

WorkloadRowspd.read_sql(sql,
adbc_conn)
pd.read_sql(sql,
pymysql_conn)
Speedup
Single numeric column (SELECT id)1 M320 ms6,185 ms19.3Γ—
Single numeric column (SELECT id)5 M421 ms30,751 ms73.0Γ—
Single numeric column (SELECT id)10 M970 ms61,524 ms63.4Γ—
Single numeric column (SELECT id)100 M6,024 ms585,556 ms97.2Γ—
20 numeric columns (SELECT *)1 M522 ms27,521 ms52.7Γ—
20 numeric columns (SELECT *)5 M1,530 ms141,500 ms92.5Γ—
20 numeric columns (SELECT *)10 M2,689 ms255,408 ms95.0Γ—
20 numeric columns (SELECT *)100 M24,568 msOOMβ€”
20 VARCHAR columns (SELECT *)1 M1,560 ms31,407 ms20.1Γ—
20 VARCHAR columns (SELECT *)5 M6,937 ms154,560 ms22.3Γ—
20 VARCHAR columns (SELECT *)10 M13,260 ms304,647 ms23.0Γ—

Each cell is fetch + convert = total; the speedup is total vs total. Narrow numeric queries hit the largest ratio because the PyMySQL side allocates a Python int per cell during fetch and pandas then walks the tuple list during conversion β€” the ADBC side skips both costs. Arrow's columnar memory format wins twice: it skips per-cell Python object allocation during fetch, and makes the DataFrame conversion almost free afterwards.

If your existing code already uses pd.read_sql, the migration is one line:

import adbc_driver_manager
import adbc_driver_flightsql.dbapi as fl
import pandas as pd

with fl.connect(
uri="grpcs://host:443",
db_kwargs={
adbc_driver_manager.DatabaseOptions.USERNAME.value: "admin",
adbc_driver_manager.DatabaseOptions.PASSWORD.value: "...",
}) as conn:
df = pd.read_sql("SELECT * FROM my_table LIMIT 5000000", conn)

Test Environment​

ComponentDetails
Client hostAWS EC2 t3.2xlarge, same VPC subnet as the cluster
Cluster3 FE + 2 BE on m6g.xlarge; Arrow Flight on grpcs://…:443, MySQL on :9030
Java stackOpenJDK 17, jdbc:arrow-flight-sql, arrow-jdbc, parquet-hadoop
Python stackpython 3.12, pyarrow 24.0, adbc-driver-flightsql 1.11, PyMySQL 1.2
WorkloadTwo 20-column tables β€” one VARCHAR-heavy, one all-integer β€” plus single-column projections; row counts of 1 M, 5 M, 10 M, and 100 M via SELECT … LIMIT N
MySQL drain modecursor.fetchall() buffered for all measurements

Choosing a Client​

Before any code-level tuning, the biggest single decision is which client API you read results through. Interact with StarRocks via Arrow Flight SQL covers the full setup for Python ADBC, the Arrow Flight JDBC driver, the Java ADBC driver, and the native FlightClient. For performance, those collapse into two paths:

  • Raw Arrow batches via FlightSqlClient or ADBC (recommended). This is the columnar end-to-end path the Flight SQL protocol is designed for: your code receives VectorSchemaRoot batches and reads them with primitive-returning vector accessors, with no per-row object allocation. End-to-end (drain + typed conversion), this path is about 10Γ— faster than Java MySQL JDBC on 10 M numeric rows, and up to 97Γ— faster than PyMySQL on 100 M narrow numeric queries delivered as a pandas DataFrame. Use it whenever your downstream code can consume columnar data (Pandas, Arrow, ML pipelines, Parquet writers, custom analytics).
  • Arrow Flight JDBC driver (jdbc:arrow-flight-sql). Use this when you need a drop-in ResultSet for an existing JDBC code path, or for BI tools like Tableau, Power BI, and DBeaver where the JDBC interface is required. JDBC's API forces the driver to return a boxed Object for every cell, so this path cannot reach the performance of raw Arrow batches. The JDBC driver is still substantially faster than MySQL JDBC; it is the right tool when JDBC compatibility is the requirement.

The per-aspect tables further down switch baselines: they compare the Java Arrow Flight JDBC driver against Java MySQL JDBC, not against PyMySQL. The Java MySQL JDBC connector is much faster at row materialization than PyMySQL β€” for example the same 5 M VARCHAR SELECT * takes ~22 s through Java MySQL JDBC versus ~105 s through PyMySQL β€” so the Java ratios you'll see are smaller than the Python numbers in Overall Performance. Java MySQL JDBC is the right baseline when you are choosing between Java drivers.

The four aspects below apply within whichever client you choose: Aspect 1 is for JDBC consumers, Aspects 2–3 are for raw-batch consumers, and Aspect 4 covers Parquet output from either.

What Affects Performance​

The speedups above assume the client code is written for Arrow. The following four aspects each move the needle by 2Γ— or more on the right workload. Getting them right is the difference between the "tuned" column in the table above and a fetch that looks no faster than MySQL.

  1. JDBC accessor method. Use rs.getObject(i) with a typed cast for numeric columns. rs.getString(i) forces the driver to format every value as a string.
  2. Vector resolution scope. When consuming raw VectorSchemaRoot batches, resolve each FieldVector once per batch outside the row loop, not once per row.
  3. Typed .get(i) for numerics. On numeric vectors, the typed .get(i) returns a primitive with no allocation. The generic accessors box every value.
  4. Parquet writer choice. PyArrow writes Parquet directly from the Arrow stream with no row-by-row code. Java has no pre-built library for this β€” every Java path requires a hand-written WriteSupport<VectorSchemaRoot> on top of parquet-hadoop.

Aspect 1 β€” JDBC: Use Typed Column Access​

When using the Arrow Flight JDBC driver, use rs.getObject(i) and cast to the expected Java type. This lets the driver return the native Java type directly without an extra conversion step, which matters most for numeric columns.

while (rs.next()) {
Integer id = (Integer) rs.getObject(1);
String name = (String) rs.getObject(2);
Long ts = (Long) rs.getObject(3);
}

Benchmark: JDBC Accessor Methods (includes network)​

WorkloadMySQL JDBCArrow Flight JDBC, typed getObject()Speedup
VARCHAR β€” 5 M22,651 ms12,660 ms1.79Γ—
VARCHAR β€” 10 M49,216 ms27,646 ms1.78Γ—
Numeric β€” 5 M16,043 ms3,123 ms5.14Γ—
Numeric β€” 10 M38,134 ms9,123 ms4.18Γ—

Aspect 2 β€” Raw Arrow Batches: Pre-Resolve Vectors and Use Typed Access​

When consuming raw Arrow batches via the native FlightSqlClient (i.e., iterating over VectorSchemaRoot objects), follow two rules.

Resolve vectors once per batch, not once per row. Call root.getVector("column_name") before the row loop so the lookup cost is paid once per batch rather than once per row.

Use typed .get(i) for numeric vectors. This returns a primitive value directly β€” no heap allocation, no GC pressure.

IntVector      idVec    = (IntVector)      root.getVector("id");
SmallIntVector yearVec = (SmallIntVector) root.getVector("birth_year");
TinyIntVector monthVec = (TinyIntVector) root.getVector("birth_month");

for (int i = 0; i < rowCount; i++) {
record.id = idVec.get(i); // int β€” no allocation
record.birthYear = yearVec.get(i); // short β€” no allocation
record.birthMonth = monthVec.get(i); // byte β€” no allocation
}

Benchmark: Arrow Batch Conversion Cost (pre-fetched)​

The Arrow Flight numbers below isolate the conversion cost: batches are drained from the cluster into memory first, then timed.

WorkloadMySQL JDBCTyped .get*(), vectors resolved once per batchSpeedup
VARCHAR β€” 5 M22,651 ms11,921 ms1.90Γ—
VARCHAR β€” 10 M49,216 ms24,686 ms1.99Γ—
Numeric β€” 5 M16,043 ms1,141 ms14.1Γ—
Numeric β€” 10 M38,134 ms2,092 ms18.2Γ—

Aspect 3 β€” Writing Results to Parquet​

Apache Arrow does not include a pre-built Parquet writer for VectorSchemaRoot. If your goal is simply to export query results to Parquet files, INSERT INTO FILES lets StarRocks write the files server-side without any client-side conversion code. The options below apply when you need client-side Parquet output via Arrow Flight.

PyArrow handles the Arrow β†’ Parquet conversion with no custom write logic. It preserves column types natively (INT32, INT64, TIMESTAMP_MICROS, etc.).

Streaming batch-by-batch from Arrow Flight:

import pyarrow.flight as fl
import pyarrow.parquet as pq

client = fl.connect("grpc+tls://host:443")
info = client.get_flight_info(
fl.FlightDescriptor.for_command(b"SELECT ..."))

reader = client.do_get(info.endpoints[0].ticket)
with pq.ParquetWriter("output.parquet", reader.schema_arrow, compression="snappy") as writer:
for batch in reader:
writer.write_batch(batch)

If the full result fits in memory:

table = reader.read_all()
pq.write_table(table, "output.parquet", compression="snappy")

Via ADBC (the recommended Python Flight SQL client):

import adbc_driver_flightsql.dbapi as fl_sql
import pyarrow.parquet as pq

with fl_sql.connect("grpcs://host:443", db_kwargs={"username": "admin", "password": "..."}) as conn:
with conn.cursor() as cur:
cur.execute("SELECT * FROM my_table LIMIT 5000000")
pq.write_table(cur.fetch_arrow_table(), "output.parquet", compression="snappy")

Option 2: Java WriteSupport​

For Java, build a custom WriteSupport<VectorSchemaRoot> on top of org.apache.parquet:parquet-hadoop. Build the schema and writer once per job, then use typed vector reads inside WriteSupport.write().

Build schema and writer once:

MessageType parquetSchema = new SchemaConverter().fromArrow(arrowSchema).getParquetSchema();
ParquetWriter<VectorSchemaRoot> writer = /* build once per job */;

// Per batch:
writer.write(batch);

Use typed reads inside WriteSupport.write():

class ArrowWriteSupport extends WriteSupport<VectorSchemaRoot> {
private RecordConsumer recordConsumer;

@Override
public void prepareForWrite(RecordConsumer consumer) {
this.recordConsumer = consumer;
}

@Override
public void write(VectorSchemaRoot root) {
int rowCount = root.getRowCount();
for (FieldVector vec : root.getFieldVectors()) {
if (vec instanceof IntVector) {
IntVector iv = (IntVector) vec;
for (int row = 0; row < rowCount; row++) {
recordConsumer.addInteger(iv.get(row));
}
} // else if (vec instanceof SmallIntVector) ... BigIntVector ... VarCharVector ...
}
}
}

Parquet Benchmark​

Numbers include both Parquet encoding and file I/O cost (see Test Environment). VARCHAR and numeric tables are benchmarked separately because they stress different parts of the Arrow encoding path: VARCHAR columns require offset-buffer arithmetic on variable-length data, while numeric columns use fixed-width typed vectors where the gains from typed access are much larger.

Java (5 M and 10 M rows)​

Both rows use the same parquet-hadoop write path (MySqlParquetConverter + arrow-jdbc adapter, batch size 65,536) so the only variable is the inbound JDBC driver.

ApproachRowsVARCHAR UNCOMPvs MySQLVARCHAR Snappyvs MySQLNumeric UNCOMPvs MySQLNumeric Snappyvs MySQL
MySQL JDBC β†’ Parquet5 M55,477 ms1.0Γ—54,888 ms1.0Γ—24,006 ms1.0Γ—25,289 ms1.0Γ—
Arrow Flight JDBC β†’ Parquet5 M46,341 ms1.20Γ—47,881 ms1.15Γ—13,978 ms1.72Γ—14,297 ms1.77Γ—
MySQL JDBC β†’ Parquet10 M110,229 ms1.0Γ—116,999 ms1.0Γ—50,509 ms1.0Γ—49,126 ms1.0Γ—
Arrow Flight JDBC β†’ Parquet10 M91,386 ms1.21Γ—102,534 ms1.14Γ—29,739 ms1.70Γ—30,102 ms1.63Γ—

Python (PyArrow 24.0.0 / ADBC 1.11.0)​

The MySQL baseline is the same Java MySQL JDBC β†’ Parquet number from the table above; "MySQL β†’ PyArrow" is not a real path because there is no MySQL β†’ Arrow adapter outside of arrow-jdbc. Python numbers were collected at 5 M only.

ApproachVARCHAR UNCOMPvs MySQLVARCHAR Snappyvs MySQLNumeric UNCOMPvs MySQLNumeric Snappyvs MySQL
MySQL JDBC β†’ Parquet (Java baseline, 5 M)55,477 ms1.0Γ—54,888 ms1.0Γ—24,006 ms1.0Γ—25,289 ms1.0Γ—
Arrow Flight + PyArrow (5 M)10,675 ms5.20Γ—14,128 ms3.89Γ—3,953 ms6.07Γ—3,848 ms6.57Γ—

PyArrow adds almost no overhead on top of the raw network fetch and requires far less code than the Java path. Use PyArrow unless Java is a hard requirement.

Summary​

Use caseRecommendation
Arrow Flight JDBCUse getObject() with typed cast
Raw VectorSchemaRoot batchesResolve vectors once per batch; use typed .get(i) for numeric columns
Arrow β†’ Parquet in Pythonpyarrow.parquet via ADBC β€” single function call, no custom code
Arrow β†’ Parquet in JavaHand-written WriteSupport<VectorSchemaRoot> with typed vector reads
Rocky the happy otterStarRocks Assistant

AI generated answers are based on docs and other sources. Please test answers in non-production environments.