Version: Stable-3.1

Broker Load

1. Does Broker Load support re-running load jobs that have been run successfully and are in the FINISHED state?

Broker Load does not support re-running load jobs that have been run successfully and are in the FINISHED state. Also, to prevent data loss and duplication, Broker Load does not allow reusing the labels of successfully run load jobs. You can use SHOW LOAD to view the history of load jobs and find the load job that you want to re-run. Then, you can copy the information of that load job and use the job information, except the label, to create another load job.

2. When I load data from HDFS by using Broker Load, what do I do if the date and time values loaded into the destination StarRocks table are 8 hours later than the date and time values from the source data file?

Both the destination StarRocks table and the Broker Load job are compiled at creation to use a China Standard Time (CST) time zone (specified by using the timezone parameter). However, the server is set to run based on a Coordinated Universal Time (UTC) time zone. As a result, 8 extra hours are added to the date and time values from the source data file during data loading. To prevent this issue, do not specify the timezone parameter when you create the destination StarRocks table.

3. When I load ORC-formatted data by using Broker Load, what do I do if the `ErrorMsg: type:ETL_RUN_FAIL; msg:Cannot cast '<slot 6>' from VARCHAR to ARRAY<VARCHAR(30)>` error occurs?

The source data file has different column names than the destination StarRocks table. In this situation, you must use the SET clause in the load statement to specify the column mapping between the file and the table. When executing the SET clause, StarRocks needs to perform a type inference, but it fails in invoking the cast function to transform the source data to the destination data types. To resolve this issue, make sure that the source data file has the same column names as the destination StarRocks table. As such, the SET clause is not needed and therefore StarRocks does not need to invoke the cast function to perform data type conversions. Then the Broker Load job can be run successfully.

4. The Broker Load job does not report errors, but why am I unable to query the loaded data?

Broker Load is an asynchronous loading method. The load job may still fail even if the load statement does not return errors. After you run a Broker Load job, you can use SHOW LOAD to view the result and errmsg of the load job. Then, you can modify the job configuration and retry.

5. What do I do if the "failed to send batch" or "TabletWriter add batch with unknown id" error occurs?

The amount of time taken to write the data exceeds the upper limit, causing a timeout error. To resolve this issue, modify the settings of the session variable query_timeout and the BE configuration item streaming_load_rpc_max_alive_time_sec based on your business requirements.

6. What do I do if the "LOAD-RUN-FAIL; msg:OrcScannerAdapter::init_include_columns. col name = xxx not found" error occurs?

If you are loading Parquet- or ORC-formatted data, check whether the column names held in the first row of the source data file are the same as the column names of the destination StarRocks table.

(tmp_c1,tmp_c2)
SET
(
   id=tmp_c2,
   name=tmp_c1
)

The preceding example maps the tmp_c1 and tmp_c2 columns of the source data file onto the name and id columns of the destination StarRocks table, respectively. If you do not specify the SET clause, the column names specified in the column_list parameter are used to declare the column mapping. For more information, see BROKER LOAD.

NOTICE

If the source data file is an ORC-formatted file generated by Apache Hive™ and the first row of the file holds (_col0, _col1, _col2, ...), the "Invalid Column Name" error may occur. If this error occurs, you need to use the SET clause to specify the column mapping.

7. How do I handle errors such as the error that causes the Broker Load job to run for an excessively long period of time?

View the FE log file fe.log and search for the ID of the load job based on the job label. Then, view the BE log file be.INFO and retrieve the log records of the load job based on the job ID to locate the root cause of the error.

8. How do I configure an Apache HDFS cluster that runs in HA mode?

If an HDFS cluster runs in high availability (HA) mode, configure it as follows:

dfs.nameservices: the name of the HDFS cluster, for example, "dfs.nameservices" = "my_ha".
dfs.ha.namenodes.xxx: the name of the NameNode in the HDFS cluster. If you specify multiple NameNode names, separate them with commas (,). xxx is the HDFS cluster name that you have specified in dfs.nameservices, for example, "dfs.ha.namenodes.my_ha" = "my_nn".
dfs.namenode.rpc-address.xxx.nn: the RPC address of the NameNode in the HDFS cluster. nn is the NameNode name that you have specified in dfs.ha.namenodes.xxx, for example, "dfs.namenode.rpc-address.my_ha.my_nn" = "host:port".
dfs.client.failover.proxy.provider: the provider of the NameNode to which the client will connect. Default value: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

Example:

(
    "dfs.nameservices" = "my-ha",
    "dfs.ha.namenodes.my-ha" = "my-namenode1, my-namenode2",
    "dfs.namenode.rpc-address.my-ha.my-namenode1" = "nn1-host:rpc_port",
    "dfs.namenode.rpc-address.my-ha.my-namenode2" = "nn2-host:rpc_port",
    "dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
)

The HA mode can be used with simple authentication or Kerberos authentication. For example, to use simple authentication to access an HDFS cluster that runs in HA mode, you need to specify the following configurations:

(
    "username"="user",
    "password"="passwd",
    "dfs.nameservices" = "my-ha",
    "dfs.ha.namenodes.my-ha" = "my_namenode1, my_namenode2",
    "dfs.namenode.rpc-address.my-ha.my-namenode1" = "nn1-host:rpc_port",
    "dfs.namenode.rpc-address.my-ha.my-namenode2" = "nn2-host:rpc_port",
    "dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
)

You can add the configurations of the HDFS cluster to the hdfs-site.xml file. This way, you only need to specify the file path and authentication information when you use brokers to load data from the HDFS cluster.

9. How do I configure Hadoop ViewFS Federation?

Copy the ViewFs-related configuration files core-site.xml and hdfs-site.xml to the broker/conf directory.

If you have a custom file system, you also need to copy the file system-related .jar files to the broker/lib directory.

10. When I access an HDFS cluster that requires Kerberos authentication, what do I do if the "Can't get Kerberos realm" error occurs?

Check that the /etc/krb5.conf file is configured on all hosts on which brokers are deployed.

If the error persists, add -Djava.security.krb5.conf:/etc/krb5.conf to the end of the JAVA_OPTS variable in the broker startup script.

Broker Load

1. Does Broker Load support re-running load jobs that have been run successfully and are in the FINISHED state?​

2. When I load data from HDFS by using Broker Load, what do I do if the date and time values loaded into the destination StarRocks table are 8 hours later than the date and time values from the source data file?​

3. When I load ORC-formatted data by using Broker Load, what do I do if the ErrorMsg: type:ETL_RUN_FAIL; msg:Cannot cast '<slot 6>' from VARCHAR to ARRAY<VARCHAR(30)> error occurs?​

4. The Broker Load job does not report errors, but why am I unable to query the loaded data?​

5. What do I do if the "failed to send batch" or "TabletWriter add batch with unknown id" error occurs?​

6. What do I do if the "LOAD-RUN-FAIL; msg:OrcScannerAdapter::init_include_columns. col name = xxx not found" error occurs?​

7. How do I handle errors such as the error that causes the Broker Load job to run for an excessively long period of time?​

8. How do I configure an Apache HDFS cluster that runs in HA mode?​

9. How do I configure Hadoop ViewFS Federation?​

10. When I access an HDFS cluster that requires Kerberos authentication, what do I do if the "Can't get Kerberos realm" error occurs?​

What did you think of this doc?