Skip to main content
Version: Candidate-3.4

Python UDF

Since version 3.4.0, StarRocks supports writing User Defined Function (UDF) in Python.

This article describes how to write and use StarRocks Python UDFs.

Currently, StarRocks Python UDF only supports Scalar UDF.

Prerequisites

Before using StarRocks' Java UDF functionality, you will need to.

  • Python3.8+ installed to run Python.

  • The UDF feature is enabled. You can set the FE configuration item enable_udf to true in the FE configuration file fe/conf/fe.conf to enable this feature, and then restart the FE nodes to make the settings take effect. For more information, see Parameter configuration.

  • BE Set the Python interpreter environment variable location. Add the configuration item python_envs to set the location of the Python interpreter installation e.g. /opt/Python-3.8/.

Develop and use UDFs

Develop a scalar UDF

Syntax

CREATE [GLOBAL] FUNCTION function_name(arg_type [, ...])
RETURNS return_type
[PROPERTIES ("key" = "value" [, ...]) | key="value" [...] ]
[AS $$ $$]

Create Python inline Scalar input UDF

echo 示例

CREATE FUNCTION python_echo(INT) RETURNS
INT
type = 'Python'
symbol = 'echo'
file = 'inline'
AS
$$
def echo(x):
return x
$$
;
parametersdescription
symbolUDF Execute function.
typeis used to mark the type of UDF created, in Python UDFs. Taking the value Python indicates a Python based UDF.
inputType of input, takes the values scalar and arrow, defaults to scalar.

Create Python inline vectorized input UDF

In order to increase the speed of UDF processing, the vectorized input is provided.

CREATE FUNCTION python_add(INT) RETURNS
INT
type = 'Python'
symbol = 'add'
input = "arrow"
AS
$$
import pyarrow.compute as pc
def add(x):
return pc.add(x, 1)
$$
;

Create Python packaged input UDF

Package Creation. First, package the module into xxx.py.zip, which needs to meet the zipimport format .

> tree .
.
├── main.py
└── yaml
├── composer.py
├── constructor.py
├── cyaml.py
├── dumper.py
├── emitter.py
├── error.py
├── events.py
├── __init__.py
├── loader.py
├── nodes.py
├── parser.py
> cat main.py 
import numpy
import yaml

def echo(a):
return yaml.__version__
CREATE FUNCTION py_pack(string) 
RETURNS string
symbol = "add"
type = "Python"
file = "http://HTTP_IP:HTTP_PORT/m1.py.zip"
symbol = "main.echo"
;

Note that the URL must end with .py.zip here.

Mapping between SQL data types and Python data types

SQL TypePython 3 Type
SCALAR:
TINYINT/SMALLINT/INT/BIGINT/LARGEINTINT
STRINGstring
DOUBLEFLOAT
BOOLEANBOOL
DATETIMEDATETIME.DATETIME
FLOATFLOAT
CHARSTRING
VARCHARSTRING
DATEDATETIME.DATE
DECIMALDECIMAL.DECIMAL
ARRAYList
MAPDict
STRUCTCOLLECTIONS.NAMEDTUPLE
JSONdict
VECTORIZED:
TYPE_BOOLEANpyarrow.lib.BoolArray
TYPE_TINYINTpyarrow.lib.Int8Array
TYPE_SMALLINTpyarrow.lib.Int15Array
TYPE_INTpyarrow.lib.Int32Array
TYPE_BIGINTpyarrow.lib.Int64Array
TYPE_FLOATpyarrow.FloatArray
TYPE_DOUBLEpyarrow.DoubleArray
VARCHARpyarrow.StringArray
DECIMALpyarrow.Decimal128Array
DATEpyarrow.Date32Array
TYPE_TIMEpyarrow.TimeArray
ARRAYpyarrow.ListArray