[SPARK-52238][PYTHON] Python client for Declarative Pipelines #50963

sryza · 2025-05-21T03:39:57Z

What changes were proposed in this pull request?

Adds the Python client for Declarative Pipelines. This implements the command line interface and Python APIs described in the Declarative Pipelines SPIP.

Python API for defining pipeline graph elements

The Python API consists of these APIs for defining flows and datasets in a pipeline dataflow graph (see their docstring for more details):

create_streaming_table
@append_flow
@materialized_view
@table
@temporary_view

Example file of definitions:

from pyspark.sql import SparkSession
from pyspark import pipelines as sdp

spark = SparkSession.active()

@sdp.materialized_view
def baby_names_raw():
    return (
        spark.read.option("header", "true").csv("babynames.csv")
        .withColumnRenamed("First Name", "First_Name")
    )

Command line interface

The CLI is implemented as a Spark Connect client. It enables launching runs of declarative pipelines. It accepts a YAML spec, which specifies where on the local filesystem to look for the Python and SQL files that contain the definitions of the flows and datasets that make up the pipeline dataflow graph.

Example usage:

bin/spark-pipelines run --remote sc://localhost --spec pipeline.yml

Example output:

Loading pipeline spec from pipeline.yaml...
Creating Spark session...
Creating dataflow graph...
Registering graph elements...
Loading definitions. Root directory: ..
Found 1 files matching glob 'transformations/**/*.py'
Importing transformations/baby_names_raw.py...
Found 1 files matching glob 'transformations/**/*.sql'
Registering SQL file transformations/baby_names_prepared.sql...
Starting run...
Starting execution...
2025-05-20T15:08:01.395Z: Flow `spark_catalog`.`default`.`baby_names_raw` is QUEUED.
2025-05-20T15:08:01.398Z: Flow `spark_catalog`.`default`.`baby_names_prepared` is QUEUED.
2025-05-20T15:08:01.402Z: Flow 'spark_catalog.default.baby_names_raw' is PLANNING.
2025-05-20T15:08:01.403Z: Flow `spark_catalog`.`default`.`baby_names_raw` is STARTING.
2025-05-20T15:08:01.404Z: Flow `spark_catalog`.`default`.`baby_names_raw` is RUNNING.
2025-05-20T15:08:03.096Z: Flow 'spark_catalog.default.baby_names_raw' has COMPLETED.
2025-05-20T15:08:03.422Z: Flow 'spark_catalog.default.baby_names_prepared' is PLANNING.
2025-05-20T15:08:03.422Z: Flow `spark_catalog`.`default`.`baby_names_prepared` is STARTING.
2025-05-20T15:08:03.422Z: Flow `spark_catalog`.`default`.`baby_names_prepared` is RUNNING.
2025-05-20T15:08:03.875Z: Flow 'spark_catalog.default.baby_names_prepared' has COMPLETED.
2025-05-20T15:08:05.492Z: Run has COMPLETED.

Architecture diagram

Why are the changes needed?

In order to implement Declarative Pipelines, as described in the SPIP.

Does this PR introduce any user-facing change?

No previous behavior is changed, but new behavior is introduced.

How was this patch tested?

Unit testing

Includes unit tests for:

Python API error cases – test_decorators.py
Command line functionality – test_cli.py
The harness for registering graph elements while evaluating pipeline definition Python files – test_graph_element_registry.py
Code for blocking execution and analysis within decorated query functions – test_block_connect_access.py

Note that, once the backend is wired up, we will submit additional unit tests that cover end-to-end pipeline execution with Python.

CLI testing

With the Declarative Pipelines Spark Connect backend (coming in a future PR), I ran the CLI and confirmed that it executed a pipeline as expected.

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng

Is Declarative Pipelines supposed to be only supported in connect mode?

zhengruifeng · 2025-05-21T08:32:40Z

python/pyspark/sql/tests/pipelines/test_block_connect_access.py

+from pyspark.sql.pipelines.block_connect_access import block_spark_connect_execution_and_analysis
+
+
+class BlockSparkConnectAccessTests(ReusedConnectTestCase):


new tests should be registered in dev/sparktestsupport/modules.py, otherwise they are skipped

Do we plan to also test this file in classic mode?

This should only be tested in connect mode – do I need to add something to the file to set that up?

zhengruifeng · 2025-05-21T08:36:21Z

python/pyspark/sql/tests/pipelines/test_graph_element_registry.py

+
+class GraphElementRegistryTest(unittest.TestCase):
+    def test_graph_element_registry(self):
+        spark = SparkSession.builder.getOrCreate()


why not reusing the ReusedSQLTestCase (for classic) and ReusedConnectTestCase (for connect)?

Ah – this test actually doesn't need a SparkSession. Updating it to take it out.

sryza · 2025-05-21T14:16:52Z

@zhengruifeng this initial implementation is just for Connect. Connect is more straightforward to support, because Connect DataFrames are lazier than classic DataFrames. This means we can evaluate the user's decorated query function immediately rather than call back after all upstream datasets have been resolved.

However, it's designed in a way that can support classic in the future – by implementing a GraphElementRegistry that registers graph elements over Py4J instead of Connect.

`init` command as described in this design doc: https://docs.google.com/document/d/1LrwYt99MO8Pt2xgQlMoVvoBjlX0EMfF_-NkWGQYi39E/edit?tab=t.0 ### How I tested ``` ./python/run-tests --modules pyspark-sql --testnames 'pyspark.sql.tests.pipelines.test_cli' dev/lint-python --compile --black --custom-pyspark-error --flake8 ~/oss/bin/spark-pipelines init --name demo2 cd demo2 ~/oss/bin/spark-pipelines run --remote sc://localhost ``` `init` output: ``` Pipeline project 'demo3' created successfully. To run your pipeline: cd 'demo3' spark-pipelines run ``` `run` output: ``` Loading pipeline spec from /Users/sandy.ryza/sdp-test/demo2/pipeline.yml... Spark session created. Creating dataflow graph... Registering graph elements... Loading definitions. Root directory: /Users/sandy.ryza/sdp-test/demo2. Found 1 files matching glob 'transformations/**/*.py' Importing /Users/sandy.ryza/sdp-test/demo2/transformations/example_python_materialized_view.py... Found 1 files matching glob 'transformations/**/*.sql' Registering SQL file /Users/sandy.ryza/sdp-test/demo2/transformations/example_sql_materialized_view.sql... Starting run... Starting execution... 2025-05-21T17:20:26.155Z: Flow `spark_catalog`.`default`.`example_python_materialized_view` is QUEUED. 2025-05-21T17:20:26.155Z: Flow `spark_catalog`.`default`.`example_sql_materialized_view` is QUEUED. 2025-05-21T17:20:26.156Z: Flow 'spark_catalog.default.example_python_materialized_view' is PLANNING. 2025-05-21T17:20:26.156Z: Flow `spark_catalog`.`default`.`example_python_materialized_view` is STARTING. 2025-05-21T17:20:26.156Z: Flow `spark_catalog`.`default`.`example_python_materialized_view` is RUNNING. 2025-05-21T17:20:26.629Z: Flow 'spark_catalog.default.example_python_materialized_view' has COMPLETED. 2025-05-21T17:20:27.164Z: Flow 'spark_catalog.default.example_sql_materialized_view' is PLANNING. 2025-05-21T17:20:27.165Z: Flow `spark_catalog`.`default`.`example_sql_materialized_view` is STARTING. 2025-05-21T17:20:27.165Z: Flow `spark_catalog`.`default`.`example_sql_materialized_view` is RUNNING. 2025-05-21T17:20:27.462Z: Flow 'spark_catalog.default.example_sql_materialized_view' has COMPLETED. 2025-05-21T17:20:29.216Z: Run has COMPLETED. ```

dev/sparktestsupport/modules.py

HyukjinKwon · 2025-06-08T22:44:20Z

python/pyspark/pipelines/cli.py

+import argparse
+import importlib.util
+import os
+import yaml


Seems like the PySpark tests fail if yaml is not installed (https://github.com/apache/spark/actions/runs/15516895252/job/43685059103). I think we should skip the tests if yaml is not found.

Is this what we do for other dependencies? I'd be worried that, if we accidentally make a change to Spark CI that avoids installing pyyaml when we'd otherwise expect it to be installed, then the tests could get broken and we wouldn't find out.

I could alternatively help trace down why it isn't installed in that situation and fix it?

Yeah we do .. the tests work without any dependency basically. That build is a scheduled build dedicated for that (to test without any dependencies).

There are a bunch of scheduled jobs that we don't run for PR builders at https://github.com/apache/spark/actions, e.g., JDK 17, 21, Maven, MacOS etc.

Got it – I'll make this change. Any chance you have a reference to how we skip tests for other missing depndencies, so I can make add something consistent?

### What changes were proposed in this pull request? Adds the Python client for Declarative Pipelines. This implements the command line interface and Python APIs described in the [Declarative Pipelines SPIP](https://docs.google.com/document/d/1PsSTngFuRVEOvUGzp_25CQL1yfzFHFr02XdMfQ7jOM4/edit?tab=t.0#heading=h.9g6a5f8v6xig). #### Python API for defining pipeline graph elements The Python API consists of these APIs for defining flows and datasets in a pipeline dataflow graph (see their docstring for more details): - `create_streaming_table` - `append_flow` - `materialized_view` - `table` - `temporary_view` Example file of definitions: ```python from pyspark.sql import SparkSession from pyspark import pipelines as sdp spark = SparkSession.active() sdp.materialized_view def baby_names_raw(): return ( spark.read.option("header", "true").csv("babynames.csv") .withColumnRenamed("First Name", "First_Name") ) ``` #### Command line interface The CLI is implemented as a Spark Connect client. It enables launching runs of declarative pipelines. It accepts a YAML spec, which specifies where on the local filesystem to look for the Python and SQL files that contain the definitions of the flows and datasets that make up the pipeline dataflow graph. Example usage: ``` bin/spark-pipelines run --remote sc://localhost --spec pipeline.yml ``` Example output: ``` Loading pipeline spec from pipeline.yaml... Creating Spark session... Creating dataflow graph... Registering graph elements... Loading definitions. Root directory: .. Found 1 files matching glob 'transformations/**/*.py' Importing transformations/baby_names_raw.py... Found 1 files matching glob 'transformations/**/*.sql' Registering SQL file transformations/baby_names_prepared.sql... Starting run... Starting execution... 2025-05-20T15:08:01.395Z: Flow `spark_catalog`.`default`.`baby_names_raw` is QUEUED. 2025-05-20T15:08:01.398Z: Flow `spark_catalog`.`default`.`baby_names_prepared` is QUEUED. 2025-05-20T15:08:01.402Z: Flow 'spark_catalog.default.baby_names_raw' is PLANNING. 2025-05-20T15:08:01.403Z: Flow `spark_catalog`.`default`.`baby_names_raw` is STARTING. 2025-05-20T15:08:01.404Z: Flow `spark_catalog`.`default`.`baby_names_raw` is RUNNING. 2025-05-20T15:08:03.096Z: Flow 'spark_catalog.default.baby_names_raw' has COMPLETED. 2025-05-20T15:08:03.422Z: Flow 'spark_catalog.default.baby_names_prepared' is PLANNING. 2025-05-20T15:08:03.422Z: Flow `spark_catalog`.`default`.`baby_names_prepared` is STARTING. 2025-05-20T15:08:03.422Z: Flow `spark_catalog`.`default`.`baby_names_prepared` is RUNNING. 2025-05-20T15:08:03.875Z: Flow 'spark_catalog.default.baby_names_prepared' has COMPLETED. 2025-05-20T15:08:05.492Z: Run has COMPLETED. ``` #### Architecture diagram <img width="1256" alt="image" src="https://github.com/user-attachments/assets/0fa6428f-b506-493b-a788-7f047a2a7946" /> ### Why are the changes needed? In order to implement Declarative Pipelines, as described in the SPIP. ### Does this PR introduce _any_ user-facing change? No previous behavior is changed, but new behavior is introduced. ### How was this patch tested? #### Unit testing Includes unit tests for: - Python API error cases – test_decorators.py - Command line functionality – test_cli.py - The harness for registering graph elements while evaluating pipeline definition Python files – test_graph_element_registry.py - Code for blocking execution and analysis within decorated query functions – test_block_connect_access.py Note that, once the backend is wired up, we will submit additional unit tests that cover end-to-end pipeline execution with Python. #### CLI testing With the Declarative Pipelines Spark Connect backend (coming in a future PR), I ran the CLI and confirmed that it executed a pipeline as expected. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#50963 from sryza/sdp-python. Lead-authored-by: Sandy Ryza <[email protected]> Co-authored-by: Sandy Ryza <[email protected]> Signed-off-by: Sandy Ryza <[email protected]>

github-actions bot added SQL PYTHON labels May 21, 2025

sryza force-pushed the sdp-python branch from ad91956 to c721dec Compare May 21, 2025 03:42

HyukjinKwon changed the title ~~[SPARK-52238] Python client for Declarative Pipelines~~ [SPARK-52238][PYTHON] Python client for Declarative Pipelines May 21, 2025

zhengruifeng reviewed May 21, 2025

View reviewed changes

sryza requested a review from zhengruifeng May 22, 2025 18:07

sryza force-pushed the sdp-python branch from c721dec to d1db344 Compare May 22, 2025 21:17

github-actions bot added the BUILD label May 22, 2025

sryza self-assigned this May 25, 2025

sryza force-pushed the sdp-python branch 3 times, most recently from 41a323f to 55bbe49 Compare May 27, 2025 22:25

zhengruifeng approved these changes May 28, 2025

View reviewed changes

zhengruifeng requested review from ueshin, HyukjinKwon and allisonwang-db May 28, 2025 01:41

sryza force-pushed the sdp-python branch 4 times, most recently from 63178b4 to 3b3e843 Compare May 28, 2025 15:08

github-actions bot added the CONNECT label May 28, 2025

sryza force-pushed the sdp-python branch 6 times, most recently from 8c3adfd to 975cbcd Compare May 29, 2025 18:08

sryza added 2 commits May 29, 2025 14:01

Python client for Spark Declarative Pipelines

cf0e774

more

b3846eb

sryza force-pushed the sdp-python branch from 975cbcd to 1b3afac Compare May 29, 2025 21:15

sryza requested a review from zhengruifeng May 29, 2025 22:37

sryza force-pushed the sdp-python branch from 1b3afac to 8f23149 Compare May 29, 2025 22:55

github-actions bot added the INFRA label May 29, 2025

HyukjinKwon reviewed May 29, 2025

View reviewed changes

dev/sparktestsupport/modules.py Outdated Show resolved Hide resolved

HyukjinKwon approved these changes May 29, 2025

View reviewed changes

sryza force-pushed the sdp-python branch from 8f23149 to f6f0cfc Compare May 29, 2025 23:50

more

452a922

sryza force-pushed the sdp-python branch from f6f0cfc to 452a922 Compare May 30, 2025 01:53

sryza closed this in e3321aa May 30, 2025

HyukjinKwon mentioned this pull request Jun 8, 2025

[SPARK-52238][PYTHON][TESTS][FOLLOW-UP] Avoid importing grpc before checking to skip #51119

Closed

HyukjinKwon reviewed Jun 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52238][PYTHON] Python client for Declarative Pipelines #50963

[SPARK-52238][PYTHON] Python client for Declarative Pipelines #50963

Uh oh!

sryza commented May 21, 2025 •

edited

Loading

Uh oh!

zhengruifeng left a comment

Uh oh!

zhengruifeng May 21, 2025

Uh oh!

zhengruifeng May 21, 2025

Uh oh!

sryza May 21, 2025

Uh oh!

zhengruifeng May 21, 2025

Uh oh!

sryza May 27, 2025

Uh oh!

sryza commented May 21, 2025

Uh oh!

Uh oh!

HyukjinKwon Jun 8, 2025

Uh oh!

sryza Jun 8, 2025

Uh oh!

HyukjinKwon Jun 8, 2025

Uh oh!

HyukjinKwon Jun 8, 2025

Uh oh!

sryza Jun 9, 2025

Uh oh!

Uh oh!

		from pyspark.sql.pipelines.block_connect_access import block_spark_connect_execution_and_analysis


		class BlockSparkConnectAccessTests(ReusedConnectTestCase):

[SPARK-52238][PYTHON] Python client for Declarative Pipelines #50963

[SPARK-52238][PYTHON] Python client for Declarative Pipelines #50963

Uh oh!

Conversation

sryza commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Python API for defining pipeline graph elements

Command line interface

Architecture diagram

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Unit testing

CLI testing

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sryza commented May 21, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sryza commented May 21, 2025 •

edited

Loading