Skip to content

modified files to build for Databricks runtime 11.3 LTS compliant versions #313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,10 @@ jobs:
sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java
java -version

- name: Set up Python 3.8
- name: Set up Python 3.9.21
uses: actions/setup-python@v5
with:
python-version: '3.8.12'
python-version: '3.9.21'
cache: 'pipenv'

- name: Check Python version
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@ jobs:
sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java
java -version

- name: Set up Python 3.8
- name: Set up Python 3.9.21
uses: actions/setup-python@v5
with:
python-version: '3.8.12'
python-version: '3.9.21'
cache: 'pipenv'

- name: Check Python version
Expand Down
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ All notable changes to the Databricks Labs Data Generator will be documented in
#### Fixed
* Updated build scripts to use Ubuntu 22.04 to correspond to environment in Databricks runtime

#### Changed
* Changed base Databricks runtime version to DBR 11.3 LTS (based on Apache Spark 3.3.0)

#### Added
* Added support for serialization to/from JSON format


### Version 0.4.0 Hotfix 2

#### Fixed
Expand Down
17 changes: 7 additions & 10 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,7 @@ Dependent packages are not installed automatically by the `dbldatagen` package.

## Python compatibility

The code has been tested with Python 3.8.12 and later.

Older releases were tested with Python 3.7.5 but as of this release, it requires the Databricks
runtime 9.1 LTS or later.
The code has been tested with Python 3.9.21 and later.

## Checking your code for common issues

Expand All @@ -46,7 +43,7 @@ Our recommended mechanism for building the code is to use a `conda` or `pipenv`
But it can be built with any Python virtualization environment.

### Spark dependencies
The builds have been tested against Spark 3.2.1. This requires the OpenJDK 1.8.56 or later version of Java 8.
The builds have been tested against Spark 3.3.0. This requires the OpenJDK 1.8.56 or later version of Java 8.
The Databricks runtimes use the Azul Zulu version of OpenJDK 8 and we have used these in local testing.
These are not installed automatically by the build process, so you will need to install them separately.

Expand Down Expand Up @@ -75,7 +72,7 @@ To build with `pipenv`, perform the following commands:
- Run `make dist` from the main project directory
- The resulting wheel file will be placed in the `dist` subdirectory

The resulting build has been tested against Spark 3.2.1
The resulting build has been tested against Spark 3.3.0

## Creating the HTML documentation

Expand Down Expand Up @@ -161,19 +158,19 @@ See https://legacy.python.org/dev/peps/pep-0008/

# Github expectations
When running the unit tests on Github, the environment should use the same environment as the latest Databricks
runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 10.4 onwards,
runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 11.3 onwards,
unit tests will be run on the environment corresponding to the latest LTS release.

Libraries will use the same versions as the earliest supported LTS release - currently 10.4 LTS
Libraries will use the same versions as the earliest supported LTS release - currently 11.3 LTS

This means for the current build:

- Use of Ubuntu 22.04 for the test runner
- Use of Java 8
- Use of Python 3.11
- Use of Python 3.9.21 when testing / building the image

See the following resources for more information
= https://docs.databricks.com/en/release-notes/runtime/15.4lts.html
- https://docs.databricks.com/en/release-notes/runtime/10.4lts.html
- https://docs.databricks.com/en/release-notes/runtime/11.3lts.html
- https://github.com/actions/runner-images/issues/10636

16 changes: 8 additions & 8 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ sphinx = ">=2.0.0,<3.1.0"
nbsphinx = "*"
numpydoc = "==0.8"
pypandoc = "*"
ipython = "==7.31.1"
ipython = "==7.32.0"
pydata-sphinx-theme = "*"
recommonmark = "*"
sphinx-markdown-builder = "*"
Expand All @@ -19,13 +19,13 @@ prospector = "*"

[packages]
numpy = "==1.22.0"
pyspark = "==3.1.3"
pyarrow = "==4.0.1"
wheel = "==0.38.4"
pandas = "==1.2.4"
setuptools = "==65.6.3"
pyparsing = "==2.4.7"
pyspark = "==3.3.0"
pyarrow = "==7.0.0"
wheel = "==0.37.0"
pandas = "==1.3.4"
setuptools = "==58.0.4"
pyparsing = "==3.0.4"
jmespath = "==0.10.0"

[requires]
python_version = "3.8.12"
python_version = "3.9.21"
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,8 @@ The documentation [installation notes](https://databrickslabs.github.io/dbldatag
contains details of installation using alternative mechanisms.

## Compatibility
The Databricks Labs Data Generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are
compatible with the Databricks runtime 10.4 LTS and later releases. For full Unity Catalog support,
The Databricks Labs Data Generator framework can be used with Pyspark 3.3.0 and Python 3.9.21 or later. These are
compatible with the Databricks runtime 11.3 LTS and later releases. For full Unity Catalog support,
we recommend using Databricks runtime 13.2 or later (Databricks 13.3 LTS or above preferred)

For full library compatibility for a specific Databricks Spark release, see the Databricks
Expand Down Expand Up @@ -155,7 +155,7 @@ The GitHub repository also contains further examples in the examples directory.

## Spark and Databricks Runtime Compatibility
The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
older LTS versions at least from 10.4 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
older LTS versions at least from 11.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
including `current` and `preview`.

While we don't specifically drop support for older runtimes, changes in Pyspark APIs or
Expand Down
18 changes: 11 additions & 7 deletions dbldatagen/column_generation_spec.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ class ColumnGenerationSpec(SerializableToDict):
# restrict spurious messages from java gateway
logging.getLogger("py4j").setLevel(logging.WARNING)

def __init__(self, name, colType=None, minValue=0, maxValue=None, step=1, prefix='', random=False,
def __init__(self, name, colType=None, *, minValue=0, maxValue=None, step=1, prefix='', random=False,
distribution=None, baseColumn=None, randomSeed=None, randomSeedMethod=None,
implicit=False, omit=False, nullable=True, debug=False, verbose=False,
seedColumnName=DEFAULT_SEED_COLUMN,
Expand Down Expand Up @@ -529,18 +529,22 @@ def _setup_logger(self):
else:
self.logger.setLevel(logging.WARNING)

def _computeAdjustedRangeForColumn(self, colType, c_min, c_max, c_step, c_begin, c_end, c_interval, c_range,
def _computeAdjustedRangeForColumn(self, colType, c_min, c_max, c_step, *, c_begin, c_end, c_interval, c_range,
c_unique):
"""Determine adjusted range for data column
"""
assert colType is not None, "`colType` must be non-None instance"

if type(colType) is DateType or type(colType) is TimestampType:
return self._computeAdjustedDateTimeRangeForColumn(colType, c_begin, c_end, c_interval, c_range, c_unique)
return self._computeAdjustedDateTimeRangeForColumn(colType, c_begin, c_end, c_interval,
c_range=c_range,
c_unique=c_unique)
else:
return self._computeAdjustedNumericRangeForColumn(colType, c_min, c_max, c_step, c_range, c_unique)
return self._computeAdjustedNumericRangeForColumn(colType, c_min, c_max, c_step,
c_range=c_range,
c_unique=c_unique)

def _computeAdjustedNumericRangeForColumn(self, colType, c_min, c_max, c_step, c_range, c_unique):
def _computeAdjustedNumericRangeForColumn(self, colType, c_min, c_max, c_step, *, c_range, c_unique):
"""Determine adjusted range for data column

Rules:
Expand Down Expand Up @@ -589,7 +593,7 @@ def _computeAdjustedNumericRangeForColumn(self, colType, c_min, c_max, c_step, c

return result

def _computeAdjustedDateTimeRangeForColumn(self, colType, c_begin, c_end, c_interval, c_range, c_unique):
def _computeAdjustedDateTimeRangeForColumn(self, colType, c_begin, c_end, c_interval, *, c_range, c_unique):
"""Determine adjusted range for Date or Timestamp data column
"""
effective_begin, effective_end, effective_interval = None, None, None
Expand Down Expand Up @@ -656,7 +660,7 @@ def _getUniformRandomSQLExpression(self, col_name):
else:
return "rand()"

def _getScaledIntSQLExpression(self, col_name, scale, base_columns, base_datatypes=None, compute_method=None,
def _getScaledIntSQLExpression(self, col_name, scale, base_columns, *, base_datatypes=None, compute_method=None,
normalize=False):
""" Get scaled numeric expression

Expand Down
4 changes: 2 additions & 2 deletions dbldatagen/data_analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ def _displayRow(self, row):

return ", ".join(results)

def _addMeasureToSummary(self, measureName, summaryExpr="''", fieldExprs=None, dfData=None, rowLimit=1,
def _addMeasureToSummary(self, measureName, *, summaryExpr="''", fieldExprs=None, dfData=None, rowLimit=1,
dfSummary=None):
""" Add a measure to the summary dataframe

Expand Down Expand Up @@ -340,7 +340,7 @@ def _generatorDefaultAttributesFromType(cls, sqlType, colName=None, dataSummary=
return result

@classmethod
def _scriptDataGeneratorCode(cls, schema, dataSummary=None, sourceDf=None, suppressOutput=False, name=None):
def _scriptDataGeneratorCode(cls, schema, *, dataSummary=None, sourceDf=None, suppressOutput=False, name=None):
"""
Generate outline data generator code from an existing dataframe

Expand Down
10 changes: 5 additions & 5 deletions dbldatagen/data_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ class DataGenerator(SerializableToDict):

# logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.NOTSET)

def __init__(self, sparkSession=None, name=None, randomSeedMethod=None,
def __init__(self, sparkSession=None, name=None, *, randomSeedMethod=None,
rows=1000000, startingId=0, randomSeed=None, partitions=None, verbose=False,
batchSize=None, debug=False, seedColumnName=DEFAULT_SEED_COLUMN,
random=False,
Expand Down Expand Up @@ -782,7 +782,7 @@ def _checkColumnOrColumnList(self, columns, allowId=False):
f" column `{columns}` must refer to defined column")
return True

def withColumnSpec(self, colName, minValue=None, maxValue=None, step=1, prefix=None,
def withColumnSpec(self, colName, *, minValue=None, maxValue=None, step=1, prefix=None,
random=None, distribution=None,
implicit=False, dataRange=None, omit=False, baseColumn=None, **kwargs):
""" add a column specification for an existing column
Expand Down Expand Up @@ -842,7 +842,7 @@ def hasColumnSpec(self, colName):
"""
return colName in self._columnSpecsByName

def withColumn(self, colName, colType=StringType(), minValue=None, maxValue=None, step=1,
def withColumn(self, colName, colType=StringType(), *, minValue=None, maxValue=None, step=1,
dataRange=None, prefix=None, random=None, distribution=None,
baseColumn=None, nullable=True,
omit=False, implicit=False, noWarn=False,
Expand Down Expand Up @@ -1058,7 +1058,7 @@ def withStructColumn(self, colName, fields=None, asJson=False, **kwargs):

return newDf

def _generateColumnDefinition(self, colName, colType=None, baseColumn=None,
def _generateColumnDefinition(self, colName, colType=None, baseColumn=None, *,
implicit=False, omit=False, nullable=True, **kwargs):
""" generate field definition and column spec

Expand Down Expand Up @@ -1591,7 +1591,7 @@ def scriptTable(self, name=None, location=None, tableFormat="delta", asHtml=Fals

return results

def scriptMerge(self, tgtName=None, srcName=None, updateExpr=None, delExpr=None, joinExpr=None, timeExpr=None,
def scriptMerge(self, tgtName=None, srcName=None, *, updateExpr=None, delExpr=None, joinExpr=None, timeExpr=None,
insertExpr=None,
useExplicitNames=True,
updateColumns=None, updateColumnExprs=None,
Expand Down
4 changes: 2 additions & 2 deletions dbldatagen/text_generator_plugins.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ class _FnCallContext:
def __init__(self, txtGen):
self.textGenerator = txtGen

def __init__(self, fn, init=None, initPerBatch=False, name=None, rootProperty=None):
def __init__(self, fn, *, init=None, initPerBatch=False, name=None, rootProperty=None):
super().__init__()
assert fn is not None or callable(fn), "Function must be provided wiith signature fn(context, oldValue)"
assert init is None or callable(init), "Init function must be a callable function or lambda if passed"
Expand Down Expand Up @@ -284,7 +284,7 @@ class FakerTextFactory(PyfuncTextFactory):

_defaultFakerTextFactory = None

def __init__(self, locale=None, providers=None, name="FakerText", lib=None,
def __init__(self, *, locale=None, providers=None, name="FakerText", lib=None,
rootClass=None):

super().__init__(name)
Expand Down
3 changes: 2 additions & 1 deletion dbldatagen/text_generators.py
Original file line number Diff line number Diff line change
Expand Up @@ -429,7 +429,8 @@ def _prepareTemplateStrings(self, genTemplate, escapeSpecialMeaning=False):

return num_placeholders, retval

def _applyTemplateStringsForTemplate(self, baseValue, genTemplate, placeholders, rnds, escapeSpecialMeaning=False):
def _applyTemplateStringsForTemplate(self, baseValue, genTemplate, placeholders, rnds, *,
escapeSpecialMeaning=False):
""" Vectorized implementation of template driven text substitution

Apply substitutions to placeholders using random numbers
Expand Down
4 changes: 2 additions & 2 deletions makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ prepare: clean

create-dev-env:
@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"
conda create -n $(ENV_NAME) python=3.8.10
conda create -n $(ENV_NAME) python=3.9.21

create-github-build-env:
@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"
conda create -n pip_$(ENV_NAME) python=3.8
conda create -n pip_$(ENV_NAME) python=3.9.21

install-dev-dependencies:
@echo "$(OK_COLOR)=> installing dev environment requirements$(NO_COLOR)"
Expand Down
18 changes: 9 additions & 9 deletions python/dev_require.txt
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# The following packages are used in building the test data generator framework.
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
numpy==1.22.0
pandas==1.2.4
pandas==1.3.4
pickleshare==0.7.5
py4j>=0.10.9.3
pyarrow==4.0.1
pyspark>=3.2.1,<=3.3.0
python-dateutil==2.8.1
six==1.15.0
pyparsing==2.4.7
pyarrow==7.0.0
pyspark==3.3.0
python-dateutil==2.8.2
six==1.16.0
pyparsing==3.0.4
jmespath==0.10.0

# The following packages are required for development only
wheel==0.36.2
setuptools==52.0.0
wheel==0.37.0
setuptools==58.0.4
bumpversion
pytest
pytest-cov
Expand All @@ -28,7 +28,7 @@ sphinx_rtd_theme
nbsphinx
numpydoc==0.8
pypandoc
ipython==7.22.0
ipython==7.32.0
recommonmark
sphinx-markdown-builder
Jinja2 < 3.1
Expand Down
18 changes: 9 additions & 9 deletions python/require.txt
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# The following packages are used in building the test data generator framework.
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
numpy==1.22.0
pandas==1.2.5
pandas==1.3.4
pickleshare==0.7.5
py4j==0.10.9
pyarrow==4.0.1
pyspark>=3.2.1
python-dateutil==2.8.1
six==1.15.0
pyparsing==2.4.7
pyarrow==7.0.0
pyspark==3.3.0
python-dateutil==2.8.2
six==1.16.0
pyparsing==3.0.4
jmespath==0.10.0

# The following packages are required for development only
wheel==0.36.2
setuptools==52.0.0
wheel==0.37.0
setuptools==58.0.4
bumpversion
pytest
pytest-cov
Expand All @@ -27,7 +27,7 @@ sphinx_rtd_theme
nbsphinx
numpydoc==0.8
pypandoc
ipython==7.22.0
ipython==7.32.0
recommonmark
sphinx-markdown-builder
Jinja2 < 3.1
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,5 +55,5 @@
"Intended Audience :: Developers",
"Intended Audience :: System Administrators"
],
python_requires='>=3.8.10',
python_requires='>=3.9.21',
)
Loading
Loading