Skip to content

Commit da1df6b

Browse files
ronanstokes-dbnfx
andauthored
Feature standard datasets - part 1 (#258)
* work in progress * work in progress * work in progress * wip * wip * added implementations for Datasets describe and listing * bumpedBuild * bumpedBuild * bumpedBuild * bumpedBuild * fixed dataset provider imports * fixed dataset provider imports * fixed dataset provider imports * fixed dataset provider imports * fixed dataset provider imports * fixed dataset provider imports * wip * wip * initial working version * initial working version * initial working version * initial working version * initial working version * initial working version * initial working version * initial working version * initial working version * added telephony plans * added telephony plans * added telephony plans * initial working version added plugin mechanics, initial user table and part of telephony plans * Added tokei.rs badge (#253) [![lines of code](https://tokei.rs/b1/github/databrickslabs/dbldatagen)]([https://codecov.io/github/databrickslabs/dbldatagen](https://github.com/databrickslabs/dbldatagen)) * Prep for release 036 (#251) * prep for version 0.3.6 * added telephony plans * initial implementation * added basic/iot dataset * wip * work in progress * wip * wip * wip * wip * wip * wip * work in progress * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * additional coverage tests * additional coverage * additional coverage * additional coverage * additional coverage --------- Co-authored-by: Serge Smertin <[email protected]>
1 parent 2d51200 commit da1df6b

18 files changed

+3089
-603
lines changed

CHANGELOG.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,19 @@
1-
# Databricks Labs Data Generator Release Notes
1+
# Databricks Labs Synthetic Data Generator Release Notes
22

33
## Change History
44
All notable changes to the Databricks Labs Data Generator will be documented in this file.
55

66
### Unreleased
77

8-
### Changed
8+
#### Changed
99
* Modified data generator to allow specification of constraints to the data generation process
1010
* Updated documentation for generating text data.
1111
* Modified data distribiutions to use abstract base classes
1212
* migrated data distribution tests to use `pytest`
1313

14-
### Added
14+
#### Added
1515
* Added classes for constraints on the data generation via new package `dbldatagen.constraints`
16+
* Added support for standard data sets via the new package `dbldatagen.datasets`
1617

1718

1819
### Version 0.3.6 Post 1
@@ -24,7 +25,6 @@ All notable changes to the Databricks Labs Data Generator will be documented in
2425
#### Fixed
2526
* Fixed scenario where `DataAnalyzer` is used on dataframe containing a column named `summary`
2627

27-
2828
### Version 0.3.6
2929

3030
#### Changed

Pipfile.lock

Lines changed: 1176 additions & 563 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ used in other computations
5353
* plugin mechanism to allow use of 3rd party libraries such as Faker
5454
* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
5555
* Generate synthetic data generation code from existing schema or data (experimental)
56+
* Use of standard datasets for quick generation of synthetic data
5657

5758
Details of these features can be found in the online documentation -
5859
[online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html).
@@ -110,6 +111,17 @@ in your environment.
110111

111112
Once the library has been installed, you can use it to generate a data frame composed of synthetic data.
112113

114+
The easiest way to use the data generator is to use one of the standard datasets which can be further customized
115+
for your use case.
116+
117+
```buildoutcfg
118+
import dbldatagen as dg
119+
df = dg.Datasets(spark, "basic/user").get(rows=1000_000).build()
120+
num_rows=df.count()
121+
```
122+
123+
You can also define fully custom data sets using the `DataGenerator` class.
124+
113125
For example
114126

115127
```buildoutcfg

dbldatagen/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
from .utils import ensure, topologicalSort, mkBoundsList, coalesce_values, \
3131
deprecated, parse_time_interval, DataGenError, split_list_matching_condition, strip_margins, \
3232
json_value_from_path, system_time_millis
33+
3334
from ._version import __version__
3435
from .column_generation_spec import ColumnGenerationSpec
3536
from .column_spec_options import ColumnSpecOptions
@@ -43,11 +44,12 @@
4344
from .text_generators import TemplateGenerator, ILText, TextGenerator
4445
from .text_generator_plugins import PyfuncText, PyfuncTextFactory, FakerTextFactory, fakerText
4546
from .html_utils import HtmlUtils
47+
from .datasets_object import Datasets
4648

4749
__all__ = ["data_generator", "data_analyzer", "schema_parser", "daterange", "nrange",
4850
"column_generation_spec", "utils", "function_builder",
4951
"spark_singleton", "text_generators", "datarange", "datagen_constants",
50-
"text_generator_plugins", "html_utils"
52+
"text_generator_plugins", "html_utils", "datasets_object"
5153
]
5254

5355

dbldatagen/datasets/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
from .dataset_provider import DatasetProvider, dataset_definition
2+
from .basic_user import BasicUserProvider
3+
from .multi_table_telephony_provider import MultiTableTelephonyProvider
4+
5+
__all__ = ["dataset_provider",
6+
"basic_user",
7+
"multi_table_telephony_provider"
8+
]

dbldatagen/datasets/basic_user.py

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
from .dataset_provider import DatasetProvider, dataset_definition
2+
3+
4+
@dataset_definition(name="basic/user", summary="Basic User Data Set", autoRegister=True, supportsStreaming=True)
5+
class BasicUserProvider(DatasetProvider.NoAssociatedDatasetsMixin, DatasetProvider):
6+
"""
7+
Basic User Data Set
8+
===================
9+
10+
This is a basic user data set with customer id, name, email, ip address, and phone number.
11+
12+
It takes the following optins when retrieving the table:
13+
- random: if True, generates random data
14+
- dummyValues: number of additional dummy value columns to generate (to widen row size if necessary)
15+
- rows : number of rows to generate. Default is 100000
16+
- partitions: number of partitions to use. If -1, it will be computed based on the number of rows
17+
-
18+
19+
As the data specification is a DataGenerator object, you can add further columns to the data set and
20+
add constraints (when the feature is available)
21+
22+
Note that this datset does not use any features that would prevent it from being used as a source for a
23+
streaming dataframe, and so the flag `supportsStreaming` is set to True.
24+
25+
"""
26+
MAX_LONG = 9223372036854775807
27+
COLUMN_COUNT = 5
28+
29+
@DatasetProvider.allowed_options(options=["random", "dummyValues"])
30+
def getTableGenerator(self, sparkSession, *, tableName=None, rows=-1, partitions=-1,
31+
**options):
32+
import dbldatagen as dg
33+
34+
generateRandom = options.get("random", False)
35+
dummyValues = options.get("dummyValues", 0)
36+
37+
if rows is None or rows < 0:
38+
rows = DatasetProvider.DEFAULT_ROWS
39+
40+
if partitions is None or partitions < 0:
41+
partitions = self.autoComputePartitions(rows, self.COLUMN_COUNT + dummyValues)
42+
43+
assert tableName is None or tableName == DatasetProvider.DEFAULT_TABLE_NAME, "Invalid table name"
44+
df_spec = (
45+
dg.DataGenerator(sparkSession=sparkSession, rows=rows,
46+
partitions=partitions,
47+
randomSeedMethod="hash_fieldname")
48+
.withColumn("customer_id", "long", minValue=1000000, maxValue=self.MAX_LONG, random=generateRandom)
49+
.withColumn("name", "string",
50+
template=r'\w \w|\w \w \w', random=generateRandom)
51+
.withColumn("email", "string",
52+
template=r'\w.\w@\w.com|\w@\w.co.u\k', random=generateRandom)
53+
.withColumn("ip_addr", "string",
54+
template=r'\n.\n.\n.\n', random=generateRandom)
55+
.withColumn("phone", "string",
56+
template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd',
57+
random=generateRandom)
58+
)
59+
60+
if dummyValues > 0:
61+
df_spec = df_spec.withColumn("dummy", "long", random=True, numColumns=dummyValues,
62+
minValue=1, maxValue=self.MAX_LONG)
63+
64+
return df_spec

0 commit comments

Comments
 (0)