Skip to content

[SYNPY-1578] DatasetCollection OOP Model #1189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Apr 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a084116
adds initial DatasetCollection implementation
BWMac Apr 10, 2025
2d4b8e6
adds unit tests
BWMac Apr 10, 2025
3d14cd7
pre-commit
BWMac Apr 10, 2025
cd9e910
updates docstrings
BWMac Apr 10, 2025
e1a1e8a
adds integration tests
BWMac Apr 11, 2025
41eaf74
adds docs pages
BWMac Apr 11, 2025
25f12ff
removes example script section from dataset documentation
BWMac Apr 11, 2025
3a5b017
adds dataset collection tutorial
BWMac Apr 11, 2025
2ffdeab
fixes tutorial script
BWMac Apr 11, 2025
c43a172
adds tutorial path to mkdocs.yml
BWMac Apr 11, 2025
0b40a21
bullet points
BWMac Apr 11, 2025
73baab0
fixes tutorial code lines
BWMac Apr 11, 2025
9d2984e
fixes tutorial references
BWMac Apr 11, 2025
c4866f6
test doc format fix
BWMac Apr 14, 2025
6c196a3
fixes dataset docs
BWMac Apr 14, 2025
daedf46
fixes sync integration tests
BWMac Apr 14, 2025
84a73e3
fixes DatasetCollection docstrings
BWMac Apr 14, 2025
60fa4f7
refactors entity factory
BWMac Apr 14, 2025
cd208d6
fixes argument error
BWMac Apr 14, 2025
3a70496
updates test strings
BWMac Apr 14, 2025
b7e728e
Merge branch 'develop' into synpy-1578-oop-model-dataset-collection
BWMac Apr 14, 2025
7512c0e
pre-commit
BWMac Apr 14, 2025
e0e82bd
Update docs/tutorials/python/dataset_collection.md
BWMac Apr 15, 2025
27a7656
updates tutorials
BWMac Apr 15, 2025
cd775a5
removes elif block
BWMac Apr 15, 2025
0b2603f
pre-commit
BWMac Apr 15, 2025
66c562a
removes unused cleanup
BWMac Apr 15, 2025
87950e9
updates version handling and tests
BWMac Apr 15, 2025
1368c7a
fix async tests
BWMac Apr 15, 2025
5d39a79
addresses comments
BWMac Apr 15, 2025
f279036
fixes docstrings
BWMac Apr 15, 2025
d63b212
adds retry logic for uncaught async jobs
BWMac Apr 15, 2025
345a2ee
set max on timeout
BWMac Apr 15, 2025
7a993ba
addresses comments
BWMac Apr 15, 2025
fe17606
updates unit test for version num
BWMac Apr 16, 2025
80edc7e
fixes incorrect line number
BWMac Apr 16, 2025
6bc4374
adds missing snapshot tests
BWMac Apr 16, 2025
f646dc1
corrects type hint
BWMac Apr 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions docs/reference/experimental/async/dataset_collection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Dataset Collection

Contained within this file are experimental interfaces for working with the Synapse Python
Client. Unless otherwise noted these interfaces are subject to change at any time. Use
at your own risk.

## API reference

::: synapseclient.models.DatasetCollection
options:
inherited_members: true
members:
- add_item_async
- remove_item_async
- store_async
- get_async
- delete_async
- update_rows_async
- snapshot_async
- query_async
- query_part_mask_async
- add_column
- delete_column
- reorder_column
- rename_column
- get_permissions
- get_acl
- set_permissions
---
::: synapseclient.models.EntityRef
---
10 changes: 0 additions & 10 deletions docs/reference/experimental/sync/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,6 @@ Contained within this file are experimental interfaces for working with the Syna
Client. Unless otherwise noted these interfaces are subject to change at any time. Use
at your own risk.

## Example Script:

<details class="quote">
<summary>Working with Synapse datasets</summary>

```python
{!docs/scripts/object_orientated_programming_poc/oop_poc_dataset.py!}
```
</details>

## API reference

::: synapseclient.models.Dataset
Expand Down
31 changes: 31 additions & 0 deletions docs/reference/experimental/sync/dataset_collection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Dataset Collection

Contained within this file are experimental interfaces for working with the Synapse Python
Client. Unless otherwise noted these interfaces are subject to change at any time. Use
at your own risk.

## API reference

::: synapseclient.models.DatasetCollection
options:
inherited_members: true
members:
- add_item
- remove_item
- store
- get
- delete
- update_rows
- snapshot
- query
- query_part_mask
- add_column
- delete_column
- reorder_column
- rename_column
- get_permissions
- get_acl
- set_permissions
---
::: synapseclient.models.EntityRef
---
24 changes: 12 additions & 12 deletions docs/tutorials/python/dataset.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Datasets
Datasets in Synapse are a way to organize, annotate, and publish sets of files for others to use. Datasets behave similarly to Tables and EntityViews, but provide some default behavior that makes it easy to put a group of files together.

This tutorial will walk through basics of working with datasets using the Synapse Python client.
This tutorial will walk through basics of working with datasets using the Synapse Python Client.

# Tutorial Purpose
In this tutorial, you will:
Expand Down Expand Up @@ -29,15 +29,15 @@ In this tutorial, you will:
Let's get started by authenticating with Synapse and retrieving the ID of your project.

```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=17-23}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=3-24}
```

## 2. Create your Dataset

Next, we will create the dataset. We will use the project ID to tell Synapse where we want the dataset to be created. After this step, we will have a Dataset object with all of the needed information to start building the dataset.

```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=27-28}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=29-30}
```

Because we haven't added any files to the dataset yet, it will be empty, but if you view the dataset's schema in the UI, you will notice that datasets come with default columns that help to describe each file that we add to the dataset.
Expand All @@ -50,20 +50,20 @@ Let's add some files to the dataset now. There are three ways to add files to a

1. Add an Entity Reference to a file with its ID and version
```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=32-34}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=34-36}
```
2. Add a File with its ID and version
```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=36-38}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=38-40}
```
3. Add a Folder. When adding a folder, all child files inside of the folder are added to the dataset recursively.
```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=40-42}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=42-44}
```

Whenever we make changes to the dataset, we need to call the `store()` method to save the changes to Synapse.
```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=44}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=46}
```

And now we are able to see our dataset with all of the files that we added to it.
Expand All @@ -75,37 +75,37 @@ And now we are able to see our dataset with all of the files that we added to it
Now that we have a dataset with some files in it, we can retrieve the dataset from Synapse the next time we need to use it.

```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=48-50}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=50-52}
```

## 5. Query the dataset

Now that we have a dataset with some files in it, we can query the dataset to find files that match certain criteria.

```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=54-57}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=56-59}
```

## 6. Add a custom column to the dataset

We can also add a custom column to the dataset. This will allow us to annotate files in the dataset with additional information.

```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=61-67}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=63-69}
```

Our custom column isn't all that useful empty, so let's update the dataset with some values.

```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=70-78}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=72-80}
```

## 7. Save a snapshot of the dataset

Finally, let's save a snapshot of the dataset. This creates a read-only version of the dataset that captures the current state of the dataset and can be referenced later.

```python
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=82-86}
{!docs/tutorials/python/tutorial_scripts/dataset.py!lines=84-88}
```

## Source Code for this Tutorial
Expand Down
112 changes: 112 additions & 0 deletions docs/tutorials/python/dataset_collection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Dataset Collections
Dataset Collections are a way to organize, annotate, and publish sets of datasets for others to use. Dataset Collections behave similarly to Tables and EntityViews, but provide some default behavior that makes it easy to put a group of datasets together.

This tutorial will walk through basics of working with Dataset Collections using the Synapse Python Client.

# Tutorial Purpose
In this tutorial, you will:

- Create a Dataset Collection
- Add datasets to the collection
- Add a custom column to the collection
- Update the collection with new annotations
- Query the collection
- Save a snapshot of the collection

# Prerequisites
* This tutorial assumes that you have a project in Synapse and have already created datasets that you would like to add to a Dataset Collection.
* If you need help creating datasets, you can refer to the [dataset tutorial](./dataset.md).
* Pandas must be installed as shown in the [installation documentation](../installation.md)

## 1. Get the ID of your Synapse project

Let's get started by authenticating with Synapse and retrieving the ID of your project.

```python
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=3-16}
```

## 2. Create your Dataset Collection

Next, we will create the Dataset Collection using the project ID to tell Synapse where we want the Dataset Collection to be created. After this step, we will have a Dataset Collection object with all of the necessary information to start building the collection.

```python
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=25-33}
```

Because we haven't added any datasets to the collection yet, it will be empty, but if you view the Dataset Collection's schema in the UI, you will notice that Dataset Collections come with default columns.

![Dataset Collection Default Schema](./tutorial_screenshots/dataset_collection_default_schema.png)

## 3. Add Datasets to the Dataset Collection

Now, let's add some datasets to the collection. We will loop through our dataset ids and add each dataset to the collection using the `add_item` method.

```python
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=37-38}
```

Whenever we make changes to the Dataset Collection, we need to call the `store()` method to save the changes to Synapse.

```python
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=40}
```

And now we are able to see our Dataset Collection with all of the datasets that we added to it.

![Dataset Collection with Datasets](./tutorial_screenshots/dataset_collection_with_datasets.png)

## 4. Retrieve the Dataset Collection

Now that our Dataset Collection has been created and we have added some Datasets to it, we can retrieve the Dataset Collection from Synapse the next time we need to use it.

```python
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=44-46}
```

## 5. Add a custom column to the Dataset Collection

In addition to the default columns, you may want to annotate items in your DatasetCollection using custom columns.

```python
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=50-56}
```

Our custom column isn't all that useful empty, so let's update the Dataset Collection with some values.

```python
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=59-67}
```

## 6. Query the Dataset Collection

If you want to query your DatasetCollection for items that match certain criteria, you can do so using the `query` method.

```python
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=71-74}
```

## 7. Save a snapshot of the Dataset Collection

Finally, let's save a snapshot of the Dataset Collection. This creates a read-only version of the Dataset Collection that captures the current state of the Dataset Collection and can be referenced later.

```python
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!lines=77}
```

## Source Code for this Tutorial

<details class="quote">
<summary>Click to show me</summary>

```python
{!docs/tutorials/python/tutorial_scripts/dataset_collection.py!}
```
</details>

## References
- [DatasetCollection](../../reference/experimental/sync/dataset_collection.md)
- [Dataset](../../reference/experimental/sync/dataset.md)
- [Project](../../reference/experimental/sync/project.md)
- [Column][synapseclient.models.Column]
- [syn.login][synapseclient.Synapse.login]
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 4 additions & 2 deletions docs/tutorials/python/tutorial_scripts/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,11 @@
syn = Synapse()
syn.login()

project = Project(name="My Testing Project").get() # Replace with your project name
project = Project(
name="My uniquely named project about Alzheimer's Disease"
).get() # Replace with your project name
project_id = project.id
print(project_id)
print(f"My project ID is {project_id}")

# Next, let's create the dataset. We'll use the project id as the parent id.
# To begin, the dataset will be empty, but if you view the dataset's schema in the UI,
Expand Down
77 changes: 77 additions & 0 deletions docs/tutorials/python/tutorial_scripts/dataset_collection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
"""Here is where you'll find the code for the DatasetCollection tutorial."""

import pandas as pd

from synapseclient import Synapse
from synapseclient.models import Column, ColumnType, Dataset, DatasetCollection, Project

# First, let's get the project that we want to create the DatasetCollection in
syn = Synapse()
syn.login()

project = Project(
name="My uniquely named project about Alzheimer's Disease"
).get() # Replace with your project name
project_id = project.id
print(f"My project ID is {project_id}")

# This tutorial assumes that you have already created datasets that you would like to add to a DatasetCollection.
# If you need help creating datasets, you can refer to the dataset tutorial.

# For this example, we will be using datasets already created in the project.
# Let's create the DatasetCollection. We'll use the project id as the parent id.
# At first, the DatasetCollection will be empty, but if you view the DatasetCollection's schema in the UI,
# you will notice that DatasetCollections come with default columns.
DATASET_IDS = [
"syn65987017",
"syn65987019",
"syn65987020",
] # Replace with your dataset IDs
test_dataset_collection = DatasetCollection(
parent_id=project_id, name="test_dataset_collection"
).store()
print(f"My DatasetCollection's ID is {test_dataset_collection.id}")

# Now, let's add some datasets to the collection. We will loop through our dataset ids and add each dataset to the
# collection using the `add_item` method.
for dataset_id in DATASET_IDS:
test_dataset_collection.add_item(Dataset(id=dataset_id).get())
# Our changes won't be persisted to Synapse until we call the `store` method on our DatasetCollection.
test_dataset_collection.store()

# Now that our DatasetCollection with all of our datasets has been created, the next time we want to use it,
# we can retrieve it from Synapse.
my_retrieved_dataset_collection = DatasetCollection(id=test_dataset_collection.id).get()
print(f"My DatasetCollection's ID is still {my_retrieved_dataset_collection.id}")
print(f"My DatasetCollection has {len(my_retrieved_dataset_collection.items)} items")

# In addition to the default columns, you may want to annotate items in your DatasetCollection using
# custom columns.
my_retrieved_dataset_collection.add_column(
column=Column(
name="my_annotation",
column_type=ColumnType.STRING,
)
)
my_retrieved_dataset_collection.store()

# Now that our custom column has been added, we can update the DatasetCollection with new annotations.
modified_data = pd.DataFrame(
{
"id": DATASET_IDS,
"my_annotation": ["good dataset" * len(DATASET_IDS)],
}
)
my_retrieved_dataset_collection.update_rows(
values=modified_data, primary_keys=["id"], dry_run=False
)

# If you want to query your DatasetCollection for items that match certain criteria, you can do so
# using the `query` method.
rows = my_retrieved_dataset_collection.query(
query=f"SELECT id, my_annotation FROM {my_retrieved_dataset_collection.id} WHERE my_annotation = 'good dataset'"
)
print(rows)

# Create a snapshot of the DatasetCollection
my_retrieved_dataset_collection.snapshot(comment="test snapshot")
Loading
Loading