This repository accompanies the preprint Dissecting regulatory syntax in human development with scalable multiomics and deep learning (Liu*, Jessa*, Kim*, Ng*, ..., Kundaje+, Farh+, Greenleaf+, bioRxiv, 2025).
* equal contribution
+ co-corresponding
- The repository is on GitHub here and you can view a rendered version here
- This repository contains primarily code, see the Data availability section for links to data, and our documentation here for download instructions and explanations of data formats
- Jump to the Code to reproduce figures section for links to code and rendered HTMLs for analysis presented in each figure
- Codebase
- Code to produce the figures
- Data availability
- Installation and system requirements
- Demo
- Citation
This repository is meant to enhance the Materials & Methods section by providing code for the custom analyses in the manuscript, in order to improve reproducibility for the main results. However, it is not a fully executable workflow.
code
--> pipelines, scripts, and analysis notebooks for data processing and analysisutils
--> contains .R files with custom functions and palettes used throughout the analysis01-preprocessing
01-snakemake
--> config files for processing raw bcl files into fragment files and count matrices02-archr_seurat_scripts
--> per organ preprocessing scripts to create final Seurat objects and ArchR projects03-global
--> creating global objects (e.g. global peak set, marker genes)
02-global_analysis
01
--> global QC and metadata visualizations per organ and per sample02
,03
--> construction of dendrogram on cell type similarity04
--> calculate TF expression levels
03-chrombpnet
- detailed README here
00
--> prepare inputs for training ChromBPNet models01
--> train and interpret ChromBPNet models02
--> assembly of motif compendium/lexicon03
--> downstream analysis of ChromBPNet models and motif syntax/synergy
04-enhancers
01
--> export global accessible candidate cis-regulatory elements (acCREs)02
--> convert fragment files to tagalign for running Activity-By-Contact model (ABC)03
--> ABC workflow config files04
--> acCREs co-accessibility analysis05
--> acCREs peak-to-gene linkage analysis06
--> acCREs ABC enhancer-to-promoter linkage analysis07
--> overlap of HDMA acCREs with ENCODE v4 cCREs08
,09
--> overlap of HDMA acCREs with VISTA enhancers
05-misc
01
--> create global BPCells object02
--> examples for plotting tracks using BPCells04
--> examples for ChromBPNet use cases, including how to load models and make predictions
06-variants
00
to03
--> analysis related to eQTLs04
to05
--> causal variant analysis with gchromvar06
--> variant scoring using ChromBPNet models07a
to07c
--> plot variant scoring results
Our pipeline for processing SHARE-seq data is available at https://github.com/GreenleafLab/shareseq-pipeline (v1.0.0).
Code to reproduce analyses is saved in code
. This table contains pointers to code for the key analyses associated with each figure.
The links in the Analysis column lead to rendered HTMLs, where possible, and the links in the Path column lead to scripts or notebooks within the repository.
All data and analysis products (including fragment files, counts matrices, cell annotations, global acCRE annotations, ChromBPNet models, motif lexicon, motif instances, and genomic tracks) are deposited at https://zenodo.org/communities/hdma. A list of all data types and the corresponding URL and DOI is provided in Table S14 of the manuscript.
We provide a detailed description of the main data types deposited on Zenodo here, along with a demonstration of how to programmatically download files of interest.
All genomic tracks are also hosted online for interactive visualization with the WashU Genome Browser here at this link: https://epigenomegateway.wustl.edu/browser2022/?genome=hg38&hub=https://human-dev-multiome-atlas.s3.amazonaws.com/tracks/HDMA_trackhub.json. We demonstrate how to load tracks here.
This repository can be cloned locally (expected time: several minutes due to
large HTMLs included in the repository).
Scripts are meant to be run on a Linux OS inside a Slurm-enabled HPC system.
We performed our analysis on CentOS 7.9.2009.
To run the code, one would need to set up the appropriate R and python environments,
for example using renv
and conda
, respectively.
Analysis conducted in R was performed using R 4.1.2, and the package version for each analysis is listed in the "Session info" section at the end of each rendered HTML (e.g. here)
Analysis conducted in python depended on key packages and their own dependencies.
Package environments were managed using conda
, and the conda
environment used for
a particular analysis is typically specified in the associated Slurm submission script.
A full list of package versions in each environment is located at code/envs
.
Our repository is designed to enable reproducibility for the results by providing exact code and software/package versions, although it is not a fully executable workflow.
Below, we describe the major inputs and outputs for each section of the code. To demo the code on a small dataset, we link here to inputs and outputs from the Adrenal organ (total ~3k cells). Note that some analyses are, by definition, integrative, and won't be possible to execute with only one organ. All files are hosted on Zenodo (https://zenodo.org/communities/hdma/records?q=&l=list&p=1&s=10). When outputs are not specified, the major outputs are typically plots and data visualizations, wich can be viewed above in the section "Code to produce the figures". Code is intended to be run on a Slurm-enabled high-performance compute machine allowing for parallel jobs. On a small dataset such as the Adrenal organ, expected execution time on an HPC is on the order of several days.
code/01-preprocessing
:code/02-global_analysis
:- Inputs: Seurat object
code/03-chrombpnet
:- Inputs: ArchR project
- Outputs:
- Trained ChromBPNet models in h5 format, five folds per cell type
- Basepair-resolution contribution scores in h5 format, one per cell type
- Bigwig tracks viewable in a genome browser
- De novo motifs in h5 format, one per cell type
code/04-enhancers
:- Inputs:
- Outputs:
- ABC loops in TSV format, one per cell type
code/06-variants
:
We provide a few notebooks with examples of how to interact with HDMA data, analysis outputs, and trained models:
- How to download specific files or data for specific cell types from across the Zenodo records:
DATA.md
(html) - Plotting genomic tracks using BPCells:
code/05-misc/02-bp_cells_plotting_examples.Rmd
(html) - Use cases for ChromBPNet models and outputs, including visualizing predicted accessibility and contribution scores at a region of interest, loading models, making new predictions, and predicting variant effect:
code/05-misc/04-ChromBPNet_use_cases.ipynb
(html)
If you use this data or code, please cite:
Dissecting regulatory syntax in human development with scalable multiomics and deep learning. Betty B. Liu, Selin Jessa, Samuel H. Kim, Yan Ting Ng, Soon il Higashino, Georgi K. Marinov, Derek C. Chen, Benjamin E. Parks, Li Li, Tri C. Nguyen, Sean K. Wang, Austin T. Wang, Serena Y. Tan, Michael Kosicki, Len A. Pennacchio, Eyal Ben-David, Anca M. Pasca, Anshul Kundaje, Kyle K.H. Farh, William J. Greenleaf, bioRxiv 2025.04.30.651381; doi: https://doi.org/10.1101/2025.04.30.651381