scATAC Walkthrough
About
ABC can predict cell-type-specific enhancer-gene interactions using only the chromatin accessibility profile obtained through scATAC-seq. Given the sparsity of scATAC data, we recommend pseudo-bulking the scATAC fragments by cell cluster. Each pseudo-bulked scATAC profile should contain at least 3 million (preferably 6 million) unique fragments to ensure optimal performance.
We’ll walk through an example of running a scATAC dataset through ABC.
1. Converting to TagAlign
Prerequisite
Make sure you’ve followed the environment setup steps in Getting Started and activate your abc conda environment.
When working with scATAC files, you may start off with a fragment file. Since ABC works with tagAlign file format, we’ll want to convert our fragment file to tagAlign. If you have a tagAlign file already, great! You can skip to the stage where we create the biosamples table.
Let’s download an example fragment file from the internet
(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data] (job 37050919) $ wget https://www.encodeproject.org/files/ENCFF794UXO/@@download/ENCFF794UXO.tar.gz
(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data] (job 37050919) $ tar -xf ENCFF794UXO.tar.gz
(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data] (job 37050919) $ ls encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments/
fragments.tsv.gz fragments.tsv.gz.tbi
Let’s convert the fragment file to .tagAlign and index it
(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data] (job 37050919) $ cd encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments
(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data/encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments] (job 37050919) $ LC_ALL=C zcat fragments.tsv.gz | sed '/^#/d' | awk -v OFS='\t' '{mid=int(($2+$3)/2); print $1,$2,mid,"N",1000,"+"; print $1,mid+1,$3,"N",1000,"-"}' | sort -k 1,1V -k 2,2n -k3,3n --parallel 5 | bgzip -c > tagAlign.gz # Adjust --parallel 5 based on number of cpus you have. The more cpus, the faster
(abc-env) [atan5133@sh03-04n24 /oak/stanford/groups/engreitz/Users/atan5133/data/encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments] (job 37151429) $ tabix -p bed tagAlign.gz
Sorting the tagAlign typically takes a while and varies based on the size of the file. Setting LC_ALL=C and adding parallelization makes the sorting much faster.
2. Creating Your Config
Once we have the tagAlign file, we’ll create our biosample_table config. Let’s assume we don’t have any HiC data, so we’ll opt to use powerlaw as our estimate for contact. Our biosamples config would look like the following:
biosample |
DHS |
ATAC |
H3K27ac |
default_accessibility_feature |
HiC_file |
HiC_type |
HiC_resolution |
alt_TSS |
alt_genes |
|---|---|---|---|---|---|---|---|---|---|
K562 |
/oak/stanford/groups/engreitz/Users/atan5133/data/encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments/tagAlign.gz |
ATAC |
Update the ABC Config to point the biosample tsv file you just created
3. Running ABC
(abc-env) [atan5133@sh02-ln01 login /oak/stanford/groups/engreitz/Users/atan5133/
ABC-Enhancer-Gene-Prediction]$ snakemake -j1
All your results will go to the results directory of your ABC directory!
The actual predictions will be stored under the results/{biosample}/Predictions folder.