scATAC Walkthrough

About

ABC can predict cell-type-specific enhancer-gene interactions using only the chromatin accessibility profile obtained through scATAC-seq. Given the sparsity of scATAC data, we recommend pseudo-bulking the scATAC fragments by cell cluster. Each pseudo-bulked scATAC profile should contain at least 3 million (preferably 6 million) unique fragments to ensure optimal performance.

We’ll walk through an example of running a scATAC dataset through ABC.

1. Converting to TagAlign

Prerequisite

Make sure you’ve followed the environment setup steps in Getting Started and activate your abc conda environment.

When working with scATAC files, you may start off with a fragment file. Since ABC works with tagAlign file format, we’ll want to convert our fragment file to tagAlign. If you have a tagAlign file already, great! You can skip to the stage where we create the biosamples table.

Let’s download an example fragment file from the internet

(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data] (job 37050919) $ wget https://www.encodeproject.org/files/ENCFF794UXO/@@download/ENCFF794UXO.tar.gz

(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data] (job 37050919) $ tar -xf ENCFF794UXO.tar.gz

(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data] (job 37050919) $ ls encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments/
    fragments.tsv.gz  fragments.tsv.gz.tbi

Let’s convert the fragment file to .tagAlign and index it

(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data] (job 37050919) $ cd encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments

(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data/encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments] (job 37050919) $ LC_ALL=C zcat fragments.tsv.gz | sed '/^#/d' | awk -v OFS='\t' '{mid=int(($2+$3)/2); print $1,$2,mid,"N",1000,"+"; print $1,mid+1,$3,"N",1000,"-"}' | sort -k 1,1V -k 2,2n -k3,3n --parallel 5 | bgzip -c > tagAlign.gz  # Adjust --parallel 5 based on number of cpus you have. The more cpus, the faster

(abc-env) [atan5133@sh03-04n24 /oak/stanford/groups/engreitz/Users/atan5133/data/encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments] (job 37151429) $ tabix -p bed tagAlign.gz

Sorting the tagAlign typically takes a while and varies based on the size of the file. Setting LC_ALL=C and adding parallelization makes the sorting much faster.

2. Creating Your Config

Once we have the tagAlign file, we’ll create our biosample_table config. Let’s assume we don’t have any HiC data, so we’ll opt to use powerlaw as our estimate for contact. Our biosamples config would look like the following:

biosample	DHS	ATAC	H3K27ac	default_accessibility_feature	HiC_file	HiC_type	HiC_resolution	alt_TSS	alt_genes
K562		/oak/stanford/groups/engreitz/Users/atan5133/data/encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments/tagAlign.gz		ATAC

Update the ABC Config to point the biosample tsv file you just created

3. Running ABC

(abc-env) [atan5133@sh02-ln01 login /oak/stanford/groups/engreitz/Users/atan5133/
ABC-Enhancer-Gene-Prediction]$ snakemake -j1

All your results will go to the results directory of your ABC directory! The actual predictions will be stored under the results/{biosample}/Predictions folder.