Scprep is a lightweight scRNA-seq toolkit for Python Data Scientists
Most scRNA-seq toolkits are written in R, but we develop our tools in Python. Currently, Scanpy is the most popular toolkit for scRNA-seq analysis in Python. In fact, if you’d prefer to use that, you can find most of our lab’s analytical methods including PHATE and MAGIC in scanpy. However, scanpy has a highly structured framework for data representation and a steep learning curve that is unnecessary for users already comfortable with the suite of methods available in Pandas, Scipy, and Sklearn.
To accommodate these users (including ourselves) we developed
scprep (Single Cell PREParation).
scprep makes it easier to use the Pandas / Scipy / Sklearn ecosystem for scRNA-seq analysis. Most of
scprep is composed of helper functions to perform tasks common to single cell data like loading counts matrices, filtering & normalizing cells by library size, and calculating common statistics. The key advantage of scprep is that data can be stored in Pandas DataFrames, Numpy arrays, Scipy Sparse Matices, it just works.
For users starting out, you might find it more valuable to spend time spent getting comfortable with these tools because they can be used for analysis of all kinds of data, not just scRNA-seq. If you want to learn more, checkout the Numpy Quickstart Tutorial, the Pandas Tutorials, the Matplotlib Tutorials, and consider getting a copy of Wes McKinney’s book Python for Data Analysis. McKinney is the original author of Pandas, and I’m glad I went through the book at the beginning of my PhD.
From here on out, I will assume that you have Python, Numpy, Pandas, Scipy, Scikit-learn, and Matplotlib all installed.
$ pip install --user scprep
scprep you’ll find tools for loading single cell data (
scprep.io), library size normalization (
scprep.normalize), transforming data (
scprep.transform), performing statistical calculations (
scprep.stats), and many others. Check out the screp documentation for the full API.
1.1 - Using scprep to load a 10X Genomics counts matrix
Most of the labs we work with use 10X Genomics for single cell sequencing, so we’ll use this as an example.
scprep.io also has helper functions for csv, tsv, and mtx files.
Say your 10X data is in
/home/user/data/my_sample. To load your data, run:
import scprep data_uf = scprep.io.load_10X('/home/user/data/my_sample/outs/filtered_gene_bc_matrices/my_genome/', sparse=True, gene_labels='both')
data_uf is my shorthand for ‘data_unfiltered’.
load_10X and other
scprep.io functions return a Pandas
SparseDataFrame by default. The
SparseDataFrame behaves similarly to regular Pandas
DataFrames, but take up much less memory by storing only non-zero values and their indices in memory. Note: operations are sparse matrices are slower than dense matrices. If you can afford to store the dense matrix in memory, set
As an aside, many genome annotations (the descriptions of where genes are located on a given genome) contain non-unique gene symbols, so with
gene_labels=both we store the gene symbol and the gene ID as
If you look at the head of your data, you should see something like this:
Now, let’s start preprocessing your data.
1.2 - Merging batches using
This is the point to merge batches if you have multiple samples from the same experiment that you want to compare. Although we’ll talk briefly about batch effects later, we’ll write a whole post about how to correct them at another point.
How to combine samples
Here, we are taking counts matrices from each sample and stacking them vertically. We’ve implemented this in the
For this example, I’m going to use the Embryoid Body timecourse covered in the PHATE tutorial.
Two important notes:
First, it’s important to know exactly which rows of the data matrix correspond to which sample so that we can separate them during downstream analysis. To facilitate this,
combine_batches() takes an list
batch_labels that contains one label per sample and returns an array
sample labels that contains the sample label for each row in the resulting matrix.
Second, it’s possible (and exceedingly common) for the same cell barcode to appear in multiple experiments. This becomes an issue when combining batches because you want each row to have a unique index. To solve this problem, we append the sample label to the cell barcode by setting
append_to_cell_names=True. This turns cell barcode
AAACATACCAGAGG-1 from sample
# Loading each of 5 timepoints T1 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T0_1A"), gene_labels='both') T2 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T2_3B"), gene_labels='both') T3 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T4_5C"), gene_labels='both') T4 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T6_7D"), gene_labels='both') T5 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T8_9E"), gene_labels='both') # Concatenating data matrices EBT_counts, sample_labels = scprep.utils.combine_batches( data=[T1, T2, T3, T4, T5], batch_labels=["Day0-3", "Day6-9", "Day12-15", "Day18-21", "Day24-27"], append_to_cell_names=True ) # remove objects from memory del T1, T2, T3, T4, T5
It’s good practice to remove the original data matrices from memory and avoid doubling the memory usage of our scripts.
1.3 - Filtering cells by library size
Why we filter cells by library size
In scRNA-seq the library size of a cell is the number of unique mRNA molecules detected in that cell. These unique molecules are identified using a random barcode incorporated during the first round of reverse transcription. This barcode is called a Unique Molecule Identifier, and often we refer to the number unique mRNAs in a cell as the number of UMIs. To read more about UMIs, Smith et al. (2017) write about how sequencing errors and PCR amplification errors lead to innaccurate quantification of UMIs/cell.
Depending on the method of scRNA-seq, the amount of library size filtering done can vary. The 10X Genomics CellRanger tool, the DropSeq and InDrops pipelines, and the Umitools package each have their own method and cutoff for determining real cells from empty droplets. You can take these methods at face value or set some manual cutoffs based on your data.
Visualing the library size distribution using
There is a helper function for plotting library size from a gene expression matrix in scprep called
Here we can see a typical library size for a 10X dataset from Datlinger et al. (2017). In this sample we see that there is a long tail of cells that have very high library sizes.
Selecting a cutoff
Several papers describe strategies for picking a maximum and minimum threshold that can be found with a quick google search for “library size threshold single cell RNA seq”.
Most of these pick an arbitrary measure such as a certain number of deviations below or above the mean or median library size. We find that spending too much time worrying about the exact threshold is inefficient.
For the above dataset, I would remove all cells with more than 25,000 UMI / cell in fear they might represent doublets of cells. I will generally also remove all cells with fewer than 500 reads per cell.
Filtering cells by library size
You can do this using
scprep.filter.filter_library_size(). The syntax looks like:
data = scprep.filter.filter_library_size(data_uf, cutoff=25000, keep_cells='below')
And now when we plot the library size we see:
Note that there are many different valid distributions of library sizes. See the following three libraries, all generated from different tissues in the zebrafish embryo. One of the libraries is from a failed experiment, and the other two are from published papers.
Can you tell which is which?
Here, Sample 1 is from Pandley et al. (2018), Sample 2 is an internal failed experiment, and Sample 3 is from Farrell et al. (2018). The low library size in Sample 2 is the giveaway with 90% of cells having fewer than 1200 UMI/cell and a mode at 325 UMI/cell. Additionally, this library generated a Low Fraction Reads in Cells alert in the Cell Ranger web summary with only 33% of reads assigned to cells. To learn more about what this means, read this post in the 10X Forums..
1.4 - Filtering lowly expressed genes
Why remove lowly expressed genes?
Capturing RNA from single cells is a noisy process. The first round of reverse transcription is done in the presence of cell lysate. This results in capture of only 10-40% of the mRNA molecules in a cell leading to a phenomenon called dropout where some lowly expressed genes are not detected in cells in which they are expressed [1, 2, 3, 4]. As a result, some genes are so lowly expressed (or expressed not at all) that we do not have sufficient observations of that gene to make any inferences on its expression.
Lowly expressed genes that may only be represented by a handful of mRNAs may not appear in a given dataset. Others might only be present in a small number of cells. Because we lack sufficient information about these genes, we remove lowly expressed genes from the gene expression matrix during preprocessing. Typically, if a gene is detected in fewer than 5 or 10 cells, it gets removed.
Here, we can see that in the T cell dataset from Datlinger et al. (2017), there are many genes that are detected in very few cells.
genes_per_cell = np.sum(t_cell_data > 0, axis=0) fig, ax = plt.subplots(1, figsize=(4,6)) ax.hist(genes_per_cell, bins=100) ax.set_xlabel('# cells in which gene is expressed') ax.set_ylabel('# of genes') ax.set_yscale('symlog') ax.set_title('Gene detection across cells') fig.tight_layout()
Here, around 16,000/35,000 genes are detected in fewer than 10 cells. We can remove these columns from the gene expression matrix moving forward.
How to remove lowly expressed genes using
The syntax to remove these genes is similar to filtering on library size. The scprep function is
scprep.filter.remove_rare_genes(). You can use it like this:
data = scprep.filter.remove_rare_genes(data, cutoff=0, min_cells=5)
Another advantage of removing rare genes from our counts matrix is that it reduces the dimensions of the matrix that we’re working with. Instead of needing to do computations over a matrix that is 10,000x35,000, we are now working with one that is 10,000x16,000.
1.5 - Library size normalization
Why library size normalize?
Now that we have our cells filtered by library size and removed lowly expressed genes from our dataset, it is time to normalize the data. Library size normalization is meant to align the scales of gene expression across cells that have different # of UMIs / cell. This is equivalent to comparing only the ratios of genes detected within a cell as opposed to comparing the absolute quantity of each RNA.
Mathematically, this process involves dividing the count of each gene in each cell by the # UMIs in that cell. Optionally, one may then multiply the gene expression in all cells by a constant, such as the median # UMIs /cell in an experiment. This is purely a stylistic choice, but makes interpreting gene expression values a little easier.
Normalizing library size using
This step is accomplished using the
scprep.normalize.library_size_normalize() function with the following syntax:
data_ln = scprep.normalize.library_size_normalize(data)
Here, I use
data_ln to refer to data that has been library size normalized.
1.6 - Removing cells with high mitochondrial gene expression
What does high mitochondrial gene expression indicate?
Generally, we assume that cells with high detection of mitochondrial RNAs have undergone degradation of the mitochondrial membrane as a result of apoptosis. This may be from stress during dissociation, culture, or really anywhere in the experimental pipeline. As with the high and low library size cells, we want to remove the long tail from the distribution. In a successful experiment, it’s typical for 5-10% of the cells to have this apoptotic signature.
Plotting mitochondrial expression
Let’s look at the mitochondrial expression from Datlinger et al. (2017). Here, the dashed line is the 95th percentile of mitochondrial expression.
# get mitochondrial genes mitochondrial_gene_list = np.array([g.startswith('MT-') for g in data_ln.columns]) # get expression mito_exp = data_ln.loc[:,mitochondrial_gene_list].mean(axis=1) # plotting fig, ax = plt.subplots(1, figsize=(6,5)) ax.hist(mito_exp, bins=100) ax.axvline(np.percentile(mito_exp, 95)) ax.set_xlabel('Mean mitochondrial expression') ax.set_ylabel('# of cells') ax.set_title('Mitochondrial expression') fig.tight_layout()
Removing cells with high mitochondrial expression using
Each genome and species will have its own list of mitochondrial genes and associated symbols. In human and mouse, these genes start with ‘MT-’. However, as there’s no standard, we didn’t want to include a function for filtering mitochondrial expression directly in
scprep. Instead, we provide the
scprep.filter.filter_values() function. This method takes
data and an array
values and removes all cells from
values is above or below the set
Here, this would look like:
# filter cells above 95th percentile data_ln = scprep.filter.filter_values(data_ln, mito_expression, percentile=95, keep_cells='below')
1.7 - Transforming data
Our final step for preprocessing is to transform the data so that it’s usable for the algorithms that we’re using later.
The purpose of transforming data is to make sure that each gene or feature in our counts matrix is counted equally. Because of math, if we’re calculating the Euclidean distance between two cells, genes that are more highly expressed (i.e. have larger values) will be considered more important and will have a larger impact on downstream analysis.
There are many transforms, but the two most common for scRNA-seq are the log-transform and the square-root transform. In CyTOF, the arcsinh transform is also popular. You can access all of these using
One note: the log-transform doesn’t like zeros, which are incredibly common in single cell datasets. To overcome this, people commonly add a pseudocount to their data, either 1 or a very small value called machine epsilon. Personally, I don’t like this because it skews the relationships between small values, which are a huge portion of single cell counts matrices.
Instead, we use the squareroot transform 99% of the time. The syntax couldn’t be simpler:
data_sq = scprep.transform.sqrt(data_ln)
We’re done preprocessing!
Congratulations, we’re ready to start visualizing!