As single cell RNA-sequencing experiments become more popular, we keep on hearing a few questions over and over: “How should I start analyzing my data?” “What’s the advantage of using PHATE over t-SNE?” “How should I cluster my data?” “How can I identify differentially expressed genes between these two clusters?”
High throughput technologies are only a few years old, and there are not many (if any) standards for analyzing a dataset. This being said, there are a few nice tutorials out there. Rahul Satija’s lab at the NY Genome Center, the home of the popular Seurat toolkit has a collection of easy-to-follow tutorials. In May of this year, the Hemberg lab at the Sanger Institute posted a great single cell course as part of the Bioinformatics Training Unit at Cambridge. John Marioni’s lab at UCL has a paper on F1000 on Low-level analysis of single-cell RNA-seq data with Bioconductor. A quick google search will reveal a handful of other tutorials.
I encourage reading widely and getting a sense of what other groups do. There are many ways to do single cell analysis. Some ways are definitely bad (don’t get us started on clustering on t-SNE dimensions), but usually the important thing is trying to understand how the methods you apply work and why you might pick one tool or another. In general, you should view the results of most of these analyses as hypotheses that need to be validated by the specific tools of your discipline.
This being said, in the Krishnaswamy lab, we have gone through hundreds of samples of scRNA-seq data and have learned a lot along the way. We have developed a growing number of novel methods for analyzing these datasets and extracting biological meaning. You can find all of our tools publicly available on our lab GitHub: https://github.com/krishnaswamylab/.
In this introductory post, my goal is to go through the basics of analyzing a single cell RNA-sequencing dataset composed of a few samples. I’m assuming you already know what scRNA-seq is and have read a few papers in the field. I’m also assuming you’ve heard of Python and are willing to learn Numpy, Pandas, and Matplotlib. Most of the tools in the lab have been ported to R and are available as part of the scanpy and seurat packages, but all of us in the lab use python when we’re analyzing datasets.
Here’s the basic workflow we’re going to cover in this post:
1. What is single cell RNA-seq? 2. Preprocessing - Filtering, normalization, transformation 3. Visualization 4. Clustering 5. Differential expression
We already have a set of tutorials for some of these tools on our lab Github: www.github.com/krishnaswamylab but here I will take a higher level approach to starting analysis and provide some insights that will hopefully facilitate your analysis.
To get started, visit: 1. What is single cell?