Predicting Cell Phenotypes from scRNAseq Data


Ryan Nayebi*

Mentor: Brianna Chrisman

*Woodside Priory School


The use of machine learning to analyse data obtained through single-cell RNA sequencing (scRNAseq) provides eye-opening insights into cellular biology. Applications of this type of data analysis range from understanding tumor heterogeneity (with potential for creating lifesaving, patient-specific treatment) to studying stochastic gene expression.[17] Throughout the researching process, we analyse the performance of a series of supervised learning models on three sets of scRNAseq data. Each model’s goal is to predict cell type based on expression levels of different genes within each cell. By training these models on our data we discover correlations between expression levels of specific genes and their cellular phenotypes. Cell-type classification is highly relevant in areas within the medical industry. For example, classifying healthy versus cancerous cells can be used in detecting an early onset of cancer or detecting how far a tumor has spread.[17] Additionally, our research is highly relevant for educational purposes. It provides students in both biology and computer science a real-world application of the central dogma of molecular biology and fundamentals of machine learning (i.e., dimensionality reduction, cross-validation, grid searching, supervised learning models, confusion matrices, etc.). In the end, we discovered which supervised learning methods (including their specific hyperparameters), performed the best for differentiating cell types based on gene expression on each dataset. These models and analysis have implications for establishing a framework for future research in cell-type classification, and can be found at


The goal for this research project was to create a machine learning classifier that can differentiate cell types based on gene expression in order to be used for practical purposes.

Gene expression, the process in which DNA creates a product such as proteins, contains far more information about a cell compared to a phenotypic observation of the cell.[18] A cell’s phenotype is the observable traits of a cell.[18] Sometimes a cell may have genes that aren’t expressed, and thus aren’t observable.[18] Phenotypic observation is the observation of a cell’s expressed traits and characteristics. The ability to observe and make meaning of genetic information (at the single cell level) opens a new lens into cellular biology; one which may help detect an early onset of a specific disease or perhaps to confirm or deny a diagnosis.[17] For example, being able to detect a cell as malignant (cancerous) or benign (non cancerous) can inform doctors if they need to immediately try localising the cancer. Furthermore, the ability to detect the specific attributes in areas of a malignant tumour can allow doctors to provide treatment specific to a patient’s needs.

The use of this classifier can also be extended for educational purposes for both biology and machine learning. For example, seeing how different levels of gene expression may correlate with specific types of cells, can help students gain a better and more real-world understanding of the central dogma of molecular biology. Additionally, this project showcases many different methods of supervised learning, and may be a good starting point for seeing the practical side of the theory taught in many machine learning classes.

Gene Expression

Every cell in our body generally contains the same set of genetic information (genome) for encoding the functional behavior of a cell.[18] Despite having the same information, our cells perform very different functions.[18] To accomplish this, each cell only reads, or expresses, specific subsets of the entire genome. For example, cardiac pacemaker cells (within the heart) would not use the subset of the genetic information that helps create the proteins exclusive to foveolar cells (within the stomach). During transcription, an enzyme called RNA polymerase transcribes a section of the DNA to create messenger RNA (mRNA).[18] Ribosomes then use this mRNA to create proteins in a process known as translation.[18] These proteins are what perform cellular processes and functions.[18] Therefore, because pacemaker cells do not need to create proteins that secrete mucus for stomach protection, they do not transcribe, or express that subsection of the genome. These irrelevant sections of the genome are suppressed in many different ways in different organisms, including by RNA interference (RNAi)[14] and shifts in chromatin structure caused by regulatory proteins (repressors). You can check this link right here now to see where we get our recombinant proteins for research.

Single Cell Sequencing

Although the cells generally have the same set of genetic information, new errors frequently arise in the genome every time a cell undergoes mitosis.[18] This is especially true for malignant tumors. Harsh tumours often have a high genetic variance, allowing the tumour to mutate and have a better chance of withstanding and surviving treatment.[18] Assuming all cells have the exact same genome is erroneous, which is why looking at cells at their individual level is highly important.

Labelling a group of cells as a collective whole, where each cell has the same general function, rather than observing individual cells that make up a functioning unit, obscures important details only observable at the single cell level.[1, 17] For example, it is difficult to see if only specific cells within the unit show a specific phenotype or if all the cells within the unit show a phenotype. A collection of cells may be diverse and heterogeneous; however this is not obvious without single cell sequencing. Single cell sequencing also allows us to identify cells based on their RNA. While cells might have the exact same set of DNA, RNA levels differ greatly from cell to cell.[1] Furthermore, DNA does not inform us about which genes are being expressed.[18]

In order to look at specific cells within a group of similar cells, researchers must break bonds between cells while keeping them alive. Researchers use special indicators to mark the groups of cells as healthy (and thus viable) and unhealthy. After choosing the healthy cells, they are then isolated and lysed. Lysis is the process in which a cell begins to break down as its plasma membrane becomes damaged.[18] This isolation and lysis process and its details vary depending on which technique is implemented. Some of the most popular single-cell isolation methods include: fluorescence activated cell sorting (FACS)[1,2], microfluidic technology[1,3], and microdroplet-based fluidynamics[1,4]. After single cell isolation, the genetic material is often isolated and manipulated into complementary DNA (cDNA) via reverse transcription[7]. cDNA is much stabler and easier to work with compared to RNA. Note that cDNA still retains the same information as RNA. During this reverse transcription process often DNA sequences, called genetic barcodes, are attached so tracing the genetic information back to its original cell is much easier. Commonly, second-strand synthesis follows this reverse transcription process, where second strands are created from the first strand cDNA (usually via poly(A) tailing or a template-switching mechanism)[5]. Lastly, the cDNA is then amplified before sequencing[1].

By the time the data is in our hands, we have expression levels from a plethora of cells. In the data, we can see how much an individual gene is being expressed within that cell. These expression levels are derived from observing mRNA. The amount of mRNA of a gene is an indirect indicator of how much a gene is being expressed, as mRNA is created during transcription.

Machine Learning

This process of single-cell sequencing yields a large amount of data. We used supervised learning techniques in order to make sense of the data at hand. Supervised learning is a technique where the computer learns from labeled data, as opposed to unsupervised learning where the algorithm would find patterns within the data on its own[19]. In other words, in supervised learning, when an input is provided, the machine learning algorithm predicts an output value. This prediction is based on previous labeled data entries, meaning that it has learned from this previous data. In this project, we trained various supervised learning models to predict phenotypes of a cell-based on how much a gene is expressed (which is obtained via scRNAseq). The model types we chose to use were: logistic regression, K-Nearest Neighbors (KNN), Gaussian Naive Bayes, random forest, and support vector machines.

Generally speaking, logistic regression fits the data based on maximum likelihood. Through maximum likelihood, the weights of the linear combination (which represent the correlations between inputs and outputs) change to best fit the patterns of the data.

In KNN, classification depends on the nearest data-points around the new input[19]. The number of nearest data-points (nearest neighbors) is provided by the user and represents the ‘K’ in ‘KNN’. For example, let K = 5. First, we will “plot”*[1]the new data point amongst the rest of the data, and then look at the 5 nearest neighbors. Based on those other 5 data points, if say 4 of them are in category α, and 1 is in category β, the new data point will be classified as α. This is because a majority of the neighbors are categorized as α.

The theory behind Gaussian Naive Bayes supervised learning technique is based on statistics. First, Gaussian curves are created for each feature based on the training data. Next, the prior probabilities are calculated, denoted as . Then we calculate the likelihood of each feature based on the Gaussian curves, denoted as . Often, we then take the log of all of these likelihoods (to prevent underflow). This written as:

[2] Where n is the number of features

Or in other notations[6]:

Where and are predicted based on maximum likelihood

Afterwards, we choose the output depending on which outcome is more likely to occur, based on these calculations.

In the random forest algorithm, a bootstrapped dataset is created from the training data to create a series of decision trees. To create a bootstrapped dataset, one uses statistical techniques in order to create lots of simulated samples through a process called bootstrapping [21]. Each level in each decision tree uses a random subset of number of variables from the bootstrapped dataset. Thus, by doing this multiple times we end up with a large set of decision trees. When we feed a new data-point into the random forest algorithm to classify it, we pass it through each decision tree and keep track of the output of each decision tree. The majority label based on all the decision trees is what determines the classification. We then calculate the out-of-bag error based on the data excluded in the bootstrapped dataset. Next, we can change (the number of variables used per level in each tree) and find the most optimal number of variables, one which minimises the out-of-bag error. Interestingly, because of the plethora of decision trees, random forest also can use these trees to create proximity matrices to help fill in missing data.

Similar to logistic regression, support vector machines (SVM) create a threshold used to classify the observations. If the data has ℝn features, the support vector classifier will be an ℝn-1 hyperplane. However, when data overlaps (see figure 1 & 2 for visuals), it is difficult to create a logical threshold. To overcome this, support vector machines increase the dimensions of the data in order to create thresholds that more accurately grapple with overlap. To transform the data into higher dimensions different kernel functions are used. Some kernel functions work better than others depending on the dataset provided.


After dimensionality increase may look like:

Chart, scatter chart
Description automatically generated

After having chosen a model, it is important to tune its hyperparameters. A hyperparameter is a general term for a setting that one may use, where changing it affects the way in which learning occurs. As a simple example, one may decide to reduce the learning rate [18] overtime during gradient descent [18] to avoid missing an extrema. We used GridSearchCV to do so[6]. We create a search space based on the range of values per hyperparameter and use grid search to find the best model. Grid search then tests all the different variations of the models and uses cross-validation to ensure that the highest performing set of hyperparameters are truly the best parameters for the model for the specific dataset.


Data Collection

The first step in this project was to find and narrow down the datasets. We needed to choose a handful of datasets for building and training cell type classifiers that had large volumes of data collected from various types of cells. In the end, we chose to focus on three single-cell datasets from published research within the Broad Institute. The first dataset is from Defining T Cell States Associated with Response to Checkpoint Immunotherapy in Melanoma[8, 9]. Here, researchers strived to understand and combat failures in checkpoint immunotherapy for melanoma. The researchers collected their data by taking tumour samples from melanoma patients, specifically those treated with checkpoint inhibitors (drugs that block proteins)[10, 11, 18, 20]. From these samples, they profiled the transcriptomes in order to analyse and improve checkpoint immunotherapy (blocks these proteins from binding with their partner proteins, allowing T cells to attack the cancer cells that are within the body) [10, 11, 18, 20]. The second dataset is from T Helper Cell Cytokines Modulate Intestinal Stem Cell Renewal and Differentiation, where researchers strived to gain a better understanding of how stem cells within the small intestine differentiate into specific cells[10, 11]. The third dataset is from Mitogenic and Progenitor Programs in Single Pilocytic Astrocytoma Cells, where researchers performed scRNA sequencing in order to better understand how mutations affected brain tumours[12, 13].


After this, we preprocessed the data in order to make it more streamlined and easier to work with. First, we got rid of unnecessary columns, rows, and labels that would not add any insight to the data, such as the unique name/code given to each individual cell. After this, we saw that the metadata and the gene expression data within the same dataset had different dimensions. To take care of this issue we used the cells and data points that the metadata and gene expression data had in common. We did this so that we could train our models, if we had cells that contained no labels, it would be impossible to properly identify them. Note that this in no way would skew our model as we are using all the usable data available.

Feature Reduction

As there were thousands of cells, we knew that to fit the original data to the models would take a long time. Thus we decided to use principal component analysis (PCA) to reduce the features, while still maintaining the original patterns within the data. PCA works by using eigendecomposition[15] or singular value decomposition (SVD)[16] to find the most relevant aspects of the data (principal components). After this, we input the lower dimension data (still contains most of the initial information) into the models to make training and fitting the models much easier.

Figures 3, 6, and 8 visually show what the data looks like after performing PCA and projecting the data points with color-coding onto a 2d plane.


For each model, we built a confusion matrix to see how well the model performed on each cell type. A confusion matrix is a graph that precisely depicts the performance of a supervised learning model. Rather than outputting a percentage or how many errors the model made, a confusion matrix elegantly shows what type of misclassifications the model made. The negatively sloped diagonal of the confusion matrix represents the properly predicted cases. The numbers outside of that diagonal are misclassified. For example in figure 4, row G07 (memory T-cells), column G10 has a value of 9. This means that the model misclassified 9 G10 cells (Regulatory T-cells) as G07 cells. The color on the confusion matrix corresponds with the value in the position of the matrix which is specified in each figure. Combining the confusion matrices with the 2d PCA plots provides more insight into how well the model performed. If one group of cells, that seem to have little correlation with anything, drags down the accuracy score of the model, and if the rest of the cells have been accurately predicted, we can conclude our model is mostly performing well.

For individual graphs, specific grid search parameters, and optimized hyperparameters, please refer to the project repository at


As seen in figure 3, the clusters were well defined and were not difficult to work with. The outliers were easily taken care of. Logistic regression ended up working the best (see figure 4 for confusion matrix and table 1 for accuracies). The hyperparameters that worked the best were very similar to the default parameters of the ScikitLearn model. The major difference was that the solver that was used was ‘saga’. The rest of the default hyperparameters stayed mostly the same.

Despite KNN not performing as accurately as logistic regression, we ended up with a nice accuracy vs K-value plot, suggesting that the hyperparameter for number of neighbors was successfully optimised (see figure 5).

Cell Type Key For Figures 3-4 [8,9]:











Dendritic cells


Exhausted CD8+ T-cells






Regulatory T-cells

Cytotoxic Lymphocytes

Exhausted/HS CD8+ T-cells

Memory T-cells

Lymphocytes exhausted/cell-cycle

Brain Cells

As seen in figure 6, the data points clustered by cell type fairly well. There are some outliers, but not too many where the data seems scattered. In this dataset random forest ended up performing the best. The two hyperparameters we tuned were max_features and n_estimators. The combination that yielded the best performance used the square root of the number of features for the max features and 954 estimators for the number of estimators (see figure 7 for confusion matrix and table 1 for accuracies).

Cell Type Key For Figures 6-7 [12,13]:









T cell


Stem Cells

As seen in figure 8, the cells did not cluster by type as well as in the other two datasets. In this dataset, an SVM performed the best. We ended up tuning a couple of hyperparameters including the value for C, the kernel type, and the setting for gamma. (other hyperparameters that affected performance remained at their default values). The combination that ended up performing the best was using a polynomial kernel with degree 1, a C of ~3.6, and gamma was set to “scale” (see figure 9 for confusion matrix and table 1 for accuracies).

Table 1: Percent Accuracy Per Model Per Dataset


Logistic Regression

K-Nearest Neighbors

Gaussian Naive Bayes

Random Forest

Support Vector Machine

T-cell dataset






Stem-cell dataset






Brain-cell dataset

83.2 %






Advantages and Limitations of Different Models

Unfortunately, because of the high computational costs, we were unable to tune the hyperparameters for random forest in our T-cell dataset and were unable to tune SVM in the brain cell dataset over the ranges we would have liked to grid search over. We speculate that the reason it took so long to train was due to the sheer number of cells in each dataset, and the large number of training iterations grid search requires. Despite doing a dimensionality reduction via PCA, the models still would not finish training. Although we could reduce the number of features to only a couple of principal components, we saw that by doing this a significant portion of the information was lost. In the future, we would aim to have access to more computing power and perhaps re-write some of the models to take advantage of GPUs and TPUs as well.

In the T-cell dataset, we believe that logistic regression ended up performing better than KNN and Gaussian Naive Bayes because the dataset had thousands of dimensions and because the data was log normalized when we initially downloaded it. We speculate that using KNN in large dimensions is a difficult task. After testing the model’s performance using only a couple principal components, it still did not perform well. We believe this is the case because by only using a few principal components we do not retain a significant portion of the information in the dataset. The reason Gaussian Naive Bayes did not perform well was likely due to the fact that its calculations often depend on the fact that the data is in a nice Gaussian distribution. Despite being log normalized, the scRNAseq data was likely not a perfect Gaussian distribution. However, we emphasise that given that our datasets had up to 11 classes to predict, even seemingly low accuracies (around 50%) are much better than random guesses. The dataset and model combination with the best accuracy was the T-cell dataset with logistic regression. We believe that this is because the dataset had a large amount of data, and thus the outliers have less of an impact on the overall model. This would mean that the model did not overfit and did an overall good job of classifying the cells. We can also see from figure 3 that the groups in this dataset were relatively well defined. This would help the classifier as well. We saw that SVM performed almost as well as the logistic regression on the T-cell dataset. However, for practical purposes, even if SVM performed slightly better than logistic regression, we would still choose logistic regression. This is because grid searching over SVM took a lot longer to complete compared to grid searching over logistic regression. The same goes for random forest. The reason we think SVM performed well was because we grid searched over multiple different kernels, and the radial basis function kernel ended up working the best. This makes sense because the data, even if transformed into higher dimensions, would be difficult to separate via a linear or polynomial kernel function.

In the stem-cell dataset, the SVM performed the best. Although the brain-cell dataset has fewer cells compared to the stem-cell dataset, the brain-cell dataset has more features than the stem-cell dataset. We believe that this is why we were unable to grid search over a range of hyperparameters for GridSearchCV. Although the stem-cell dataset has fewer features, it still has a significant number of features overall. However, for similar reasoning to the T-cell dataset, logistic regression also performed quite well, while GNB and KNN did not. Typically, we see that random forest performs better than SVM in data with large amounts of features. However, this is not the case with the stem-cell dataset. Note that many of the examples we saw did not use PCA. Perhaps PCA disproportionately affects random forest.

Lastly, in the Brain-cell dataset, we can see from table 1 that the random forest classifier works the best. We think that when confronted with large amounts of data collected in labs, with large amounts of features, that one should stray away from using GNB and KNN. They constantly performed subpar, while the other models usually performed well, especially logistic regression (much cannot be said about random forest and SVM as those were not implemented for all of the datasets due to computational limits).


Main Findings

After analysing the data and the resulting figures, we came to the conclusion that more complexity is not always necessary, especially at the high school level. Although, SVM and random forest performed very well, they both took a large amount of computational power and time. On the other hand, a simple logistic regression performed almost as well, if not better than SVM and random forest, while taking much less computational power and time. Thus for finding general trends within a dataset, with a relatively high performance, logistic regression is often a great choice, however it is still important to test the other models if possible. In settings where accuracy is of utmost importance, such as in product design, testing models such as SVM and random forest is crucial.

Furthermore, our goals for this research project were reached, as we were able to successfully create a supervised learning model that predicts phenotypes from scRNAseq data with practical uses in the medical field, create a standard method of preprocessing data and comparing models against each other, and lastly discovering which supervised learning techniques worked the best for predicting phenotypes from scRNAseq data.

In addition to establishing a framework for future research in cell-type classification, our research is highly relevant for educational purposes. We hope that it can provide students in both biology and computer science a real-world application of the central dogma of molecular biology and the fundamentals of machine learning.

Future Directions

The three next goals we have for this project are incorporating neural networks and seeing how they compare with the models we have currently trained. Furthermore, we would love to create a more pleasant UI to be able to use this code in the real world for real applications. Currently, these models are just in a notebook and everyday users would not find this fun to interface with. And lastly, we would like to try writing the code to be compatible with GPUs (graphics processing units) and TPUs (tensor processing units), to be able to train the models faster.

Personal Reflections (Ryan Nayebi)

Personally, what went well for me was my own personal growth as a student interested in computer science. Machine learning was a relatively new area of computer science that I had only scratched the surface in. We had been comfortable with AI concepts, however searches such as A* and alpha-beta are vastly differ from supervised learning such as random forest and Gaussian Naive Bayes. I found myself becoming much more articulate in numpy, pandas, and scikit-learn as well. Initially, it was a little challenging getting used to the different attributes in each package. After using them in this project, I feel more comfortable and fluent with these packages. I also loved the fact that I could combine my interest in genetics with computer science in one project. As far as learning goes on the biological side, I learned a lot more about scRNA-seq and the different techniques in how scRNA-seq occurs, concepts which I never learned in high school biology.

With regards to the coding, what went well was creating the PCA plots, confusion matrices, and getting my first logistic regression working. These went relatively seamlessly. However, when it came to cleaning and preprocessing the data (at least for the first dataset), I had a lot of trouble. This was mainly because I knew what I needed to do, but it was initially difficult trying to translate my ideas into the numpy and pandas code.

Trying to reduce dimensions of the data gave me some difficulties at first. Before using PCA I tried using standard deviations to remove the least important aspects of the data, however this did not end up going well as I deleted much more than I intended to. I saw that using PCA is a much better method of dimensionality reduction.

Another challenging aspect of the project was getting my SVM to train on certain datasets. 24 hours was not a sufficient amount of time to train the SVM. (Note this also happened when I ran GridSearchCV with random forest on the T-cells). Due to this problem with time, it was difficult to debug. Furthermore, with the resources available, I exceeded the runtime available to me, and thus could not train SVM on the brain-cell dataset.

Work Cited:

  1. Hwang, Byungjin, Ji Hyun Lee, and Duhee Bang. “Single-Cell RNA Sequencing Technologies and Bioinformatics Pipelines.” Nature News. Nature Publishing Group, August 7, 2018. 
  2. Julius, M. H., T. Masuda, and L. A. Herzenberg. “Demonstration That Antigen-Binding Cells Are Precursors of Antibody-Producing Cells after Purification with a Fluorescence-Activated Cell Sorter.” Proceedings of the National Academy of Sciences of the United States of America. U.S. National Library of Medicine, July 1972. 
  3. Whitesides, George M. “The Origins and the Future of Microfluidics.” Nature News. Nature Publishing Group, July 26, 2006.
  4. Thorsen, Todd, et al. “Dynamic Pattern Formation in a Vesicle-Generating Microfluidic Device.” Physical review letters. U.S. National Library of Medicine, April 30, 2001. 
  5. Saiful, Islam. et al. “Quantitative Single-Cell RNA-Seq with Unique Molecular Identifiers.” Nature methods. U.S. National Library of Medicine, December 22, 2013. 
  6. Pedregosa, Fabian, et al. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research, January 1, 1970. 
  7. Sambrook, Joseph, and David W. Russell. “Amplification of CDNA Generated by Reverse Transcription of MRNA.” Cold Spring Harbor Protocols, May 1, 2019. 
  8. Sade-Feldman, Moshe, et al. “Defining T Cell States Associated with Response to Checkpoint Immunotherapy in Melanoma.” Cell, U.S. National Library of Medicine, January 10, 2019. 
  9. Sade-Feldman, Moshe, et al. “Study: Defining T Cell States Associated with Response to Checkpoint Immunotherapy in Melanoma 16291 Cells.” Single Cell Portal, January 10, 2019.
  10. Biton, Moshe, et al. “T Helper Cell Cytokines Modulate Intestinal Stem Cell Renewal and Differentiation.” Cell, U.S. National Library of Medicine, November 1, 2018. 
  11. Biton, Moshe, et al. “Study: Intestinal Stem Cell 28462 Cells.” Single Cell Portal, November 1, 2019. 
  12. Reitman, Zachary J., et al. “Mitogenic and Progenitor Gene Programmes in Single Pilocytic Astrocytoma Cells.” Nature Communications, U.S. National Library of Medicine, August 19, 2019. 
  13. Reitman, Zachary J. et al. “Study: Pilocytic Astrocytoma Single Cell RNA-Seq 931 Cells.” Single Cell Portal, August 19, 2019. 
  14. “RNA Interference (RNAi).” National Center for Biotechnology Information, U.S. National Library of Medicine,
  15. Weisstein, Eric W. “Eigen Decomposition.” Wolfram MathWorld,
  16. Weisstein, Eric W. “Singular Value Decomposition.” Wolfram MathWorld,
  17. Wang, Yong, and Nicholas E Navin. “Advances and Applications of Single-Cell Sequencing Technologies.” Molecular cell. U.S. National Library of Medicine, May 21, 2015. 
  18. Audesirk, Teresa, Gerald Audesirk, and Bruce E. Byers. Biology: Life on Earth with Physiology, Ninth Edition. Boston, MA: Benjamin Cummings, 2011. 
  19. Russell, Stuart, and Peter Norvig. Artificial Intelligence: A Modern Approach. London: Pearson, 2009. 
  20. “NCI Dictionary of Cancer Terms.” National Cancer Institute. National Cancer Institute, n.d. 
  21. Frost, Jim. “Introduction to Bootstrapping in Statistics with an Example.” Statistics By Jim, October 7, 2018. 

About the author

Ryan is currently a high school senior in the Bay Area interested in computer science and biology, and enjoys pursuing projects where he can apply computer science to other fields. He hopes that through his academic growth, he can apply his skills in helping the world become a better place. In his free time, Ryan enjoys running, playing tennis, and various board games.

Leave a Comment

Your email address will not be published. Required fields are marked *