In extreme cases, where only a few cells have been collected for some subjects, interpretation of gene expression differences should be handled with caution. . Here, we present the DS results comparing CF and non-CF pigs only in secretory cells from the small airways. In the bulk RNA-seq, genes with adjusted P-values less than 0.05 and at least a 2-fold difference in gene expression between CD66+ and CD66-basal cells are considered true positives and all others are considered true negatives. ## [97] Matrix_1.5-3 vctrs_0.6.1 pillar_1.9.0 The volcano plot for the subject method shows three genes with adjusted P-value <0.05 (-log 10 (FDR) > 1.3), whereas the other six methods detected a much larger number of genes. Beta ## [4] lazyeval_0.2.2 sp_1.6-0 splines_4.2.0 S14f), wilcox produces better ranked gene lists of known markers than both subject and wilcox and again, the mixed method has the worst performance. . Data visualization methods in Seurat Seurat - Satija Lab FindMarkers from Seurat returns p values as 0 for highly - ECHEMI See Supplementary Material for brief example code demonstrating the usage of aggregateBioVar. Visualization of RNA-Seq results with Volcano Plot (c) Volcano plots show results of three methods (subject, wilcox and mixed) used to identify CD66+ and CD66- basal cell marker genes. In practice, we have omitted comparisons of gene expression in rare cell types because the gene expression profiles had high variation, and the reliability of the comparisons was questionable. We designed a simulation study to examine characteristics of using subjects or cells as units of analysis for DS testing under data simulated from the proposed model. So, If I change the assay to "RNA", how we can trust that the DEGs are not due . #' @param de_groups The two group labels to use for differential expression, supplied as a vector. Flexible wrapper for GEX volcano plots GEX_volcano Oxford University Press is a department of the University of Oxford. Four of the methods were applications of the FindMarkers function in the R package Seurat (Butler et al., 2018; Satija et al., 2015; Stuart et al., 2019) with different options for the type of test performed: for the method wilcox, cell counts were normalized, log-transformed and a Wilcoxon rank sum test was performed for each gene; for the method NB, cell counts were modeled using a negative binomial generalized linear model; for the method MAST, cell counts were modeled using a hurdle model based on the MAST software (Finak et al., 2015) and for the method DESeq2, cell counts were modeled using the DESeq2 software (Love et al., 2014). Visualizing marker genes Scanpy documentation - Read the Docs Session Info Andrew L Thurman, Jason A Ratcliff, Michael S Chimenti, Alejandro A Pezzulo, Differential gene expression analysis for multi-subject single-cell RNA-sequencing studies with aggregateBioVar, Bioinformatics, Volume 37, Issue 19, 1 October 2021, Pages 32433251, https://doi.org/10.1093/bioinformatics/btab337. Step 4: Customise it! NPV is the fraction of undetected genes that were not differentially expressed. ## [1] stats graphics grDevices utils datasets methods base Supplementary Figure S11 shows cumulative distribution functions (CDFs) of permutation P-values and method P-values. ## [70] ggridges_0.5.4 evaluate_0.20 stringr_1.5.0 ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C Third, the proposed model also ignores many aspects of the gene expression distribution in favor of simplicity. #' @return Returns a volcano plot from the output of the FindMarkers function from the Seurat package, which is a ggplot object that can be modified or plotted. I prefer to apply a threshold when showing Volcano plots, displaying any points with extreme / impossible p-values (e.g. ## [58] deldir_1.0-6 utf8_1.2.3 tidyselect_1.2.0 FindMarkers from Seurat returns p values as 0 for highly significant genes. SeuratFindMarkers() Volcano plot - For each method, we compared the permutation P-values to the P-values directly computed by each method, which we define as the method P-values. 1. However, the plot does not look well volcanic. Next, we matched the empirical moments of the distributions of Eijc and Eij to the population moments. The volcano plots for subject and mixed show a stronger association between effect size (absolute log2-transformed fold change) and statistical significance (negative log10-transformed adjusted P-value). The subject method has the strongest type I error rate control and highest PPVs, wilcox has the highest TPRs and mixed has intermediate performance with better TPRs than subject yet lower FPRs than wilcox (Supplementary Table S2). ## [88] plotly_4.10.1 png_0.1-8 spatstat.utils_3.0-2 First, a random proportion of genes, pDE, were flagged as differentially expressed. Second, there may be imbalances in the numbers of cells collected from different subjects. ## [124] spatstat.explore_3.1-0 shiny_1.7.4. Tried. In summary, here we (i) suggested a modeling framework for scRNA-seq data from multiple biological sources, (ii) showed how failing to account for biological variation could inflate the FDR of DS analysis and (iii) provided a formal justification for the validity of pseudobulking to allow DS analysis to be performed on scRNA-seq data using software designed for DS analysis of bulk RNA-seq data (Crowell et al., 2020; Lun et al., 2016; McCarthy et al., 2017). Step 2: Get the data ready. (Crowell et al., 2020) provides a thorough comparison of a variety of DGE methods for scRNA-seq with biological replicates including: (i) marker detection methods, (ii) pseudobulk methods, where gene counts are aggregated between cells from different biological samples and (iii) mixed models, where models for gene expression are adjusted for sample-specific or batch effects. For example, a simple definition of sjc is the number of unique molecular identifiers (UMIs) collected from cell c of subject j. PR curves for DS analysis methods. Subject-level gene expression scores were computed as the average counts per million for all cells from each subject. We proceed as follows. Volcano plots are commonly used to display the results of RNA-seq or other omics experiments. To better illustrate the assumptions of the theorem, consider the case when the size factor sjcis the same for all cells in a sample j and denote the common size factor as sj*. I would like to create a volcano plot to compare differentially expressed genes (DEGs) across two samples- a "before" and "after" treatment. Results for alternative performance measures, including receiver operating characteristic (ROC) curves, TPRs and false positive rates (FPRs) can be found in Supplementary Figures S7 and S8. ## [115] MASS_7.3-56 rprojroot_2.0.3 withr_2.5.0 These analyses provide guidance on strengths and weaknesses of different methods in practice. ## [91] tibble_3.2.1 bslib_0.4.2 stringi_1.7.12 Seurat utilizes Rs plotly graphing library to create interactive plots. Overall, the subject and mixed methods had the highest concordance between permutation and method P-values. When only 1% of genes were differentially expressed, the mixed method had a larger area under the curve than the other five methods. In this comparison, many genes were detected by all seven methods. The intra-cluster correlations are between 0.9 and 1, whereas the inter-cluster correlations are between 0.51 and 0.62. As increases, the width of the distribution of effect sizes increases, so that the signal-to-noise ratio for differentially expressed genes is larger. The resulting matrix contains counts of each genefor each subject and can be analyzed using software for bulk RNA-seq data. The negative binomial distribution has a convenient interpretation as a hierarchical model, which is particularly useful for sequencing studies. The color represents the average expression level, # Single cell heatmap of feature expression, # Plot a legend to map colors to expression levels. In addition to simulated data, we analysed an animal model dataset containing large and small airway epithelia from CF and non-CF pigs (Rogers et al., 2008). Department of Internal Medicine, Roy J. and Lucille A. This is done by passing the Seurat object used to make the plot into CellSelector(), as well as an identity class. ## Reyfman et al. For full access to this pdf, sign in to an existing account, or purchase an annual subscription. Therefore, as experiments that include biological replication become more common, statistical frameworks to account for multiple sources of biological variability will be critical, as recently described by Lhnemann et al. ## [3] thp1.eccite.SeuratData_3.1.5 stxBrain.SeuratData_0.1.1 Data for the analysis of human skin biopsies were obtained from GEO accession GSE130973. The expression level of gene i for group 1, i1, was matched to the pig data by setting ei1=jcKijc/i'jcKi'jc. Raw gene-by-cell count matrices for pig scRNA-seq data are available as GEO accession GSE150211. Next, we applied our approach for marker detection and DS analysis to published human datasets. To use, simply make a ggplot2-based scatter plot (such as DimPlot() or FeaturePlot()) and pass the resulting plot to HoverLocator(). For example, consider a hypothetical gene having heterogeneous expression in CF pigs, where cells were either low expressors or high expressors versus homogeneous expression in non-CF pigs, where cells were moderate expressors. This figure suggests that the methods that account for between subject differences in gene expression (subject and mixed) will detect different sets of genes than the methods that treat cells as the units of analysis. Specifically, if Kijc is the count of gene i in cell c from pig j, we defined Eijc=Kijc/i'Ki'jc to be the normalized expression for cell c from subject j and Eij=cKijc/i'cKi'jc to be the normalized expression for subject j. Introduction. Overall, mixed seems to have the best performance, with a good tradeoff between false positive and TPRs. Differential gene expression analysis for multi-subject single-cell RNA In scRNA-seq studies, where cells are collected from multiple subjects (e.g. The analyses presented here have illustrated how different results could be obtained when data were analysed using different units of analysis. (Lahnemann et al., 2020). ## [61] labeling_0.4.2 rlang_1.1.0 reshape2_1.4.4 Gene counts were simulated from the model in Section 2.1. ## [106] cowplot_1.1.1 irlba_2.3.5.1 httpuv_1.6.9 In our simulation study, we also found that the pseudobulk method was conservative, but in some settings, mixed models had inflated FDR. If we omit DESeq2, which seems to be an outlier, the other six methods form two distinct clusters, with cluster 1 composed of wilcox, NB, MAST and Monocle, and cluster 2 composed of subject and mixed. 5c). ADD REPLY link 18 months ago by Kevin Blighe 84k 0. Define Kijc to be the count for gene i in cell ccollected from subject j, and a size factorsjc related to the amount of information collected from cell c in subject j (i=1,G; c=1,,Cj;j=1,,n). Marker detection methods were found to have unacceptable FDR due to pseudoreplication bias, in which cells from the same individual are correlated but treated as independent replicates, and pseudobulk methods were found to be too conservative, in the sense that too many differentially expressed genes were undiscovered. First, the CF and non-CF labels were permuted between subjects. The vertical axis gives the precision (PPV) and the horizontal axis gives recall (TPR). FloWuenne/scFunctions source: R/DE_Seurat.R - rdrr.io We compared the performances of subject, wilcox and mixed for DS analysis of the scRNA-seq from healthy and IPF subjects within AT2 and AM cells using bulk RNA-seq of purified AT2 and AM cell type fractions as a gold standard, similar to the method used in Section 3.5. For each subject, the number of cells and numbers of UMIs per cell were matched to the pig data. We will create a volcano plot colouring all significant genes. Supplementary Figure S10 shows concordance between adjusted P-values for each method. In order to objectively measure the performance of our tested approaches in scRNA-seq DS analysis, we compared them to a gold standard consistent of bulk RNA-seq analysis of purified/sorted cell types. Figure 5d shows ROC and PR curves for the three scRNA-seq methods using the bulk RNA-seq as a gold standard. If a gene was not differentially expressed, the value of i2 was set to 0. In stage iii, technical variation in counts is generated from a Poisson distribution. For the AT2 cells (Fig. The FindAllMarkers () function has three important arguments which provide thresholds for determining whether a gene is a marker: logfc.threshold: minimum log2 foldchange for average expression of gene in cluster relative to the average expression in all other clusters combined. Further, they used flow cytometry to isolate alveolar type II (AT2) cell and alveolar macrophage (AM) fractions from the lung samples and profiled these PCTs using bulk RNA-seq. Further, subject has the highest AUPR (0.21) followed by mixed (0.14) and wilcox (0.08). We set xj1=1 for all j and define xj2 as a dummy variable indicating that subject j belongs to the treated group. ## [109] R6_2.5.1 promises_1.2.0.1 KernSmooth_2.23-20 Here, we propose a statistical model for scRNA-seq gene counts, describe a simple method for estimating model parameters and show that failing to account for additional biological variation in scRNA-seq studies can inflate false discovery rates (FDRs) of statistical tests. We have developed the software package aggregateBioVar (available on Bioconductor) to facilitate broad adoption of pseudobulk-based DE testing; aggregateBioVar includes a detailed vignette, has low code complexity and minimal dependencies and is highly interoperable with existing RNA-seq analysis software using Bioconductor core data structures (Fig. The subject method had the highest PPV, and the NB method had the lowest PPV in all nine simulation settings. It is helpful to inspect the proposed model under a simplifying assumption. In bulk RNA-seq studies, gene counts are often assumed to follow a negative binomial distribution (Hardcastle and Kelly, 2010; Leng et al., 2013; Love et al., 2014; Robinson et al., 2010). Step 3: Create a basic volcano plot. Rows correspond to different proportions of differentially expressed genes, pDE and columns correspond to different SDs of (natural) log fold change, . These methods appear to form two clusters: the cell-level methods (wilcox, NB, MAST, DESeq2 and Monocle) and the subject-level method (subject), with mixed sharing modest concordance with both clusters. One such subtype, defined by expression of CD66, was further processed by sorting basal cells according to detection of CD66 and profiling by bulk RNA-seq. Infinite p-values are set defined value of the highest . It enables quick visual identification of genes with large fold changes that are also statistically significant. The marker genes list can be a list or a dictionary. Figure 3a shows the area under the PR curve (AUPR) for each method and simulation setting. make sure label exists on your cells in the metadata corresponding to treatment (before- and after-), You will be returned a gene list of pvalues + logFc + other statistics. Plots a volcano plot from the output of the FindMarkers function from the Seurat package or the GEX_cluster_genes function alternatively. # search for positive markers monocyte.de.markers <- FindMarkers (pbmc, ident.1 = "CD14+ Mono", ident.2 = NULL, only.pos = TRUE) head (monocyte.de.markers) For higher numbers of differentially expressed genes (pDE > 0.01), the subject method had lower NPV values when = 0.5 and similar or higher NPV values when > 0.5. < 10e-20) with a different symbol at the top of the graph. Infinite p-values are set defined value of the highest -log(p) + 100. In a study in which a treatment has the effect of altering the composition of cells, subjects in the treatment and control groups may have different numbers of cells of each cell type. The subject and mixed methods are composed of genes that have high inter-group (CF versus non-CF) and low intra-group (between subject) variability, whereas the wilcox, NB, MAST, DESeq2 and Monocle methods tend to be sensitive to a highly variable gene expression pattern from the third CF pig. You can download this dataset from SeuratData, In addition to changes to FeaturePlot(), several other plotting functions have been updated and expanded with new features and taking over the role of now-deprecated functions. ## [19] globals_0.16.2 matrixStats_0.63.0 pkgdown_2.0.7 Developed by Paul Hoffman, Satija Lab and Collaborators. 1 Answer. The expression parameter for the difference between groups 1 and 2, i2, was varied in order to evaluate the properties of DS analysis under a number of different scenarios. We will call genes significant here if they have FDR < 0.01 and a log2 fold change of 0.58 (equivalent to a fold-change of 1.5). A common use of DGE analysis for scRNA-seq data is to perform comparisons between pre-defined subsets of cells (referred to here as marker detection methods); many methods have been developed to perform this analysis (Butler et al., 2018; Delmans and Hemberg, 2016; Finak et al., 2015; Guo et al., 2015; Kharchenko et al., 2014; Korthauer et al., 2016; Miao et al., 2018; Qiu et al., 2017a, b; Wang et al., 2019; Wang and Nabavi, 2018). As scRNA-seq studies grow in scope, due to technological advances making these studies both less labor-intensive and less expensive, biological replication will become the norm. # Calculate feature-specific contrast levels based on quantiles of non-zero expression. To obtain permutation P-values, we measured the proportion of permutation test statistics less than or equal to the observed test statistic, which is the permutation test statistic under the observed labels. FindMarkers: Finds markers (differentially expressed genes) for identified clusters. Among the three genes detected by subject, the genes CFTR and CD36 were detected by all methods, whereas only subject, wilcox, MAST and Monocle detected APOB. True positives were identified as those genes in the bulk RNA-seq analysis with FDR<0.05 and |log2(IPF/healthy)|>1. Supplementary Figure S14 shows the results of marker detection for T cells and macrophages. The wilcox, MAST and Monocle methods had intermediate performance in these nine settings. ## [67] cachem_1.0.7 cli_3.6.1 generics_0.1.3 In terms of identifying the true positives, wilcox and mixed had better performance (TPR = 0.62 and 0.56, respectively) than subject (TPR = 0.34). This suggests that methods that fail to account for between subject differences in gene expression are more sensitive to biological variation between subjects, leading to more false discoveries. Specifically, we considered a setting in which there were two groups of subjects to compare, containing four and three subjects, respectively with 21 731 genes. Next, we used subject, wilcox and mixed to test for differences in expression between healthy and IPF subjects within the AT2 and AM cell populations. S14e), we find that the subject and wilcox methods produce ranked gene lists with higher frequencies of marker genes than the mixed method, with subject having a slightly higher detection of known markers than wilcox. First, it is assumed that prerequisite steps in the bioinformatic pipeline produced cells that conform to the assumptions of the proposed model. ## [112] gridExtra_2.3 parallelly_1.35.0 codetools_0.2-18 The cluster contains hundreds of computation nodes with varying numbers of processor cores and memory, but all jobs were submitted to the same job queue, ensuring that the relative computation times for these jobs were comparable. I understand a little bit more now. This issue is most likely to arise with rare cell types, in which few or no cells are profiled for any subject. baseplot <- DimPlot (pbmc3k.final, reduction = "umap") # Add custom labels and titles baseplot + labs (title = "Clustering of 2,700 PBMCs") Generally, tests for marker detection, such as the wilcox method, are sufficient if type I error rate control is less of a concern than type II error rate and in circumstances where type I error rate is most important, methods like subject and mixed can be used. Results for analysis of CF and non-CF pig small airway secretory cells. They also thank Paul A. Reyfman and Alexander V. Misharin for sharing bulk RNA-seq data used in this study. I have successfully installed ggplot, normalized my datasets, merged the datasets, etc., but what I do not understand is how to transfer the sequencing data to the ggplot function. Plotting multiple plots was previously achieved with the CombinePlot() function. Further, if we assume that, for some constants k1 and k2, Cj-1csjck1 and Cj-1csjc2k2 as Cj, then the variance of Kij is ij+i+o1ij2. If subjects are composed of different proportions of types A and B, DS results could be due to different cell compositions rather than different mean expression levels. Downstream Analyses of SC Data - omicsoft doc - GitHub Pages EnhancedVolcano: publication-ready volcano plots with enhanced This creates a data.frame with gene names as rows, and includes avg_log2FC, and adjusted p-values. In the first stage of the hierarchy, gene expression for each sample is assumed to follow a gamma distribution with mean expression modeled as a function of sample-specific covariates. The results of our comparisons are shown in Figure 6. When samples correspond to different experimental subjects, the first stage characterizes biological variation in gene expression between subjects. When only 1% of genes were differentially expressed (pDE = 0.01), all methods had NPV values near 1. You can now select these cells by creating a ggplot2-based scatter plot (such as with DimPlot() or FeaturePlot(), and passing the returned plot to CellSelector(). Cons: Differential expression testing Seurat - Satija Lab To consider characteristics of a real dataset, we matched fixed quantities and parameters of the model to empirical values from a small airway secretory cell subset from the newborn pig data we present again in Section 3.2. In your DoHeatmap () call, you do not provide features so the function does not know which genes/features to use for the heatmap. #' @param min_pct The minimum percentage of cells in either group to express a gene for it to be tested. Below is a brief demonstration but please see the patchwork package website here for more details and examples. Then the regression model from Section 2.1 simplifies to logqij=i1+i2xj2. In our simulation, the analysis focused on transcriptome-wide data simulated from the proposed model for scRNA-seq counts under different numbers of differentially expressed genes and different signal-to-noise ratios. ## loaded via a namespace (and not attached): ## [1] systemfonts_1.0.4 plyr_1.8.8 igraph_1.4.1, ## [4] lazyeval_0.2.2 sp_1.6-0 splines_4.2.0, ## [7] crosstalk_1.2.0 listenv_0.9.0 scattermore_0.8, ## [10] digest_0.6.31 htmltools_0.5.5 fansi_1.0.4, ## [13] magrittr_2.0.3 memoise_2.0.1 tensor_1.5, ## [16] cluster_2.1.3 ROCR_1.0-11 limma_3.54.1, ## [19] globals_0.16.2 matrixStats_0.63.0 pkgdown_2.0.7, ## [22] spatstat.sparse_3.0-1 colorspace_2.1-0 rappdirs_0.3.3, ## [25] ggrepel_0.9.3 textshaping_0.3.6 xfun_0.38, ## [28] dplyr_1.1.1 crayon_1.5.2 jsonlite_1.8.4, ## [31] progressr_0.13.0 spatstat.data_3.0-1 survival_3.3-1, ## [34] zoo_1.8-11 glue_1.6.2 polyclip_1.10-4, ## [37] gtable_0.3.3 leiden_0.4.3 future.apply_1.10.0, ## [40] abind_1.4-5 scales_1.2.1 spatstat.random_3.1-4, ## [43] miniUI_0.1.1.1 Rcpp_1.0.10 viridisLite_0.4.1, ## [46] xtable_1.8-4 reticulate_1.28 ggmin_0.0.0.9000, ## [49] htmlwidgets_1.6.2 httr_1.4.5 RColorBrewer_1.1-3, ## [52] ellipsis_0.3.2 ica_1.0-3 farver_2.1.1, ## [55] pkgconfig_2.0.3 sass_0.4.5 uwot_0.1.14, ## [58] deldir_1.0-6 utf8_1.2.3 tidyselect_1.2.0, ## [61] labeling_0.4.2 rlang_1.1.0 reshape2_1.4.4, ## [64] later_1.3.0 munsell_0.5.0 tools_4.2.0, ## [67] cachem_1.0.7 cli_3.6.1 generics_0.1.3, ## [70] ggridges_0.5.4 evaluate_0.20 stringr_1.5.0, ## [73] fastmap_1.1.1 yaml_2.3.7 ragg_1.2.5, ## [76] goftest_1.2-3 knitr_1.42 fs_1.6.1, ## [79] fitdistrplus_1.1-8 purrr_1.0.1 RANN_2.6.1, ## [82] pbapply_1.7-0 future_1.32.0 nlme_3.1-157, ## [85] mime_0.12 formatR_1.14 compiler_4.2.0, ## [88] plotly_4.10.1 png_0.1-8 spatstat.utils_3.0-2, ## [91] tibble_3.2.1 bslib_0.4.2 stringi_1.7.12, ## [94] highr_0.10 desc_1.4.2 lattice_0.20-45, ## [97] Matrix_1.5-3 vctrs_0.6.1 pillar_1.9.0, ## [100] lifecycle_1.0.3 spatstat.geom_3.1-0 lmtest_0.9-40, ## [103] jquerylib_0.1.4 RcppAnnoy_0.0.20 data.table_1.14.8, ## [106] cowplot_1.1.1 irlba_2.3.5.1 httpuv_1.6.9, ## [109] R6_2.5.1 promises_1.2.0.1 KernSmooth_2.23-20, ## [112] gridExtra_2.3 parallelly_1.35.0 codetools_0.2-18, ## [115] MASS_7.3-56 rprojroot_2.0.3 withr_2.5.0, ## [118] sctransform_0.3.5 parallel_4.2.0 grid_4.2.0, ## [121] tidyr_1.3.0 rmarkdown_2.21 Rtsne_0.16, ## [124] spatstat.explore_3.1-0 shiny_1.7.4, Fast integration using reciprocal PCA (RPCA), Integrating scRNA-seq and scATAC-seq data, Demultiplexing with hashtag oligos (HTOs), Interoperability between single-cell object formats.