Enhancing Rigor in Computational Methods for Biological Data Analysis

Magazine: Features
Enhancing Rigor in Computational Methods for Biological Data Analysis

FREE CONTENT FEATURE

When you use the most popular computational methods for biological data analysis, have you checked whether their models are reasonable in your settings?

Enhancing Rigor in Computational Methods for Biological Data Analysis

By Xinzhou Ge, February 2024

Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition

Tags: Bioinformatics, Computational biology, Cross-computing tools and techniques

When I first read the paper on the MACS algorithm as a graduate student, I was struck by a mistake with its algorithm that seemed so evident. However, this algorithm is so popular that I even doubted myself." This was shared with me while I was in my second year of graduate studies in the Department of Statistics at UCLA. My advisor, Dr. Jingyi Jessica Li, and I were discussing problems with the MACS algorithm [1], one of the most popular computational methods for biological data analysis. By that time, I had developed my "statistical mindset" through four years of rigorous undergraduate training in classical statistics, followed by two years of advanced graduate training. Accustomed to strict mathematical rigor, it was surprising to find such an apparent flaw in a well-regarded algorithm—a flaw that seemed to leap out from the perspective of a trained statistician.

The field of computational biology, while relatively nascent compared to traditional statistics, has rapidly expanded alongside technological advancements. The advent of high-throughput genomic technologies in recent decades has generated numerous new biological data types, providing valuable resources for advancing research in human health and fostering medical breakthroughs. The analysis of these novel datasets necessitates the creation of innovative computational methodologies, which have been a driving force in the evolution of computational biology.

While many of these computational methods incorporate statistical models and may even include theoretical results, the primary users are often biologists. These users might not possess extensive statistical knowledge to fully comprehend the complexities of the algorithms they are using. Consequently, when it comes to choosing an algorithm, users may prioritize one that was published first, enjoys popularity, or comes with user-friendly software over one that is statistically more rigorous. This can sometimes lead overlooking robust statistical validation in favor of convenience or convention.

"What if we systematically identified the misused statistics in these prevalent algorithms and alerted biologists to exercise caution when utilizing them?" This simple idea quickly emerged from our discussion. Based on this thought, I started a thorough literature review and discovered that a significant number of highly cited computational biology papers did indeed exhibit methodological issues. This revelation was an indicator of the lack of rigor in the field and catalyzed my decision to pursue research aimed at fortifying the statistical rigor of biological data analysis.

Feature Screening from High-Throughput Biological Data

Our investigative journey started from the MACS algorithm, the subject of our initial discussion. We pinpointed a critical issue within this method: The generation of invalid p-values, a problem not unique to MACS but rather widespread across various high-throughput biological data analyses involving two conditions.

High-throughput technologies have revolutionized the field of biology, allowing for the system-wide quantification of biological features such as genes, genomic regions, and proteins. The term "high-throughput" refers to the ability to assay a vast number of features, typically in the thousands, simultaneously. The primary objective in analyzing high-throughput datasets is to distinguish between two different conditions to identify "interesting features."

The two conditions when compared could be different cell types, cancerous versus normal tissue samples, or experimental versus control groups. Through such comparative analysis, the goal is to identify features that exhibit a significant relationship to the conditions under study. For instance, in the context of chromatin immunoprecipitation sequencing (ChIP-seq) data, an application like MACS focuses on the identification of protein-binding sites across the genome (a process known as "peak calling") [1, 2]. ChIP-seq technology quantifies the binding affinity between proteins and DNA across various genomic regions. By contrasting ChIP-seq results from an experimental sample against a control, scientists aim to identify genomic regions with elevated interaction intensities in the experimental condition as opposed to the control condition. These regions are considered "interesting" as their heightened activity is presumed to be condition specific.

Another application of feature screening in high-throughput biological data is the detection of differentially expressed genes (DEGs) using genome-wide expression profiling techniques like microarrays and RNA sequencing (RNA-seq) [3, 4]. RNA-seq, which leverages the power of next-generation sequencing technologies, offers a comprehensive snapshot of the transcriptome, capturing the expression levels of genes within a given sample.

The central pursuit of DEG analysis is to identify genes that exhibit significant expression differences between two comparative conditions. This analysis provides insights into the genes that may be upregulated or downregulated in response to specific biological states or experimental manipulations, thereby shedding light on the molecular mechanisms underlying those conditions.

False Discovery Rate Control

In scientific research, the interesting features only constitute a small proportion of all features, and the remaining majority is referred to as "uninteresting features." The identified features, also called the "discoveries" from feature screening analysis, are subject to further investigation and validation. For example, from the DEG analysis, if we identify an interesting gene that is differential between treated and untreated patient samples, this gene may be related to the treatment, and we can further validate this discovery using biological experiments. Hence, to reduce experimental validation that is often laborious or expensive, researchers require reliable discoveries that contain few false discoveries.

Given that experimental validation can be resource-intensive and costly, there is a critical need for the initial discovery process to be as precise as possible, minimizing the inclusion of false discoveries. Accordingly, the false discovery rate (FDR) has been developed as a statistical criterion for ensuring discoveries' reliability [5]. Technically, the FDR is defined as the expected proportion of uninteresting features among the discoveries. FDR control refers to the goal of finding discoveries such that the FDR is under a pre-specified threshold (e.g., 0.05). By controlling the FDR to a small number, biologists would like to ensure that their discoveries are as reliable as possible.

The Problem of Invalid P-Values

Existing computational methods for FDR control primarily rely on p-values, one per feature. Among the p-value-based FDR control methods, the most classic and popular one is the Benjamini-Hochberg (BH) procedure [5] and the Storey's q-value [6]. All these methods set a p-value cutoff based on the pre-specified FDR threshold. However, the computation of p-values requires certain distributional assumptions that may not always hold in the biological context, or it requires a substantial number of replicates, which can be impractical due to constraints in resources or sample availability. These limitations highlight a critical challenge in the application of these p-value based methods to the field of high-throughput biological data analysis.

For instance, in peak calling analysis, the MACS algorithm assumes the data follow a Poisson distribution—an assumption that is often contested by empirical observations. Additionally, when comparing the measurements from the two conditions, MACS treats the data from the negative control sample as fixed, and only the measurements from the experimental condition are considered random. This leads to the calculation of p-values using a one-sample test, even though the situation calls for a two-sample test, given that measurements from both the experimental and control samples are random variables. This misalignment between the model's assumptions and the actual test leads to invalid p-values.

The rationale behind MACS's model selection is somewhat understandable, as peak calling analyses typically involve only a single sample from each condition, making a two-sample test practically infeasible with such limited data. The developers of the algorithm were likely compelled to make compromises due to these constraints, a fact that has historically been overlooked by users.

Similarly, in DEG analysis, popular algorithms like DESeq2 and edgeR assume that gene expression measurements are from the negative binomial distribution. This assumption, however, does not hold in the presence of outliers, which are not uncommon in biological data. Outliers can lead to a violation of the assumed distribution, thereby cause invalid p-values.

A P-Value-Free FDR Control Framework

These scenarios illustrate the difficulties of getting valid p-values in high-throughput biological data analysis. Often, bioinformatics tools produce ill-posed p-values, which can lead to unreliable FDR control or a lack of power to detect true biological signals. Since p-values are intermediary tools and the ultimate objective is precise FDR control, one possible solution is to control the FDR without using p-values.

Inspired by the Barber-Candès (BC) procedure [7], which theoretically controls the FDR without reliance on p-values, in 2012 we introduced a comprehensive statistical framework named Clipper [8]. Clipper aims to provide reliable FDR control for high-throughput biological data analysis, without the need for p-values and specific assumptions about data distributions. It is a robust and adaptable framework, suitable for an array of feature screening analyses from diverse high-throughput biological data traits, such as varying distributions, number of replicates, and the presence of outliers.

Clipper has two principal steps: construction and thresholding of contrast scores. First, Clipper assigns a contrast score to each feature—for example, to each gene—serving as an alternative to a p-value. This score summarizes the feature's measurements between two conditions and reflects the feature's level of interest. Second, as its name suggests, Clipper establishes a cutoff on features' contrast scores and calls as discoveries the features whose contrast scores exceed the cutoff. Clipper is a flexible framework that only requires a minimal input: all features' measurements under two conditions and a target FDR threshold (e.g., 5%).

Requiring only basic statistical assumptions that are commonly accepted in bioinformatics, Clipper provides a theoretical framework with an FDR control guarantee, applicable to feature screening analyses regardless of the number of biological samples involved. This positions Clipper as a pioneering approach for ensuring robust and transparent discovery in high-throughput biological analyses.

Inflated False Discoveries on Real-Data Examples

Despite demonstrating that computational methods yielding invalid p-values may result in a significant number of false discoveries, we've encountered a notable divergence between the perspectives of statisticians and biologists. From a statistician's standpoint, any rate of false positives exceeding the target threshold is considered a big concern. However, dialogues with some biologists have revealed a different viewpoint. They often express a degree of acceptance toward a marginally higher false discovery rate, for instance, 10% instead of the targeted 5%, questioning the extent of its severity.

This feedback brought us to the realization that biologists tend to trust empirical evidence over theoretical results. Therefore, to convincingly articulate the importance of rigorous FDR control, we recognized the need to present more concrete examples.

A compelling opportunity to illustrate the pitfalls of false discoveries came through our collaborative work with Dr. Wei Li's laboratory at the University of California, Irvine. In this partnership, we conducted a thorough analysis using DESeq2 and edgeR on many real population-level RNA-seq datasets [9, 10, 11] to assess their efficacy in identifying DEGs between two conditions.

One of the more striking findings emerged from an immunotherapy dataset, which included samples from 51 untreated and 58 treated patients undergoing anti-PD-1 therapy [9]. DESeq2 and edgeR identified DEGs with a mere 8% concordance. This low overlap prompted us to investigate whether the methods were truly controlling the FDR within the desired 5% threshold.

To explore this, we first generated many negative-control datasets by randomly permuting the two-condition labels (pre-nivolumab and on-nivolumab) of the 109 RNA-seq samples in this immunotherapy dataset. Since any DEGs identified from these permuted datasets are considered as false positives, we used these permuted datasets to evaluate the FDRs of DESeq2 and edgeR. Surprisingly, DESeq2 and edgeR had 84.88% and 78.89% chances, respectively, to identify more DEGs from the permuted datasets than from the original dataset. These results raised the caution about exaggerated false positives found by DESeq2 and edgeR on the original dataset.

Adding to the intrigue, we observed that the genes with larger fold changes—often presumed by biologists to be more credible DEGs—were more frequently flagged as significant in the permuted datasets by both methods. This raises concerns about the potential misdirection of valuable experimental validation resources towards these likely false positives.

Out of curiosity and as a means of verification, we investigated the biological functions of the spurious DEGs identified by DESeq2 or edgeR from the permuted datasets. Unexpectedly, these spurious DEGs' top 5 enriched gene ontology terms included immune-related terms. Hence, if these spurious DEGs were not removed by FDR control, they would mislead researchers to believe there was an immune response difference between pre-nivolumab and on-nivolumab patients, an undoubtedly undesirable consequence that DEG analysis must avoid.

This investigation of false discoveries in DEG identification methods has underscored the critical importance of robust statistical practices in computational biology. The inflated false discovery rates in DEG analysis may lead to substantial downstream consequences, including misdirected research efforts and misguided therapeutic strategies. To solve these problems, we recommend conducting a preliminary "sanity check" on algorithms prior to their application and employing non-parametric statistical tests when dealing with datasets with large sample sizes. These are crucial for ensuring the validity and reliability of computational methods in the biological sciences, ultimately safeguarding the integrity of scientific discovery and application.

What Now?

The publication of our findings has sparked widespread interest and discussion [12]. We've received feedback from biologists who were unaware of the potential flaws in the computational tools they've relied on for years. Some algorithm developers have reached out to discuss enhanced versions of their methods that may mitigate the issue of false discoveries. These ongoing conversations are vital, as they contribute to a growing recognition of the need for statistical rigor in computational methodologies.

The application of computational methods has expanded to various disciplines beyond the realms of computer science and statistics. Yet, many users, including biologists, may not fully grasp the underlying algorithms, their assumptions, or the specific conditions under which they are valid. Take, for example, DESeq2 and edgeR, which were initially designed for small-sample datasets at a time when sequencing data was very expensive. Despite the evolution of data acquisition and the increase in sample sizes, these methods remain in widespread use, even when more appropriate solutions exist. Misapplication of these tools can lead to questionable biological conclusions. As statisticians, we feel it is our duty to communicate these ideas to users of computational methods, thereby reinforcing the integrity of research in the biomedical sciences.

It has been more than five years since my advisor, Jessica, and I first discussed the subject of rigor in computational methods, we have both advanced in our careers—Jessica to a full professor at UCLA Statistics and myself to anassistant professorship at Oregon State University. Our commitment to identifying and correcting the misuse of statistical models in biological data analysis remains, and we will continue to work toward enhancing the rigor of computational methods, striving to improve the reliability of outcomes in the field of biomedical science.

References

[1] Zhang, Y. et al. Model-based analysis of chip-seq (MACS). Genome Biology 9, R137 (2008), 1–9.

[2] Heinz, S., et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities. Molecular Cell 38, 4 (2010), 576–89.

[3] Robinson, M. D., McCarthy, D. J., and Smyth, G. K. edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 1 (2010), 139–40.

[4] Love, M. I., Huber, W., and Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550 (2014).

[5] Benjamini, Y. and Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological) 57, 1 (1995), 289–300.

[6] Storey, J. D. A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B 64, 3 (2002),479–98.

[7] Barber, R. F. and Candès, E. J. Controlling the false discovery rate via knockoffs. The Annals of Statistics 43, 5 (2015), 2055–85.

[8] Ge, X. et al. Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biology 22, 288 (2021).

[9] Riaz, N. et al. Tumor and microenvironment evolution during immunotherapy with Nivolumab. Cell 171, 4 (2017), 934–949 e916.

[10] Consortium GT. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 6509 (2020), 1318–30.

[11] The Cancer Genome Atlas Research Network, Weinstein, J. N., et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics 35, (2013), 1113–20.

[12] Li, Y., Ge, X., Peng, F., Li, W., and Li, J. J. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biology 23, 1 (2022), 79.

Author

Xinzhou Ge is an assistant professor in the Department of Statistics at the Oregon State University. He received his Ph.D. in statistics from UCLA under the supervision of Dr. Jingyi Jessica Li. His research interest includes developing statistical models and enhancing statistical rigor in high-throughput biological data analysis.

Crossroads The ACM Magazine for Students

Magazine: Features Enhancing Rigor in Computational Methods for Biological Data Analysis

FREE CONTENT FEATURE

Enhancing Rigor in Computational Methods for Biological Data Analysis

Magazine: Features
Enhancing Rigor in Computational Methods for Biological Data Analysis