An evaluation of machine learning and classical statistical methods for discovery in large-scale translational data


A recent paper in the top journal Nature Methods- Statistics versus machine learning by Bzdok, Altman and Krzywinski - concludes that random forests will correctly identify a greater proportion of truly dysregulated genes more often than traditional methods based on p-value adjustments such as Benjamini-Hochberg. However, the random forests approach, as implemented, requires prior knowledge of the number of true dysregulated genes - information not available in practice. Thus, subsequent conclusions are biased against traditional methods. In related work, second-generation p-values have recently been proposed for large-scale inference. We evaluate the viability and accuracy of this approach as well. Simulation studies of microarray of gene expression data are used. We maintain the original structure proposed by Bzdok, Altman, and Brzywinski. The different methods used to determine the number of dysregulated genes in simulated data were: unadjusted p-values, Benjamini-Hochberg adjusted p-values, random forest importance levels, and second-generation p-values. When all methods are given the same prior information, second-generation p-values generally outperform all other methods. The finding that random forest importance levels via a machine learning algorithm outperform classical methods was not reproduced nor validated in our examination. The choice of an analysis methods for large-scale translation data is critical to the success of any statistical investigation. In this context, machine learning methods fail to outperform standard methods and, given their additional complexity, are not the optimal approach.

Mar 10, 2019 12:00 AM
Washington, DC
Megan Hollister Murray
Megan Hollister Murray
PhD Candidate and Research Assistant in Biostatistics