Neuromag May 2017 | Page 16

Christopher Lance

Best practice science in the age of irreplicability

Written by Thomas Wallis
Replication- the ability to find the same effect repeatedly in independent experiments— is a cornerstone of experimental science. However, the published literature is likely full of irreplicable results [ 1, 2 ]. There are many reasons for this problem, but the root cause is arguably that the incentive structure of science has selected for flashiness and surprise rather than for truth and rigour. Authors who publish in high-impact journals tend to be rewarded with jobs, grants, and career success, whether or not the result turns out to actually be replicable [ 3 ].
These incentives can facilitate poor experimental and statistical practices that make faulty conclusions more likely. Jens Klinzing wrote a nice overview on these issues for the last addition of Neuromag [ 4 ]. In this article, I’ m going to take his lead and discuss a few practices in more detail that you can incorporate into your work, now and into the future, that will help to ensure the quality of scientific output.
The garden of forking paths
I believe that the vast majority of scientists are honestly trying to do the best and most accurate science they can. One of the most startling realisations I have had over the past few years, however, is how easy it is for even well-intentioned researchers to unconsciously mislead themselves( and thus also the larger scientific community) [ 5 ].
In practice, it’ s almost always the case that numerous decisions about how to test the research question of interest are made after or during the process of data collection. By allowing our analyses to depend on the particular data we observe in an experiment we invite the possibility that we are shining a spotlight on noise: random fluctuations that are a property of this dataset and not a replicable property of the world at large.
In an article you should definitely read, Gelman and Loken characterise this as walking through a“ garden of forking paths” [ 6 ]. The point this article makes, which should give us all pause, is that even if you do not sit and try a bunch of different analyses until you found the one that“ worked”( i. e. gave the result you wanted), you might still be on thin inferential ice. Given a different dataset( but the same experiment) you might have done a different analysis and possibly drawn a different conclusion.
Distinguishing exploratory and confirmatory experiments
When your analyses depend on your data, you are conducting exploratory research. A confirmatory test, in contrast, is when everything is pre-specified before data collection. Among other things, this distinction is crucial for the interpretation of p-values. The fabled“ 0.05” cutoff should, in theory, ensure that 5 % or fewer findings declared“ significant” are false-positives( i. e. the null is actually true) across a body of literature. However, p-values only correspond to their nominal false-positive rates for confirmatory research – when your hypotheses, design and analysis plan are defined before data collection. For exploratory analyses the true false-positive rates can be far higher( see Jens’ article in the last Neuromag).
Currently, exploratory research is almost always presented as if it is con-
16 | NEUROMAG | May 2017