Christopher Lance
Best practice science in the age of irreplicability
Written by Thomas Wallis
Replication - the ability to find the same effect repeatedly in independent experiments — is a cornerstone of experimental science . However , the published literature is likely full of irreplicable results [ 1 , 2 ]. There are many reasons for this problem , but the root cause is arguably that the incentive structure of science has selected for flashiness and surprise rather than for truth and rigour . Authors who publish in high-impact journals tend to be rewarded with jobs , grants , and career success , whether or not the result turns out to actually be replicable [ 3 ].
These incentives can facilitate poor experimental and statistical practices that make faulty conclusions more likely . Jens Klinzing wrote a nice overview on these issues for the last addition of Neuromag [ 4 ]. In this article , I ’ m going to take his lead and discuss a few practices in more detail that you can incorporate into your work , now and into the future , that will help to ensure the quality of scientific output .
The garden of forking paths
I believe that the vast majority of scientists are honestly trying to do the best and most accurate science they can . One of the most startling realisations I have had over the past few years , however , is how easy it is for even well-intentioned researchers to unconsciously mislead themselves ( and thus also the larger scientific community ) [ 5 ].
In practice , it ’ s almost always the case that numerous decisions about how to test the research question of interest are made after or during the process of data collection . By allowing our analyses to depend on the particular data we observe in an experiment we invite the possibility that we are shining a spotlight on noise : random fluctuations that are a property of this dataset and not a replicable property of the world at large .
In an article you should definitely read , Gelman and Loken characterise this as walking through a “ garden of forking paths ” [ 6 ]. The point this article makes , which should give us all pause , is that even if you do not sit and try a bunch of different analyses until you found the one that “ worked ” ( i . e . gave the result you wanted ), you might still be on thin inferential ice . Given a different dataset ( but the same experiment ) you might have done a different analysis and possibly drawn a different conclusion .
Distinguishing exploratory and confirmatory experiments
When your analyses depend on your data , you are conducting exploratory research . A confirmatory test , in contrast , is when everything is pre-specified before data collection . Among other things , this distinction is crucial for the interpretation of p-values . The fabled “ 0.05 ” cutoff should , in theory , ensure that 5 % or fewer findings declared “ significant ” are false-positives ( i . e . the null is actually true ) across a body of literature . However , p-values only correspond to their nominal false-positive rates for confirmatory research – when your hypotheses , design and analysis plan are defined before data collection . For exploratory analyses the true false-positive rates can be far higher ( see Jens ’ article in the last Neuromag ).
Currently , exploratory research is almost always presented as if it is con-
16 | NEUROMAG | May 2017