The innocent p value has been abused to within an inch of its life . It ’ s been misused , misinterpreted , miscommunicated , and misunderstood . But missing it ’ s not , as researchers , journal editors , journalists , and readers become ever more dependent on using a simple p value to make snap judgments of good or bad . Yet , using the p-value , as it has been in traditional research , fundamentally miscasts the metric and undermines the validity and reproducibility of the science it ’ s trying to prove .

In an effort to shed light on the plight of the poor p value and the damage its misuse appears to be doing to the scientific endeavor , the American Statistical Association ( ASA ), for the first time in its 177-year history , released a position statement on a specific matter of statistical practice . 1

“ We are really looking to have a conversation […] with lots of people about how to take what we know about what is good using p values and what is not ,” said ASA executive director Ronald L . Wasserstein , PhD , in an interview with CardioSource WorldNews . What they want , he added , is to “ more effectively pass that along to scientists everywhere so that we can do a better job of making inferences using statistics .”

After months of debate among a group of experts representing a wide variety of viewpoints , the ASA released their statement in March 2016 , conceding that the hard-fought ‘ agreement ’ among the experts breaks no new ground but rather frames some of the issues that have been debated for years . The statement is comprised of six guiding principles with accompanying explanations that seek to bring clarity to the issue . ( See Six P-rinciples sidebar .)

According to the ASA , “ The issues touched on here affect not only research , but research funding , journal practices , career advancement , scientific education , public policy , journalism , and law .” The question ( and we don ’ t mean hypothesis-generating ): how can the system be significantly changed ?

p < 0.05 : Not a Yes , Rather a Firm “ Maybe ” “ P values are a tremendously valuable tool ,” said Stuart Pocock , MSc , PhD , from the London School of Hygiene and Tropical Medicine , in an interview with CSWN . “ There are those who think we should get rid of them and one journal actually banned p values for a while , which is a bit over the top . The change needs to be around how we evaluate p values .”

Dr . Pocock is a leading name in statistics in cardiology and has collaborated on numerous important trials . In late 2015 , he authored a fourpart practical guide to the essentials of statistical analysis and reporting of randomized clinical trials ( RCTs ) in the Journal of the American College of Cardiology , 2-5 which included extensive discussion of statistical controversies in RCT reporting and interpretation .

As the masses of people who read scientific papers understand statistics less than statisticians themselves ( and , let ’ s face it , most readers are way too busy and distracted to dig deeply into the science of statistical probability ), a clear trend has been to reduce all data analysis down to whether the p was significant . This kind of “ bright line ” ( such as p < 0.05 ) conclusion making , said the ASA , can lead to poor choices . “ A conclusion does not immediately become ‘ true ’ on one side of the divide and ‘ false ’ on the other .”

Rather , “ there is a difference between there being a bright line at which at some point you have to make a decision and there being a bright line about how much you are learning from a particular piece of evidence ,” said Dr . Wasserstein .

Ironically , the p value was never meant to determine good from bad or true from false . When first introduced in the 1920s by the British statistician Ronald Fisher in his book Statistical Methods for Research Workers , the idea was simply to use the test to determine whether the data were worthy of further analysis and not a product of randomness . ( It probably didn ’ t help his argument when his effort was dubbed Fisher ’ s exact test ; but , face it , Fisher ’ s worthiness test doesn ’ t have quite the same ring of authority .)

According to Fisher , the p value is “ the probability of the observed result , plus more extreme results , if the null hypothesis were true .” There ’ s that word “ probability ” again .

“ Fisher offered the idea of p values as a means of protecting researchers from declaring truth based on patterns in noise ,” wrote Andrew Gelman , PhD , and Eric Loken , PhD , in American Scientist . 6 The title of their 2014 paper : The Statistical Crisis in Science . “ In an ironic twist , p values are now often used to lend credence to noisy claims based on small samples .”

Dr . Gelman is the director of the Applied Statistics Center at Columbia University and a research associate professor at Pennsylvania State University .

Who Crowned 0.05 King Anyway ? Importantly , we need to keep in mind that the p value is not the probability of the null hypothesis being true . Nor is it correct to say that if the p value is greater than 0.05 , then the null hypothesis is true . ( Dr . Pocock calls this widely present misinterpretation “ utter rubbish .”) Rather , a nonsignificant p value means that the data have provided little or no evidence that the null hypothesis is false . Certainly , when the p value is very close to the 0.05 standard cutoff , it ’ s easy to see that further analysis or data are needed to differentiate whether there is a null effect or just a small effect .

Nor is it necessarily true that p values provide a measure of the strength of the evidence against the null hypothesis . Some argue that while a p value of 0.05 does not provide strong evidence against the null hypothesis , but it is reasonable to say that a p value < 0.001 does . However , others have cautioned that because p values are dependent on sample size , a p value of 0.001 should not be interpreted as providing more support for rejecting the null hypothesis than one of 0.05 .

“ Making decisions with such limited flexibility is usually neither realistic nor prudent ,” wrote Journal of the American Medical Association senior editor Demetrios N . Kyriacou , MD , PhD , from Northwestern University Feinberg School of Medicine , Chicago , IL , in a recent editorial . 7

“ For example , it would be unreasonable to decide that a new cancer medication was ineffective because the calculated p value from a phase II trial was 0.51 and the predetermined level of statistical significance was considered to be less than 0.05 .”

Added Dr . Wasserstein : “ Part of our messages is that p = 0.049 is not qualitatively difference in any respect to p = 0.05 or 0.051 .” Furthermore , what appears to be sometimes done when the data yield a p = 0.05 or 0.051 “ is to play with your analysis a little bit until you get something a little safer .”

If we fix the p value issue , he noted , that tinkering wouldn ’ t be necessary . Say , the result garners a p = 0.08 , but the effect size is “ massive ,” then , he said , “ you still have something to talk about , you don ’ t have anything to excuse anymore , and you don ’ t have to try to talk your way out of p = 0.08 .”

Six P-rinciples

P values can indicate how incompatible the

1 data are with a specified statistical model .

P values do not measure the probability

2 that the studied hypothesis is true , or the probability that the data were produced by random chance alone .

Scientific conclusions and business or

3 policy decisions should not be based only on whether a p value passes a specific threshold .

Proper inference requires full reporting

4 and transparency .

A p value , or statistical significance , does

5 not measure the size of an effect or the importance of a result .

6

By itself , a p value does not provide a good measure of evidence regarding a model or hypothesis .

ACC . org / CSWN CardioSource WorldNews

CardioSource WorldNews | Page 35

1 data are with a specified statistical model .

2 that the studied hypothesis is true , or the probability that the data were produced by random chance alone .

3 policy decisions should not be based only on whether a p value passes a specific threshold .

4 and transparency .

5 not measure the size of an effect or the importance of a result .

6