Oct 19, 2013 - Around the Reproducibility of Scientific Research in the Big Data Era: ... âWhy most published research findings are falseâ (Ioannidis, '05) ...
Around the Reproducibility of Scientific Research in the Big Data Era: A Knockoff Filter for Controlling the False Discovery Rate Emmanuel Cand`es
Big Data and Computational Scalability, University of Warwick, July 2015
Collaborator
Rina Foygel Barber
A crisis? The Economist: Oct 19th, 2013
Problems with scientific research: How science goes wrong | The Economist
5.11.13 18:28
Problems with scientific research
How science goes wrong Scientific research has changed the world. Now it needs to change itself
Oct 19th 2013 | From the print edition A SIMPLE idea underpins science: “trust, but verify”. Results should always be subject to challenge from experiment. That simple but powerful idea has generated a vast body of knowledge. Since its birth in the 17th century, modern science has changed the world beyond recognition, and overwhelmingly for the better. But success can breed complacency. Modern scientists are doing too much trusting and not enough verifying—to the detriment of the whole of science, and of humanity. Too many of the findings that fill the academic ether are the result of shoddy experiments or poor analysis (see article (http://www.economist.com/news/briefing/21588057-scientists-thinkscience-self-correcting-alarming-degree-it-not-trouble) ). A rule of thumb among biotechnology venture-capitalists is that half of published research cannot be replicated. Even that may be optimistic. Last year researchers at one biotech firm, Amgen, found they could reproduce just six of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, a drug company, managed to repeat just a quarter of 67 similarly important papers. A leading computer scientist frets that three-quarters of papers in his subfield are bunk. In 2000-10 roughly 80,000 patients took part in clinical trials based on research that was later retracted because of mistakes or improprieties.
A crisis? The Economist: Oct 19th, 2013
Problems with scientific research: How science goes wrong | The Economist
5.11.13 18:28
Problems with scientific research
How science goes wrong
Unreliable research: Trouble at the lab | The Economist
5.11.13 18:29
Scientific research has changed the world. Now it needs to change itself
Oct 19th 2013 | From the print edition A SIMPLE idea underpins science: “trust, but verify”. Results should always be subject to challenge from experiment. That simple but powerful idea has generated a vast body of knowledge. Since its birth in the 17th century, modern science has changed the world beyond
Unreliable research
Trouble at the lab
recognition, and overwhelmingly for the better. Scientists like to think of science as self-correcting. To an alarming degree, it is not
Oct 19th 2013 | From the print edition
But success can breed complacency. Modern scientists are doing too much trusting and not enough verifying—to the detriment of the whole of and wreck of humanity. “I science, SEE a train looming,” warned Daniel
Kahneman, an eminent psychologist, in an open Too many of the findings that fill the academic ether are the result of shoddy experiments or letter last year. The premonition concerned poor analysis (see article (http://www.economist.com/news/briefing/21588057-scientists-thinkresearch on a phenomenon known as “priming”. science-self-correcting-alarming-degree-it-not-trouble) ). A rule of thumb among biotechnology Priming studies suggest that decisions can be venture-capitalists is that half of published research cannot be replicated. Even that may be influenced by apparently irrelevant actions or optimistic. Last year researchers at one biotech firm, Amgen, found they could reproduce just six events that took placea just the cusp of of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, drugbefore company, managed
havecomputer been a boom area frets in psychology to repeat just a quarter of 67 similarly important choice. papers. They A leading scientist that the pastroughly decade,80,000 and some of their insights three-quarters of papers in his subfield are bunk.over In 2000-10 patients took part have already made it out of the lab and into the toolkits ofbecause policy wonks keen or onimproprieties. “nudging” the populace. in clinical trials based on research that was later retracted of mistakes
Snippets Unreliable research: Trouble at the lab | The Economist
5.11.13 18:29
Unreliable research
Trouble at the lab Scientists like to think of science as self-correcting. To an alarming degree, it is not
Oct 19th 2013 | From the print edition “I SEE a train wreck looming,” warned Daniel Kahneman, an eminent psychologist, in an open letter last year. The premonition concerned
Systematic attempts to replicate widely cited priming experiments have failed
research on a phenomenon known as “priming”. Priming studies suggest that decisions can be
Amgen could only replicate 6 of 53 studies they considered landmarks in science events basic that took cancer place just before the cusp of influenced by apparently irrelevant actions or
choice. They have been a boom area in psychology
HealthCare could only replicate about 25% of 67 seminal studies
over the past decade, and some of their insights have already made it out of the lab and into the toolkits of policy wonks keen on “nudging” the populace.
Dr Kahneman and a growing number of his colleagues fear that a lot of this priming research is Early report (Kaplan, ’08): 50% of Phase III FDA studies ended in failure poorly founded. Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. Many of these replications have failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate
Daily Shouts Page-Turner Daily Comment Amy Davidson John Cassidy Andy Borowitz Richard Brody
Close
New Yorker: December, 2010 Reddit Linked In Email
O
n September 18, 2007, a few dozen neuroscientists, psychiatrists, and drug-company executives gathered in a hotel conference room in Brussels to hear some ANNALS OF SCIENCE startling news. It had to do with a class of drugs known as atypical or second-generation antipsychotics, which came on the market in the early nineties. The drugs, sold under Is there something wrong with the scientific method? brand names such as Abilify, Seroquel, and Zyprexa, had by Jonah Lehrer been tested on schizophrenics in several large clinical DECEMBER 13, 2010 trials, all of which had demonstrated a dramatic decrease in the subjects’ psychiatric symptoms. As a result, secondRecommend 29k Many results that are rigorously proved and generation antipsychotics had become one of the fastestaccepted start shrinking in later studies. growing and most profitable pharmaceutical classes. By 2001, Eli Lilly’s Zyprexa was generating more revenue than Prozac. It remains the company’s top-selling drug. But the data presented at the Brussels meeting made it clear that something strange was happening: the therapeutic power of the drugs appeared to be steadily waning. A recent study showed an effect that The New Yorker Reporting & Essays
THE TRUTH WEARS OFF
“Significance chasing”
Tweet 192
O
“Publication bias”
1,767
http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer?printable=true¤tPage=all
Page 1 of 11
“Selective reporting”
n September“Why 18, 2007,most a few dozen neuroscientists, published research psychiatrists, and drug-company executives gathered
http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer
findings are false” (Ioannidis, ’05) Print E-Mail Page 1 of 4
The New York Times Jan. 2014
Personal and societal concern
Great danger in seeing erosion of public confidence in science
Personal and societal concern
Great danger in seeing erosion of public confidence in science
Seems like scientific community is responding
Reproducibility Initiative
http://validation. scienceexchange.com/
disorders at the same time. Treatments are inconsistent. Outcomes are unpredictable. Science was supposed to come to the rescue. Genetics and neuroimaging studies would, all involved hoped, reveal biological signatures unique to each disorder, which could be used to provide consistent
encourage work that cuts across the boundaries. Researchers should be encouraged to investigate the causes of mental illness from the bottom up, as the US National Institute of Mental Health is doing. The brain is complicated enough. Why investigate its problems with one hand tied behind our backs? ■
Nature’s 18-point checklist: April 25, 2013 ANNO UN CEMEN T
Reducing our irreproducibility O
ver the past year, Nature has published a string of articles that highlight failures in the reliability and reproducibility of published research (collected and freely available at go.nature.com/ huhbyr). The problems arise in laboratories, but journals such as this one compound them when they fail to exert sufficient scrutiny over the results that they publish, and when they do not publish enough information for other researchers to assess results properly. From next month, Nature and the Nature research journals will introduce editorial measures to address the problem by improving the consistency and quality of reporting in life-sciences articles. To ease the interpretation and improve the reliability of published results we will more systematically ensure that key methodological details are reported, and we will give more space to methods sections. We will examine statistics more closely and encourage authors to be transparent, for example by including their raw data. Central to this initiative is a checklist intended to prompt authors to disclose technical and statistical information in their submissions, and to encourage referees to consider aspects important for research reproducibility (go.nature.com/oloeip). It was developed after discussions with researchers on the problems that lead to irreproducibility, including workshops organized last year by US National Institutes of Health (NIH) institutes. It also draws on published concerns about reporting standards (or the lack of them) and the collective experience of editors at Nature journals. The checklist is not exhaustive. It focuses on a few experimental and analytical design elements that are crucial for the interpretation of research results but are often reported incompletely. For example, authors will need to describe methodological parameters that can introduce bias or influence robustness, and provide precise characterization of key reagents that may be subject to biological variability, such as cell lines and antibodies. The checklist also consolidates existing policies about data deposition and presentation. We will also demand more precise descriptions of statistics, and
we will commission statisticians as consultants on certain papers, at the editor’s discretion and at the referees’ suggestion. We recognize that there is no single way to conduct an experimental study. Exploratory investigations cannot be done with the same level of statistical rigour as hypothesis-testing studies. Few academic laboratories have the means to perform the level of validation required, for example, to translate a finding from the laboratory to the clinic. However, that should not stand in the way of a full report of how a study was designed, conducted and analysed that will allow reviewers and readers to adequately interpret and build on the results. To allow authors to describe their experimental design and methods in as much detail as necessary, the participating journals, including Nature, will abolish space restrictions on the methods section. To further increase transparency, we will encourage authors to provide tables of the data behind graphs and figures. This builds on our established data-deposition policy for specific experiments and large data sets. The source data will be made available directly from the figure legend, for easy access. We continue to encourage authors to share detailed methods and reagent descriptions by depositing protocols in Protocol Exchange (www.nature.com/ protocolexchange), an open resource linked from the primary paper. Renewed attention to reporting and transparency is a small step. Much bigger underlying issues contribute to the problem, and are beyond the reach of journals alone. Too few biologists receive adequate training in statistics and other quantitative aspects of their subject. Mentoring of young scientists on matters of rigour and transparency is inconsistent at best. In academia, the ever increasing pressures to publish and chase funds provide little incentive to pursue studies and publish results that contradict or confirm previous papers. Those who document the validity or irreproducibility of a published piece of work seldom get a welcome from journals and funders, even as money and effort are wasted on false assumptions. Tackling these issues is a long-term endeavour that will require the commitment of funders, institutions, researchers and publishers. It is encouraging that NIH institutes have led community discussions on this topic and are considering their own recommendations. We urge others to take note of these and of our initiatives, and do whatever they can to improve research reproducibility. ■
3 9 8 | NAT U R E | VO L 4 9 6 | 2 5 A P R I L 2 0 1 3
© 2013 Macmillan Publishers Limited. All rights reserved
NAS President’s address: April 27, 2015
Big data and a new scientific paradigm
Collect data first
=⇒
Ask questions later
Large data sets available prior to formulation of hypotheses Need to quantify “reliability” of hypotheses generated by data snooping Very different from hypothesis-driven research
Big data and a new scientific paradigm
Collect data first
=⇒
Ask questions later
Large data sets available prior to formulation of hypotheses Need to quantify “reliability” of hypotheses generated by data snooping Very different from hypothesis-driven research
What does statistics have to offer? Account for “look everywhere” effect Understand reliability in the context of all hypotheses that have been explored
Most discoveries may be false: Sori´c (’89)
1000 hypotheses to test
See also The Economist
Most discoveries may be false: Sori´c (’89)
1000 hypotheses, 100 potential discoveries 1
Most discoveries may be false: Sori´c (’89)
1000 hypotheses, 100 potential discoveries 1
Most discoveries may be false: Sori´c (’89) True positives
False negatives
Power ≈ 80% −→ true positives ≈ 80 False positives (5% level) ≈ 45
=⇒
False positives
False discovery rate ≈ 36%
Most discoveries may be false: Sori´c (’89) True positives
False negatives
False positives
Reported
Power ≈ 80% −→ true positives ≈ 80 False positives (5% level) ≈ 45 1
=⇒
False discovery rate ≈ 36%
Most discoveries may be false: Sori´c (’89) True positives
False negatives
False positives
Reported
Power ≈ 30%
=⇒
False discovery rate ≈ 60%
More false negatives than true positives!
Example: meta-analysis in neuroscience Button et al. (2013) Power failure: why small sample size undermines the reliability of neuroscience
False Discovery Rate (FDR): Benjamini & Hochberg (’95)
H1 , . . . Hn hypotheses subject to some testing procedure
#false discoveries FDR = E #discoveries
‘0/0 = 0’
Natural type I error Under independence (and PRDS) simple rules control FDR (BHq) Widely used
FDR control with BHq (under independence) FDR: expected proportion of false discoveries Sorted p-values: p(1) ≤ p(2) ≤ . . . ≤ p(n) (from most to least significant) Target FDR q 1.0
● ●● ● ●●● ● ●●
●●● ● ●●
Sorted p−value
0.8
●● ● ● ● ● ●
0.6
● ●●● ●● ● ● ●●●●● ● ●● ●● ● ●●● ● ●● ●●●
0.4
● ● ●●● ●●● ● ●● ●● ● ● ●●●
0.2 0.0
● ●● ●● ● ●● ●● ● ●●●● ●●●● ● ● ●●●●●●●●●●
0
20
40
60 Index
80
100
FDR control with BHq (under independence) FDR: expected proportion of false discoveries Sorted p-values: p(1) ≤ p(2) ≤ . . . ≤ p(n) (from most to least significant) Target FDR q 1.0
Sorted p−value
Sorted p−value
● ● ●●● ●● ● ● ●●●●● ● ●● ●● ● ●●● ● ●● ●●●
0.4
● ● ●●● ●●● ● ●● ●● ● ● ●●●
0.2
● ● ● ● ● ● ●
0.4
●
0.0 40
● ● ● ● ● ●● ● ●● ● ●●
●
● ●● ●● ● ●●●● ●●●● ● ● ●●●●●●●●●●
20
●● ●● ● ●●● ● ●●
0.6
0.2
● ●● ●●
0
● ● ● ●● ●●●
0.8
●● ● ● ● ●
0.6
●●● ● ● ● ● ●
●●● ● ●●
0.8
0.0
1.0
● ●● ● ●●● ● ●●
60
80
100
Index
The cut-off is adaptive to number of non-nulls
● ● ● ●●● ● ●● ●●● ● ●●● ● ●● ●●● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●
0
20
40
60 Index
80
100
Earlier work on multiple comparisons
Henry Scheffe
John Tukey
Westfall and Young (’93) Rupert Miller
Controlled Variable Selection
obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14% Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent (DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online. We then conducted a series of association analyses to relate taking lipid lowering therapies, for a total of 8,816 phenotyped Find locations on theconsent genome that from influence trait: e.g. cholesterol level B2,261,000 genotyped and/or imputed SNPs with plasma individuals (Table 1). Informed was obtained all the a
Contemporary problem: inference for sparse regression
HDL cholesterol
CETP
–log10 (P value)
20 15 LPL
10
LIPC
ABCA1
GALNT2
MVK/MMAB
5 0
1
2
3
4
5
6
7
8
9
10
11
12
13
LCAT
14
15
16
LDL cholesterol –log10 (P value)
20 15
0
17 18 19 20 21 22
APOE cluster
APOB SORT1/CELSR2/PSRC1 LDLR
10 5
LIPG
B4GALT4
PCSK9
1
2
3
NCAN/CILP2
B3GALT4
4
5
6
7
8
9
10
11
12
13
14
15
16
17 18 19 20 21 22
Triglycerides
15 10 5 0
APOA5
GCKR
LPL
ANGPTL3 GALNT2 RBKS
1
2
MLXIPL
3
4
5
20 15
6
7
20 15
NCAN/CILP2
LIPC
8
9
10
11
12
13
LDL cholesterol value)
(P value)
HDL cholesterol
TRIB1
14
15
16
17 18 19 20 21 22
Triglycerides P value)
–log10 (P value)
20
20 15
obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14% Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent (DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online. We then conducted a series of association analyses to relate taking lipid lowering therapies, for a total of 8,816 phenotyped Find locations on theconsent genome that from influence trait: e.g. cholesterol level B2,261,000 genotyped and/or imputed SNPs with plasma individuals (Table 1). Informed was obtained all the a
Contemporary problem: inference for sparse regression
HDL cholesterol
CETP
–log10 (P value)
20 15 LPL
10
LIPC
ABCA1
GALNT2
MVK/MMAB
5 0
1
2
3
4
5
6
7
8
9
10
11
12
13
LCAT
14
15
16
LDL cholesterol
LIPG
17 18 19 20 21 22
APOE cluster
20
LDLR
10 5 0
B4GALT4
PCSK9
1
2
3
4
1 2 ky
B3GALT4
min 5
6
y : cholesterol level of patients –log10 (P value)
20 15 10 5 0
8
9
10
11
12
13
14
15
16
17 18 19 20 21 22
Triglycerides
APOA5 X : genotype matrix; e.g. Xi,j # of alleles of recessive type at location j LPL TRIB1 z ANGPTL3 environmental factors (not accounted by genetics) MLXIPL GALNT2 RBKS
1
2
3
4
5
6
7
8
9
10
11
20 15
NCAN/CILP2
LIPC
12
13
LDL cholesterol value)
(P value)
15
7
NCAN/CILP2
GCKR
HDL cholesterol 20
ˆ 2 + λkβk ˆ 1 − X βk 2
14
15
16
17 18 19 20 21 22
Triglycerides P value)
–log10 (P value)
Linear modelAPOB (y = Xβ + z) + e.g. Lasso fit 15 SORT1/CELSR2/PSRC1
20 15
obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14% Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent (DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online. We then conducted a series of association analyses to relate taking lipid lowering therapies, for a total of 8,816 phenotyped Find locations on theconsent genome that from influence trait: e.g. cholesterol level B2,261,000 genotyped and/or imputed SNPs with plasma individuals (Table 1). Informed was obtained all the a
Contemporary problem: inference for sparse regression
HDL cholesterol
CETP
–log10 (P value)
20 15 LPL
10
LIPC
ABCA1
GALNT2
MVK/MMAB
5 0
1
2
3
4
5
6
7
8
9
10
11
12
13
LCAT
14
15
16
LDL cholesterol
LIPG
17 18 19 20 21 22
APOE cluster
20
–log10 (P value)
Linear modelAPOB (y = Xβ + z) + e.g. Lasso fit 15 SORT1/CELSR2/PSRC1 LDLR
10 5 0
B4GALT4
PCSK9
1
2
3
4
1 2 ky
B3GALT4
min 5
6
y : cholesterol level of patients –log10 (P value)
20 15 10 5 0
ˆ 2 + λkβk ˆ 1 − X βk 2 7
8
9
10
11
NCAN/CILP2
12
13
16
17 18 19 20 21 22
Triglycerides
APOA5 X : genotype matrix; e.g. Xi,j # of alleles of recessive type at location j LPL TRIB1 z ANGPTL3 environmental factors (not accounted by genetics) MLXIPL GALNT2 RBKS
1
2
3
4
5
6
7
8
9
10
11
12
13
P value)
value)
LDL cholesterol
15
NCAN/CILP2
LIPC
How do we control the FDR of20selected variables {i : βˆi 6= 0}? 20 20 (P value)
15
GCKR
HDL cholesterol
15
14
15
14
15
16
17 18 19 20 21 22
Triglycerides
Sparse regression rse regression
imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500
Lasso model with λ = 1.75
●●
2
●
● ●
1
● ●
● ●
●
0 −1
●
● ● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●
−2
Fitted coefficient βj
asso fitted model for λ = 1.75:
● ● ●
●
0
100
200
300
Index j
400
500
Sparse regression rse regression
imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500
Lasso model with λ = 1.75
●●
2
●
● ●
1
● ●
● ●
●
0 −1
● ● ●
●
● ● ●
o
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●
−2
Fitted coefficient βj
asso fitted model for λ = 1.75:
● ●
o ●
●
0
100
200
False positive?
300
Index j
400
500
True positive?
Sparse regression rse regression
imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500
Lasso model with λ = 1.75
●●
2
●
● ●
1
● ●
● ●
●
0 −1
●
● ● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●
−2
Fitted coefficient βj
asso fitted model for λ = 1.75:
● ● ●
●
0
100
200
300
Index j
400
500
FDP =
26 55
= 47%
Sparse regression rse regression
imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500
Lasso model with λ = 1.75
●●
2
●
● ●
1
● ●
● ●
●
0 −1
●
● ● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●
−2
Fitted coefficient βj
asso fitted model for λ = 1.75:
● ● ●
●
0
100
200
300
Index j
Estimate FDP?
400
500
FDP =
26 55
= 47%
Sparse regression rse regression
imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500
Lasso model with λ = 1.75
●●
2
●
● ●
1
● ●
● ●
●
0 −1
●
● ● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●
−2
Fitted coefficient βj
asso fitted model for λ = 1.75:
● ● ●
●
0
100
200
300
Index j
Estimate FDP? Forget it...
400
500
FDP =
26 55
= 47%
Controlled variable selection P
j
y n×1
=
βj Xj
z}|{ Xβ n×p p×1
+
z n×1
y ∼ N (Xβ, σ 2 I)
Controlled variable selection P
j
y n×1
=
βj Xj
z}|{ Xβ n×p p×1
+
z n×1
y ∼ N (Xβ, σ 2 I)
Goal: select set of features Xj without too many false positives
FDR |{z}
False discovery rate
=
h # false positives i E # features selected | {z } False discovery proportion
‘0/0 = 0’
Controlled variable selection P
j
y n×1
=
βj Xj
z}|{ Xβ n×p p×1
+
z n×1
y ∼ N (Xβ, σ 2 I)
Goal: select set of features Xj without too many false positives
FDR |{z}
False discovery rate
=
h # false positives i E # features selected | {z }
‘0/0 = 0’
False discovery proportion
Context of multiple testing (with possibly very many irrelevant variables) Hj : βj = 0
j = 1, . . . , p
This lecture
Novel procedure for variable selection controlling FDR in finite sample settings
Requires n ≥ p (and full column rank) for identifiability n ? e> e> Xj> y = Xj> Xβ ? + Xj> z = X j Xβ + Xj z = Xj y
Why? Construct knockoffs For null feature Xj
Why?
d ˜0 ˜0 ˜0 For a null feature Xj0 y X =j , Xj0 Xβ + Xj0 z = X j Xβ + Xj z = Xj y
D e> ? e> e> Xj> y = Xj> Xβ ? + Xj> z = X j Xβ + Xj z = Xj y
Why? Lemma Pairwise exchangeability property. For any subset of nulls N Construct knockoffs d
˜0 ˜0 Why?[X X]swap(N ) y = [X X] y For a null feature X ,
=⇒ knockoffs are a ‘control group’ jfor the nulls D e> ? e> e> Xj> y = Xj> Xβ ? + Xj> z = X j Xβ + Xj z = Xj y
˜0 [X X] swap(N ) =
Knockoff method
Compute Lasso with augmented matrix: Knockoff method
Fitted
2 1 ˆ ˜ · b β(λ) = argmin − [X X] + λkbk1 2p
y b∈R model for λ = 1.75 on 2the simulated dataset: ●
Fitted coefficient βj
●
●
●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
● ●
● ●●
●
500 original features
I
500 knockoff features
Lasso selects 49 original features & 24 knockoff features
Knockoff method
Compute Lasso with augmented matrix: Knockoff method
Fitted
2 1 ˆ ˜ · b β(λ) = argmin − [X X] + λkbk1 2p
y b∈R model for λ = 1.75 on 2the simulated dataset: ●
Fitted coefficient βj
●
●
●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
● ●
● ●●
●
500 original features
500 knockoff features
Lasso selects 49 original features & 24 knockoff features I
Lasso selects 49 original features & 24 knockoff features
Knockoff method
Compute Lasso with augmented matrix: Knockoff method
Fitted
2 1 ˆ ˜ · b β(λ) = argmin − [X X] + λkbk1 2p
y b∈R model for λ = 1.75 on 2the simulated dataset: ●
Fitted coefficient βj
●
●
●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
● ●
● ●●
●
500 original features
500 knockoff features
Lasso selects 49 original features & 24 knockoff features I Lasso selects 49 original features & 24 knockoff Pairwise exchangeability =⇒ probably ≈ 24 false positives among 49 original features
features
2. Statistics Zj = sup {λ : bj (λ) 6= 0} Z˜j = sup {λ : ˜bj (λ) 6= 0}
first time Xj enters model ˜ j enters model first time X
Test statistic Wj for feature j Wj = max(Zj , Z˜j ) ·
( +1 −1
Zj > Z˜j Zj < Z˜j
2. Statistics Zj = sup {λ : bj (λ) 6= 0} Z˜j = sup {λ : ˜bj (λ) 6= 0}
first time Xj enters model ˜ j enters model first time X
Test statistic Wj for feature j Wj = max(Zj , Z˜j ) ·
+ 0
enters late
( +1 −1
Zj > Z˜j Zj < Z˜j
++ −
+ −
−
++ enters early (significant)
|W|
2. Statistics Zj = sup {λ : bj (λ) 6= 0} Z˜j = sup {λ : ˜bj (λ) 6= 0}
first time Xj enters model ˜ j enters model first time X
Test statistic Wj for feature j Wj = max(Zj , Z˜j ) ·
+ 0
enters late
Many other choices (later)
( +1 −1
Zj > Z˜j Zj < Z˜j
++ −
+ −
−
++ enters early (significant)
|W|
Variation Forward selection on augmented design X
˜ X
First time (rank) either original or knockoff enters Tag ‘+’ if original comes before its knockoff, ‘-’ otherwise
+ enters late
+ + −
+ + + − −
rank enters early (significant)
Pairwise exchangeability of the nulls exchangeable
exchangeable
Pairwise exchangeability of the nulls Pairwise exchangeability of the nulls exchangeable exchangeable
exchangeable exchangeable
Estimated FDP at threshold t=1.5
●
~ Value of λ when Xj enters
4
3
2
●
● ●
●
● ●
●
●●●
1
0
● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ●● ●● ●●● ● ● ● ● ●● ● ● ● ●
0
●
●
● ●
● ●
● ● ●
Value of λ when Xj enters
Signal Null
Consequence of exchangeability null non null 0
+
+ t
+ __
__
+
__
+
+ + __
Signs of nulls iid ±1 indep. of |W | (ordering) d
(W1 , . . . , Wp ) = (W1 · 1 , . . . , Wp · p ) i.i.d.
Sign seq. {j } indep. of W , j = +1 for all non-null j and j ∼ {±1} for null j Signs → 1-bit p-values
Knockoff estimate of FDR
0
+ __
FDP(t) =
+
+ __
__
t
+
+
+ + __
#{j null : Wj ≤ −t} #{j null : Wj ≥ t} ≈ #{j : Wj ≥ t} ∨ 1 #{j : Wj ≥ t} ∨ 1
Knockoff estimate of FDR
0
+ __
FDP(t) =
+
+ __
__
t
+
+
+ + __
#{j null : Wj ≤ −t} #{j null : Wj ≥ t} ≈ #{j : Wj ≥ t} ∨ 1 #{j : Wj ≥ t} ∨ 1 #{j : Wj ≤ −t} d ≤ := FDP(t) #{j : Wj ≥ t} ∨ 1
3. Selection and FDR control Select features with large and positive statistics {Wj ≥ T } #{j : Wj ≤ −t} ≤q Knockoff T = min t ∈ W : #{j : Wj ≥ t} ∨ 1
Theorem (Knockoff) E
V ≤q R + q −1
V : # false positives R : total # of selections
3. Selection and FDR control Select features with large and positive statistics {Wj ≥ T } #{j : Wj ≤ −t} ≤q Knockoff T = min t ∈ W : #{j : Wj ≥ t} ∨ 1
Theorem (Knockoff) E
V ≤q R + q −1
V : # false positives R : total # of selections
1 + #{j : Wj ≤ −t} T = min t ∈ W : ≤q #{j : Wj ≥ t} ∨ 1
Theorem (Knockoff+) E
V ≤q R∨1
Knockoff+
Stopping rule 0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1
0
+ __
+
+ __
__
+
+
+ + __
Stop first time ratio between # negatives and # positives below q
Stopping rule 0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1
0
+ __
+
+ __
__
+
+
+ + __
Stop first time ratio between # negatives and # positives below q
Stopping rule 0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1
0
+ __
+
+ __
__
+
+
+ + __
Stop first time ratio between # negatives and # positives below q
Stopping rule 0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1
0
+ __
+
+ __
__
+
+
+ + __
Stop first time ratio between # negatives and # positives below q
Why does all this work? 1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0
+ __
+
+ __
__
+
+
+ + __
Why does all this work? 1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0
+ __
FDP(T ) =
+
+ __
__
+
+
+ + __
#{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T } · #{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T }
Why does all this work? 1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0
+ __
FDP(T ) =
+
+ __
+
+
__
+ + __
#{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T } · #{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T } V + (T )
z }| { #{j null : Wj ≥ T } ≤q· 1 + #{j null : Wj ≤ −T } | {z } V − (T )
Why does all this work? 1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0
+
FDP(T ) =
+
+ __
__
+
+
__
+ + __
#{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T } · #{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T } V + (T )
z }| { #{j null : Wj ≥ T } ≤q· 1 + #{j null : Wj ≤ −T } | {z } V − (T )
V + (t)/(1 + V − (t)) is a super-martingale w.r.t. well defined filtration T is stopping time
Optional stopping time theorem null 0
+
+ t
V + (T ) FDR ≤ q E 1 + V − (T )
+ __
__
__
+
non null
+
+ + __
Optional stopping time theorem null 0
+
+ t
V + (T ) FDR ≤ q E 1 + V − (T )
+ __
__
+
+
__
V + (0) ≤q E 1 + V − (0)
non null + + __
Optional stopping time theorem null 0
+ __
+ t
+ __
__
+
non null
+
+ + __
z }| { V + (T ) V + (0) V + (0) FDR ≤ q E ≤ q E = q E 1 + V − (T ) 1 + V − (0) 1 + #nulls − V + (0) Ber(#nulls,1/2)
Optional stopping time theorem null 0
+ __
+ t
+ __
__
+
non null
+
+ + __
z }| { V + (T ) V + (0) V + (0) FDR ≤ q E ≤ q E = q E ≤q 1 + V − (T ) 1 + V − (0) 1 + #nulls − V + (0) Ber(#nulls,1/2)
Comparison with other methods
Permutation methods Let X π = X with rows randomly permuted Σ 0 π 0 π X X X X ≈ 0 Σ
Zj
Value of λ when variable enters model
Permutation Let X π = X withmethods rows randomly permuted. Then π Let X = X with rows randomly permuted Σ 0 > X X π π 0 X X ππ ≈ Σ 0 X X X X ≈ 0 Σ 0 Σ ●
25
●
20
●
●
15
Original non−null features Original null features Permuted features
●
10
●● ●
5
● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●● ●● ● ●●● ● ●●● ● ●● ●●●●●●●●●● ● ●●●● ●● ●● ●
0
|0 {z } | signals
50
{z
nulls
}100|
Knockoff method Permutation method
{z
150
200
permuted features
Index of column in the augmented matrix [X Xπ]
FDR (target level q = 20%) 12.29% 45.61%
}
Zj
Value of λ when variable enters model
Permutation Let X π = X withmethods rows randomly permuted. Then π Let X = X with rows randomly permuted Σ 0 > X X π π 0 X X ππ ≈ Σ 0 X X X X ≈ 0 Σ 0 Σ ●
25
●
20
●
●
15
Original non−null features Original null features Permuted features
●
10
●● ●
5
● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●● ●● ● ●●● ● ●●● ● ●● ●●●●●●●●●● ● ●●●● ●● ●● ●
0
|0 {z } | signals
50
{z
nulls
}100|
Knockoff method Knockoff method Permutation method
Permutation method
{z
150
200
permuted features
Index of column in the augmented matrix [X Xπ]
FDR over 1000 trials (nominal level q level = 20%)q FDR (target 12.29% 12.29% 45.61%
45.61%
= 20%)
}
Other methods Benjamini-Hochberg (BHq) BHq + log factor correction BHq with whitened noise
Other methods Benjamini-Hochberg (BHq) BHq + log factor correction BHq with whitened noise
y ∼ N (Xβ, σ 2 I)
βˆLS ∼ N (β, σ 2 (X 0 X)−1 )
⇐⇒
Apply BHq to Zj =
σ
p
βˆjLS (Σ−1 )jj
Not known to control FDR → log factor correction (Benjamini Yekutieli)
Empirical results
Features N (0, In ), n = 3000, p = 1000
k = 30 variables with regression coefficients of magnitude 3.5
Method Knockoff+ (equivariant) Knockoff (equivariant) Knockoff+ (SDP) Knockoff (SDP) BHq BHq + log-factor correction BHq with whitened noise
FDR (%) (nominal level q = 20%) 14.40 17.82 15.05 18.72 18.70 2.20 18.79
Power (%) 60.99 66.73 61.54 67.50 48.88 19.09 2.33
Theor. FDR control? Yes No Yes No No Yes Yes
Effect of sparsity level Same setup with amplitudes set to 3.5 (q = 0.2)
25
100
20
80 ●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
15
●
● ●
●
●
●
●
●
● ●
● ●
●
●
● ●
●
●
●
●
10
5 ●
Nominal level Knockoff Knockoff+ BHq
0
Power (%)
FDR (%)
●
●
●
60 ●
40
●
20 ●
Knockoff Knockoff+ BHq
0 50
100 Sparsity level k
150
200
50
100 Sparsity level k
150
200
Effect of signal amplitude Same setup with k = 30 (q = 0.2)
25
100
20
80
●
● ●
●
●
● ●
● ●
●
● ● ●
10
●
●
●
●
Power (%)
FDR (%)
●
15
●
● ● ●
●
60 ● ● ● ●
40 ●
●
●
5 ●
Nominal level Knockoff Knockoff+ BHq
0
20 ●
Knockoff Knockoff+ BHq
0 2.8
3.0
3.2
3.4
3.6
Amplitude A
3.8
4.0
4.2
2.8
3.0
3.2
3.4
3.6
Amplitude A
3.8
4.0
4.2
Effect of feature correlation Θjk = ρ|j−k|
Features ∼ N (0, Θ)
n = 3000, p = 1000, and k = 30 and amplitude = 3.5
30
100
●
25
Nominal level Knockoff Knockoff+ BHq
●
Knockoff Knockoff+ BHq
80
15
Power (%)
FDR (%)
20
● ● ● ●
●
60
● ●
●
●
●
40
10
● ●
● ●
20
●
5
●
● ● ●
0
●
0 0.0
0.2
0.4
0.6
Feature correlation ρ
0.8
0.0
0.2
0.4
0.6
Feature correlation ρ
0.8
Application to real HIV data
HIV drug resistance
Drug type
# drugs
Sample size
PI NRTI NNRTI
6 6 3
848 639 747
# protease or RT positions genotyped 99 240 240
# mutations appearing ≥ 3 times in sample 209 294 319
response y: log-fold-increase of lab-tested drug resistance in covariate Xj : presence or absence of mutation #j
Data from R. Shafer (Stanford) available at: http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/
PI-type drug resistance TSM list: mutations associated with the PI class of drugs in general, and is not specialized to the individual drugs in the class
30
Appear in TSM list Not in TSM list
Resistance to ATV
Resistance to IDV
35
35
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
0
0 Knockoff
BHq
5
Resistance to APV
0 Knockoff
BHq
Knockoff
# HIV−1 protease positions selected
# HIV−1 protease positions selected
Resistance to APV 35
BHq
Data set size: n=768, p=201
Data set size: n=329, p=147
Data set size: n=826, p=208
Resistance to LPV
Resistance to NFV
Resistance to SQV
35
35
35
30
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
5
0
0
0
Knockoff
BHq
Data set size: n=516, p=184
Knockoff
BHq
Data set size: n=843, p=209
Knockoff
35 30
Appear in TSM list Not in TSM list
25 20 15
BHq
Data set size: n=825, p=208
10 5
Figure: q = 0.2. # positions on the HIV-1 protease where mutations were selected. Horizontal line = # HIV-1 protease positions in TSM list. 0 Knockoff
BHq
NRTI-type drug resistance
# HIV−1 RT positions selected
Resistance to X3TC 30
Appear in TSM list Not in TSM list
Resistance to ABC
Resistance to AZT
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
5
0
0 Knockoff
BHq
0 Knockoff
BHq
Knockoff
BHq
Data set size: n=633, p=292
Data set size: n=628, p=294
Data set size: n=630, p=292
Resistance to D4T
Resistance to DDI
Resistance to TDF
30
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
0
5
0 Knockoff
BHq
Data set size: n=630, p=293
0 Knockoff
BHq
Data set size: n=632, p=292
Knockoff
BHq
Data set size: n=353, p=218
Figure: Validation against the treatment-selected mutation (TSM) panel
NNRTI-type drug resistance
# HIV−1 RT positions selected
Resistance to DLV 35 30
Appear in TSM list Not in TSM list
Resistance to EFV
Resistance to NVP
35
35
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
0
5
0 Knockoff
BHq
Data set size: n=732, p=311
0 Knockoff
BHq
Data set size: n=734, p=318
Knockoff
BHq
Data set size: n=746, p=319
Figure: Validation against the treatment-selected mutation (TSM) panel
Heavy-tailed noise
Design as in HIV example (sparse matrix) Errors as residuals from HIV example Regression coefficients entered manually
Knockoff+ BHq
FDR (q = 20%) 20.31% 25.47%
Power 60.67% 69.42%
Details
Knockoff constructions (n ≥ 2p)
˜ = X(I − Σ−1 diag{s}) + U ˜C X
Knockoff constructions (n ≥ 2p)
˜ = X(I − Σ−1 diag{s}) + U ˜C X
˜ X X
0 ˜ = X X
Σ Σ − diag{s} := G 0 Σ − diag{s} Σ
Knockoff constructions (n ≥ 2p)
˜ = X(I − Σ−1 diag{s}) + U ˜C X
˜ X X
0 ˜ = X X
G0
Σ Σ − diag{s} := G 0 Σ − diag{s} Σ
⇐⇒
diag{s} 0 2Σ − diag{s} 0
Knockoff constructions(n ≥ 2p) Equi-correlated knockoffs: sj = 2λmin (Σ) ∧ 1 ˜ j i = 1 − 2λmin (Σ) ∧ 1 hXj , X ˜ j i| Under equivariance, minimizes the value of |hXj , X
Knockoff constructions(n ≥ 2p) Equi-correlated knockoffs: sj = 2λmin (Σ) ∧ 1 ˜ j i = 1 − 2λmin (Σ) ∧ 1 hXj , X ˜ j i| Under equivariance, minimizes the value of |hXj , X SDP knockoffs: minimize subject to
P
|1 − sj | sj ≥ 0 diag{s} 2Σ j
⇐⇒
minimize subject to
Highly structured semidefinite program (SDP) Other possibilities
P
j (1 − sj ) sj ≥ 0 diag{s} 2Σ
˜ , y) Symmetric statistics: W ( X X
column swap
Sufficiency property: ˜ 0 X X ˜ , X X ˜ 0y W =f X X
Anti-symmetry property: swapping changes signs ( +1 j 6∈ S ˜ ˜ ,y · Wj X X , y = Wj X X swap(S) −1 j ∈ S
null non null 0
+ __
+ t
+ __
__
+
+
+ + __
Examples of statistics ˆ Zj = sup{λ : βˆj (λ) 6= 0}, j = 1, . . . , 2p, and β(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Wj = Zj − Zj+p
Examples of statistics ˆ Zj = sup{λ : βˆj (λ) 6= 0}, j = 1, . . . , 2p, and β(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Wj = Zj − Zj+p
Same with penalized estimate min
1 2 ky
− Xbk22 + λP (b)
Examples of statistics ˆ Zj = sup{λ : βˆj (λ) 6= 0}, j = 1, . . . , 2p, and β(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Wj = Zj − Zj+p
Same with penalized estimate min
1 2 ky
− Xbk22 + λP (b)
Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order in which 2p variables entered model Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Examples of statistics ˆ Zj = sup{λ : βˆj (λ) 6= 0}, j = 1, . . . , 2p, and β(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Wj = Zj − Zj+p
Same with penalized estimate min
1 2 ky
− Xbk22 + λP (b)
Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order in which 2p variables entered model Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Statistics based on LS estimates LS 2 Wj = |βˆjLS |2 − |βˆj+p |
LS |βˆjLS | − |βˆj+p |
Examples of statistics ˆ Zj = sup{λ : βˆj (λ) 6= 0}, j = 1, . . . , 2p, and β(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Wj = Zj − Zj+p
Same with penalized estimate min
1 2 ky
− Xbk22 + λP (b)
Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order in which 2p variables entered model Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Statistics based on LS estimates LS 2 Wj = |βˆjLS |2 − |βˆj+p |
... (endless possibilities)
LS |βˆjLS | − |βˆj+p |
Extensions
Other type-I errors Can control familywise error rate (FWER), k-FWER, ...: Janson and Su (’15)
+ 0
enters late
++
+ −
−
−
++
|W|
enters early (significant)
T : time at which m knockoffs have entered before originals appear before Reject hypotheses with ‘+’
Expected number of false discoveries EV ≤ m
Other models Suppose we wish to test for groups X y= Xg βg + z
Hg : βg = 0
g∈G
Group lasso min 12 ky − Forward group selection ...
P
g
Xg βˆg k22 + λ
P
g
kβˆg k2
Other models Suppose we wish to test for groups X y= Xg βg + z
Hg : βg = 0
g∈G
Group lasso min 12 ky − Forward group selection ...
P
g
Xg βˆg k22 + λ
Construct group knockoffs for exchangeability
P
g
kβˆg k2
Other models Suppose we wish to test for groups X y= Xg βg + z
Hg : βg = 0
g∈G
Group lasso min 12 ky − Forward group selection ...
+ 0
enters late
P
g
Xg βˆg k22 + λ
P
++ −
g
kβˆg k2
+ −
Construct group knockoffs for exchangeability Calculate statistics; e.g. signed reversed ranks
−
++ enters early (significant)
|W|
Other models Suppose we wish to test for groups X y= Xg βg + z
Hg : βg = 0
g∈G
Group lasso min 12 ky − Forward group selection ...
+ 0
enters late
P
g
Xg βˆg k22 + λ
P
++ −
kβˆg k2
+ −
Construct group knockoffs for exchangeability Calculate statistics; e.g. signed reversed ranks Compute threshold as before
g
−
++ enters early (significant)
|W|
Other models Suppose we wish to test for groups X y= Xg βg + z
Hg : βg = 0
g∈G
Group lasso min 12 ky − Forward group selection ...
+ 0
enters late
P
g
Xg βˆg k22 + λ
++ −
−
Calculate statistics; e.g. signed reversed ranks Compute threshold as before
g
kβˆg k2
+
Construct group knockoffs for exchangeability
Provides FDR control
P
−
++ enters early (significant)
|W|
Summary
Knockoff filter = inference machine You design the statistics, knockoffs take care of inference Works under any design X (handles arb. correlations) Does not require any knowledge of noise level σ Very powerful when sparse effects
Open research: n < p...
FDR control is an extremely useful concept even away from counting errors and successes
BHq y ∼ N (Xβ, σ 2 I)
⇐⇒
βˆLS ∼ N (β, σ 2 (X 0 X)−1 )
Statistics are independent iff X 0 X is diagonal (orthogonal design)
BHq Orthogonal model: y ∼ N (β, I) (σ = 1 is known) p · P{|N (0, 1)| ≥ t} ≤q TBH = min t : #{j : |yj | = |βj + zj | ≥ t}
BHq Orthogonal model: y ∼ N (β, I) (σ = 1 is known) p · P{|N (0, 1)| ≥ t} ≤q TBH = min t : #{j : |yj | = |βj + zj | ≥ t} Knockoff procedure (with lasso) is quite different: Make control group ˜ = Ip X X] 0
0 Ip
=⇒
statistics =
y z0
y ∼ N (β, I) indep. from z 0 ∼ N (0, I) Compute threshold (knockoff+) via ( ) #{j : |zj0 | ≥ t and |zj0 | > |yj |} d T = min t : FDP(t) = ≤q #{j : |yj | ≥ t and |yj | > |zj0 |}
Empirical comparison p = 1000 # true signals is 200 → fraction of nulls is 0.8 25
100 ●
20 15
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
10
80
●
Power (%)
FDR (%)
●
●
Nominal level BHq Knockoff+
5 ●
0 2
3
4
Signal Magnitude
FDR
5
6
●
●
●
●
●
●
●
●
●
●
●
● ●
60 ●
40 ●
20
BHq Knockoff+
● ● ●
0 1
●
●
●
1
●
2
3
4
5
6
Signal Magnitude
Power
FDR and power of the BHq and knockoff+ methods vs. size A of the regression coefficients (signal magnitude)
●
Some Closing Remarks: How About Prediction/Estimation?
FDR for estimation: y = Xβ + z Goal: predict response from explanatory variables Sparse (modern) setup: p large and only few important variables
FDR for estimation: y = Xβ + z Goal: predict response from explanatory variables Sparse (modern) setup: p large and only few important variables
Model selection via Cp 2 ˆ Cp (S) = ky − X β[S]k + 2σ 2 | {z } RSS
|S| |{z}
model dim.
ˆ β[S]: fitted LS coefficients from variables in S Cp (S) unbiased for prediction error of model S
S ⊂ {1, . . . , p}
FDR for estimation: y = Xβ + z Goal: predict response from explanatory variables Sparse (modern) setup: p large and only few important variables
Low bias Cp FWER FDR
Low variance
FDR thresholding (Abramovich and Benjamini (’96)) Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip ) Select nominal level q and perform BH(q) testing ( XiT y |XiT y| ≥ tFDR βˆi = 0 otherwise
FDR thresholding (Abramovich and Benjamini (’96)) Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip ) Select nominal level q and perform BH(q) testing ( XiT y |XiT y| ≥ tFDR βˆi = 0 otherwise
Theorem (Abramovich, Benjamini, Donoho, Johnstone (’05)) Sparsity class `0 (k) = {β : kβk0 ≤ k} Minimax risk
R(k) = inf sup
βˆ β∈`0 (k)
E kβˆ − βk22
FDR thresholding (Abramovich and Benjamini (’96)) Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip ) Select nominal level q and perform BH(q) testing ( XiT y |XiT y| ≥ tFDR βˆi = 0 otherwise
Theorem (Abramovich, Benjamini, Donoho, Johnstone (’05)) Sparsity class `0 (k) = {β : kβk0 ≤ k} Minimax risk
R(k) = inf sup
βˆ β∈`0 (k)
E kβˆ − βk22
Set q < 1/2. Then asymptotically, as p → ∞ and k ∈ [(log p)5 , p1−δ ] E kβˆFDR − βk22 = E kX βˆFDR − Xβk22 = R(k)(1 + o(1))
Other connections SLOPE: Bogdan, van den Berg, Sabatti, Su and Cand`es (’13) min
1 ˆ 2 + λ1 |β| ˆ (1) + λ2 |β| ˆ (2) + . . . + λp |β| ˆ (p) ky − X βk 2 2
λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0
ˆ (1) ≥ |β| ˆ (2) ≥ . . . ≥ |β| ˆ (p) |β| [order statistic]
Other connections SLOPE: Bogdan, van den Berg, Sabatti, Su and Cand`es (’13) min
1 ˆ 2 + λ1 |β| ˆ (1) + λ2 |β| ˆ (2) + . . . + λp |β| ˆ (p) ky − X βk 2 2
λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0
ˆ (1) ≥ |β| ˆ (2) ≥ . . . ≥ |β| ˆ (p) |β| [order statistic]
Adaptive minimaxity: C. and Su (’15) Sparse class `0 (k) = {β : kβk0 ≤ k}
Fix q ∈ (0, 1] and set λi = σ · Φ−1 (1 − iq/2p) (BHq) For some linear models, adaptive minimax estimation sup β∈`0 (k)
E kX βˆSLOPE − Xβk22 = R(k)(1 + o(1))