Document not found! Please try again

A Knockoff Filter for Controlling the False Discovery R

0 downloads 0 Views 5MB Size Report
Oct 19, 2013 - Around the Reproducibility of Scientific Research in the Big Data Era: ... “Why most published research findings are false” (Ioannidis, '05) ...
Around the Reproducibility of Scientific Research in the Big Data Era: A Knockoff Filter for Controlling the False Discovery Rate Emmanuel Cand`es

Big Data and Computational Scalability, University of Warwick, July 2015

Collaborator

Rina Foygel Barber

A crisis? The Economist: Oct 19th, 2013

Problems with scientific research: How science goes wrong | The Economist

5.11.13 18:28

Problems with scientific research

How science goes wrong Scientific research has changed the world. Now it needs to change itself

Oct 19th 2013 | From the print edition A SIMPLE idea underpins science: “trust, but verify”. Results should always be subject to challenge from experiment. That simple but powerful idea has generated a vast body of knowledge. Since its birth in the 17th century, modern science has changed the world beyond recognition, and overwhelmingly for the better. But success can breed complacency. Modern scientists are doing too much trusting and not enough verifying—to the detriment of the whole of science, and of humanity. Too many of the findings that fill the academic ether are the result of shoddy experiments or poor analysis (see article (http://www.economist.com/news/briefing/21588057-scientists-thinkscience-self-correcting-alarming-degree-it-not-trouble) ). A rule of thumb among biotechnology venture-capitalists is that half of published research cannot be replicated. Even that may be optimistic. Last year researchers at one biotech firm, Amgen, found they could reproduce just six of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, a drug company, managed to repeat just a quarter of 67 similarly important papers. A leading computer scientist frets that three-quarters of papers in his subfield are bunk. In 2000-10 roughly 80,000 patients took part in clinical trials based on research that was later retracted because of mistakes or improprieties.

A crisis? The Economist: Oct 19th, 2013

Problems with scientific research: How science goes wrong | The Economist

5.11.13 18:28

Problems with scientific research

How science goes wrong

Unreliable research: Trouble at the lab | The Economist

5.11.13 18:29

Scientific research has changed the world. Now it needs to change itself

Oct 19th 2013 | From the print edition A SIMPLE idea underpins science: “trust, but verify”. Results should always be subject to challenge from experiment. That simple but powerful idea has generated a vast body of knowledge. Since its birth in the 17th century, modern science has changed the world beyond

Unreliable research

Trouble at the lab

recognition, and overwhelmingly for the better. Scientists like to think of science as self-correcting. To an alarming degree, it is not

Oct 19th 2013 | From the print edition

But success can breed complacency. Modern scientists are doing too much trusting and not enough verifying—to the detriment of the whole of and wreck of humanity. “I science, SEE a train looming,” warned Daniel

Kahneman, an eminent psychologist, in an open Too many of the findings that fill the academic ether are the result of shoddy experiments or letter last year. The premonition concerned poor analysis (see article (http://www.economist.com/news/briefing/21588057-scientists-thinkresearch on a phenomenon known as “priming”. science-self-correcting-alarming-degree-it-not-trouble) ). A rule of thumb among biotechnology Priming studies suggest that decisions can be venture-capitalists is that half of published research cannot be replicated. Even that may be influenced by apparently irrelevant actions or optimistic. Last year researchers at one biotech firm, Amgen, found they could reproduce just six events that took placea just the cusp of of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, drugbefore company, managed

havecomputer been a boom area frets in psychology to repeat just a quarter of 67 similarly important choice. papers. They A leading scientist that the pastroughly decade,80,000 and some of their insights three-quarters of papers in his subfield are bunk.over In 2000-10 patients took part have already made it out of the lab and into the toolkits ofbecause policy wonks keen or onimproprieties. “nudging” the populace. in clinical trials based on research that was later retracted of mistakes

Snippets Unreliable research: Trouble at the lab | The Economist

5.11.13 18:29

Unreliable research

Trouble at the lab Scientists like to think of science as self-correcting. To an alarming degree, it is not

Oct 19th 2013 | From the print edition “I SEE a train wreck looming,” warned Daniel Kahneman, an eminent psychologist, in an open letter last year. The premonition concerned

Systematic attempts to replicate widely cited priming experiments have failed

research on a phenomenon known as “priming”. Priming studies suggest that decisions can be

Amgen could only replicate 6 of 53 studies they considered landmarks in science events basic that took cancer place just before the cusp of influenced by apparently irrelevant actions or

choice. They have been a boom area in psychology

HealthCare could only replicate about 25% of 67 seminal studies

over the past decade, and some of their insights have already made it out of the lab and into the toolkits of policy wonks keen on “nudging” the populace.

Dr Kahneman and a growing number of his colleagues fear that a lot of this priming research is Early report (Kaplan, ’08): 50% of Phase III FDA studies ended in failure poorly founded. Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. Many of these replications have failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate

Daily Shouts Page-Turner Daily Comment Amy Davidson John Cassidy Andy Borowitz Richard Brody

Close

New Yorker: December, 2010 Reddit Linked In Email

O

n September 18, 2007, a few dozen neuroscientists, psychiatrists, and drug-company executives gathered in a hotel conference room in Brussels to hear some ANNALS OF SCIENCE startling news. It had to do with a class of drugs known as atypical or second-generation antipsychotics, which came on the market in the early nineties. The drugs, sold under Is there something wrong with the scientific method? brand names such as Abilify, Seroquel, and Zyprexa, had by Jonah Lehrer been tested on schizophrenics in several large clinical DECEMBER 13, 2010 trials, all of which had demonstrated a dramatic decrease in the subjects’ psychiatric symptoms. As a result, secondRecommend 29k Many results that are rigorously proved and generation antipsychotics had become one of the fastestaccepted start shrinking in later studies. growing and most profitable pharmaceutical classes. By 2001, Eli Lilly’s Zyprexa was generating more revenue than Prozac. It remains the company’s top-selling drug. But the data presented at the Brussels meeting made it clear that something strange was happening: the therapeutic power of the drugs appeared to be steadily waning. A recent study showed an effect that The New Yorker Reporting & Essays

THE TRUTH WEARS OFF

“Significance chasing”

Tweet 192

O

“Publication bias”

1,767

http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer?printable=true¤tPage=all

Page 1 of 11

“Selective reporting”

n September“Why 18, 2007,most a few dozen neuroscientists, published research psychiatrists, and drug-company executives gathered

http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer

findings are false” (Ioannidis, ’05) Print E-Mail Page 1 of 4

The New York Times Jan. 2014

Personal and societal concern

Great danger in seeing erosion of public confidence in science

Personal and societal concern

Great danger in seeing erosion of public confidence in science

Seems like scientific community is responding

Reproducibility Initiative

http://validation. scienceexchange.com/

disorders at the same time. Treatments are inconsistent. Outcomes are unpredictable. Science was supposed to come to the rescue. Genetics and neuroimaging studies would, all involved hoped, reveal biological signatures unique to each disorder, which could be used to provide consistent

encourage work that cuts across the boundaries. Researchers should be encouraged to investigate the causes of mental illness from the bottom up, as the US National Institute of Mental Health is doing. The brain is complicated enough. Why investigate its problems with one hand tied behind our backs? ■

Nature’s 18-point checklist: April 25, 2013 ANNO UN CEMEN T

Reducing our irreproducibility O

ver the past year, Nature has published a string of articles that highlight failures in the reliability and reproducibility of published research (collected and freely available at go.nature.com/ huhbyr). The problems arise in laboratories, but journals such as this one compound them when they fail to exert sufficient scrutiny over the results that they publish, and when they do not publish enough information for other researchers to assess results properly. From next month, Nature and the Nature research journals will introduce editorial measures to address the problem by improving the consistency and quality of reporting in life-sciences articles. To ease the interpretation and improve the reliability of published results we will more systematically ensure that key methodological details are reported, and we will give more space to methods sections. We will examine statistics more closely and encourage authors to be transparent, for example by including their raw data. Central to this initiative is a checklist intended to prompt authors to disclose technical and statistical information in their submissions, and to encourage referees to consider aspects important for research reproducibility (go.nature.com/oloeip). It was developed after discussions with researchers on the problems that lead to irreproducibility, including workshops organized last year by US National Institutes of Health (NIH) institutes. It also draws on published concerns about reporting standards (or the lack of them) and the collective experience of editors at Nature journals. The checklist is not exhaustive. It focuses on a few experimental and analytical design elements that are crucial for the interpretation of research results but are often reported incompletely. For example, authors will need to describe methodological parameters that can introduce bias or influence robustness, and provide precise characterization of key reagents that may be subject to biological variability, such as cell lines and antibodies. The checklist also consolidates existing policies about data deposition and presentation. We will also demand more precise descriptions of statistics, and

we will commission statisticians as consultants on certain papers, at the editor’s discretion and at the referees’ suggestion. We recognize that there is no single way to conduct an experimental study. Exploratory investigations cannot be done with the same level of statistical rigour as hypothesis-testing studies. Few academic laboratories have the means to perform the level of validation required, for example, to translate a finding from the laboratory to the clinic. However, that should not stand in the way of a full report of how a study was designed, conducted and analysed that will allow reviewers and readers to adequately interpret and build on the results. To allow authors to describe their experimental design and methods in as much detail as necessary, the participating journals, including Nature, will abolish space restrictions on the methods section. To further increase transparency, we will encourage authors to provide tables of the data behind graphs and figures. This builds on our established data-deposition policy for specific experiments and large data sets. The source data will be made available directly from the figure legend, for easy access. We continue to encourage authors to share detailed methods and reagent descriptions by depositing protocols in Protocol Exchange (www.nature.com/ protocolexchange), an open resource linked from the primary paper. Renewed attention to reporting and transparency is a small step. Much bigger underlying issues contribute to the problem, and are beyond the reach of journals alone. Too few biologists receive adequate training in statistics and other quantitative aspects of their subject. Mentoring of young scientists on matters of rigour and transparency is inconsistent at best. In academia, the ever increasing pressures to publish and chase funds provide little incentive to pursue studies and publish results that contradict or confirm previous papers. Those who document the validity or irreproducibility of a published piece of work seldom get a welcome from journals and funders, even as money and effort are wasted on false assumptions. Tackling these issues is a long-term endeavour that will require the commitment of funders, institutions, researchers and publishers. It is encouraging that NIH institutes have led community discussions on this topic and are considering their own recommendations. We urge others to take note of these and of our initiatives, and do whatever they can to improve research reproducibility. ■

3 9 8 | NAT U R E | VO L 4 9 6 | 2 5 A P R I L 2 0 1 3

© 2013 Macmillan Publishers Limited. All rights reserved

NAS President’s address: April 27, 2015

Big data and a new scientific paradigm

Collect data first

=⇒

Ask questions later

Large data sets available prior to formulation of hypotheses Need to quantify “reliability” of hypotheses generated by data snooping Very different from hypothesis-driven research

Big data and a new scientific paradigm

Collect data first

=⇒

Ask questions later

Large data sets available prior to formulation of hypotheses Need to quantify “reliability” of hypotheses generated by data snooping Very different from hypothesis-driven research

What does statistics have to offer? Account for “look everywhere” effect Understand reliability in the context of all hypotheses that have been explored

Most discoveries may be false: Sori´c (’89)

1000 hypotheses to test

See also The Economist

Most discoveries may be false: Sori´c (’89)

1000 hypotheses, 100 potential discoveries 1

Most discoveries may be false: Sori´c (’89)

1000 hypotheses, 100 potential discoveries 1

Most discoveries may be false: Sori´c (’89) True positives

False negatives

Power ≈ 80% −→ true positives ≈ 80 False positives (5% level) ≈ 45

=⇒

False positives

False discovery rate ≈ 36%

Most discoveries may be false: Sori´c (’89) True positives

False negatives

False positives

Reported

Power ≈ 80% −→ true positives ≈ 80 False positives (5% level) ≈ 45 1

=⇒

False discovery rate ≈ 36%

Most discoveries may be false: Sori´c (’89) True positives

False negatives

False positives

Reported

Power ≈ 30%

=⇒

False discovery rate ≈ 60%

More false negatives than true positives!

Example: meta-analysis in neuroscience Button et al. (2013) Power failure: why small sample size undermines the reliability of neuroscience

False Discovery Rate (FDR): Benjamini & Hochberg (’95)

H1 , . . . Hn hypotheses subject to some testing procedure 

#false discoveries FDR = E #discoveries



‘0/0 = 0’

Natural type I error Under independence (and PRDS) simple rules control FDR (BHq) Widely used

FDR control with BHq (under independence) FDR: expected proportion of false discoveries Sorted p-values: p(1) ≤ p(2) ≤ . . . ≤ p(n) (from most to least significant) Target FDR q 1.0

● ●● ● ●●● ● ●●

●●● ● ●●

Sorted p−value

0.8

●● ● ● ● ● ●

0.6

● ●●● ●● ● ● ●●●●● ● ●● ●● ● ●●● ● ●● ●●●

0.4

● ● ●●● ●●● ● ●● ●● ● ● ●●●

0.2 0.0

● ●● ●● ● ●● ●● ● ●●●● ●●●● ● ● ●●●●●●●●●●

0

20

40

60 Index

80

100

FDR control with BHq (under independence) FDR: expected proportion of false discoveries Sorted p-values: p(1) ≤ p(2) ≤ . . . ≤ p(n) (from most to least significant) Target FDR q 1.0

Sorted p−value

Sorted p−value

● ● ●●● ●● ● ● ●●●●● ● ●● ●● ● ●●● ● ●● ●●●

0.4

● ● ●●● ●●● ● ●● ●● ● ● ●●●

0.2

● ● ● ● ● ● ●

0.4



0.0 40

● ● ● ● ● ●● ● ●● ● ●●



● ●● ●● ● ●●●● ●●●● ● ● ●●●●●●●●●●

20

●● ●● ● ●●● ● ●●

0.6

0.2

● ●● ●●

0

● ● ● ●● ●●●

0.8

●● ● ● ● ●

0.6

●●● ● ● ● ● ●

●●● ● ●●

0.8

0.0

1.0

● ●● ● ●●● ● ●●

60

80

100

Index

The cut-off is adaptive to number of non-nulls

● ● ● ●●● ● ●● ●●● ● ●●● ● ●● ●●● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●

0

20

40

60 Index

80

100

Earlier work on multiple comparisons

Henry Scheffe

John Tukey

Westfall and Young (’93) Rupert Miller

Controlled Variable Selection

obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14% Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent (DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online. We then conducted a series of association analyses to relate taking lipid lowering therapies, for a total of 8,816 phenotyped Find locations on theconsent genome that from influence trait: e.g. cholesterol level B2,261,000 genotyped and/or imputed SNPs with plasma individuals (Table 1). Informed was obtained all the a

Contemporary problem: inference for sparse regression

HDL cholesterol

CETP

–log10 (P value)

20 15 LPL

10

LIPC

ABCA1

GALNT2

MVK/MMAB

5 0

1

2

3

4

5

6

7

8

9

10

11

12

13

LCAT

14

15

16

LDL cholesterol –log10 (P value)

20 15

0

17 18 19 20 21 22

APOE cluster

APOB SORT1/CELSR2/PSRC1 LDLR

10 5

LIPG

B4GALT4

PCSK9

1

2

3

NCAN/CILP2

B3GALT4

4

5

6

7

8

9

10

11

12

13

14

15

16

17 18 19 20 21 22

Triglycerides

15 10 5 0

APOA5

GCKR

LPL

ANGPTL3 GALNT2 RBKS

1

2

MLXIPL

3

4

5

20 15

6

7

20 15

NCAN/CILP2

LIPC

8

9

10

11

12

13

LDL cholesterol value)

(P value)

HDL cholesterol

TRIB1

14

15

16

17 18 19 20 21 22

Triglycerides P value)

–log10 (P value)

20

20 15

obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14% Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent (DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online. We then conducted a series of association analyses to relate taking lipid lowering therapies, for a total of 8,816 phenotyped Find locations on theconsent genome that from influence trait: e.g. cholesterol level B2,261,000 genotyped and/or imputed SNPs with plasma individuals (Table 1). Informed was obtained all the a

Contemporary problem: inference for sparse regression

HDL cholesterol

CETP

–log10 (P value)

20 15 LPL

10

LIPC

ABCA1

GALNT2

MVK/MMAB

5 0

1

2

3

4

5

6

7

8

9

10

11

12

13

LCAT

14

15

16

LDL cholesterol

LIPG

17 18 19 20 21 22

APOE cluster

20

LDLR

10 5 0

B4GALT4

PCSK9

1

2

3

4

1 2 ky

B3GALT4

min 5

6

y : cholesterol level of patients –log10 (P value)

20 15 10 5 0

8

9

10

11

12

13

14

15

16

17 18 19 20 21 22

Triglycerides

APOA5 X : genotype matrix; e.g. Xi,j # of alleles of recessive type at location j LPL TRIB1 z ANGPTL3 environmental factors (not accounted by genetics) MLXIPL GALNT2 RBKS

1

2

3

4

5

6

7

8

9

10

11

20 15

NCAN/CILP2

LIPC

12

13

LDL cholesterol value)

(P value)

15

7

NCAN/CILP2

GCKR

HDL cholesterol 20

ˆ 2 + λkβk ˆ 1 − X βk 2

14

15

16

17 18 19 20 21 22

Triglycerides P value)

–log10 (P value)

Linear modelAPOB (y = Xβ + z) + e.g. Lasso fit 15 SORT1/CELSR2/PSRC1

20 15

obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14% Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent (DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online. We then conducted a series of association analyses to relate taking lipid lowering therapies, for a total of 8,816 phenotyped Find locations on theconsent genome that from influence trait: e.g. cholesterol level B2,261,000 genotyped and/or imputed SNPs with plasma individuals (Table 1). Informed was obtained all the a

Contemporary problem: inference for sparse regression

HDL cholesterol

CETP

–log10 (P value)

20 15 LPL

10

LIPC

ABCA1

GALNT2

MVK/MMAB

5 0

1

2

3

4

5

6

7

8

9

10

11

12

13

LCAT

14

15

16

LDL cholesterol

LIPG

17 18 19 20 21 22

APOE cluster

20

–log10 (P value)

Linear modelAPOB (y = Xβ + z) + e.g. Lasso fit 15 SORT1/CELSR2/PSRC1 LDLR

10 5 0

B4GALT4

PCSK9

1

2

3

4

1 2 ky

B3GALT4

min 5

6

y : cholesterol level of patients –log10 (P value)

20 15 10 5 0

ˆ 2 + λkβk ˆ 1 − X βk 2 7

8

9

10

11

NCAN/CILP2

12

13

16

17 18 19 20 21 22

Triglycerides

APOA5 X : genotype matrix; e.g. Xi,j # of alleles of recessive type at location j LPL TRIB1 z ANGPTL3 environmental factors (not accounted by genetics) MLXIPL GALNT2 RBKS

1

2

3

4

5

6

7

8

9

10

11

12

13

P value)

value)

LDL cholesterol

15

NCAN/CILP2

LIPC

How do we control the FDR of20selected variables {i : βˆi 6= 0}? 20 20 (P value)

15

GCKR

HDL cholesterol

15

14

15

14

15

16

17 18 19 20 21 22

Triglycerides

Sparse regression rse regression

imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500

Lasso model with λ = 1.75

●●

2



● ●

1

● ●

● ●



0 −1



● ● ●

● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●



−2

Fitted coefficient βj

asso fitted model for λ = 1.75:

● ● ●



0

100

200

300

Index j

400

500

Sparse regression rse regression

imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500

Lasso model with λ = 1.75

●●

2



● ●

1

● ●

● ●



0 −1

● ● ●



● ● ●

o

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●



−2

Fitted coefficient βj

asso fitted model for λ = 1.75:

● ●

o ●



0

100

200

False positive?

300

Index j

400

500

True positive?

Sparse regression rse regression

imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500

Lasso model with λ = 1.75

●●

2



● ●

1

● ●

● ●



0 −1



● ● ●

● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●



−2

Fitted coefficient βj

asso fitted model for λ = 1.75:

● ● ●



0

100

200

300

Index j

400

500

FDP =

26 55

= 47%

Sparse regression rse regression

imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500

Lasso model with λ = 1.75

●●

2



● ●

1

● ●

● ●



0 −1



● ● ●

● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●



−2

Fitted coefficient βj

asso fitted model for λ = 1.75:

● ● ●



0

100

200

300

Index j

Estimate FDP?

400

500

FDP =

26 55

= 47%

Sparse regression rse regression

imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500

Lasso model with λ = 1.75

●●

2



● ●

1

● ●

● ●



0 −1



● ● ●

● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●



−2

Fitted coefficient βj

asso fitted model for λ = 1.75:

● ● ●



0

100

200

300

Index j

Estimate FDP? Forget it...

400

500

FDP =

26 55

= 47%

Controlled variable selection P

j

y n×1

=

βj Xj

z}|{ Xβ n×p p×1

+

z n×1

y ∼ N (Xβ, σ 2 I)

Controlled variable selection P

j

y n×1

=

βj Xj

z}|{ Xβ n×p p×1

+

z n×1

y ∼ N (Xβ, σ 2 I)

Goal: select set of features Xj without too many false positives

FDR |{z}

False discovery rate

=

h # false positives i E # features selected | {z } False discovery proportion

‘0/0 = 0’

Controlled variable selection P

j

y n×1

=

βj Xj

z}|{ Xβ n×p p×1

+

z n×1

y ∼ N (Xβ, σ 2 I)

Goal: select set of features Xj without too many false positives

FDR |{z}

False discovery rate

=

h # false positives i E # features selected | {z }

‘0/0 = 0’

False discovery proportion

Context of multiple testing (with possibly very many irrelevant variables) Hj : βj = 0

j = 1, . . . , p

This lecture

Novel procedure for variable selection controlling FDR in finite sample settings

Requires n ≥ p (and full column rank) for identifiability n ? e> e> Xj> y = Xj> Xβ ? + Xj> z = X j Xβ + Xj z = Xj y

Why? Construct knockoffs For null feature Xj

Why?

d ˜0 ˜0 ˜0 For a null feature Xj0 y X =j , Xj0 Xβ + Xj0 z = X j Xβ + Xj z = Xj y

D e> ? e> e> Xj> y = Xj> Xβ ? + Xj> z = X j Xβ + Xj z = Xj y

Why? Lemma Pairwise exchangeability property. For any subset of nulls N Construct knockoffs d

˜0 ˜0 Why?[X X]swap(N ) y = [X X] y For a null feature X ,

=⇒ knockoffs are a ‘control group’ jfor the nulls D e> ? e> e> Xj> y = Xj> Xβ ? + Xj> z = X j Xβ + Xj z = Xj y

˜0 [X X] swap(N ) =

Knockoff method

Compute Lasso with augmented matrix: Knockoff method

Fitted

2 1 ˆ ˜ · b β(λ) = argmin − [X X] + λkbk1 2p

y b∈R model for λ = 1.75 on 2the simulated dataset: ●

Fitted coefficient βj









● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●





● ●

● ●●



500 original features

I

500 knockoff features

Lasso selects 49 original features & 24 knockoff features

Knockoff method

Compute Lasso with augmented matrix: Knockoff method

Fitted

2 1 ˆ ˜ · b β(λ) = argmin − [X X] + λkbk1 2p

y b∈R model for λ = 1.75 on 2the simulated dataset: ●

Fitted coefficient βj









● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●





● ●

● ●●



500 original features

500 knockoff features

Lasso selects 49 original features & 24 knockoff features I

Lasso selects 49 original features & 24 knockoff features

Knockoff method

Compute Lasso with augmented matrix: Knockoff method

Fitted

2 1 ˆ ˜ · b β(λ) = argmin − [X X] + λkbk1 2p

y b∈R model for λ = 1.75 on 2the simulated dataset: ●

Fitted coefficient βj









● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●





● ●

● ●●



500 original features

500 knockoff features

Lasso selects 49 original features & 24 knockoff features I Lasso selects 49 original features & 24 knockoff Pairwise exchangeability =⇒ probably ≈ 24 false positives among 49 original features

features

2. Statistics Zj = sup {λ : bj (λ) 6= 0} Z˜j = sup {λ : ˜bj (λ) 6= 0}

first time Xj enters model ˜ j enters model first time X

Test statistic Wj for feature j Wj = max(Zj , Z˜j ) ·

( +1 −1

Zj > Z˜j Zj < Z˜j

2. Statistics Zj = sup {λ : bj (λ) 6= 0} Z˜j = sup {λ : ˜bj (λ) 6= 0}

first time Xj enters model ˜ j enters model first time X

Test statistic Wj for feature j Wj = max(Zj , Z˜j ) ·

+ 0

enters late

( +1 −1

Zj > Z˜j Zj < Z˜j

++ −

+ −



++ enters early (significant)

|W|

2. Statistics Zj = sup {λ : bj (λ) 6= 0} Z˜j = sup {λ : ˜bj (λ) 6= 0}

first time Xj enters model ˜ j enters model first time X

Test statistic Wj for feature j Wj = max(Zj , Z˜j ) ·

+ 0

enters late

Many other choices (later)

( +1 −1

Zj > Z˜j Zj < Z˜j

++ −

+ −



++ enters early (significant)

|W|

Variation  Forward selection on augmented design X

˜ X



First time (rank) either original or knockoff enters Tag ‘+’ if original comes before its knockoff, ‘-’ otherwise

+ enters late

+ + −

+ + + − −

rank enters early (significant)

Pairwise exchangeability of the nulls exchangeable

exchangeable

Pairwise exchangeability of the nulls Pairwise exchangeability of the nulls exchangeable exchangeable

exchangeable exchangeable

Estimated FDP at threshold t=1.5



~ Value of λ when Xj enters

4

3

2



● ●



● ●



●●●

1

0

● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ●● ●● ●●● ● ● ● ● ●● ● ● ● ●

0





● ●

● ●

● ● ●

Value of λ when Xj enters

Signal Null

Consequence of exchangeability null non null 0

+

+ t

+ __

__

+

__

+

+ + __

Signs of nulls iid ±1 indep. of |W | (ordering) d

(W1 , . . . , Wp ) = (W1 · 1 , . . . , Wp · p ) i.i.d.

Sign seq. {j } indep. of W , j = +1 for all non-null j and j ∼ {±1} for null j Signs → 1-bit p-values

Knockoff estimate of FDR

0

+ __

FDP(t) =

+

+ __

__

t

+

+

+ + __

#{j null : Wj ≤ −t} #{j null : Wj ≥ t} ≈ #{j : Wj ≥ t} ∨ 1 #{j : Wj ≥ t} ∨ 1

Knockoff estimate of FDR

0

+ __

FDP(t) =

+

+ __

__

t

+

+

+ + __

#{j null : Wj ≤ −t} #{j null : Wj ≥ t} ≈ #{j : Wj ≥ t} ∨ 1 #{j : Wj ≥ t} ∨ 1 #{j : Wj ≤ −t} d ≤ := FDP(t) #{j : Wj ≥ t} ∨ 1

3. Selection and FDR control Select features with large and positive statistics {Wj ≥ T }   #{j : Wj ≤ −t} ≤q Knockoff T = min t ∈ W : #{j : Wj ≥ t} ∨ 1

Theorem (Knockoff) E



 V ≤q R + q −1

V : # false positives R : total # of selections

3. Selection and FDR control Select features with large and positive statistics {Wj ≥ T }   #{j : Wj ≤ −t} ≤q Knockoff T = min t ∈ W : #{j : Wj ≥ t} ∨ 1

Theorem (Knockoff) E



 V ≤q R + q −1

V : # false positives R : total # of selections

  1 + #{j : Wj ≤ −t} T = min t ∈ W : ≤q #{j : Wj ≥ t} ∨ 1

Theorem (Knockoff+) E



 V ≤q R∨1

Knockoff+

Stopping rule   0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1

0

+ __

+

+ __

__

+

+

+ + __

Stop first time ratio between # negatives and # positives below q

Stopping rule   0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1

0

+ __

+

+ __

__

+

+

+ + __

Stop first time ratio between # negatives and # positives below q

Stopping rule   0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1

0

+ __

+

+ __

__

+

+

+ + __

Stop first time ratio between # negatives and # positives below q

Stopping rule   0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1

0

+ __

+

+ __

__

+

+

+ + __

Stop first time ratio between # negatives and # positives below q

Why does all this work?   1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0

+ __

+

+ __

__

+

+

+ + __

Why does all this work?   1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0

+ __

FDP(T ) =

+

+ __

__

+

+

+ + __

#{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T } · #{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T }

Why does all this work?   1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0

+ __

FDP(T ) =

+

+ __

+

+

__

+ + __

#{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T } · #{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T } V + (T )

z }| { #{j null : Wj ≥ T } ≤q· 1 + #{j null : Wj ≤ −T } | {z } V − (T )

Why does all this work?   1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0

+

FDP(T ) =

+

+ __

__

+

+

__

+ + __

#{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T } · #{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T } V + (T )

z }| { #{j null : Wj ≥ T } ≤q· 1 + #{j null : Wj ≤ −T } | {z } V − (T )

V + (t)/(1 + V − (t)) is a super-martingale w.r.t. well defined filtration T is stopping time

Optional stopping time theorem null 0

+

+ t



V + (T ) FDR ≤ q E 1 + V − (T )

+ __

__



__

+

non null

+

+ + __

Optional stopping time theorem null 0

+

+ t



V + (T ) FDR ≤ q E 1 + V − (T )

+ __

__



+

+

__



V + (0) ≤q E 1 + V − (0)

non null + + __



Optional stopping time theorem null 0

+ __

+ t

+ __

__

+

non null

+

+ + __



 z }| {       V + (T ) V + (0) V + (0)   FDR ≤ q E ≤ q E = q E   1 + V − (T ) 1 + V − (0)  1 + #nulls − V + (0)  Ber(#nulls,1/2)

Optional stopping time theorem null 0

+ __

+ t

+ __

__

+

non null

+

+ + __



 z }| {       V + (T ) V + (0) V + (0)   FDR ≤ q E ≤ q E = q E   ≤q 1 + V − (T ) 1 + V − (0)  1 + #nulls − V + (0)  Ber(#nulls,1/2)

Comparison with other methods

Permutation methods Let X π = X with rows randomly permuted       Σ 0 π 0 π X X X X ≈ 0 Σ

Zj

Value of λ when variable enters model

Permutation Let X π = X withmethods rows randomly permuted. Then π Let X = X with rows randomly permuted       Σ 0 >   X X π π 0 X X ππ  ≈ Σ 0 X X X X ≈ 0 Σ 0 Σ ●

25



20





15

Original non−null features Original null features Permuted features



10

●● ●

5

● ●



● ●

● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●● ●● ● ●●● ● ●●● ● ●● ●●●●●●●●●● ● ●●●● ●● ●● ●

0

|0 {z } | signals

50

{z

nulls

}100|

Knockoff method Permutation method

{z

150

200

permuted features

Index of column in the augmented matrix [X Xπ]

FDR (target level q = 20%) 12.29% 45.61%

}

Zj

Value of λ when variable enters model

Permutation Let X π = X withmethods rows randomly permuted. Then π Let X = X with rows randomly permuted       Σ 0 >   X X π π 0 X X ππ  ≈ Σ 0 X X X X ≈ 0 Σ 0 Σ ●

25



20





15

Original non−null features Original null features Permuted features



10

●● ●

5

● ●



● ●

● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●● ●● ● ●●● ● ●●● ● ●● ●●●●●●●●●● ● ●●●● ●● ●● ●

0

|0 {z } | signals

50

{z

nulls

}100|

Knockoff method Knockoff method Permutation method

Permutation method

{z

150

200

permuted features

Index of column in the augmented matrix [X Xπ]

FDR over 1000 trials (nominal level q level = 20%)q FDR (target 12.29% 12.29% 45.61%

45.61%

= 20%)

}

Other methods Benjamini-Hochberg (BHq) BHq + log factor correction BHq with whitened noise

Other methods Benjamini-Hochberg (BHq) BHq + log factor correction BHq with whitened noise

y ∼ N (Xβ, σ 2 I)

βˆLS ∼ N (β, σ 2 (X 0 X)−1 )

⇐⇒

Apply BHq to Zj =

σ

p

βˆjLS (Σ−1 )jj

Not known to control FDR → log factor correction (Benjamini Yekutieli)

Empirical results

Features N (0, In ), n = 3000, p = 1000

k = 30 variables with regression coefficients of magnitude 3.5

Method Knockoff+ (equivariant) Knockoff (equivariant) Knockoff+ (SDP) Knockoff (SDP) BHq BHq + log-factor correction BHq with whitened noise

FDR (%) (nominal level q = 20%) 14.40 17.82 15.05 18.72 18.70 2.20 18.79

Power (%) 60.99 66.73 61.54 67.50 48.88 19.09 2.33

Theor. FDR control? Yes No Yes No No Yes Yes

Effect of sparsity level Same setup with amplitudes set to 3.5 (q = 0.2)

25

100

20

80 ●





● ●

● ●

















15



● ●











● ●

● ●





● ●









10

5 ●

Nominal level Knockoff Knockoff+ BHq

0

Power (%)

FDR (%)







60 ●

40



20 ●

Knockoff Knockoff+ BHq

0 50

100 Sparsity level k

150

200

50

100 Sparsity level k

150

200

Effect of signal amplitude Same setup with k = 30 (q = 0.2)

25

100

20

80



● ●





● ●

● ●



● ● ●

10









Power (%)

FDR (%)



15



● ● ●



60 ● ● ● ●

40 ●





5 ●

Nominal level Knockoff Knockoff+ BHq

0

20 ●

Knockoff Knockoff+ BHq

0 2.8

3.0

3.2

3.4

3.6

Amplitude A

3.8

4.0

4.2

2.8

3.0

3.2

3.4

3.6

Amplitude A

3.8

4.0

4.2

Effect of feature correlation Θjk = ρ|j−k|

Features ∼ N (0, Θ)

n = 3000, p = 1000, and k = 30 and amplitude = 3.5

30

100



25

Nominal level Knockoff Knockoff+ BHq



Knockoff Knockoff+ BHq

80

15

Power (%)

FDR (%)

20

● ● ● ●



60

● ●







40

10

● ●

● ●

20



5



● ● ●

0



0 0.0

0.2

0.4

0.6

Feature correlation ρ

0.8

0.0

0.2

0.4

0.6

Feature correlation ρ

0.8

Application to real HIV data

HIV drug resistance

Drug type

# drugs

Sample size

PI NRTI NNRTI

6 6 3

848 639 747

# protease or RT positions genotyped 99 240 240

# mutations appearing ≥ 3 times in sample 209 294 319

response y: log-fold-increase of lab-tested drug resistance in covariate Xj : presence or absence of mutation #j

Data from R. Shafer (Stanford) available at: http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/

PI-type drug resistance TSM list: mutations associated with the PI class of drugs in general, and is not specialized to the individual drugs in the class

30

Appear in TSM list Not in TSM list

Resistance to ATV

Resistance to IDV

35

35

30

30

25

25

25

20

20

20

15

15

15

10

10

10

5

5

0

0 Knockoff

BHq

5

Resistance to APV

0 Knockoff

BHq

Knockoff

# HIV−1 protease positions selected

# HIV−1 protease positions selected

Resistance to APV 35

BHq

Data set size: n=768, p=201

Data set size: n=329, p=147

Data set size: n=826, p=208

Resistance to LPV

Resistance to NFV

Resistance to SQV

35

35

35

30

30

30

25

25

25

20

20

20

15

15

15

10

10

10

5

5

5

0

0

0

Knockoff

BHq

Data set size: n=516, p=184

Knockoff

BHq

Data set size: n=843, p=209

Knockoff

35 30

Appear in TSM list Not in TSM list

25 20 15

BHq

Data set size: n=825, p=208

10 5

Figure: q = 0.2. # positions on the HIV-1 protease where mutations were selected. Horizontal line = # HIV-1 protease positions in TSM list. 0 Knockoff

BHq

NRTI-type drug resistance

# HIV−1 RT positions selected

Resistance to X3TC 30

Appear in TSM list Not in TSM list

Resistance to ABC

Resistance to AZT

30

30

25

25

25

20

20

20

15

15

15

10

10

10

5

5

5

0

0 Knockoff

BHq

0 Knockoff

BHq

Knockoff

BHq

Data set size: n=633, p=292

Data set size: n=628, p=294

Data set size: n=630, p=292

Resistance to D4T

Resistance to DDI

Resistance to TDF

30

30

30

25

25

25

20

20

20

15

15

15

10

10

10

5

5

0

5

0 Knockoff

BHq

Data set size: n=630, p=293

0 Knockoff

BHq

Data set size: n=632, p=292

Knockoff

BHq

Data set size: n=353, p=218

Figure: Validation against the treatment-selected mutation (TSM) panel

NNRTI-type drug resistance

# HIV−1 RT positions selected

Resistance to DLV 35 30

Appear in TSM list Not in TSM list

Resistance to EFV

Resistance to NVP

35

35

30

30

25

25

25

20

20

20

15

15

15

10

10

10

5

5

0

5

0 Knockoff

BHq

Data set size: n=732, p=311

0 Knockoff

BHq

Data set size: n=734, p=318

Knockoff

BHq

Data set size: n=746, p=319

Figure: Validation against the treatment-selected mutation (TSM) panel

Heavy-tailed noise

Design as in HIV example (sparse matrix) Errors as residuals from HIV example Regression coefficients entered manually

Knockoff+ BHq

FDR (q = 20%) 20.31% 25.47%

Power 60.67% 69.42%

Details

Knockoff constructions (n ≥ 2p)

˜ = X(I − Σ−1 diag{s}) + U ˜C X

Knockoff constructions (n ≥ 2p)

˜ = X(I − Σ−1 diag{s}) + U ˜C X



˜ X X

0   ˜ = X X



 Σ Σ − diag{s} := G  0 Σ − diag{s} Σ

Knockoff constructions (n ≥ 2p)

˜ = X(I − Σ−1 diag{s}) + U ˜C X



˜ X X

0   ˜ = X X

G0



 Σ Σ − diag{s} := G  0 Σ − diag{s} Σ

⇐⇒

diag{s}  0 2Σ − diag{s}  0

Knockoff constructions(n ≥ 2p) Equi-correlated knockoffs: sj = 2λmin (Σ) ∧ 1 ˜ j i = 1 − 2λmin (Σ) ∧ 1 hXj , X ˜ j i| Under equivariance, minimizes the value of |hXj , X

Knockoff constructions(n ≥ 2p) Equi-correlated knockoffs: sj = 2λmin (Σ) ∧ 1 ˜ j i = 1 − 2λmin (Σ) ∧ 1 hXj , X ˜ j i| Under equivariance, minimizes the value of |hXj , X SDP knockoffs: minimize subject to

P

|1 − sj | sj ≥ 0 diag{s}  2Σ j

⇐⇒

minimize subject to

Highly structured semidefinite program (SDP) Other possibilities

P

j (1 − sj ) sj ≥ 0 diag{s}  2Σ

  ˜ , y) Symmetric statistics: W ( X X

column swap

Sufficiency property:        ˜ 0 X X ˜ , X X ˜ 0y W =f X X

Anti-symmetry property: swapping changes signs (       +1 j 6∈ S ˜ ˜ ,y · Wj X X , y = Wj X X swap(S) −1 j ∈ S

null non null 0

+ __

+ t

+ __

__

+

+

+ + __

Examples of statistics ˆ Zj = sup{λ : βˆj (λ) 6= 0}, j = 1, . . . , 2p, and β(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )

Wj = Zj − Zj+p

Examples of statistics ˆ Zj = sup{λ : βˆj (λ) 6= 0}, j = 1, . . . , 2p, and β(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )

Wj = Zj − Zj+p

Same with penalized estimate min

1 2 ky

− Xbk22 + λP (b)

Examples of statistics ˆ Zj = sup{λ : βˆj (λ) 6= 0}, j = 1, . . . , 2p, and β(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )

Wj = Zj − Zj+p

Same with penalized estimate min

1 2 ky

− Xbk22 + λP (b)

Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order in which 2p variables entered model Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )

Examples of statistics ˆ Zj = sup{λ : βˆj (λ) 6= 0}, j = 1, . . . , 2p, and β(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )

Wj = Zj − Zj+p

Same with penalized estimate min

1 2 ky

− Xbk22 + λP (b)

Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order in which 2p variables entered model Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Statistics based on LS estimates LS 2 Wj = |βˆjLS |2 − |βˆj+p |

LS |βˆjLS | − |βˆj+p |

Examples of statistics ˆ Zj = sup{λ : βˆj (λ) 6= 0}, j = 1, . . . , 2p, and β(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )

Wj = Zj − Zj+p

Same with penalized estimate min

1 2 ky

− Xbk22 + λP (b)

Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order in which 2p variables entered model Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Statistics based on LS estimates LS 2 Wj = |βˆjLS |2 − |βˆj+p |

... (endless possibilities)

LS |βˆjLS | − |βˆj+p |

Extensions

Other type-I errors Can control familywise error rate (FWER), k-FWER, ...: Janson and Su (’15)

+ 0

enters late

++

+ −





++

|W|

enters early (significant)

T : time at which m knockoffs have entered before originals appear before Reject hypotheses with ‘+’

Expected number of false discoveries EV ≤ m

Other models Suppose we wish to test for groups X y= Xg βg + z

Hg : βg = 0

g∈G

Group lasso min 12 ky − Forward group selection ...

P

g

Xg βˆg k22 + λ

P

g

kβˆg k2

Other models Suppose we wish to test for groups X y= Xg βg + z

Hg : βg = 0

g∈G

Group lasso min 12 ky − Forward group selection ...

P

g

Xg βˆg k22 + λ

Construct group knockoffs for exchangeability

P

g

kβˆg k2

Other models Suppose we wish to test for groups X y= Xg βg + z

Hg : βg = 0

g∈G

Group lasso min 12 ky − Forward group selection ...

+ 0

enters late

P

g

Xg βˆg k22 + λ

P

++ −

g

kβˆg k2

+ −

Construct group knockoffs for exchangeability Calculate statistics; e.g. signed reversed ranks



++ enters early (significant)

|W|

Other models Suppose we wish to test for groups X y= Xg βg + z

Hg : βg = 0

g∈G

Group lasso min 12 ky − Forward group selection ...

+ 0

enters late

P

g

Xg βˆg k22 + λ

P

++ −

kβˆg k2

+ −

Construct group knockoffs for exchangeability Calculate statistics; e.g. signed reversed ranks Compute threshold as before

g



++ enters early (significant)

|W|

Other models Suppose we wish to test for groups X y= Xg βg + z

Hg : βg = 0

g∈G

Group lasso min 12 ky − Forward group selection ...

+ 0

enters late

P

g

Xg βˆg k22 + λ

++ −



Calculate statistics; e.g. signed reversed ranks Compute threshold as before

g

kβˆg k2

+

Construct group knockoffs for exchangeability

Provides FDR control

P



++ enters early (significant)

|W|

Summary

Knockoff filter = inference machine You design the statistics, knockoffs take care of inference Works under any design X (handles arb. correlations) Does not require any knowledge of noise level σ Very powerful when sparse effects

Open research: n < p...

FDR control is an extremely useful concept even away from counting errors and successes

BHq y ∼ N (Xβ, σ 2 I)

⇐⇒

βˆLS ∼ N (β, σ 2 (X 0 X)−1 )

Statistics are independent iff X 0 X is diagonal (orthogonal design)

BHq Orthogonal model: y ∼ N (β, I) (σ = 1 is known)   p · P{|N (0, 1)| ≥ t} ≤q TBH = min t : #{j : |yj | = |βj + zj | ≥ t}

BHq Orthogonal model: y ∼ N (β, I) (σ = 1 is known)   p · P{|N (0, 1)| ≥ t} ≤q TBH = min t : #{j : |yj | = |βj + zj | ≥ t} Knockoff procedure (with lasso) is quite different: Make control group   ˜ = Ip X X] 0

0 Ip



=⇒

statistics =

  y z0

y ∼ N (β, I) indep. from z 0 ∼ N (0, I) Compute threshold (knockoff+) via ( ) #{j : |zj0 | ≥ t and |zj0 | > |yj |} d T = min t : FDP(t) = ≤q #{j : |yj | ≥ t and |yj | > |zj0 |}

Empirical comparison p = 1000 # true signals is 200 → fraction of nulls is 0.8 25

100 ●

20 15





































● ●

10

80



Power (%)

FDR (%)





Nominal level BHq Knockoff+

5 ●

0 2

3

4

Signal Magnitude

FDR

5

6























● ●

60 ●

40 ●

20

BHq Knockoff+

● ● ●

0 1







1



2

3

4

5

6

Signal Magnitude

Power

FDR and power of the BHq and knockoff+ methods vs. size A of the regression coefficients (signal magnitude)



Some Closing Remarks: How About Prediction/Estimation?

FDR for estimation: y = Xβ + z Goal: predict response from explanatory variables Sparse (modern) setup: p large and only few important variables

FDR for estimation: y = Xβ + z Goal: predict response from explanatory variables Sparse (modern) setup: p large and only few important variables

Model selection via Cp 2 ˆ Cp (S) = ky − X β[S]k + 2σ 2 | {z } RSS

|S| |{z}

model dim.

ˆ β[S]: fitted LS coefficients from variables in S Cp (S) unbiased for prediction error of model S

S ⊂ {1, . . . , p}

FDR for estimation: y = Xβ + z Goal: predict response from explanatory variables Sparse (modern) setup: p large and only few important variables

Low bias Cp FWER FDR

Low variance

FDR thresholding (Abramovich and Benjamini (’96)) Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip ) Select nominal level q and perform BH(q) testing ( XiT y |XiT y| ≥ tFDR βˆi = 0 otherwise

FDR thresholding (Abramovich and Benjamini (’96)) Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip ) Select nominal level q and perform BH(q) testing ( XiT y |XiT y| ≥ tFDR βˆi = 0 otherwise

Theorem (Abramovich, Benjamini, Donoho, Johnstone (’05)) Sparsity class `0 (k) = {β : kβk0 ≤ k} Minimax risk

R(k) = inf sup

βˆ β∈`0 (k)

E kβˆ − βk22

FDR thresholding (Abramovich and Benjamini (’96)) Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip ) Select nominal level q and perform BH(q) testing ( XiT y |XiT y| ≥ tFDR βˆi = 0 otherwise

Theorem (Abramovich, Benjamini, Donoho, Johnstone (’05)) Sparsity class `0 (k) = {β : kβk0 ≤ k} Minimax risk

R(k) = inf sup

βˆ β∈`0 (k)

E kβˆ − βk22

Set q < 1/2. Then asymptotically, as p → ∞ and k ∈ [(log p)5 , p1−δ ] E kβˆFDR − βk22 = E kX βˆFDR − Xβk22 = R(k)(1 + o(1))

Other connections SLOPE: Bogdan, van den Berg, Sabatti, Su and Cand`es (’13) min

1 ˆ 2 + λ1 |β| ˆ (1) + λ2 |β| ˆ (2) + . . . + λp |β| ˆ (p) ky − X βk 2 2

λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0

ˆ (1) ≥ |β| ˆ (2) ≥ . . . ≥ |β| ˆ (p) |β| [order statistic]

Other connections SLOPE: Bogdan, van den Berg, Sabatti, Su and Cand`es (’13) min

1 ˆ 2 + λ1 |β| ˆ (1) + λ2 |β| ˆ (2) + . . . + λp |β| ˆ (p) ky − X βk 2 2

λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0

ˆ (1) ≥ |β| ˆ (2) ≥ . . . ≥ |β| ˆ (p) |β| [order statistic]

Adaptive minimaxity: C. and Su (’15) Sparse class `0 (k) = {β : kβk0 ≤ k}

Fix q ∈ (0, 1] and set λi = σ · Φ−1 (1 − iq/2p) (BHq) For some linear models, adaptive minimax estimation sup β∈`0 (k)

E kX βˆSLOPE − Xβk22 = R(k)(1 + o(1))