Oct 25, 2017 - the Lasso thatallow sforN orm al,i.i.d.,additive covariate m easurem enterror.D atta and Zou5 proposed the convex conditioned Lasso (C oC ...
Received 9 January 2017;
Revised 25 O ctober2017;
Accepted Pending
D O I:xxx/xxxx
M EBoost:Variable Selection in the Presence ofM easurem ent Error
arXiv:1701.02349v3 [stat.CO] 25 Oct 2017
Ben Brow n1 | Tim othy W eaver2 | Julian W olfson3 1
N AM SA,M innesota,U SA M inneapolisM edicalResearch Foundation, M innesota,U SA 3 D ivision ofBiostatistics,SchoolofPublic H ealth,U niversity ofM innesota,M innesota, U SA 2
Sum m ary W e present a novelm ethod for variable selection in regression m odels w hen covariates are m easured w ith error.The iterative algorithm w e propose,M easurem entErrorBoost(M EBoost), follow s a path defined by estim ating equations thatcorrectfor covariate m easurem enterror.
Correspondence
Via sim ulation,w e evaluated ourm ethod and com pare itsperform anceto the recently-proposed
Ben Brow n,400 U S-H w y 169,M inneapolis, M N 55426 Em ail:bbrow n@ nam sa.com PresentA ddress 400 U S-H w y 169,M inneapolis,M N 55426
Convex Conditioned Lasso (CoCoLasso)and to the “naive” Lasso w hich does not correct for m easurem enterror.Increasing the degree ofm easurem enterrorincreased prediction errorand decreased the probability ofaccurate covariate selection,but this loss of accuracy w as least pronounced w hen using M EBoost.W e illustrate the use ofM EBoost in practice by analyzing data from the Box Lunch Study,a clinicaltrialin nutrition w here severalvariablesare based on self-reportand hence m easured w ith error. K EYW O RD S:
Boosting,H igh-D im ensionalD ata,M achine Learning,M easurem entError,Variable Selection
1
IN TRO D U CTIO N
Variable selection isa w ell-studied problem in situationsw here covariatesare m easured w ithouterror.H ow ever,itiscom m on forcovariate m easurem entsto be error-prone orsubjectto random variation around som e m ean value.Consider,forinstance,a study w herein subjectsreporttheir daily food intake on the basis ofa dietary recallquestionnaire.There is variation from day to day in an individual’s calorie consum ption,butitis also w ellestablished in the nutrition literature thatthere iserrorassociated w ith the recallorm easurem entofthe num berofcaloriesin a m eal1,2 . In the usualregression setting,ignoring m easurem enterrorleadsto biased coefficientestim ation 3 ,and hence the presence ofm easurem enterror hasthe potentialto affectthe perform ance ofvariable selection procedures.In thisexam ple,w e m ay be able to create a predictive m odelbased on these m ism easured dietery recalldata,thatw e can then apply the m odelto m ore expensive data thatcan be m easured w ith reduced orelim inated m easurem enterrorsuch asw ith the help ofa nutritionistorthrough prepackaged m eals. Therehasbeen relativelylittleresearch doneaboutvariableselection in thepresenceofm easurem enterror.Sorensen 4 introduced avariation of the Lasso thatallow sforN orm al,i.i.d.,additive covariate m easurem enterror.D attaand Zou 5 proposed the convex conditioned Lasso (CoCoLasso) w hich correctsforboth additive and m ultiplicative m easurem enterrorin the norm alcase.Both ofthese m ethodsare applicable to linearm odels forcontinuousoutcom es,butdo noteasily extend to regression m odelsforotheroutcom e types(e.g.,binary orcountdata).M eanw hile,there is a sizable statisticalliterature on m ethodsforperform ing estim ation and inference forlow -dim ensionalregression param etersin the presence of m easurem enterror3,6,7 ,butthese approachesdo notaddressthe variable selection problem and cannotbe applied in large p,sm alln problem s. W e propose a novelm ethod forvariable selection in the presence ofm easurem enterror,M EBoost,w hich leveragesestim ating equationsthat have been proposed forlow -dim ensionalestim ation and inference in thissetting.M EBoostisa com putationally efficientpath-follow ing algorithm thatm ovesiterativelyin directionsdefined bytheseestim atingequations,onlyrequiringthe calculation (notthesolution)ofan estim atingequation ateach step.Asa result,itism uch fasterthan alternative approachesinvolving,e.g.,a m atrix projection calculation ateach step.M EBoostisalso flexible:the version thatw e describe isbased on estim ating equations proposed by N akam ura8 ,w hich apply to som e generalized linearm odels,
2
BRO W N
ET A L
and the underlyingM EBoostalgorithm can easilyincorporate m easurem enterror-corrected estim atingequationsforotherregression m odels.W e conducted a sim ulation study to com pare M EBoostto the Convex Conditioned Lasso (CoCoLasso)proposed by D aata and Zou 5 and the “naive” Lasso w hich ignoresm easurem enterror.W e also applied M EBoostto datafrom the BoxLunch Study,aclinicaltrialin nutrition w here caloricintake acrossa num beroffood categoriesw asbased on self-reportand hence m easured w ith error.
BACKG RO U N D
2 2.1
Regression in the Presence ofCovariate M easurem entError
O urdiscussion ofm easurem enterrorm odelsdraw sheavily from Fuller7 .W hen m odeling errorthe covariatescan be treated asrandom orfixed values.Structuralm odelsconsiderthe covariatesto be random quantitiesand functionalm odelsconsiderthe covariatesto be fixed 9 .W e consider a structuralm odel.LetY = Xβ + ǫ,w here X isa (random )m atrix ofcovariatesofdim ension n × p,β a vectorofcoefficientsoflength p,ǫ isa
vectorofN orm ally distributed i.i.d.random errorsoflength n,and Y isthe resultantoutcom e vectoralso oflength n.In an additive m easurem ent
errorm odel,w eassum ethatw hatisobserved isnotX butratherthe“contam inated”or“error-prone”m atrix W = X + U w here U arandom n× p m atrix. W hen a m odelisfitthatignoresm easurem enterror,i.e.itassum esthatthe true m odelis Y = WβW + ǫ,the resulting estim ates βˆW are said to be naive and satisfy ′
′
E[βˆW ] = β (ΣXX + ∆)−1 ΣXX
(1)
w here β isthe true coefficientvector,ΣXX isthe covariance m atrix ofthe covariatesand ∆ ≡ ΣUU isthe covariance m atrix ofthe m easurem ent error.In thecaseoflinearregression w ith asinglecovariate,(1)sim plifiesto an attenuatingfactorthatbiasesthecoefficientestim atestow ardszero. H ow ever,w ith m ultiple covariatesthe biasm ay increase,decrease,and even change the sign ofthe estim ated coefficients.N otably,m easurem ent erroraffecting asingle covariate can biascoefficientestim atesin allofthe covariates,even those thatare notm easured w ith error9 .
2.2
Variable selection in the Presence ofM easurem entError
M a10 presented m ethodsto accountform easurem enterrorw hileperform ingvariableselection in param etricand sem i-param etricsettings.Focusingon theparam etricsetting,theyproposed aw ide scopingm ethod thatcan be used in m orethan justgeneralized linearm odels.The m ethod relies on deriving the fulllikelihood ofeach observation and it’scorresponding score function,S∗eff (Wi , Y i , β),choosing a penalty function and finding its derivative,p′ (β),then solving the penalized estim atingequations: n X i=1
′ ∗ Sef f (Wi , Y i , β) − np (β) = 0
(2)
Solving the penalized equations can be very difficult com putationally,especially in the high dim ensionalsetting.Therefore,w e w illlook to com pare ourm ethod w ith fasterm ethodsthatare variantsofthe Lasso,w hich can be solved m uch m ore quickly.
2.3
Lasso in the Presence ofM easurem entError
Sorensen etal.4 analyze the Lasso 11 in the presence ofm easurem enterrorby studying the propertiesof βˆLasso,λn = argmin ||Y − Wα||22 + λn ||α||1 .
(3)
α
′ ′ βˆLasso,λn isasym ptotically biased w hen λn /n → 0 as n → ∞ since E[βˆLasso,λn ] = β (ΣXX + ∆)−1 ΣXX .N otice thisisthe sam e biasthatis
introduced w hen naive linearregression isperform ed on observed covariates.Sorensen etal.4 derive a low erbound on the m agnitude ofthe nonzero coefficientelem entsbelow w hich the correspondingcovariate w illnotbe selected,and an upperbound on the L1 estim ation error||βˆW − β||1 . They show thatw ith increasing m easurem enterrorthe low erbound increases,i.e.,increasing m easurem enterroraddsnon-inform ative noise to the system and so forthe signalassociated w ith the relevantcovariatesto be identified the signalm ustincrease.Increased m easurem enterroralso leadsto an increasein theupperbound oftheestim ation error.Sign consistentselection isalso im pacted bythepresenceofcovariatem easurem ent error.Sorensen etal.4 seta low erbound on the probability ofsign consistentselection in thissetting.The resultrequiresthatthe Irrepresentability Condition with Measurement Error (IC-M E)holds.The IC-M E requiresthatthe m easurem entsofthe relevantand irrelevantcovariateshave lim ited correlation,relative to the size ofthe relevantm easured covariate correlation.N ote the sam ple correlation ofthe irrelevantcovariatesisnotconsidered.By studying the form ofthe low erbound,itcan be concluded that(atleastw hen using the Lasso)m easurem enterrorintroducesa greater distortion on the selection ofirrelevantcovariatesthan itdoesin the selection ofrelevantcovariates.
BRO W N
3
ET A L
Sorensen etal.4 introduced an iterative m ethod to obtain the Regularized Corrected Lasso w ith constrainton the radiusR: βˆRCL = arg m in{||y − Wβ||2 − nβ ′ ∆β + λ||β||1 }.
(4)
||β||1