Big data and causality unrealistic correlation?

4 downloads 68249 Views 1MB Size Report
Big data, causality and unrealistic ... Statisticians – Long tradition of modeling data samples from large populations with ... Non-linearities in data: What are best?
Big data, causality and unrealistic correlation Priyantha Wijayatunga Senior Lecturer Department of Statistics, Umeå School of Business and Economics Umeå University, Umeå 901 87 Sweden URL: http://www.usbe.umu.se/om/personal/prwi0001 Email: [email protected]

Summary of the talk • Types of data analysis and models for data with non-linearies • Correlations and measures of dependence for complex data • Can we update such measures as new data arise without redoing? • A statistical modeling problem – we need subject domain knowledge

Data and Analysts • Statisticians – Long tradition of modeling data samples from large populations with mathematical functions and methods. Work supports to improve subject matter knowledge • Probabilists – Use of complex mathematics for estimation of probabilities of events of interests (Probability theory) • Probabilisticians – Use no strict assumptions, and mathematical functions and relations. Combinatorial thinking is preferred. • Data scientists – Use a lot of data for mining them to find useful patterns

Massive Data • Better for the statistical, probability and probabilistic theory • Statistical significance may not imply practical significance - subject matter knowledge is needed to identity (not to quantify) the size of the effect that is predictive or causal • Large samples are computationally heavy – need to balance it with statistical accuracy • High dimensionality of data brings noise accumulation more, spurious correlations, heterogeneity, incidental endogeneity (many unrelated variables may incidentally be correlated with residual noises) • If the data arise from different experimental settings it should be taken into consideration first: data are generated from a mixture of distributions 𝑓 𝑥 = 𝛼1 𝑓1 𝑥 + ⋯ + 𝛼𝑘 𝑓𝑘 𝑥

Non-linearities in data: What are best? • At Googles it is claimed that even as small as 5% of the re-sampled data from massive data can give highly accurate models • But sampling should be done accurately • Tree-based models are shown to be very effective for modeling non-linear effects and interactions (Varian 2014) • Eg: Classifying credit applications • Problem: Model can be over-fitted to data • We need to prone the tree: reduce the complexity of the model • A better model a Random Forest

German Credit Data

Conditional Inference Tree 1 Account.Balance p < 0.001 {1, 2}

{3, 4}

2 Payment.Status.of.Previous.Credit p < 0.001 {3, 4, 5}

9 Payment.Status.of.Previous.Credit p = 0.035

{1, 2}

3 Payment.Status.of.Previous.Credit p = 0.029 5

{3, 4}

{3, 5}

{1, 2, 4}

5 Account.Balance p = 0.006 1

Node 8 (n = 68)

1

Node 10 (n = 395)

1

Node 11 (n = 62) 1

1

1

Node 7 (n = 181)

1

1

1

Node 6 (n = 172) 1

1

1

0.8

0.8

0.6

0.6

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0.2

0.2

0

0

0

0

0

2

0.8

2

0.8

2

0.8

2

0.8

2

2

1

Node 4 (n = 122)

2

0

Spurious Correlation • If population correlation is zero, then sample correlation is very small (absolute value) for large sample size (n). • But for large number of variables (p) with we may get large correlations: simulations are given in Fan, (2014) • In some simulations: if we generate a large number of variables independently then some of them may be correlated!!!

A meaningful relationship? Sales of organic foods and number of autism individuals If there are trends in two series then they are correlated!! One may not cause the other But is there any common cause?

Source: https://www.buzzfeed.com/ ..

Measuring Dependence: MIC • Correlation is only a measure of linear association • With Big Data one may need to measure any dependence among many variables: Maximal Information Criterion (MIC), Reshef et al. (2011) • There are algorithms to handle high-dimensional data with MIC • But it can have low statistical power • Can we estimate/update a dependence measure as new data arise? For example, online data • Can do it for correlation coefficient, MIC (a little subtle) and for a new measure we have proposed

Correlation Coefficient

• A measure of dependence is a normalized distance between dependence and independence. For binary X and Y, 𝑝(0,0) 𝑝(0,1) • 𝑃 𝑋, 𝑌 = 𝑝(1,0) 𝑝(1,1) 𝑝 0 𝑞(0) 𝑝 0 𝑞(1) • 𝑃𝐼 𝑋, 𝑌 = 𝑝 1 𝑞(0) 𝑝 1 𝑞(1) 𝑝(0) 0 • 𝑃𝑀𝑥 𝑋, 𝑌 = =𝑃 𝑋 0 𝑝(1) • 𝑃𝑀𝑦 • 𝝆 =

𝑞(0) 𝑋, 𝑌 = 0

0 =𝑄 𝑋 𝑞(1)

|𝒑 𝒙,𝒚 −𝒑 𝒙 𝒒 𝒚 | 𝒑 𝒙 𝟏−𝒒 𝒚 𝒒(𝒚)(𝟏−𝒑(𝒙))

for any x, y = 0,1

Correlation Coefficient • A measure of dependence is a normalized distance between dependence and independence • Distances are used to measure dependencies in Wijayatunga et al. 2006 and Granger et al. 2004 independently • But they did not do any normalization • E.g.: X is gender of the student and Y is passing/failing the exam • Variables with only 2 values are only be linearly related or independent • Correlation is meaningful • Odds ratio is another measure of dependence

Correlation Coefficient • A measure of dependence is a normalized distance between dependence and independence • For n-nary X and Y, • 𝝆 𝑿, 𝒀 =

𝑬 𝑿,𝒀 −𝑬(𝑿)𝑬(𝒀) 𝑽𝒂𝒓(𝑿)𝑽𝒂𝒓(𝒀)

• Terms are just moments • Can update the measure This is because it is calculated from probability distributions. It is true for the odds ratio

Measure of Dependence of X and Y, DD(X,Y) • D(P(.),Q(.)) is the distance between two probability distributions P(.) and Q(.), for e.g. Hellinger distance • PXi(X,Y) is the ith probability distribution representing maximal dependence with marginal of X is unchanged 𝑫𝑫 𝑿, 𝒀 =

𝑫(𝑷 𝑿, 𝒀 , 𝑷(𝑿)𝑷(𝒀)) ς𝒂𝒊=1𝑫(𝑷𝒊𝑿

𝑿, 𝒀 , 𝑷(𝑿)𝑷(𝒀)൯

1Τ𝟐𝒂

ς𝒃𝒊=1𝑫(𝑷𝒊𝒀

𝑿, 𝒀 , 𝑷(𝑿)𝑷(𝒀)൯

• Can measure any non-linear dependence !!! • Finding a maximal dependence of X and Y when marginal distribution of X is unchanged, PX1(X,Y), etc. manually is time consuming !! • Calculation of DD(X,Y) is computational: need to program it!!

1Τ2𝒃

Measures of Dependence of X and Y • It is based on probability distribution of X and Y, P(X,Y) • Data are used to estimate P(X,Y), PI(X,Y)=P(X)P(Y), etc. • Therefore E(X,Y), E(X), etc. • Some online updating estimation is possible • Update probability distributions using for e.g. Bayesian theory • Distributions from earlier data are prior distributions for new data • Updated distributions are the posterior distributions for whole data 𝑷(𝑿, 𝒀 |𝑫𝒂𝒕𝒂) ∝ 𝑳𝒊𝒌𝒆𝒍𝒊𝒉𝒐𝒐𝒅 × 𝒑𝒓𝒊𝒐𝒓 • Suitably updated probability distributions, etc. can be used to estimate the updated measures: No need to use raw data from the whole past • Updating can be different – select the best way to do it!!

Lord’s Paradox Initial weight distribution and final weight distribution after a diet program are the same for each sex. Is there a effect of sex on the weight gain?

Graphs: http://m-clark.github.io/docs/lord/ and https://www.r-bloggers.com/lords-paradox-in-r/

Lord’s Paradox Weight gain of the j-th boy in i-th subgroup = Dij1 for j=1,..,ni; i=1,…,a

Weight gain of the j-th girl in i-th subgroup = Dij0 for j=1,..,mi; i=1,…,a 𝑓𝑖1 = 𝑓𝑖0 𝑓𝑖 =

𝑛𝑖

σ𝑎𝑖=1 𝑛𝑖

𝑚𝑖 = 𝑎 σ𝑖=1 𝑚𝑖 𝑛𝑖 + 𝑚𝑖 σ𝑎𝑖=1(𝑛𝑖 +𝑚𝑖 )

ഥ𝑖1 = 𝐷 ഥ𝑖0 = 𝐷

1 σ𝑎𝑗=1 𝐷𝑖𝑗

𝑛𝑖 0 σ𝑎𝑗=1 𝐷𝑖𝑗

𝑚𝑖

Let’s identify each subgroup by initial weight WI

Lord’s Paradox .

Lord’s Paradox • There is no confounding by WI if initial weight WI distributions are the same for both sexes S=0 and S=1: That is f1=f0, then A2=A1 • Absolute measure of confounding effect is distance between two distributions, f1 and f0, possibly normalized • What if WI is an effect of S and WF is an effect of both WI and S: So WI is a mediator: Then causal effect is A1 • We argue differently: Causal effect is A2

• Who knows the truth? Data are less helpful. • Need subject matter knowledge based on some form of experiments!!! This can be quite expensive

Causal Lord’s Paradox • In literature sex (S) and initial weight (WI) regarded as causal factors for final weight (WF) • But S is taken as the cause of WI !! • If genetic information decides one’s sex and weight, in order to model WF causally S is a cause of WI but they are associated • Data cannot decide this: we need subject domain knowledge.

Causal Lord’s Paradox • We can turn the smaller regression into a larger one • Then it has its error (residual) variance smaller: better model!! • But errors should be symmetrically distributed: WF values should be symmetric about model prediction • Then unknown factors associated with it are assumed to be random

Which model is better: either the model in the first line or that in the last? Behavior of the residuals (symmetric or not) is important. If symmetric then conditional variance smaller the better If two models contradict then we cannot conclude

References Fan, J. (2014). Features of Big Data and sparsest solution in high confidence set. In Past, Present and Future of Statistical Science (X, Lin, C. Genest, D. L. Banks, G. Molenberghs, D. W. Scott, J.-L. Wang, Eds.), Chapman & Hall, New York, 507-523

Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Statist. Soc. B (2012) 74, Part 1, 37–65. Granger, C. W., Maasoumi, E. and Racine, J. (2004). A Dependence Metric for Possibly Nonlinear Processes, The Journal of Time Series Analysis, 25(5), 649-669.

Reshef, D. et al. (2011). Detecting novel associations in large data sets. Science, 334(6062), 1518-1524. Varian, H. R. (2014). Big Data: New Tricks for Econometrics. J. Economic Perspectives, 28(2), 3–28. Wijayatunga, P., Mase, S. & Nakamura, M., (2006). Appraisal of Companies with Bayesian Networks. International Journal of Business Intelligence and Data Mining, Vol. 1, No. 3, 329–346. Wijayatunga, P. (2016). A geometric view on Pearson's correlation coefficient and a generalization of it to non-linear dependencies. Ratio Mathematica, 30, 3–21. Wijayatunga, P. (2017). Resolving the Lord’s paradox. Manuscript submitted