Dimension Reduction and Variable Selection

2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

Ch. 1.1-1

Contents Definition Data sets and Examples UEDA: Univariate EDA Definitions of central tendency and variability. Data set descriptions. Transformations. Statistical Inference – Probability. Univariate Data Distributions – Normality. CLT – Statistical Tests. Statistical Puzzles Bayesian Inference (under construction). BEDA Continuous Variables – Correlations - Causation Nominal Variables – Chi Square – Odds 2/13/2018


-2

Contents (cont.) MEDA Principal Components Factor Analysis Clustering and Segmentation Canonical Discrimination Analysis Missing Value imputation Outliers and Variable Transformations.

2/13/2018


-3

EDA: definition, purpose and usefulness. Typically (or should be) initial step to view and try to comprehend data set , assumed to be of rectangular form. Columns called variables, rows observations. Comprehension of individual and across individual variables.

Given the size of present data bases with hundreds if not thousands or more variables or attributes, it is either hard or not possible to obtain a full conceptual understanding of the informational content imbued in the data. Since many applications lead to (some) modeling, it is also possible and desirable to perform EDA on the outcome of these procedures. In this sense, we can envision EDA applied twice, prior to and after a model creation, and thus possibly restarting the modeling effort. Variables are random when their values cannot be known with certainty, i.e., they are not deterministic variables. For instance, it is not possible to know the next outcome of a roulette bet, we know the probability of success. When this probability is not knowable, we talk about uncertainty. Otherwise, risk. 2/13/2018


1.1-4

Data sets in large data set setting, contain hundreds if not more variables, with taxonomy: 1) 2) 3) 4) 5)

Numeric, Character, considered nominal, i.e., no cardinality. Ordinal Ratio: ratio of numeric variables. Id: a. Id proper, nominal variables: patient id, SSN, etc. b. Indices: Time index, patient visit number, etc.

Can compute

Nominal

Ordinal

Interval

Ratio

frequency distribution. median and percentiles. add or subtract. mean, standard deviation, standard error of the mean. ratio, or coefficient of variation.

Yes No No No

Yes Yes No No

Yes Yes Yes Yes

Yes Yes Yes Yes

No

No

No

Yes

2/13/2018


1.1-5

Proposed EDA steps We can envision 5 steps in EDA:

Data Set definition: at least number of observations, variables and taxonomy. UEDA Univariate BEDA Bivariate

MEDA Multivariate: Missing values analysis and imputation, principal components and clustering. Outliers and variable transformation or Engineering (that involve UEDA, BEDA and MEDA. E.g., log (Var1) is Univariate transformation. PCA is multivariate Transformation). 2/13/2018


-6

Statistical Inference It is both part of EDA and modeling, allows for comparisons and statements about similarity or not in many different situations. All data sets assumed to be representative samples from an infinite population, which can be unrealistic. If interested in inferences on heights of first graders in specific school at specific time, data is entire population, and there is no uncertainty on recorded heights.

If interest is in height changes across time (and thus work with samples), then use statistical inference to infer information about overall population (perhaps ideal, not real). 2/13/2018


1.1-7

2/13/2018


1.1-8

Data set 1: Definition by way of Example • • • •

Health insurance company Ophtamologic Insurance Claims Is claim valid? Present operation: • Manual review of history and circumstances Alternative: Scoring analytical system. Leonardo Auslender –Ch. 1 Copyright 2004

—9—

Data Mining Solution • Data set 1 (DS1): Use data on past claims to verify fraud • Data set 2: Investigate babies’ deaths.

Historical Data

Predictive Model


New Applicants

— 10 —

SAS EXAMPLE for fraud data set: ods html; proc contents data = fraud.fraud; run; ods html close; Alphabetic List of Variables and Attributes # Variable Type Len Format Informa Label t 3 DOCTOR_VISITS Num 8 BEST12 F12. Total visits to a doctor . 1 FRAUD Num 8 BEST12 F12. Fraudulent Activity yes/no . 5

MEMBER_DURAT Num 8 ION

4

NO_CLAIMS

7

NUM_MEMBERS Num 8

6

OPTOM_PRESC

Num 8

BEST12 F12. .

Number of opticals claimed

2

TOTAL_SPEND

Num 8

BEST12 F12. .

Total spent on opticals

2/13/2018

Num 8

Membership duration BEST12 F12. .

No of claims made recently

Number of members covered


Ch. 1.1-11

Data set 2 (DS2): Babies’ deaths (health care example). • Babies’ death or survivability. • Determine basic statistics for mostly binary variables. • Data anomalies? • Is present data representative or anomalous?


— 12 —

# Variable 16 Const 18 H 17 M1 7 abort 1 death 9 dyslab

2/13/2018

Alphabetic List of Variables and Attributes Type Len Label Num 8 Num 8 Num 8 Num 8 past abortion Num 8 Death Num 8 Labor progress

5 gestage

Num

8 Gestational AGe

8 hydramnios

Num

8 Too much amniotic fluid

6 isoimm

Num

8 Iso immunization

15 malpres 11 nomonit 2 nonwhite 4 nullip 10 placord

Num Num Num Num Num

8 Mal Presented 8 No Monitor 8 Non-White 8 Null Parity 8 Placental - cord anomaly

14 prerupt 3 teenages 12 twint 13 ward

Num Num Num Num

8 PROM 8 Early Age 8 Twin, Triplet 8 Public Ward


-13

Data Set HMEQ (DS 3) Reports characteristics and delinquency information for 5,960 home equity loans: loan where the obligor uses the equity of his or her home as the underlying collateral. The data set has the following characteristics: ◾ BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan ◾ LOAN: Amount of the loan request ◾ MORTDUE: Amount due on existing mortgage ◾ VALUE: Value of current property ◾ REASON: DebtCon = debt consolidation; HomeImp = home improvement ◾ JOB: Occupational categories ◾ YOJ: Years at present job ◾ DEROG: Number of major derogatory reports ◾ DELINQ: Number of delinquent credit lines ◾ CLAGE: Age of oldest credit line in months ◾ NINQ: Number of recent credit inquiries ◾ CLNO: Number of credit lines ◾ DEBTINC: Debt-to-income ratio

2/13/2018


-14

Home Work Questions:

2/13/2018


Ch. 1.1-15

Statistical Inference. 2/13/2018


Ch. 1.1-16

BEDA Analysis of pairs of variables, typically correlation and chisquare analysis.

MEDA Focuses on dimension reduction. Studied dimensions are Variables and Observations  Methods to group variables (PCA, FA) Methods to group observations (Cluster analysis). Plus Missing Value imputation Multivariate transformations Multivariate outlier detection. 2/13/2018


-17

2/13/2018


Ch. 1.1-18

Dimension Reduction and Variable Selection

Dimension Reduction and Variable Selection

Suggest Documents

Dimension Reduction and Variable Selection

Dimension Reduction and Variable Selection

Dimension reduction and variable selection in case control studies via ...

Learning sparse gradients for variable selection and dimension ... - arXiv

feature selection, learning metrics and dimension reduction in training ...

Selection of variables and dimension reduction in high ... - arXiv

Band Selection for Dimension Reduction in Hyper Spectral ... - ijmlc

On the selection of dimension reduction ... - Semantic Scholar

Subset selection in dimension reduction methods - Dipartimento di ...

Band Selection for Dimension Reduction in Hyper Spectral Image ...

Selection principles and countable dimension

Supervised dimension reduction mappings

Most Informative Dimension Reduction

Variable dimension algorithms: Basic theory, interpretations and

Trace Optimization and Eigenproblems in Dimension Reduction

Dimension Reduction and Data Visualization ... - Semantic Scholar

Denoising and Dimension Reduction in Feature Space

Fusion Frames and Robust Dimension Reduction

Sufficient Dimension Reduction and Prediction in Regression

sufficient dimension reduction based on normal and

Lecture 6: Variable Selection

Lecture 6: Variable Selection

Convex Optimization Methods for Dimension Reduction and ...

Homogenization, linearization and dimension reduction in elasticity