Dimension Reduction and Variable Selection

2 downloads 0 Views 1MB Size Report
Feb 13, 2018 - home equity loans: loan where the obligor uses the equity of ... BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid ...
2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

Ch. 1.1-1

Contents Definition Data sets and Examples UEDA: Univariate EDA Definitions of central tendency and variability. Data set descriptions. Transformations. Statistical Inference – Probability. Univariate Data Distributions – Normality. CLT – Statistical Tests. Statistical Puzzles Bayesian Inference (under construction). BEDA Continuous Variables – Correlations - Causation Nominal Variables – Chi Square – Odds 2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

-2

Contents (cont.) MEDA Principal Components Factor Analysis Clustering and Segmentation Canonical Discrimination Analysis Missing Value imputation Outliers and Variable Transformations.

2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

-3

EDA: definition, purpose and usefulness. Typically (or should be) initial step to view and try to comprehend data set , assumed to be of rectangular form. Columns called variables, rows observations. Comprehension of individual and across individual variables.

Given the size of present data bases with hundreds if not thousands or more variables or attributes, it is either hard or not possible to obtain a full conceptual understanding of the informational content imbued in the data. Since many applications lead to (some) modeling, it is also possible and desirable to perform EDA on the outcome of these procedures. In this sense, we can envision EDA applied twice, prior to and after a model creation, and thus possibly restarting the modeling effort. Variables are random when their values cannot be known with certainty, i.e., they are not deterministic variables. For instance, it is not possible to know the next outcome of a roulette bet, we know the probability of success. When this probability is not knowable, we talk about uncertainty. Otherwise, risk. 2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

1.1-4

Data sets in large data set setting, contain hundreds if not more variables, with taxonomy: 1) 2) 3) 4) 5)

Numeric, Character, considered nominal, i.e., no cardinality. Ordinal Ratio: ratio of numeric variables. Id: a. Id proper, nominal variables: patient id, SSN, etc. b. Indices: Time index, patient visit number, etc.

Can compute

Nominal

Ordinal

Interval

Ratio

frequency distribution. median and percentiles. add or subtract. mean, standard deviation, standard error of the mean. ratio, or coefficient of variation.

Yes No No No

Yes Yes No No

Yes Yes Yes Yes

Yes Yes Yes Yes

No

No

No

Yes

2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

1.1-5

Proposed EDA steps We can envision 5 steps in EDA:

Data Set definition: at least number of observations, variables and taxonomy. UEDA Univariate BEDA Bivariate

MEDA Multivariate: Missing values analysis and imputation, principal components and clustering. Outliers and variable transformation or Engineering (that involve UEDA, BEDA and MEDA. E.g., log (Var1) is Univariate transformation. PCA is multivariate Transformation). 2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

-6

Statistical Inference It is both part of EDA and modeling, allows for comparisons and statements about similarity or not in many different situations. All data sets assumed to be representative samples from an infinite population, which can be unrealistic. If interested in inferences on heights of first graders in specific school at specific time, data is entire population, and there is no uncertainty on recorded heights.

If interest is in height changes across time (and thus work with samples), then use statistical inference to infer information about overall population (perhaps ideal, not real). 2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

1.1-7

2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

1.1-8

Data set 1: Definition by way of Example • • • •

Health insurance company Ophtamologic Insurance Claims Is claim valid? Present operation: • Manual review of history and circumstances Alternative: Scoring analytical system. Leonardo Auslender –Ch. 1 Copyright 2004

—9—

Data Mining Solution • Data set 1 (DS1): Use data on past claims to verify fraud • Data set 2: Investigate babies’ deaths.

Historical Data

Predictive Model

Leonardo Auslender –Ch. 1 Copyright 2004

New Applicants

— 10 —

SAS EXAMPLE for fraud data set: ods html; proc contents data = fraud.fraud; run; ods html close; Alphabetic List of Variables and Attributes # Variable Type Len Format Informa Label t 3 DOCTOR_VISITS Num 8 BEST12 F12. Total visits to a doctor . 1 FRAUD Num 8 BEST12 F12. Fraudulent Activity yes/no . 5

MEMBER_DURAT Num 8 ION

4

NO_CLAIMS

7

NUM_MEMBERS Num 8

6

OPTOM_PRESC

Num 8

BEST12 F12. .

Number of opticals claimed

2

TOTAL_SPEND

Num 8

BEST12 F12. .

Total spent on opticals

2/13/2018

Num 8

Membership duration BEST12 F12. .

No of claims made recently

Number of members covered

Leonardo Auslender –Ch. 1 Copyright 2004

Ch. 1.1-11

Data set 2 (DS2): Babies’ deaths (health care example). • Babies’ death or survivability. • Determine basic statistics for mostly binary variables. • Data anomalies? • Is present data representative or anomalous?

Leonardo Auslender –Ch. 1 Copyright 2004

— 12 —

# Variable 16 Const 18 H 17 M1 7 abort 1 death 9 dyslab

2/13/2018

Alphabetic List of Variables and Attributes Type Len Label Num 8 Num 8 Num 8 Num 8 past abortion Num 8 Death Num 8 Labor progress

5 gestage

Num

8 Gestational AGe

8 hydramnios

Num

8 Too much amniotic fluid

6 isoimm

Num

8 Iso immunization

15 malpres 11 nomonit 2 nonwhite 4 nullip 10 placord

Num Num Num Num Num

8 Mal Presented 8 No Monitor 8 Non-White 8 Null Parity 8 Placental - cord anomaly

14 prerupt 3 teenages 12 twint 13 ward

Num Num Num Num

8 PROM 8 Early Age 8 Twin, Triplet 8 Public Ward

Leonardo Auslender –Ch. 1 Copyright 2004

-13

Data Set HMEQ (DS 3) Reports characteristics and delinquency information for 5,960 home equity loans: loan where the obligor uses the equity of his or her home as the underlying collateral. The data set has the following characteristics: ◾ BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan ◾ LOAN: Amount of the loan request ◾ MORTDUE: Amount due on existing mortgage ◾ VALUE: Value of current property ◾ REASON: DebtCon = debt consolidation; HomeImp = home improvement ◾ JOB: Occupational categories ◾ YOJ: Years at present job ◾ DEROG: Number of major derogatory reports ◾ DELINQ: Number of delinquent credit lines ◾ CLAGE: Age of oldest credit line in months ◾ NINQ: Number of recent credit inquiries ◾ CLNO: Number of credit lines ◾ DEBTINC: Debt-to-income ratio

2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

-14

Home Work Questions:

2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

Ch. 1.1-15

Statistical Inference. 2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

Ch. 1.1-16

BEDA Analysis of pairs of variables, typically correlation and chisquare analysis.

MEDA Focuses on dimension reduction. Studied dimensions are Variables and Observations  Methods to group variables (PCA, FA) Methods to group observations (Cluster analysis). Plus Missing Value imputation Multivariate transformations Multivariate outlier detection. 2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

-17

2/13/2018

Leonardo Auslender –Ch. 1 Copyright 2004

Ch. 1.1-18