Feb 13, 2018 - home equity loans: loan where the obligor uses the equity of ... BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid ...
2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
Ch. 1.1-1
Contents Definition Data sets and Examples UEDA: Univariate EDA Definitions of central tendency and variability. Data set descriptions. Transformations. Statistical Inference – Probability. Univariate Data Distributions – Normality. CLT – Statistical Tests. Statistical Puzzles Bayesian Inference (under construction). BEDA Continuous Variables – Correlations - Causation Nominal Variables – Chi Square – Odds 2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
-2
Contents (cont.) MEDA Principal Components Factor Analysis Clustering and Segmentation Canonical Discrimination Analysis Missing Value imputation Outliers and Variable Transformations.
2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
-3
EDA: definition, purpose and usefulness. Typically (or should be) initial step to view and try to comprehend data set , assumed to be of rectangular form. Columns called variables, rows observations. Comprehension of individual and across individual variables.
Given the size of present data bases with hundreds if not thousands or more variables or attributes, it is either hard or not possible to obtain a full conceptual understanding of the informational content imbued in the data. Since many applications lead to (some) modeling, it is also possible and desirable to perform EDA on the outcome of these procedures. In this sense, we can envision EDA applied twice, prior to and after a model creation, and thus possibly restarting the modeling effort. Variables are random when their values cannot be known with certainty, i.e., they are not deterministic variables. For instance, it is not possible to know the next outcome of a roulette bet, we know the probability of success. When this probability is not knowable, we talk about uncertainty. Otherwise, risk. 2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
1.1-4
Data sets in large data set setting, contain hundreds if not more variables, with taxonomy: 1) 2) 3) 4) 5)
Numeric, Character, considered nominal, i.e., no cardinality. Ordinal Ratio: ratio of numeric variables. Id: a. Id proper, nominal variables: patient id, SSN, etc. b. Indices: Time index, patient visit number, etc.
Can compute
Nominal
Ordinal
Interval
Ratio
frequency distribution. median and percentiles. add or subtract. mean, standard deviation, standard error of the mean. ratio, or coefficient of variation.
Yes No No No
Yes Yes No No
Yes Yes Yes Yes
Yes Yes Yes Yes
No
No
No
Yes
2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
1.1-5
Proposed EDA steps We can envision 5 steps in EDA:
Data Set definition: at least number of observations, variables and taxonomy. UEDA Univariate BEDA Bivariate
MEDA Multivariate: Missing values analysis and imputation, principal components and clustering. Outliers and variable transformation or Engineering (that involve UEDA, BEDA and MEDA. E.g., log (Var1) is Univariate transformation. PCA is multivariate Transformation). 2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
-6
Statistical Inference It is both part of EDA and modeling, allows for comparisons and statements about similarity or not in many different situations. All data sets assumed to be representative samples from an infinite population, which can be unrealistic. If interested in inferences on heights of first graders in specific school at specific time, data is entire population, and there is no uncertainty on recorded heights.
If interest is in height changes across time (and thus work with samples), then use statistical inference to infer information about overall population (perhaps ideal, not real). 2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
1.1-7
2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
1.1-8
Data set 1: Definition by way of Example • • • •
Health insurance company Ophtamologic Insurance Claims Is claim valid? Present operation: • Manual review of history and circumstances Alternative: Scoring analytical system. Leonardo Auslender –Ch. 1 Copyright 2004
—9—
Data Mining Solution • Data set 1 (DS1): Use data on past claims to verify fraud • Data set 2: Investigate babies’ deaths.
Historical Data
Predictive Model
Leonardo Auslender –Ch. 1 Copyright 2004
New Applicants
— 10 —
SAS EXAMPLE for fraud data set: ods html; proc contents data = fraud.fraud; run; ods html close; Alphabetic List of Variables and Attributes # Variable Type Len Format Informa Label t 3 DOCTOR_VISITS Num 8 BEST12 F12. Total visits to a doctor . 1 FRAUD Num 8 BEST12 F12. Fraudulent Activity yes/no . 5
MEMBER_DURAT Num 8 ION
4
NO_CLAIMS
7
NUM_MEMBERS Num 8
6
OPTOM_PRESC
Num 8
BEST12 F12. .
Number of opticals claimed
2
TOTAL_SPEND
Num 8
BEST12 F12. .
Total spent on opticals
2/13/2018
Num 8
Membership duration BEST12 F12. .
No of claims made recently
Number of members covered
Leonardo Auslender –Ch. 1 Copyright 2004
Ch. 1.1-11
Data set 2 (DS2): Babies’ deaths (health care example). • Babies’ death or survivability. • Determine basic statistics for mostly binary variables. • Data anomalies? • Is present data representative or anomalous?
Leonardo Auslender –Ch. 1 Copyright 2004
— 12 —
# Variable 16 Const 18 H 17 M1 7 abort 1 death 9 dyslab
2/13/2018
Alphabetic List of Variables and Attributes Type Len Label Num 8 Num 8 Num 8 Num 8 past abortion Num 8 Death Num 8 Labor progress
5 gestage
Num
8 Gestational AGe
8 hydramnios
Num
8 Too much amniotic fluid
6 isoimm
Num
8 Iso immunization
15 malpres 11 nomonit 2 nonwhite 4 nullip 10 placord
Num Num Num Num Num
8 Mal Presented 8 No Monitor 8 Non-White 8 Null Parity 8 Placental - cord anomaly
14 prerupt 3 teenages 12 twint 13 ward
Num Num Num Num
8 PROM 8 Early Age 8 Twin, Triplet 8 Public Ward
Leonardo Auslender –Ch. 1 Copyright 2004
-13
Data Set HMEQ (DS 3) Reports characteristics and delinquency information for 5,960 home equity loans: loan where the obligor uses the equity of his or her home as the underlying collateral. The data set has the following characteristics: ◾ BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan ◾ LOAN: Amount of the loan request ◾ MORTDUE: Amount due on existing mortgage ◾ VALUE: Value of current property ◾ REASON: DebtCon = debt consolidation; HomeImp = home improvement ◾ JOB: Occupational categories ◾ YOJ: Years at present job ◾ DEROG: Number of major derogatory reports ◾ DELINQ: Number of delinquent credit lines ◾ CLAGE: Age of oldest credit line in months ◾ NINQ: Number of recent credit inquiries ◾ CLNO: Number of credit lines ◾ DEBTINC: Debt-to-income ratio
2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
-14
Home Work Questions:
2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
Ch. 1.1-15
Statistical Inference. 2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
Ch. 1.1-16
BEDA Analysis of pairs of variables, typically correlation and chisquare analysis.
MEDA Focuses on dimension reduction. Studied dimensions are Variables and Observations Methods to group variables (PCA, FA) Methods to group observations (Cluster analysis). Plus Missing Value imputation Multivariate transformations Multivariate outlier detection. 2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
-17
2/13/2018
Leonardo Auslender –Ch. 1 Copyright 2004
Ch. 1.1-18