tree-based ensemble classifiers for high-dimensional ...

0 downloads 0 Views 770KB Size Report
Building a classification model from thousands of available predictor .... The Markov models by Jarrow, Lando and Turnbull (JLT) (1997) and Kijima and.
Journal of the Chinese Statistical Association Vol. 45, (2007)1–17

TREE-BASED ENSEMBLE CLASSIFIERS FOR HIGH-DIMENSIONAL DATA James J. Chen, Hojin Moon, and Songjoon Baek Division of Personalized Nutrition and Medicine National Center for Toxicological Research Food and Drug Administration

ABSTRACT Building a classification model from thousands of available predictor variables with a relatively small sample size presents challenges for most traditional classification algorithms. When the number of samples is much smaller than the number of predictors, there can be a multiplicity of good classification models. An ensemble classifier combines multiple single classifiers to improve classification accuracy. This paper overviews tree-based classifiers and compares the performance of the three ensemble classifiers: random forest (RF), classification by ensembles from random partitions (CERP), and adaptive boosting (AdaBoost), and three single tree algorithms are also evaluated, classification tree (CTree), classification rule with unbiased interaction selection and estimation (CRUISE), and quick, unbiased and efficient statistical tree (QUEST). The six tree-based classifiers are applied to five high-dimensional datasets. In all datasets, the three ensemble classifiers show much higher classification accuracies than the three single tree algorithms, with the exception of the AdaBoost ensemble classifier in one dataset. RF and CERP are comparable in terms of accuracy. The RF and CERP bagging classifiers show higher accuracies than the AdaBoost boosting classifier. For the three tree classifiers, QUEST generally shows higher accuracy than CTree and CRUISE. Key words and phrases: adaptive boosting (AdaBoost), bagging, boosting, classification by ensembles from random partitions (CERP), classification tree, random forest

Journal of the Chinese Statistical Association Vol. 45, (2007)18–37

ASSESSING SOFTWARE RELIABILITY BY UNREVEALED PROPORTION ESTIMATION IN STRATIFIED SAMPLING Mark C. K. Yang1 , Anne Chao2 and Y. C. Chen3 1 Department 2 Institute

of Statistics, University of Florida

of Statistics, National Tsing Hua University

3 Chia-Nan

University of Pharmacy and Science

ABSTRACT Suppose there is an unknown number of species in an area of interest. A sample is collected and all the animals are identified according to their species. The purpose is to estimate the proportion θ of the animals that belong to undiscovered species. Point and interval estimators for θ are derived when the area is stratified into K strata and observations are taken from each stratum. Use of proportional allocation is shown to be more efficient than simple random sampling. This model fits very well in software reliability estimation under beta testing, i.e., the software faults are detected by complaints from the users. Many difficult situations in software testing and reliability are overcome by this model. Key words and phrases: undiscovered species; proportional allocation; software reliability; software maintenance; beta testing. JEL classification: C13, C42, C88.

1. Introduction Improving software reliability by testing is one of the most important components in software development. It has been estimated that in many projects, the efforts used

2

JAMES J. CHEN, HOJIN MOON, AND SONGJOON BAEK

(RF), voting classifier, tree algorithm. JEL classification: C14; C45

1. Introduction Classification or class prediction (machine learning algorithm) is a widely used data mining technique in many areas of research and applications. Classification uses a supervised learning method where the classification algorithm learns from a training set, a set of predictors with known class labels, and establishes a prediction rule to classify new samples. Classification has been used to predict the activity or toxicological property of untested chemicals, for instance, to predict rodent carcinogenicity (Helma and Kramer, 2003), Salmonella mutagenicity (e.g., Rosenkranz and Cunningham, 2001), or estrogen receptor binding activity of chemicals (Chen et al., 2005, 2006) using structure-activity relationship models. Recently, classification models have been developed to discriminate different biologic or clinical phenotypes, or to predict the diagnostic category, prognostic stage of a patient, or treatment response in personalized medicine (e.g, Alon et al., 1999; Golub et al., 1999; Moon et al., 2007). Classification involving a large number of predictors presents a challenge to the development of accurate classifiers. Standard classification model building, such as the logistic regression and Fisher’s linear discriminant analysis, has relied on an a priori collection of predictors and model selection that was limited to a few variables of interest. The classification tree (CTree) is known as binary recursive partitioning (Breiman et al., 1984). A tree is constructed iteratively in the direction of making data purer according to a splitting criterion until it is fully grown. The support vector machine (SVM) is a kernel-based machine-learning predictor (Vapnik, 1995). In a twoclass classification, an SVM classifier finds a hyperplane between the two classes that minimizes the error probability. Both CTree and SVM have incorporated important predictor variables into model building. These algorithms find a subset of predictors and evaluate its relevance for the classification; the classification rules are built from an optimal predictor subset.

Journal of the Chinese Statistical Association Vol. 45, (2007)38–54

CHARACTERIZATIONS OF THE ORDER STATISTICS POINT PROCESS Wen-Jang Huang1 , Nan-Cheng Su2 And Jyh-Cherng Su3 1 National 2 National

University of Kaohsiung Sun Yat-sen University

3 R.O.C.

Military Academy

ABSTRACT Let A ≡ {A(t), t ≥ 0} be an order statistics point process, with E(A(t)) = m(t) being the mean value function of A(t), t ≥ 0. It is known that m(t) determines the distribution of the process A. In this work, we give some characterizations of m(t), by using certain relations between the conditional moments of the last jump time or current life of A at time t. It is interesting that some results are parallel to those characterizations of Poisson process as a renewal process. Finally, we present some extensions of the results about record values given in Abu-Youssef (2003). Key words and phrases: Characterization; nonhomogeneous Poisson process; order statistics; order statistics property; record values. JEL classification:.

1. Introduction Let {A(t), t ≥ 0} with A(0) = 0, A(t) < ∞, ∀t ≥ 0, be a point process with right continuous sample paths having successive unit steps at times S1 , S2 ,.... The process {A(t), t ≥ 0} is said to have the order statistics property or called an order statistics point process if for every t > 0 and integer n ≥ 1, whenever P (A(t) = n) > 0, given

Journal of the Chinese Statistical Association Vol. 45, (2007)55–73

AN EMPIRICAL APPLICATION OF MARKOV MODEL FOR THE TERM STRUCTURE OF CREDIT RISK SPREADS Alan T. Wang1 and Wei-Chen Lee2 1 Department

of Accounting, National Cheng Kung University

2 Research

Department, Tachung Bank

ABSTRACT The Markov models by Jarrow, Lando and Turnbull (JLT) (1997) and Kijima and Komoribayashi (KK) (1998) provide an important alternative for pricing financial instruments on credit risk and risk management. Albeit some extensions have already been developed, empirical analysis of JLT-KK model is less documented. This article implements the model of KK and reports empirical results. Using the credit spread term structure observed in the market, unconstrained risk premium adjustments for riskier bounds with longer maturities are shown to easily exceed the upper bounds when default-dependent recovery rates are used. Key words and phrases: credit risk, default probability, Markov. JEL classification: G12, G13, G32.

1. Introduction The credit derivatives market has grown rapidly recently, and it’s a booming business with $8.4 trillion outstanding contracts at the end of 2004, compared to $919 billion in 2001, according to the International Swaps and Derivatives Association. Credit derivatives are linked to the probability that the debts will be paid off, which is in turn linked to the credit status of the issuing firms and the value of the reference asset. For

Journal of the Chinese Statistical Association Vol. 45, (2007)74–98

ACCOUNTING FOR TAIWAN GDP GROWTH: PARAMETRIC AND NONPARAMETRIC ESTIMATES Ling Sun1 and Lilyan E. Fulginiti2 1 Department 2 Department

of Accounting, Providence University

of Agricultural Economics, University of Nebraska-Lincoln ABSTRACT

The purpose of this paper is to study the impact of changes in prices of tradables on economic growth in a highly open economy, Taiwan. We do so by measuring productivity growth with both index number and parametric approaches, and identifying the sources of output growth using a methodology that allows the impacts of changes in the terms of trade to be accounted for. The results show that Taiwan’s economic growth depends on inputs accumulation as well as technical progress with the terms-of-trade effect being negligible. Key words and phrases: productivity change; SUR (Seemingly Unrelated Regressions); nonparametric approach; stochastic approach; terms of trade. JEL classification: O30, C01, C14, C30 .

1. Introduction Productivity is defined as output per unit of input. The study of productivity is intimately related to the study of economic growth as productivity increases induce an increase in output in perpetuity while this might not be true of input use. In fact, estimated aggregate supply elasticities have been known to be very small. It is shifts in this aggregate supply due to innovations that has reverted Malthusian predictions and in fact allowed higher standard of living.

Journal of the Chinese Statistical Association Vol. 45, (2007)99–105

談高中加入統計課程 楊照崑 佛羅里達大學統計學學系





在高中教統計, 應以基本概念, 實例, 誤用為主, 由於時間的限制, 推 導可視為次要。 因此統計不宜放在數學課程中。 該放在它最有用的地方 。 就是那些有大量雜音的領域。 特別是會因人之不同而產生相當不同的 反應的數據, 像行為科學, 社會科學, 生命科學, 教育及醫學最需要統計。 統計應該是社會學 「方法論」 中的一個主要課題。 最近 (2006 年 10 月) 聽到台灣學術界有意將統計納入高中課程, 這是統 計界的大好消息, 非同小可。 本文拋磚引玉, 談談個人的淺見, 希望帶出 更多的討論, 集思廣益, 把這件事從一開始就做得完美。

1. 為什麼在高中學統計

現代統計, 一言以蔽之, 就是告訴一個發言者“以你的數據, 你能有多少把握做你想得到的 結論”。 通常我們把一個沒有百分之百把握的結論稱之為推論。 統計的目的就是要用數字來精確的描 述這個推論的 「 可信度」, 而不是隨便說個“很有把握”或“相當可靠”就算了。 這當然不是一般沒有 統計知識的人所辦不到的。 隨著時代的進步, 資訊在日常生活中變得越來越重要。 在我們聽到和看到許多的資訊中, 有些是 事實, 但也有些是推論。 可是我們很少聽到這些推論的可信度。 統計知識將使我們能夠區分那個結 論應該是統計性的推論, 有可能由它的出處猜測出推論的可信度, 不至於盲從或被騙。 更重要的是, 現代人的競爭在於創新。 沒有創新 (專利) 的產品, 任何人可以生產。 因此這些類 工作終將落入苦力的手中, 做這些工作的人也只能得到最低工資, 活在僅能溫飽的邊緣。 創新, 必須

Suggest Documents