Statistical Discretization of Continuous Attributes

Statistical Discretization of Continuous Attributes Using Kolmogorov-Smirnov Test Hadi Mohammadzadeh Abachi1(B) , Saeid Hosseini1,2 , Mojtaba Amiri Maskouni1 , Mohammadreza Kangavari1 , and Ngai-Man Cheung2 1

2

Iran University of Science and Technology, Tehran, Iran {hadi mohammadzadeh,amiri m,kangavari}@comp.iust.ac.ir [email protected] Singapore University of Technology and Design, Singapore, Singapore ngaiman [email protected]

Abstract. Unlike unsupervised discretization methods that use simple rules to discretize continuous attributes through a low time complexity which mostly depends on sorting procedure, supervised discretization algorithms take the class label of attributes into consideration to achieve high accuracy. Supervised discretization process on continuous features encounters two significant challenges. Firstly, noisy class labels affect the effectiveness of discretization. Secondly, due to the high computational time of supervised algorithms in large-scale datasets, time complexity would rely on discretizing stage rather than sorting procedure. Accordingly, to address the challenges, we devise a statistical unsupervised method named as SUFDA. The SUFDA aims to produce discrete intervals through decreasing differential entropy of the normal distribution with a low temporal complexity and high accuracy. The results show that our unsupervised system obtains a better effectiveness compared to other discretization baselines in large-scale datasets. Keywords: Discretization · Kolmogorov-Smirnov Data reduction · Na¨ıve Bayes

1

· Data mining

Introduction

Data preprocessing as a main stage in knowledge extraction and data mining, includes components such as data transformation, cleaning, and data reduction. Discretization, as one of the basic data reduction procedures, maps the data from a huge range of continuous values to a greatly shrunk subset of discrete values. A discretization scheme partitions a sorted numerical attribute into k discrete and disjoint intervals denoted as η = {[s0 , s1 ], (s1 , s2 ], · · · (sk−1 , sk ]}; where s0 and sk are respectively the minimum and the maximum values of the numerical attribute. Subject to different applications, the discretization methods can be c Springer International Publishing AG, part of Springer Nature 2018 J. Wang et al. (Eds.): ADC 2018, LNCS 10837, pp. 309–315, 2018. https://doi.org/10.1007/978-3-319-92013-9_25

310

H. M. Abachi et al.

classified into various twin categories including global vs. local, top-down vs. bottom-up, direct vs. incremental, static vs. dynamic, and finally supervised vs. unsupervised [3]. As a matter of fact, prior supervised as well as unsupervised approaches face certain challenges. Most unsupervised methods are so na¨ıve that may not lead to effective results. Additionally, where the irrelevant data exist, outliers may not be handled in an appropriate manner and consequently can affect the accuracy negatively. To address these challenges, supervised models employ class information. However, required class information is at times noisy and even inaccessible. Moreover, due to the high computational complexity and additional statistical criterions, the efficiency of supervised algorithms in large-scale datasets and complex environments is significantly flattened. To this end, in this paper, we devise a new unsupervised approach that not only promotes the effectiveness, but also reduces the complexities involved in supervised approaches.

2

Related Work

Discretization. Many discretization techniques have been proposed in the literature that can be used in several applications such as: association mining algorithms, induction rules, clinical datasets and recommendation systems [4]. Equal width (EWD) and frequency discretization(EFD) are Unsupervised Methods that divide the observed samples between smin and smax into k equally sized intervals or into k intervals so that each interval includes approximately the same number of instances. Here k is a user predefined value. The complexity of such methods is O(n ∗ log(n)), that only relies on the sorting algorithm. Entropy minimization discretization (Fayyad) [2], is the essential Supervised Method. This greedy algorithm uses the class information entropy of possible intervals to select the boundaries for discretization. Considering all possible cut points, it starts finding a single cut point that minimizes the entropy of intervals. The complexity of the Fayyad’s method, assuming m as the number of classes, is O(m ∗ n ∗ log(n)) which dominates the sorting complexity. The supervised algorithms require a high computational time to discretize numerical attributes which is tangible in large scale datasets. Our work relates itself to the literature through proposing a non-parametric unsupervised discretization algorithm which provides accurate and effective results.

3 3.1

Problem Statement Preliminary Concepts

Definition 1. In statistics, the Kolmogorov Smirnov a non-parametric test is used to compare a group of samples with a reference probability distribution (onesample KS test). The Kolmogorov Smirnov statistic test calculates the distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution. Dn = sup |Fn (x) − F (x)| x

(1)

Statistical Discretization of Continuous Attributes Using KS Test

3.2

311

Problem Definition

Given a dataset S consisting N examples and M attributes (features), where the attributes can be either numerical (Mc ) or nominal (Md ), we try to discretize numerical attributes (Mc ) based on one-sample Kolmogorov-Smirnov. Assuming that numerical attributes are distributed normally, we define an initial normal distribution as a reference for every feature and set the related parameters such as mean and standard deviation. The next step is to define a null hypothesis and try to reject it by comparing the calculated pV alue to a predefined threshold. Note that, the pV alue is the probability of getting a test statistic result at least as high as the one that was actually spotted, assuming that the null hypothesis is true. In discretization procedure, splitting a feature as an interval to sub intervals is associated with rejecting or accepting the null hypothesis.

4

Proposed Algorithm

In this section, we further elucidate our proposed Statistical Unsupervised Feature Discretization Algorithm (SUFDA). The proposed model in this paper is local, top-down, static, incremental, and unsupervised. Our model independently considers each individual feature as an initial interval and subject to the statistical analysis, splits it into sub-intervals recursively until it meets the stopping criterion. Aiming to reduce the algorithm complexity on large-scale datasets, our algorithm utilizes an optimized and approximate approach for the Kolmogorov-Smirnov (KS) assessment. This can lead to a linear temporal complexity [7,9]. Additionally, unlike other unsupervised algorithms, the number of intervals can be inferred through the distribution as well as the frequency of feature samples. At times, the discretization of the numerical attributes can negatively affect the accuracy. Therefore, during the splitting stage, our model categorizes the samples with high frequencies into distinguished intervals. Where a numerical instance is repeated in more than half of the interval size, it will be recorded as having a high frequency. The significant factors of our model are twofold. First, the distribution of samples matters as they may follow a non-normal distribution(e.g. skewed or uniform) or Gaussian (normal) distribution. Second, samples belonging to the same interval during discretization, should follow the interval’s normal distribution with a low deviation from the mean. Regarding low standard deviation, if samples are normally distributed we should adjust a compactness level (ϕ) through further restrictions. In this paper, we utilize the KS assessment on numerical features to find the discretized intervals through compacted samples. The KS statistics can be rewritten as formulated in Eq. 2: n

Dn = max(φ(Zi ) − i=1

i−1 i , − φ(Zi )) n n

(2)

Here, n is the size of samples, Z is the sorted samples, and φ represents the Cumulative Distribution Function (CDF) of the theoretical distribution used for

312

H. M. Abachi et al.

assessment. Intuitively, the CDF value for normal distribution can be calculated as follows (Eq. 3): φ(x|μsk , δsk ) =

x − μsk 1 √ )] [1 + erf ( 2 δ sk 2

(3)

The mean and standard deviations are formulated in Eq. 4. Note that the μsk and δsk are the mean and standard deviation of each interval k that is being processed at the time. k k 1 i xi μsk = ; δ sk = (xi − μsk )2 (4) Nk Nk i The error function (erf ) in Eq. 3 is the probability that a number drawn at random from the Standard Normal Distribution (mean = 0, standard deviation = 1), will be no greater than given x. The error function is defined in Eq. 5. We can approximate the integral as a partial sum of an alternating infinite series. x 2 2 e−t dt (5) erf (x) = √ π 0 The null hypothesis (H0 ) states that the interval samples follow the normal distribution (N (μ, δ)), which is a reference to every numerical feature. The mean and standard deviation parameters of the current interval, which is being scanned, can be denoted by μ = μs and δ = ϕ ∗ δs respectively. Note that the initial interval represents the whole samples of the numerical features. Also, the 0 < ϕ

Statistical Discretization of Continuous Attributes

Statistical Discretization of Continuous Attributes

Suggest Documents

Discretization of Continuous Attributes for Learning Classification Rules

Discretization of Continuous Attributes for Learning Classification Rules

Cost-sensitive Discretization of Numeric Attributes

Supervised and Unsupervised Discretization of Continuous Features ...

[hal-00383757, v1] Discretization of Continuous ...

Discretizing Continuous Attributes While Learning

ReliefF for estimation and discretization of attributes in ... - CiteSeerX

Discretization of Ordinal Attributes and Feature Selection - NUS

Improvement of Decision Accuracy Using Discretization of Continuous

A simple and effective discretization of a continuous ...

Research on Discretization Algorithm of Continuous ... - Springer Link

Discretization of Continuous Valued Dimensions in OLAP Data Cubes

Bayesian Network Classification with Continuous Attributes - CiteSeerX

Effects of Continuous Application of CO2 on Fruit Quality Attributes ...

An Adaptive Discretization in the ACDT Algorithm for Continuous ...

Class-dependent discretization for inductive learning from continuous ...

Statistical theory of the continuous double auction

The Continuous 1.5 D Terrain Guarding Problem: Discretization ...

Attributes of neural networks for extracting continuous ...

Improved Use of Continuous Attributes in C4.5 - Google Sites

Handling Continuous Attributes in Ant Colony ... - University of Kent

continuous structural monitoring using statistical process ... - CiteSeerX

continuous structural monitoring using statistical process control

Handling Continuous Attributes in an Evolutionary Inductive Learner