Statistical Discretization of Continuous Attributes Using Kolmogorov-Smirnov Test Hadi Mohammadzadeh Abachi1(B) , Saeid Hosseini1,2 , Mojtaba Amiri Maskouni1 , Mohammadreza Kangavari1 , and Ngai-Man Cheung2 1
2
Iran University of Science and Technology, Tehran, Iran {hadi mohammadzadeh,amiri m,kangavari}@comp.iust.ac.ir
[email protected] Singapore University of Technology and Design, Singapore, Singapore ngaiman
[email protected]
Abstract. Unlike unsupervised discretization methods that use simple rules to discretize continuous attributes through a low time complexity which mostly depends on sorting procedure, supervised discretization algorithms take the class label of attributes into consideration to achieve high accuracy. Supervised discretization process on continuous features encounters two significant challenges. Firstly, noisy class labels affect the effectiveness of discretization. Secondly, due to the high computational time of supervised algorithms in large-scale datasets, time complexity would rely on discretizing stage rather than sorting procedure. Accordingly, to address the challenges, we devise a statistical unsupervised method named as SUFDA. The SUFDA aims to produce discrete intervals through decreasing differential entropy of the normal distribution with a low temporal complexity and high accuracy. The results show that our unsupervised system obtains a better effectiveness compared to other discretization baselines in large-scale datasets. Keywords: Discretization · Kolmogorov-Smirnov Data reduction · Na¨ıve Bayes
1
· Data mining
Introduction
Data preprocessing as a main stage in knowledge extraction and data mining, includes components such as data transformation, cleaning, and data reduction. Discretization, as one of the basic data reduction procedures, maps the data from a huge range of continuous values to a greatly shrunk subset of discrete values. A discretization scheme partitions a sorted numerical attribute into k discrete and disjoint intervals denoted as η = {[s0 , s1 ], (s1 , s2 ], · · · (sk−1 , sk ]}; where s0 and sk are respectively the minimum and the maximum values of the numerical attribute. Subject to different applications, the discretization methods can be c Springer International Publishing AG, part of Springer Nature 2018 J. Wang et al. (Eds.): ADC 2018, LNCS 10837, pp. 309–315, 2018. https://doi.org/10.1007/978-3-319-92013-9_25
310
H. M. Abachi et al.
classified into various twin categories including global vs. local, top-down vs. bottom-up, direct vs. incremental, static vs. dynamic, and finally supervised vs. unsupervised [3]. As a matter of fact, prior supervised as well as unsupervised approaches face certain challenges. Most unsupervised methods are so na¨ıve that may not lead to effective results. Additionally, where the irrelevant data exist, outliers may not be handled in an appropriate manner and consequently can affect the accuracy negatively. To address these challenges, supervised models employ class information. However, required class information is at times noisy and even inaccessible. Moreover, due to the high computational complexity and additional statistical criterions, the efficiency of supervised algorithms in large-scale datasets and complex environments is significantly flattened. To this end, in this paper, we devise a new unsupervised approach that not only promotes the effectiveness, but also reduces the complexities involved in supervised approaches.
2
Related Work
Discretization. Many discretization techniques have been proposed in the literature that can be used in several applications such as: association mining algorithms, induction rules, clinical datasets and recommendation systems [4]. Equal width (EWD) and frequency discretization(EFD) are Unsupervised Methods that divide the observed samples between smin and smax into k equally sized intervals or into k intervals so that each interval includes approximately the same number of instances. Here k is a user predefined value. The complexity of such methods is O(n ∗ log(n)), that only relies on the sorting algorithm. Entropy minimization discretization (Fayyad) [2], is the essential Supervised Method. This greedy algorithm uses the class information entropy of possible intervals to select the boundaries for discretization. Considering all possible cut points, it starts finding a single cut point that minimizes the entropy of intervals. The complexity of the Fayyad’s method, assuming m as the number of classes, is O(m ∗ n ∗ log(n)) which dominates the sorting complexity. The supervised algorithms require a high computational time to discretize numerical attributes which is tangible in large scale datasets. Our work relates itself to the literature through proposing a non-parametric unsupervised discretization algorithm which provides accurate and effective results.
3 3.1
Problem Statement Preliminary Concepts
Definition 1. In statistics, the Kolmogorov Smirnov a non-parametric test is used to compare a group of samples with a reference probability distribution (onesample KS test). The Kolmogorov Smirnov statistic test calculates the distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution. Dn = sup |Fn (x) − F (x)| x
(1)
Statistical Discretization of Continuous Attributes Using KS Test
3.2
311
Problem Definition
Given a dataset S consisting N examples and M attributes (features), where the attributes can be either numerical (Mc ) or nominal (Md ), we try to discretize numerical attributes (Mc ) based on one-sample Kolmogorov-Smirnov. Assuming that numerical attributes are distributed normally, we define an initial normal distribution as a reference for every feature and set the related parameters such as mean and standard deviation. The next step is to define a null hypothesis and try to reject it by comparing the calculated pV alue to a predefined threshold. Note that, the pV alue is the probability of getting a test statistic result at least as high as the one that was actually spotted, assuming that the null hypothesis is true. In discretization procedure, splitting a feature as an interval to sub intervals is associated with rejecting or accepting the null hypothesis.
4
Proposed Algorithm
In this section, we further elucidate our proposed Statistical Unsupervised Feature Discretization Algorithm (SUFDA). The proposed model in this paper is local, top-down, static, incremental, and unsupervised. Our model independently considers each individual feature as an initial interval and subject to the statistical analysis, splits it into sub-intervals recursively until it meets the stopping criterion. Aiming to reduce the algorithm complexity on large-scale datasets, our algorithm utilizes an optimized and approximate approach for the Kolmogorov-Smirnov (KS) assessment. This can lead to a linear temporal complexity [7,9]. Additionally, unlike other unsupervised algorithms, the number of intervals can be inferred through the distribution as well as the frequency of feature samples. At times, the discretization of the numerical attributes can negatively affect the accuracy. Therefore, during the splitting stage, our model categorizes the samples with high frequencies into distinguished intervals. Where a numerical instance is repeated in more than half of the interval size, it will be recorded as having a high frequency. The significant factors of our model are twofold. First, the distribution of samples matters as they may follow a non-normal distribution(e.g. skewed or uniform) or Gaussian (normal) distribution. Second, samples belonging to the same interval during discretization, should follow the interval’s normal distribution with a low deviation from the mean. Regarding low standard deviation, if samples are normally distributed we should adjust a compactness level (ϕ) through further restrictions. In this paper, we utilize the KS assessment on numerical features to find the discretized intervals through compacted samples. The KS statistics can be rewritten as formulated in Eq. 2: n
Dn = max(φ(Zi ) − i=1
i−1 i , − φ(Zi )) n n
(2)
Here, n is the size of samples, Z is the sorted samples, and φ represents the Cumulative Distribution Function (CDF) of the theoretical distribution used for
312
H. M. Abachi et al.
assessment. Intuitively, the CDF value for normal distribution can be calculated as follows (Eq. 3): φ(x|μsk , δsk ) =
x − μsk 1 √ )] [1 + erf ( 2 δ sk 2
(3)
The mean and standard deviations are formulated in Eq. 4. Note that the μsk and δsk are the mean and standard deviation of each interval k that is being processed at the time. k k 1 i xi μsk = ; δ sk = (xi − μsk )2 (4) Nk Nk i The error function (erf ) in Eq. 3 is the probability that a number drawn at random from the Standard Normal Distribution (mean = 0, standard deviation = 1), will be no greater than given x. The error function is defined in Eq. 5. We can approximate the integral as a partial sum of an alternating infinite series. x 2 2 e−t dt (5) erf (x) = √ π 0 The null hypothesis (H0 ) states that the interval samples follow the normal distribution (N (μ, δ)), which is a reference to every numerical feature. The mean and standard deviation parameters of the current interval, which is being scanned, can be denoted by μ = μs and δ = ϕ ∗ δs respectively. Note that the initial interval represents the whole samples of the numerical features. Also, the 0 < ϕ