2013 26th IEEE Canadian Conference Of Electrical And Computer Engineering (CCECE)
OPTIMIZING GINI COEFFICIENT OF PROBABILISTIC ROUGH SET REGIONS USING GAME-THEORETIC ROUGH SETS Yan Zhang Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2
[email protected] ABSTRACT Probabilistic rough sets define positive, negative and boundary regions, and the probabilistic thresholds determine the Gini coefficient of three regions. We use Game-Theoretic Rough Sets to investigate the relationship between changes in probabilistic thresholds and their impacts on the Gini coefficient of rough set regions. The example shows that effective probabilistic thresholds values corresponding to applicable Gini coefficient of probabilistic regions can be derived by repeatedly tuning the thresholds values. Index Terms— Game-Theoretic Rough set, Probabilistic Rough Set, Gini coefficient 1. INTRODUCTION Probabilistic rough sets extend the Pawlak rough set model [1] by using a pair of probabilistic thresolds (α, β) to define upper and lower approximations, or equivalently, three probabilistic regions [2] [3]. One of the important issues in probabilistic rough sets is the interpretation and estimation of the required threshold values [4]. For instance, decision-theoretic rough set model determines the pair of thresholds by minimizing overall classification cost [5]. In information-theoretic interpretation, the effective thresholds can be calculated by minimization of uncertainty of the three probabilistic regions [6]. Game-theoretic rough sets (GTRS) is an approach to determine effective probabilistic thresholds by formulating competition or cooperation among multiple criteria [7]. Herbert and Yao constructed two types of games in which the region parameters and classification approximation measures were defined as game players respectively [7]. Azam and Yao extended GTRS by treating the problem of determining thresholds as a decision making problem and constructed games to formulate and analyze multiple criteria [8]. The work is conducted under the guidance of my Ph.D. supervisor Dr. Jingtao Yao and is partially supported by a discovery grant from NSERC Canada awarded to him. This work gets the support from Department of Computer Science, Faculty of Science and Faculty of Graduate Studies and Research at University of Regina
978-1-4799-0033-6/13/$31.00 ©2013 IEEE
GTRS can also be applied in rules mining [8] [9], feature selection [10], classification [7], uncertainty analysis [11] and three way decision [12]. The objectives of the users influence the formulation of game in GTRS. Generally, a game is formulated as a set of players, a set of actions or strategies for each player, and the respective payoff functions for each action. Each player chooses actions to be performed according to expected payoff, usually some actions maximizing payoff while minimizing other players payoff [7]. In the paper, we use Gini coefficient of rough set regions to interpret and estimate of probabilistic thresholds, and use GTRS to analyze Gini coefficient of regions in the aim of obtain effective probabilistic thresholds. A competitive game is formulated between regions and thresholds are modified in order to improve their payoff values. The game is repeated and the start thresholds of next game are obtained as the equilibrium of the former game. The iterative learning mechanism can automatically tunes the thresholds until Gini coefficient of regions satisfies the user’s requirement. The result in this study may enhance our understanding and the applicability of GTRS.
2. GINI COEFFICIENT OF PROBABILISTIC ROUGH SET REGIONS The Gini coefficient is a summary statistic of the Lorenz curve and a measure of inequality in a population. Mathematically, Gini coefficient is most easily calculated from unordered size data as the ”relative mean difference,” i.e., the mean of the difference between every possible pair of individuals [13] [14]. In data mining, Gini coefficient is used to measure the impurity of node when determining the best split attribute [15]. Suppose π = {p1 , p2 , ..., pn } is a partition of universe U , that is, ∪ni=1 pi = U and pi ∩ pj = ∅ for i 6= j. A probabilistic distribution can be defined as: Pπ =
|p1 | |p2 | |pn | , , ..., , |U | |U | |U |
(1)
where P r(pi ) = |pi |\|U | denotes the probability of the block
pi . The Gini coefficient of Pπ is defined as: G(Pπ ) = 1 −
n n X X |pi | 2 ) . (P r(pi ))2 = 1 − ( |U | i=1 i=1
(2)
Conditional probabilities of C c can be computed in the similar way. The probabilities of three regions can be calculated by: |P OS(α,β) (C)| , |U | |N EG(α,β) (C)| P r(N EG(α,β) (C) = , |U | |BN D(α,β) (C)| P r(BN D(α,β) (C) = . |U | P r(P OS(α,β) (C) =
In probabilistic rough set model, for a pair of thresholds (α, β) with 0 ≤ β < α ≤ 1, the probabilistic positive, negative and boundary regions of concept C are defined as [2]: P OS(α,β) (C) ={x ∈ U |P r(C|[x]) ≥ α}, N EG(α,β) (C) ={x ∈ U |P r(C|[x]) ≤ β}, BN D(α,β) (C) ={x ∈ U |β ≤ P r(C|[x]) ≤ α}.
(3)
The three probabilistic regions are pair-wise disjoint and their union is the entire universe U , so we get the partition: π(α,β) = {P OS(α,β) (C), N EG(α,β) (C), BN D(α,β) (C)}. (4) To determine the value of the pair of thresholds, Deng and Yao [6] use Shannon entropy to measure the uncertainty of three regions. In this paper, we adopt the same framework but use Gini coefficient for measuring them. We can compute Gini coefficient of πC = {C, C c } as the following equation: G(πC |P OS(α,β) (C)) =1 − (P r(C|P OS(α,β) (C)))2
3. ANALYZING GINI COEFFICIENT WITH GTRS In Game-theoretic rough sets, there are three factors when formulating a game: a set of players O, sets of strategies for players S, and sets of payoff functions resulting from players performing actions T , that is G = {O, S, T } [7]. For a two-player game, players are O = {o1 , o2 }; strategies are S = {S1 , S2 }, for player oi , Si = {ai1 , ..., aik } and aij = fij (α, β), that is, action aij means the change of probabilistic thresholds ; payoff functions are T = {u1 , u2 }, and u1 (αij , βij ), here (αij , βij ) = F (f1i (α, β), f2j (α, β)), F is a user-defined function of (α, β). Table 1 shows the payoff table.
− (P r(C c |P OS(α,β) (C)))2 , G(πC |N EG(α,β) (C)) =1 − (P r(C|N EG(α,β) (C)))2 − (P r(C c |N EG(α,β) (C)))2 , G(πC |BN D(α,β) (C)) =1 − (P r(C|BN D(α,β) (C)))2 − (P r(C c |BN D(α,β) (C)))2 . (5) The Gini coefficient of three regions can be computed as their average Gini coefficients, which is called the conditional Gini coefficient of πC given π(α,β) : G(πC |π(α,β) ) = P r(P OS(α,β) (C))G(πC |P OS(α,β) (C)) + P r(N EG(α,β) (C))G(πC |N EG(α,β) (C)) + P r(BN D(α,β) (C))G(πC |BN D(α,β) (C)). (6) The probability P r(C|P OS(α,β) (C)) in Equation 5 denotes the conditional probability of an object x in C given that the object is in the positive probabilistic region P OS(α,β) (C). The conditional probabilities can be computed by: |C ∩ P OS(α,β) (C)| , P r(C|P OS(α,β) (C) = |P OS(α,β) (C)| P r(C|N EG(α,β) (C) =
|C ∩ N EG(α,β) (C)| , |N EG(α,β) (C)|
P r(C|BN D(α,β) (C) =
|C ∩ BN D(α,β) (C)| . |BN D(α,β) (C)|
(8)
(7)
o1
a11 = f11 (α, β) a12 = f12 (α, β) ...
Table 1. Payoff table o2 a21 = a22 = f (α, β) f (α, β) 21 22
u1 (α11 , β11 ), u (α 1 12 , β12 ), u (α , β ) u (α , β12 ) 2 11 11 2 12
u1 (α21 , β21 ), u1 (α22 , β22 ), u2 (α21 , β21 ) u2 (α22 , β22 ) ... ...
... ... ... ... ... ... ...
We use GP (α, β), GN (α, β) and GB (α, β) to represent the Gini coefficient of positive, negative and boundary regions, respectively. According to Equality 6, they can be calculated as: GP (α, β) = P r(P OS(α,β) (C))G(πC |P OS(α,β) (C)), GN (α, β) = P r(N EG(α,β) (C))G(πC |N EG(α,β) (C)), GB (α, β) = P r(BN D(α,β) (C))G(πC |BN D(α,β) (C)). (9) When (α, β) = (1, 0), GP (α, β) = 0, GN (α, β) = 0 and GB (α, β) gets maximal value; The decrease of α can increase Gini coefficient of positive region while decrease Gini coefficient of boundary region, similarly, the increase of β can increase Gini coefficient of negative and decrease Gini coefficient of boundary region. In order to analyze Gini coefficient of regions, the game can be formulated as a competition between regions. There
are two players in the game, positive and negative regions versus boundary region, i.e. O = {P, B}; the strategies for the players are formulated as the change of probabilistic thresholds, for example, SP = SB = {(α0 , β0 ), (α0 (1 − c), β0 (1 + c)), (α0 (1−2c), β0 (1+2c))}; the payoff functions of players P and B are: GCP (α, β) + GCN (α, β) , 2 uB (α, β) = GCB (α, β), uP (α, β) =
Table 2. Summary of experimental data X1 X2 X3 X4 X5 P (Xi ) 0.097 0.106 0.098 0.134 0.076 P (C|Xi ) 1 0.98 0.93 0.91 0.81 X6 X7 X8 X9 X10 P (Xi ) 0.116 0.089 0.115 0.069 0.1 P (C|Xi ) 0.19 0.06 0.04 0.03 0
(10) Gini coefficient of negative region is
where GCP (α, β), GCN (α, β) and GCB (α, β) are defined: GCP (α, β) = 1 − GP (α, β),
For the boundary region, the probability is
GCN (α, β) = 1 − GN (α, β), GCB (α, β) = 1 − GB (α, β).
(11)
The payoff table of the game is shown as Table 1, where f1i = f2i = (α(1 − i × c), β(1 + i × c)); the thresholds at crosspoint equals to the half of the sum of two changes, i.e. (αij , βij ) = f +f Fij = 1i 2 2j . In the competition, we can find equilibria
of the payoff table. The cell uP (αkl , βkl ), uB (αkl , βkl ) is an equilibrium if actions a1k and a2l satisfy:
(12)
The probabilistic thresholds (αkl , βkl ) corresponding to the equilibrium can be obtained. If the values of (αkl , βkl ) do not satisfy stop conditions, the process of finding a new pair of thresholds can be repeated until the stop conditions are satisfied. In the iterations, the players still are P and B, new thresholds pair (α0 , β0 ) will be set as (αkl , βkl ), each player can access to the same actions. The stop conditions can be formulated according to the objectives of users or the requirement of probabilistic rough sets, such as the payoff of player P should be greater than that of player B, the payoff of player P is greater than a specific value, etc. 4. AN EXAMPLE In this section, we present a demonstrative example to illustrate that effective probabilistic thresholds values corresponding to applicable Gini coefficient of probabilistic regions can be derived with GTRS. Table 4 summarizes probabilistic data about a concept C. There are 10 equivalence classes, i.e. Xi and i = 1, 2, ..., 10, which are listed in a decreasing order of the conditional probabilities P r(C|Xi ). When (α, β) = (1, 0), three regions are P OS(1,0) (C) = X1 , BN D(1,0) (C) = X2 ∪X3 ∪...∪X9 and N EG(1,0) (C) = X10 . Gini coefficient of positive region is G(πC |P OS(α,β) (C)) = 1 − 1 − 0 = 0.
P r(BN D(1,0) (C)) =
9 X
P r(Xi ) = 0.803.
i=2
The conditional probability of C is P9 P r(C|Xi )P r(Xi ) P r(C|BN D(1,0) (C)) = i=2P9 i=2 P r(Xi ) 0.4126 = 0.5138. = 0.803 The Gini coefficient of the boundary region is
uP (a1k , a2j ) > u1 (a1i , a2j ), for all i 6= k, for all j, uB (a1i , a2l ) > u2 (a1i , a2j ), for all j 6= l, for all i.
G(πC |N EG(α,β) (C)) = 1 − 0 − 1 = 0.
G(πC |BN D(α,β) (C)) = 1 − (0.5138)2 − (1 − 0.5138)2 = 0.4996. In the competition, the initial values of (α, β) are (1, 0). The actions of both players are SP = SB = {(1, 0), (0.95, 0.05), (0.9, 0.1)}. The payoff table is shown as Table 3. Referring to Table 1, the thresholds in the table cell of the first row of Table 3 are calculated as: (1 + 1) (0 + 0) , ) = (1, 0), 2 2 (1 + 0.95) (0 + 0.05) , ) = (0.975, 0.025), (α12 , β12 ) = ( 2 2 (1 + 0.9) (0 + 0.1) (α13 , β13 ) = ( , ) = (0.95, 0.05). (13) 2 2 (α11 , β11 ) = (
According to the above calculation, we have UP (1, 0) = 1 − 0 = 1, UB (1, 0) = 1 − 0.803 × 0.4996 = 0.5988. The utility of other cells can be computed in the same way. The stop condition is set as the utility of two players are both greater than 0.95, i.e. uP (α, β) > 0.95 and uB (α, β) > 0.95. In Table 3, no table cell can satisfy the stop condition. So we need to find the equilibrium of the payoff table and then repeat the competition. The Nash equilibrium of the payoff table can be calculated using Equation 12. The table cell with bold numbers is the equilibrium of Table 3. The thresholds of the equilibrium is (α, β) = (0.95, 0.05). In the second iteration, the initial thresholds is (α, β) = (0.95, 0.05). The actions of both players are SP = SB =
P
Table 3. Payoff table of exemplary game B (1, 0) (0.95, 0.05) (0.90, 0.1) (1, 0) < 1.0000, < 0.9979, < 0.9914, 0.5988 > 0.6560 > 0.7516 > (0.95, < 0.9979, < 0.9914, < 0.9797, 0.05) 0.6560 > 0.7516 > 0.8481 > (0.90, < 0.9914, < 0.9797, < 0.9684, , 0.1) 0.7516 > 0.8481 > 0.9056 >
{(0.95, 0.05), (0.855, 0.055), (0.76, 0.06)}. A payoff table can be formulated according to Table 1, and then we can get the thresholds (α, β) = (0.8075, 0.0575) on which the Gini coefficient of regions satisfy the user’s requirement, i.e.: uP (0.8075, 0.0575) = 0.9605 > 0.95, uB (0.8075, 0.0575) = 0.9526 > 0.95. The result shows that when (α, β) = (0.8075, 0.0575), Gini coefficient of positive and negative regions is less than 0.05, Gini coefficient of boundary region is less than 0.05. 5. CONCLUSION Game-theoretic rough sets is a recent approach to determine effective probabilistic threshold pair by formulating competition or cooperation among multiple criteria. In this paper, we use GTRS to analyze Gini coefficient of probabilistic regions. The relationship between changes of probabilistic thresholds and Gini coefficient of regions are investigated. A competitive game between regions is formulated and iterative learning mechanism is adopted to so that effective thresholds can be obtained. In the future research, the proposed approach can be extended to real world applications. 6. REFERENCES [1] Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academatic Publishers, Boston, USA, 1991. [2] Y. Y. Yao, “Probabilistic rough set approximations,” International Journal of Approximate Reasoning, vol. 49, no. 2, pp. 255–271, 2008. [3] Y. Y. Yao, “Probabilistic approaches to rough sets,” Expert Systems, vol. 20, no. 5, pp. 287–297, 2003. [4] Y. Y. Yao, “Two semantic issues in a probabilistic rough set model,” Fundamenta Informaticae, vol. 108, no. 3, pp. 249–265, 2011. [5] Y. Y. Yao, “Decision-theoretic rough set models,” Rough Sets and Knowledge Technology, vol. 4481, pp. 1–12.
[6] X. F. Deng and Y. Y. Yao, “An information-theoretic interpretation of thresholds in probabilistic rough sets,” in 7th International Conference on Rough Sets and Knowledge Technology (RSKT’12), 2012, vol. 7413, pp. 232– 241. [7] J. P. Herbert and J. T. Yao, “Game-theoretic rough sets,” Fundamenta Informaticae, vol. 108, no. 3-4, pp. 267– 286, 2011. [8] N. Azam and J. T. Yao, “Multiple criteria decision analysis with game-theoretic rough sets,” in 7th International Conference on Rough Sets and Knowledge Technology (RSKT’12), 2012, vol. 7414, pp. 399–408. [9] Y. Zhang and J. T. Yao, “Rule measures tradeoff using game-theoretic rough sets,” in Proceedings of Brain Informatics 2012, 2012, vol. 7670, pp. 348–359. [10] N. Azam and J. T. Yao, “Classifying attributes with game-theoretic rough sets,” in Proceedings of the 4th International Conference on Intelligent Decision Technologies (IDT2012), 2012, vol. 15, pp. 175–184. [11] N. Azam and J. T. Yao, “Analyzing uncertainties of probabilistic rough set regions with game-theoretic rough sets,” Submitted to International Journal of Approximate Reasoning, 2012. [12] N. Azam, “Formulating three-way decisions with gametheoretic rough sets,” in Proceedings of Canadian Conference on Electrical and Computer Engineering (CCECE13), 2013. [13] C. Damgaard and J. Weiner, “Describing inequality in plant size or fecundity,” Ecology, vol. 81, no. 4, pp. 1139–1142, 2000. [14] P. M Dixon, J. Weiner, T. Mitchell-Olds, and R. Woodley, “Bootstrapping the gini coefficient of inequality,” Ecology, pp. 1548–1551, 1987. [15] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison-Wesley, 2006.