Multitask TSK Fuzzy System Modeling by Mining ... - Semantic Scholar

27 downloads 780 Views 2MB Size Report
task common hidden structure among multiple tasks to enhance ... permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. ... Nk training data samples Dk = [Xk, Yk] with an input dataset.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS

1

Multitask TSK Fuzzy System Modeling by Mining Intertask Common Hidden Structure Yizhang Jiang, Member, IEEE, Fu-Lai Chung, Member, IEEE, Hisao Ishibuchi, Fellow, IEEE, Zhaohong Deng, Senior Member, IEEE, and Shitong Wang

Abstract—The classical fuzzy system modeling methods implicitly assume data generated from a single task, which is essentially not in accordance with many practical scenarios where data can be acquired from the perspective of multiple tasks. Although one can build an individual fuzzy system model for each task, the result indeed tells us that the individual modeling approach will get poor generalization ability due to ignoring the intertask hidden correlation. In order to circumvent this shortcoming, we consider a general framework for preserving the independent information among different tasks and mining hidden correlation information among all tasks in multitask fuzzy modeling. In this framework, a low-dimensional subspace (structure) is assumed to be shared among all tasks and hence be the hidden correlation information among all tasks. Under this framework, a multitask Takagi–Sugeno–Kang (TSK) fuzzy system model called MTCS-TSK-FS (TSK-FS for multiple tasks with common hidden structure), based on the classical L2-norm TSK fuzzy system, is proposed in this paper. The proposed model can not only take advantage of independent sample information from the original space for each task, but also effectively use the intertask common hidden structure among multiple tasks to enhance the generalization performance of the built fuzzy systems. Experiments on synthetic and real-world datasets demonstrate the applicability and distinctive performance of the proposed multitask fuzzy system model in multitask regression learning scenarios. Index Terms—Common hidden structure, fuzzy modeling, multitask learning, Takagi-Sugeno-Kang (TSK) fuzzy systems.

Manuscript received September 19, 2013; revised April 11, 2014; accepted June 9, 2014. This work was supported in part by the Hong Kong Polytechnic University under Grant G-UA68, in part by the National Natural Science Foundation of China under Grant 61170122 and Grant 61272210, in part by the Natural Science Foundation of Jiangsu Province under Grant BK201221834, in part by JiangSu 333 Expert Engineering under Grant BRA2011142, in part by the Fundamental Research Funds for the Central Universities under Grant JUDCF13030, in part by the Ministry of Education Program for New Century Excellent Talents under Grant NCET-120882, and in part by 2013 Postgraduate Student’s Creative Research Fund of Jiangsu Province under Grant CXZZ13_0760. This paper was recommended by Associate Editor H. M. Schwartz. Y. Jiang, Z. Deng, and S. Wang are with the School of Digital Media, Jiangnan University, Wuxi 214122, China (e-mail: [email protected]; [email protected]; [email protected]). F.-L. Chung is with the Department of Computing, Hong Kong Polytechnic University, Hong Kong (e-mail: [email protected]). H. Ishibuchi is with the Department of Computer Science and Intelligent Systems, Osaka Prefecture University, Osaka 599-8531, Japan (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2330844

I. I NTRODUCTION RECENT principle tells us that multitask learning or learning multiple related tasks simultaneously has better performance than learning these tasks independently [1]–[5]. Focused on multitask learning, the principal goal of multitask learning is to improve the generalization performance of learners by leveraging the domain-specific information contained in the related tasks [1]. One way to reach the goal is learning multiple-related tasks simultaneously while using a common representation. In fact, the training signals for extra tasks serve as an inductive bias which is helpful to learn multiple complex tasks together [1]. Empirical and theoretical studies on multitask learning have been actively performed in the following three areas: multitask classification learning [2]–[13], multitask clustering [14]–[21], and multitask regression learning [22]–[28]. It has been shown in those studies that, when there are relations between multiple tasks to learn, it is beneficial to learn them simultaneously instead of learning each task independently. Although those studies have indicated the significance of multitask learning and demonstrated certain effectiveness in different real-world applications, the current multitask learning methods are still very limited and cannot keep up with the real-world requirements, particularly in fuzzy modeling. Thus, this paper focuses on multitask fuzzy modeling. With the rapid development of data collection technologies, the regression datasets are collected with different formats. Usually, the datasets obtained can be divided into four types: 1) Single-Input Single-Output (SISO); 2) Single-Input MultiOutput (SIMO); 3) Multiple-Input Single-Output (MISO); and 4) Multi-Input Multi-Output (MIMO). Except the following two types, i.e., SISO datasets and MISO datasets, the other two types can be transformed into multitask regression learning scenarios. Taking a MIMO dataset as an example, we can always divide it into multiple MISO datasets. Each resulted MISO dataset is simpler to process than the original MIMO dataset. This corresponds to a typical multitask (regression) learning scenario. While each resulted MISO dataset can be modeled individually to preserve the independence of each task (MISO function), the intertask hidden correlation is virtually lost. In order to solve this problem, this paper proposes a novel modeling technique which can effectively take into consideration of the hidden correlation information between MISO datasets. Thus, the proposed multitask regression learning method makes good use of both task independent information and intertask hidden correlation information of a given MIMO dataset.

A

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

Because of their interpretability and learning capability, fuzzy systems have been widely used in many fields such as intelligent control, signal processing, pattern recognition [29]–[35], [57]. Just like most of the state-of-theart machine-learning methods, current fuzzy systems have been designed for a single task. For multitask learning, one can build an individual fuzzy system for each dataset from each task. Obviously, the intertask hidden correlation information has been ignored and hence the generalization performance of the overall system has not yet been upgraded through mining such a hidden correlation information. As we may know well, the hidden structural learning mechanism is attracting more and more attention in the fields of machine learning and its effectiveness has been verified [36]–[41]. In this paper, we attempt to mine the common low-dimensional structure among all tasks to enhance the generalization performance of the built fuzzy systems. By exploiting both task independent information from original space and intertask common low-dimensional hidden structure, a new multitask fuzzy system modeling method is proposed in this paper. It is rooted at the classical Takagi–Sugeno–Kang (TSK) type fuzzy systems and makes use of a novel objective function constructed to synthesize the task independence and intertask hidden correlation. Our experimental results indicate that the proposed fuzzy system modeling method indeed enhances the generalization performance of the trained system and hence is very promising for multitask regression learning. The contributions of this paper can be highlighted as follows. 1) The fuzzy system modeling is first studied from the viewpoint of multitask learning among all tasks and accordingly a multitask fuzzy system learning framework is developed. 2) Within the multitask learning framework, a learning mechanism based on the common hidden structure is proposed based on the TSK fuzzy system. In our approach, a new objective function for multitask learning is formulated to integrate the task independent information and the task correlation information. The latter is captured by the common low-dimensional hidden structure among all tasks. 3) It is demonstrated by experimental results on synthetic and real-world multitask regression datasets that the proposed method outperforms standard single-task TSK-type fuzzy system modeling methods in multitask scenarios. The rest of this paper is organized as follows. In Section II, the concept and principle of classical TSK-FS systems is briefly reviewed, and then a novel concept and principle of multitask TSK fuzzy system-based intertask common hidden structure is proposed. In Section III, the learning algorithm of classical TSK-FS is briefly introduced. A novel multitask TSK fuzzy system model called MTCS-TSK-FS (TSK-FS for multiple tasks with common hidden structure), based on the ε-insensitive criterion and L2-norm penalty terms, is then proposed. The proposed system is extensively evaluated with the experimental results reported in Section IV. Conclusions are given in Section V.

IEEE TRANSACTIONS ON CYBERNETICS

II. TSK F UZZY S YSTEMS FOR M ULTIPLE TASKS W ITH C OMMON H IDDEN S TRUCTURE A. Problem Definition Let us first describe our multitask learning problem as follows. Given a set of K regression tasks {T 1 , T 2 , . . . , T K } and for each individual task T k (1 ≤ k ≤ K), we have a set of Nk training data samples Dk = [Xk , Yk ] with an input dataset Xk = {xi,k , i = 1, . . . , Nk }, xi,k ∈ Rd and an output dataset Yk = {yi,k , i = 1, . . . , Nk }. According to the description of the above multitask scenario, our main purpose is to design a novel TSK fuzzy model for the multitask scenario which is able to achieve the following two goals: 1) preserving the independence information to take full advantage of the characteristic of each task and 2) mining the intertask hidden correlation information to enhance the generalization performance of the built fuzzy systems. To achieve the above goals, we propose a novel multitask TSK fuzzy system based intertask common hidden structure in this section and its training algorithm will be introduced in the next section. B. Classical Single-Task TSK Fuzzy System Let us first review the classical single-task-based TSK fuzzy system [42]. For the classical TSK fuzzy system, the most commonly used fuzzy inference rules are as follows: TSK Fuzzy Rule Rm m m If x1 is Am 1 ∧ x2 is A2 ∧ · · · ∧ xd is Ad m m m m then f (x) = p0 + p1 x1 + · · · + pd xd

m = 1, . . . , M. (1)

In (1), Am i is a fuzzy subset on the input variable xi for the mth rule, which implies a linguistic variable in the corresponding subspace; M is the number of fuzzy rules, and ∧ is a fuzzy conjunction operator. Each rule is premised on the input vector x = [x1 , x2 , . . . , xd ]T , and maps a fuzzy subspace in the input space Am ⊂ Rd to a varying singleton denoted by f m (x). When multiplicative conjunction is employed as the conjunction operator, multiplicative implication as the implication operator, and additive disjunction as the disjunction operator, the output of the TSK fuzzy model can be formulated as M M   μm (x) 0 m μ˜ m (x) · f m (x) (2.a) · f (x) = y = M  m (x) μ  m=1 m=1 m =1 where μm (x) and μ˜ m (x) denote the compatibility grade and the normalized compatibility grade with the antecedent part (i.e., fuzzy subspace) Am , respectively. These two functions can be calculated as d  μAm (xi ) (2.b) μm (x) = i

 μ˜ (x) = μ (x) m

m

i=1 M 



μm (x).

(2.c)



m =1

A commonly-used membership function is the Gaussian membership function which can be expressed by 2 m μAmi (xi ) = exp(−(xi − cm i ) /2δi )

(2.d)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. JIANG et al.: MULTITASK TSK FUZZY SYSTEM MODELING

3

m where the parameters cm i and δi can be estimated by clustering techniques or other partition methods. For example, with fuzzy m c-means (FCM) clustering, cm i and δi can be estimated as follows:  N N   m ci = ujm xji ujm (2.e) j=1

δim

= h·

j=1

N 

2 ujm (xji − cm i )

j=1



N 

ujm

(2.f)

j=1

where ujm denotes the membership degree of the jth input data xj = (xj1 , . . . , xjd )T , to the mth cluster obtained by FCM clustering [43], [44] or other partition methods. Here h is a scalar parameter and can be adjusted manually. When the premise of the TSK fuzzy model is determined, let xe = (1, xT )T x˜ m = μ˜ m (x) xe xg = ((˜x1 )T , (˜x2 )T , . . . , (˜xM )T )T

(3.b) (3.c)

m m T pm = (pm 0 , p1 , . . . , pd )

(3.d)

pg = ((p ) , (p ) , . . . , (p ) ) 1 T

2 T

M T T

(3.a)

(3.e)

then (2.a) can be formulated as the following linear regression problem: yo = pTg xg .

(3.f)

Thus, the training problem of the above TSK model can be transformed into the learning of the parameters in the corresponding linear regression model [33], [34], [45]. 1) Discussion (Classical TSK Fuzzy Systems for Multitask Scenarios): Above all, we know that the classical fuzzy system modeling strategies were designed for a single-task scenario, and hence they suffer certain deficiency in multitask learning. For example, a straightforward modeling strategy is to construct individual fuzzy systems for different tasks by the samples in the corresponding tasks, as shown in Fig. 1. Although the resulted framework provides a viable solution to various multitask learning application scenarios, individual modeling in fact ignores the relationship between tasks, which is important in boosting the generalization ability of different fuzzy systems in different tasks. C. Multitask TSK Fuzzy System With Intertask Common Hidden Structure In order to empower the classical single-task TSK fuzzy system modeling methods with multitask learning ability, a multitask TSK fuzzy system-based intertask common hidden structure (MTCS-TSK-FS) is proposed in this paper. As mentioned in the beginning of this section, the proposed MTCSTSK-FS has two goals, i.e., it can preserve the independence information of each task and mine the intertask hidden correlation information simultaneously. To achieve these two goals, we propose to divide the output fkm (xk ) for the mth rule of task k into two parts, i.e., a common part and an individual part. For the common part, a common output m S (xk ) is defined.

Fig. 1.

Fuzzy systems for multiple tasks: Independent learning strategy.

In order to mine the intertask hidden correlation information among all tasks, the corresponding parameters U of Sm (·) and the common low-dimensional (r-dimensions) hidden structure Hm are defined. Meanwhile, for the individual part, we define an individual output gm k (xk ) and the corresponding parameters are also defined to preserve the independence information θm k of each task in the original space. Overall, a novel multitask fuzzy inference rules are defined as follows: TSK Fuzzy Rule Rm k m If x1,k is Am ∧ x is Am 2,k 1,k 2,k ∧ · · · ∧ xd,k is Ad,k m then fkm (xk ) = m S (xk ) + gk (xk )  T = (U)T Hm xk + θm xk k

(4)

T m = (U)T hm 0 + (U) h1 x1,k + · · · m m + (U)T hm d xd,k + θ0,k + θ1,k x1,k + · · · m + θd,k xd,k m T m m = ((U)T hm 0 + θ0,k ) + ((U) h1 + θ1,k )x1,k + · · · m + ((U)T hm d + θd,k )xd,k m = 1, . . . , Mk = 1, . . . , K.

In (4), M is the number of fuzzy rules, K is the number of tasks, and the means of other symbols are the same as in classical TSK fuzzy systems. Each rule is premised on the input  T vector xk = x1,k , x2,k , . . . , xd,k ∈ Rd×1 for each task k, and d maps a fuzzy subspace in the input space Am k ⊂ R to a varying m singleton denoted by fk (xk ) for each task k, and fkm (xk ) consists of two parts: 1) the individual part (i.e., gm k (·)) for each task k, whichcan be characterized by the corresponding paramT

m m m m eters θm ∈ R(d+1)×1 generated k = θ0,k , θ1,k , θ2,k , . . . , θd,k by using independence information of each task for original space, i.e., (d + 1)-dimensions space and 2) the common part (i.e., m S (·)) among all tasks, which can be characterized by the corresponding parameters U = (u1 , . . . , ur )T ∈ Rr×1 of Sm (·) and the common low-dimensional (r-dimensions) m m r×(d+1) , h = hidden structure Hm = (hm i 0 , h1 , . . . , hd ) ∈ R T r×1 (h1 , . . . , hr ) ∈ R , i = 1, . . . d, obtained through mining intertask hidden correlation information among all tasks. Like the processing of classic single-task TSK model, the output of the multitask TSK fuzzy model for task k can be formulated as

y0k

=

M  m=1

μm k (x)

M



μm (x) m =1 k 

· fkm (x)

=

M  m=1

m μ˜ m k (x) · fk (x) (5.a)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON CYBERNETICS

where μm ˜m k (x) and μ k (x) denote the fuzzy membership function and the normalized fuzzy membership associated with the fuzzy set Am k for each task k, respectively. These two functions can be calculated by using μm k (x) =

d 

μAm (xi,k )

μ˜ m k (x)

=

(5.b)

i,k

i=1



μm k (x)

M 



μm k (x).

(5.c)



m =1

Similar to classical TSK fuzzy system, for each task k, we used Gaussian membership function as fuzzy membership function

2 −(xi,k − cm i,k ) (5.d) μAmi,k (xi,k ) = exp m 2δi,k where we also used FCM clustering to estimate the value of m m m cm i,k , δi,k , ci,k , δi,k for each task k can be estimated as follows:  N N   m ci,k = ujm,k xji,k ujm,k (5.e) j=1 m δi,k

= hk ·

j=1 N 

2 ujm,k (xji,k − cm i,k )

j=1



N 

ujm,k

(5.f)

j=1

where ujm,k denotes the fuzzy membership of the jth input data xj,k = (xj1,k , . . . , xjd,k )T for each task k, belonging to the mth cluster obtained by FCM clustering. Here hk is a scalar parameter and can be adjusted manually. Similar to the classical TSK fuzzy system (see the above subsection), when the premise of the multitask TSK fuzzy model for each task k is determined, let xe,k = (1, xTk )T x˜ m ˜m k =μ k (xk ) xe,k T T xg,k = ((˜x1k )T , (˜x2k )T , . . . , (˜xM k ) )     T m m T m m p˜ m k = (U) h0 + θ0,k , (U) h1 + θ1,k , . . . ,  T m T m (U) hd + θd,k

p˜ g,k = p˜ g,k =

(((H ) U + θ1g,k )T , ((Hm )T U + θ2g,k )T , . . . , T T ((Hm )T U + θM g,k ) ) T T ˜ T ˜ T ((p˜ 1k )T , (p˜ 2k )T , . . . , (p˜ M k ) ) = ((H) U + θk )

Fig. 2.

Fuzzy systems for multiple tasks: Multitask learning strategy.

takes full use of the intertask hidden correlation information. In the following section, the learning algorithm of MTCSTSK-FS based on ε-insensitive criterion and L2-norm penalty terms will be elaborated (Algorithm 1) III. TSK F UZZY M ODEL L EARNING FOR M ULTIPLE TASKS WITH C OMMON H IDDEN S TRUCTURE In this section, the ε-insensitive criterion and L2-norm penalty-based TSK fuzzy system (L2-TSK-FS) learning algorithm are first reviewed briefly. Then the learning algorithm of MTCS-TSK-FS based on the ε-insensitive criterion and L2-norm penalty terms are introduced in detail (Algorithm 1).

(6.a)

A. Classical Single-Task TSK Fuzzy Model Learning

(6.b)

Given a training dataset Dtr = {xi , yi |xi ∈ Rd , yi ∈ R, i = 1, . . . , N}, for fixed antecedents obtained via clustering of the input space (or by other partition techniques), the least square (LS) solution to the consequent parameters is to minimize the following LS criterion function [46], that is:

(6.c) (6.d)

m T

(6.e) (6.f)

˜ = (H1 , . . . , HM )T ∈ Rr×(M·(d+1)) , θ˜ k = where H 1 T T T ∈ R(M·(d+1))×1 , then (5.a) can be ((θg,k ) , . . . , (θM g,k ) ) formulated as the following linear regression problem: y0k = p˜ Tg,k xg,k .

(6.g)

Thus, the training problem of the above multitask TSK model for each task k can also be transformed into the learning of the parameters in the corresponding linear regression model. The framework for the construction of MTCS-TSK-FS can be described with Fig. 2. In Fig. 2, its modeling strategy is shown. It can be seen that each fuzzy system is trained in a multitask learning manner by the multitask dataset, which

min E = pg

N   i=1

yoi − yi

2

=

N  

pTg xgi − yi

2

i=1

= (y − Xg pg )T (y − Xg pg )

(7)

where Xg = [xg1 , . . . , xgN ]T ∈ RN×(M·(d+1)) and y = [y1 , . . . , yN ]T ∈ RN . The most popular LS-criterion-based TSK fuzzy system learning algorithm is the one used in the adaptive-networkbased fuzzy inference systems (ANFIS) [46]. For this type of algorithms, a major shortcoming is their weak robustness for modeling tasks involving noisy and/or small datasets. In addition to the LS-criterion-based TSK fuzzy system learning algorithm, another important representative εinsensitive criterion based TSK-FS learning method is the one developed by employing the L1-norm penalty terms [45] and the L2-norm penalty terms [33], [34]. Compared with the L1norm penalty terms-based TSK fuzzy system (L1-TSK-FS)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. JIANG et al.: MULTITASK TSK FUZZY SYSTEM MODELING

5

learning algorithms, the L2-norm penalty terms-based algorithms (L2-TSK-FS) have shown more advantages with its improved version, such as MEB-based L2-TSK-FS for very large datasets [33] and knowledge-leverage-based transfer learning L2-TSK-FS for missing data [34]. Here, we mainly focus on the L2-TSK-FS learning algorithm in [33] since it is more related to our work in this paper. For TSK fuzzy system training, the ε-insensitive objective function is defined as follows. Given a scalar g and a vector g = [g1 , . . . , gg ]T , the corresponding ε-insensitive loss functions take the following forms respectively: d |g|ε = g − ε (g > ε), |g|ε = 0 (g ≤ 0), and |g|ε = i=1 |gi |ε . For the linear regression problem of the TSK fuzzy model in (3.f), the corresponding ε-insensitive loss based criterion function [45] is defined as min E = pg

N 

|yoi − yi |ε =

i=1

N 

|pTg xgi − yi |ε .

(8.a)

i=1

In general, the inequalities yi −pTg xgi < ε and pTg xgi −yi < ε are not satisfied for all data pairs (Xgi , Yi ). Further, by introducing the regularization term [45], (8.a) is modified to N   1  T 1 g pg , ε = |pg xgi − yi |ε + pTg pg Nτ 2

min pg

(8.b)

i=1

where | · |ε is the ε-insensitive measure, and the balance parameter τ (> 0) controls the tradeoff between the complexity of the regression model and the tolerance of the errors. When L2-norm penalty terms with slack variables ξi+ and ξi− are introduced, the corresponding objective function of the L2-TSK-FS can be formulated as follows [33]: N 1 1  + 2 (ξi ) min, ξ , ξ , ε g pg , ξ , ξ , ε = · pg τ N i=1 1 2 + (ξi− )2 + pTg pg + · ε 2 τ yi − pTg xgi < ε + ξi+ s.t. T ∀i. (9.a) pg xgi − yi < ε + ξi− +





+





Compared with the L1-norm penalty-based ε-insensitive criterion function [45], the L2-norm penalty-based criterion [33] is advantageous because of the following characteristics: 1) the constraints ξi+ ≥ 0 and ξi− ≥ 0 in the objective function of L1-TSK-FS [45] are not needed for the optimization and 2) the insensitive parameter ε can be obtained automatically by optimization without the need of manual setting. Similar properties can also be found in other L2-norm penalty-based machine learning algorithms, such as L2-SVR [47]. Based on optimization theory, the dual of (9.a) can be formulated as the following QP problem:  N + − + − T max − N i=1 j=1 (αi − αi )(αj − αj ) · xgi xgj α + ,α −  N Nτ − 2 + 2 Nτ − N i=1 2 (αi ) − i=1 2 (αi ) N N + + i=1 αi · yi · τ − i=1 αi− · yi · τ s.t.

N  i=1

(αi+ + αi− ) = 1, αi+ , αi− ≥ 0

∀ i.

(9.b)

Notably, the characteristic of the QP problem in (9.b) enables the use of the coreset-based minimal enclosing ball (MEB) approximation technique to solve problems involving very large datasets [47]. The scalable L2-TSK-FS learning algorithm (STSK) has thus been proposed in this regard [33]. The (9.b) can also be used to develop the transfer learning version on TSK fuzzy system to solve the problem of missing data [34]. B. TSK Fuzzy Model Learning for Multiple Tasks With Common Hidden Structure According to the advantages of L2-TSK-FS, we mainly focus on the L2-TSK-FS learning algorithm that is more related to our work in this paper. As distinguished from the single-task learning algorithm, when we design the objective function of the multitask with common hidden structure TSK fuzzy system based on the classic ε-insensitive criterion and L2-norm penalty terms, we should consider how to maintain a balance between the unique characteristics of different tasks of data samples (independence information) and correlation information (intertask hidden correlation by mining the common hidden structure among all tasks), and how to generalize the independence and correlation information. Above all, the following objective function is defined: min

˜ H,ξ,ε ˜ θ,U,

˜ U, H, ˜ ξ, ε) = J(θ,

K 

  ˜ TU gk θ˜ k , ξk , εk + S H

k=1 ⎧ + T T ˜ ⎪ ⎨ yi,k − (H U + θ˜ k ) xgi,k < εk + ξi,k ˜ T U + θ˜ k )T xgi,k − yi,k < εk + ξ − ∀i, k s.t. (H i,k ⎪ ⎩H ˜H ˜ T = Ir×r

(10) where Nk   λ 1 ˜T ˜ 1  + 2 − 2 (ξi,k gk θ˜ k , ξk , εk = ) + (ξi,k ) θk θk + K2 Nk τk i=1

2 + εk (10.a) τk  ˜ T U)T (H ˜ T U). ˜ T U = 1 (H S H (10.b) 2 It can be found from (10) that (10.a) and (10.b) play different roles in (10), i.e., (10.a) representing the independence information on independent regression sample space (M · (d + 1)-dimensions space) of each task and (10.b) representing the correlation information on common hidden structure among all tasks. Specifically, in order to represent the independence information and the correlation information (intertask hidden correlation), we use the vector θ˜ k ∈ R(M·(d+1))×1 represents the individual consequents of each task and it tends to zero when different tasks are similar to each other, otherwise the common hidden structure consequents U ∈ Rr×1 tends to zero. Then we can get the model parameter p˜ g,k for task˜ T U + θ˜ k . In other words, the common k by using p˜ g,k = H r×(M·(d+1)) ˜ ∈ R structure H will realize mapping the original regression sample space (M · (d + 1)-dimensions space) into a low-dimensional hidden structure (r-dimensions space, r < (M · (d + 1))). Note here that the regularization parameters τk > 0 control the tradeoff between the complexity of the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON CYBERNETICS

⎛ ⎜ + − − 1 ⎜+ L(λ+ 1 , . . . , λ K , λ1 , . . . , λ K ) = − 2 ⎝       + K K K Nk + − + k=1 i=1 (λi,k − λi,k )yi,k − λ+ k ≥ 0, λk ≥ 0

s.t.

Nk 



K K Nk Nl + − + − T k=1 l=1 i=1 j=1 (λj,l − λj,l )(λi,k − λi,k )xgj,l xgi,k    ⎟ N N K + − + − K k k T ⎟ k=1 j,k − λj,k )xgj,k xgi,k ⎠ i=1 j=1 (λi,k − λi,k )(λ λ  K Nk τk Nk + 2 − 2 k=1 2 i=1 (λi,k ) + (λi,k )

− (λ+ i,k + λi,k ) =

i=1

2 τk

(11)

∀k k = 1 . . . K.

Let T ˜+ ˜− ˜− ˜+ ˜+ ˜− ˜− υ = (λ˜ + 1,1 , . . . , λN1 ,1 , λ1,1 , . . . , λN1 ,1 , . . . , λ1,K , . . . , λNK ,K , λ1,K , . . . , λNK ,K )       2NK

2N1

zi,k

     T  − T T T T = λ+ , λ− , . . . , λ+ K , λK 1 1 xgi,k , i = 1, . . . , Nk = −xg(i−N),k , i = N + 1, . . . , 2Nk

(12.a) (12.b)

T

β = (yT1 , −yT1 , . . . , yTK , −yTK ) , yTk = (y1,k , . . . , yN,k )T

regression model and the tolerance of errors, and the parameter λ has an impact on θ˜ k . When λ− > +∞, each θ˜ k tends to 0, denoting strong correlation and weak independence. When λ− > 0, each θ˜ k tends to +∞, denoting strong independence and weak correlation. The values of the vector τ and the constant λ can be manually set and can also be taken by the cross-validation strategy [48]. C. Parameter Solution for MTCS-TSK-FS ˜ are required to In (10), all of the variables θ˜ , U, and H be optimized. Overall, solving this problem directly is not a trivial task. In this paper, an iterative method is adopted which occurred in our pervious work [35]. The optimizing procedure contains three main steps. ∗ Step 1: The computation of θ˜ and U∗ . ˜ is fixed, (10) When the common structural parameter H becomes the typical quadratic programming (QP) problem in (11)–(14.c) among which (11)–(12.c) can be seen at the top of the page. Equation (11) can be reformulated as υ

2 τk

υ i,k ≥ 0

∀i, k

(13)

where

  ˜ k = k˜ ij K

K , k˜ ij = ·zTgj,k zgi,k λ Nk τk 1, i = j + . (14.a) δij , δij = 0, i = j 2   ˆ k,l = k˜ ij , k˜ ij = zTgj,l zgi,k (14.b) K 2Nl ×2Nk ⎛˜ ⎞ ˆ 1,1 ˆ K,1 ˆ 2,1 K1 + K ··· K K ⎜ K ⎟ ˆ 2,2 · · · ˆ K,2 ˆ 1,2 ˜2+K K K ⎜ ⎟ K=⎜ ⎟ . (14.c) .. .. . . . . ⎝ ⎠ . . . . ˜K +K ˆ K,K ˆ 1,K ˆ 2,K ··· K K K 2Nk ×2Nk

(12.c)

 + ∗  − ∗  ∗  − ∗ With the optimal solution λ+ 1 , λ1 , . . . , λK , λK of the dual in (11) or (13), we can get the optimal solution of the primal in (10) based on the relations in (15.a) and (15.b) Nk  ∗  ∗ K ∗ ( λ+ − λ− (15.a) θ˜ k = i,k i,k )xgi,k λ i=1

K N Nk  K  k  ∗ ∗   + − ∗ ˜ λ λ U =H xgi,k − xgi,k . i,k

i,k

k=1 i=1

k=1 i=1

(15.b) Please note, the detailed derivations of (11), (15.a), and (15.b) can be found in the Appendix. ˜ ∗. Step 2: The computation of H When the parameter θ˜ and U are fixed and the parameters , . . . , λ+ , λ− , . . . , λ− obtained by Step 1, then the optiλ+  1  k  1  k K

K

˜ that solves the optimization problem in (10) can be mal H expressed as k   ˜ T U)T (H ˜ T U)T xgi,k ˜ T U) + ˜ = 1 (H −(H λ+ G(H) i,k 2

K

arg max − 12 υT Kυ + υT β s.t. υTk 1 =

k = 1, . . . K

N

k=1 i=1

+

Nk K  



˜T T λ− i,k (H U) xgi,k



k=1 i=1

˜ H ˜ T = Ir×r . s.t. H

(16.a)

˜ as follows: We have the gradients on H k  ∂G T ˜ − λ+ = UUT H i,k Uxgi,k ˜ ∂H

K

N

k=1 i=1

+

Nk K   k=1 i=1

T λ− i,k Uxgi,k .

(16.b)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. JIANG et al.: MULTITASK TSK FUZZY SYSTEM MODELING

7

Algorithm 1: Learning Algorithm for MTCS-TSK-FS Stage 1: Initialization Optimal Conditions ˜ (0) ; Initialize t = 0, H Compute J(0); Set the maximum number of external iterations tmax , the maximum number of internal iterations lmax , the regularization parameter τk , λ, the number of fuzzy rules Mk for each task k = 1. K. (Set M1 = M2 = · · · = MK ) and the threshold δ. Stage 2: Constructing Multitask Dataset for Linear Regression Step 1: Use FCM or other partition methods to generate the regression datasets Dk = {xgi,k , yi,k }, i = 1, 2..., Nk , k = 1, 2, . . . , K for each task. Stage 3: Optimizing the Objective Function of MTCSTSK-FS Repeat: t = t + 1; Step 2: (t) Use (11) or (13) to update θ˜ k and U(t) ; Step 3: l = 0; ˜ (t−1) ˜ (t,l) = H H Compute G(0); Repeat: l = l + 1; Use (17.c) to update step size η; ˜ Use (18) to update H; Until: G(t, l) − G(t, l − 1) ≤ δ or l ≥lmax ˜ (t,l) ˜ (t) = H H Step 4: Use (19) to update the consequent parameter p˜ g,k of each task; Compute J(t); Until: J(t) − J(t − 1) ≤ δ or t ≥ tmax Stage 4: Generating the MTCS-TSK Fuzzy System for Each View Step 5: Generate the desired MTCS-TSK-FS for each task by using the final optimal consequent parameter p˜ g,k and (6.g).

˜ can be learned via a gradient descent Then, the variable H algorithm on the Grassmann manifold [49]–[51] ˜ H ˜ T) = H ˜ − η∇ H. ˜ ˜ ←H ˜ − η ∂G (Ir×r − H H ˜ ∂H

(16.c)

That is ˜ (t) − η(UUT H ˜ (t) − ˜ (t+1) = H H

Nk K  

+

k=1 i=1

k=1 i=1

−η

Nk K  

 ˜T T λ− i,k (∇ H U) xgi,k

(17.a)

k=1 i=1

and the gradient of step size η is ∂f 1 ˜T T ˜T 1 ˜T T ˜T = − (∇ H U) H U − (H U) ∇ H U ∂η 2 2 ˜ TU ˜ T U)T ∇ H + η(∇ H N K k   ˜T T λ+ + i,k (∇ H U) xgi,k k=1 i=1



Nk K  

 ˜T T λ− i,k (∇ H U) xgi,k .

(17.b)

k=1 i=1 ∂f Let ∂η = 0, we can obtain (17.c), as shown at the bottom of the next page. Above all, we have the update rule for the common ˜ by using (16.d) and (17.c), that is structural parameter H

˜ ←H ˜ − η∇ H ˜ H

(18)

˜H ˜ T ) and ∂G = UUT H ˜ = ∂G (Ir×r − H ˜ − where ∇ H K Nk − ∂ H˜T K Nk + ∂TH˜ k=1 k=1 i=1 λi,k Uxgi,k + i=1 λi,k Uxgi,k . Step 3: The computation of p˜ ∗g,k . ∗ With the optimal parameter θ˜ and U∗ of the dual in (11) or (13) from Step 1 and the optimal common structural parameter ˜ ∗ from Step 2, we can get the optimal model parameters H ∗  of the trained MT-TSK-FS for Task-k, i.e., p˜ g,k , is then given by  ∗  ∗ T ∗  ∗ ˜ p˜ g,k = H U + θ˜ k . (19)  ∗ Finally, we can use the obtained optimal parameter p˜ g,k to construct the TSK fuzzy system by using (6.g) for each task. ˜ U, and H ˜ will Note here that the joint optimization over θ, cause the optimization being non-convex. Therefore, only a local optimal solution can be obtained. This usually does not lead to a serious problem as the local optimal solution is effective enough in most practical applications.

T λ+ i,k Uxgi,k

k=1 i=1 Nk K  

Please note, the step size η can be manually set and it also can be obtained analytically as follows. First, let us put in the update rule in (16.c) into the objective function in (16.a), we have η ˜ T U)T H ˜ T U)T ∇ H ˜ TU ˜ T U − η (H f (η) = − (∇ H 2 2 η2 ˜ T U)T ∇ H ˜ TU (∇ H + 2 Nk K    ˜ T U)T xgi,k (∇ H λ+ +η i,k

T ˜ (t) ˜ (t)T ). (16.d) λ− i,k Uxgi,k )(Ir×r − H H

D. Learning Algorithm for MTCS-TSK-FS Based on the above update rules, the learning algorithm of the proposed MTCS-TSK-FS is presented in Algorithm 1.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON CYBERNETICS

TABLE I N OTATIONS OF THE A DOPTED DATASETS AND T HEIR D EFINITIONS

IV. E XPERIMENT R ESULTS A. Setup 1) Methods Adopted for Comparison: In this section, we evaluate the effectiveness of the proposed MTCS-TSK-FS in comparison with three representative methods on synthetic and real-world datasets. The three methods include: 1) L2norm penalty-based ε-insensitive TSK fuzzy model (L2-TSKFS) [33]; 2) TS-fuzzy-system-based support vector regression (TSFS-SVR) [52]; and 3) fuzzy system learned through fuzzy clustering and support vector machine (FS-FCSVM) [53]. 2) Parameter Setting: In our experiments, the parameters of the above three methods and our proposed method as follows. 1) The numbers of fuzzy rules K: For all the algorithms except FS-FCSVM (as the FS-FCSVM can automatically determined its fuzzy rules by the number of support vectors), the number of fuzzy rules, according to the scale of the datasets, are determined by fivefold crossvalidation strategy with the parameter set {5, 10, 15, 20, 25, 30}. 2) The regularization parameters τ of L2-norm based TSKFS (L2-TSK-FS and MTCS-TSK-FS) and C of SVMbased TSK-FS (TSFS-SVR and FS-FCSVM): these regularization parameters are determined by using fivefold strategy  with the parameter set   −6 cross-validation 2 , 2−5 , . . . , 25 , 26 and 10−3 , 10−2 , . . . , 102 , 103 , respectively. 3) The kernel function of SVM-based TSK-FS (TSFS-SVR and FS-FCSVM): Here, we use Gaussion kernel function 2 2 K(x, y) = e− x−y /σ in [52] and [53], and its kernel parameter σ is determined by fivefold cross-validation   strategy with the parameter set 2−6 , 2−5 , . . . , 25 , 26 . 4) The common hidden r-dimensional structure of our proposed MT-hidden-TSK: We tested different r-values and found that r = d/2 seems to give the best results, where d represents the dimensions of regression datasets. When r is too large, the computation cost is too high. When r is too small, the performance may suffer due to information loss.    3) Performance Index: J = N1 Ni=1 (yi − yi )2 / N1 Ni=1 (yi − y¯ )2 is adopted to evaluate the training and test performance. In this index, N is the number of sample in a dataset, yi is the output for the ith data pair, yi is the fuzzy model output  for the ith input datum, and y¯ = N1 N i=1 yi . The smaller the value of J obtained on a test set, the better the generalization performance. And for clarity, the notations for the datasets and their definitions are listed in Table I. In the experiments, the parameters of all the methods adopted for comparison are determined by using the fivefolds cross-validation strategy with the training datasets. All

the algorithms are implemented using 64-bit MATLAB on a computer with Intel Xeon E5-2620 2.0 GHz CPU × 2 and 32GB RAM. B. Synthetic Datasets 1) Generation of Synthetic Datasets: Synthetic datasets were generated to simulate different scenes in the paper. The following assumptions need to be satisfied for these synthetic datasets: 1) there should exist different tasks (independence information) and 2) these tasks should be related (correlation information). In other words, our synthetic datasets satisfy the assumptions that the multiple tasks are different but related. Based on the above assumptions, we define three different scenes as described in Table II to simulate real-world ones as follows. a) Same Input-Different Output [SI-DO(DN)] scene: This scene contains three tasks, which have the same input dataset, i.e., X1 = X2 = X3 , but different outputs by using the same mapping function but with different noise added. Thus, a multitask dataset {T 1 , T 2 , T 3 }, where T 1 = [X, Y 1 ], T 2 = [X, Y 2 ] and T 3 = [X, Y 3 ], is obtained. b) Same Input-Different Output [SI-DO(DF)] scene: This scene contains three tasks, again having the same input dataset, i.e., X1 = X2 = X3 . However, their outputs are different by using different mapping functions and having the same noise adding function applied. c) Different Input-Different Output [DI-DO(SN&SF)] scene: This scene contains three tasks, with different input datasets generated by different range of inputs and different output values, generated by the same mapping function and the same noise adding function. d) Different Input-Different Output [DI-DO(DN&SF)] scene: This scene contains three tasks, with different input datasets generated by different range of inputs and different output values, generated by using the same mapping function but with different noise added. The settings for generating the above synthetic datasets are described in Table II. In this experiment, ten training datasets with noise were generated and a noise free test dataset was generated for each multitask scene. The average performance of each algorithm under different multitask scenes has been reported.

1 η=

1 ˜T T ˜ TU ˜T T ˜T 2 (∇ H U) H U +2 (H U) ∇ H K Nk +  Nk − ˜ T U)T xgi,k + K − k=1 i=1 λi,k (∇ H k=1 i=1 λi,k

˜ T U)T ∇ H ˜ TU (∇ H

 ˜ T U)T xgi,k (∇ H

(17.c)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. JIANG et al.: MULTITASK TSK FUZZY SYSTEM MODELING

9

TABLE II D ETAILS OF THE S YNTHETIC DATASETS

2) Comparing With Related TSK-FS Modeling Methods: The proposed MT-TSK-FS is compared with three related TSK-based fuzzy system method and the results are shown in Table III and Fig. 3 [for Fig. 3, to save the space of the paper, we only chose the experimental results on the dataset of DI-DO(SF&DN) scene]. According to the experiment results on these four different scenes, the following observations can be made. 1) For SI-DO(DN) scene, it can be seen from the experimental results in Table III that for the increase in the degree of noise the performance of comparison methods, i.e., L2-TSK-FS, TSFS-SVR, and FS-FCSVM, become more and more poor. An obvious reason is that the data in the training set are noisy of each task, which degrades the generalization capability of these three methods. However, our proposed MTCS-TSK-FS can considerably improve not only the generalization performance but also the fitting to each individual task in this SI-DO(DN) scene, indicating that the use of common information (i.e., common hidden structure) has positive effects on the fitting of each individual fuzzy model to each individual training data set and the generalization performance of all models to unseen test data sets. 2) For SI-DO(DF) scene, these three data generating functions have different levels of complexity, respectively. The modeling effect in Table III shows that the three related comparison methods are not ideal on this multitask scene. In particular, on the second function f2 (x2 , N2 ) = x22 · cos(x2 ) + N2 , both the generalization performance and fitting performance of these three related methods are much weaker for this function. Focused on generalization performance, although these three methods have demonstrated an undesirable result on this scene, our proposed MTCS-TSK-FS still obtain an acceptable generalization capability. 3) For DI-DO(SN&SF) scene, the results shown in Table III that by expending the interval of input data

the complexity of output data is increasing and the performance of comparison methods is not able to achieve an ideal results. By taking the independence information of each task and mining the shared hidden correlation information among all task, the MTCS-TSKFS is able to achieve the best performance. 4) For DI-DO(DN&SF) scene, Table III and Fig. 3 shows the modeling results of L2-TSK-FS, TSFS-SVR, FSFCSVM, and our MTCS-TSK-FS. It can be seen from Table III and Fig. 3 that both the generalization and fitting performances of the proposed MTCS-TSK-FS is better than that of several related methods adopted in this scene. This indicates that by simultaneously expending the interval of input data and increasing in the degree of noise the output data become exceptionally complex among all tasks. However, the single-task-based methods cannot effectively take full use of the information among all tasks, especially they cannot use the common information among all tasks, so they are still exhibiting poor ability. In contrast to single-task-based methods, the multitask-based MTCS-TSK-FS achieve the best performance by exploring the independence information and shared hidden correlation information. Above all, the experimental results confirm that the proposed multitask with common hidden structure TSK-FS outperforms the existing state-of-the-art TSK-based fuzzy system method on multitask learning scene. C. Real-World Datasets 1) Glutamic Acid Fermentation Process Modeling: The glutamic acid fermentation process [33], [54] has a strong nonlinear and time-varying characteristic, so it is difficult to establish a model by using its mechanism. However, a complex model is also useless for us to control and optimize its process. Therefore, it is greatly significant and practical for us to model the glutamic acid fermentation process using the fuzzy modeling technology. In this subsection, to further evaluate the performance of the proposed multitask TSK fuzzy

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE TRANSACTIONS ON CYBERNETICS

TABLE III G ENERALIZATION AND F ITTING P ERFORMANCES (J) OF THE P ROPOSED M ETHOD MTCS-TSK-FS AND S EVERAL R ELATED TSK-BASED F UZZY S YSTEM M ETHODS ON THE DATASET OF D IFFERENT S CENES

system learning method, an experiment is conducted to apply the proposed method to model a biochemical process with real-world datasets [33]–[35]. The datasets adopted originates from the glutamic acid fermentation process, which is a MIMO system. The input variables of the dataset include the fermentation time h, glucose concentration S(h), thalli concentration X(h), glutamic acid concentration P(h), stirring speed R(h), and ventilation Q(h), where h= 0, 2, . . . , 28. The output variables are glucose concentration S(h + 2), thalli concentration X(h + 2), and glutamic acid concentration P(h + 2) at a future time h+2. The TSK-FS-based biochemical process prediction model is illustrated in Fig. 4. The data in this experiment were collected from 21 batches of fermentation processes, with each batch containing 14 effective data samples. In this experiment, we randomly selected 20 batches as the training datasets (D1-train) and the remained batch as the test dataset (D2-test). The above procedure was repeated ten times to obtain the average performance of each algorithm. In this experiment, in order to match the situation discussed in this paper, the dataset are divided into three tasks as follows. 1) Task 1 (Glucose Concentration-FS): The main object of this fuzzy system is to predict the output value of glucose concentration on the next time, i.e., S(h + 2). 2) Task 2 (Thalli Concentration-FS): The main object of this fuzzy system is to predict the output value of thalli concentration on the next time, i.e., P(h + 2). 3) Task 3 (Glutamic Acid Concentration-FS): The main object of this fuzzy system is to predict the output value

of glutamic acid concentration on the next time, i.e., X(h + 2). 2) Polymer Test Plant Modeling: This dataset was taken from a polymer test plant.1 There are ten input variables, measurements of controlled variables in a polymer processing plant (temperatures, feed rates, etc.) X ∈ Rd , d = 10, and four output variables ([Y 1 , Y 2 , Y 3 , Y 4 ]) are measures of the output of that plant. It is claimed that this data set is particularly good for testing the robustness of nonlinear modeling methods to irregularly spaced data. It is also a MIMO system. According to its output, we can divide this dataset into four tasks, i.e., [X, Y 1 ] for task 1, [X, Y 2 ] for task 2, [X, Y 3 ] for task 3 and [X, Y 4 ] for task 4. In this experiment, the dataset was also randomly partitioned with ratio 3:1 for training and testing. This procedure was repeated ten times and the average performance of each algorithm on 10 runs is reported. 3) Wine Preferences Modeling: This dataset is adopted from the wine quality dataset [55], [56]. It contains two subdatasets that measure physicochemical properties of red wine (1599 samples) and white wine (4898 samples), 11 conditional attributes (input) based on physicochemical tests (e.g., pH values, etc.) X ∈ Rd , d = 11 and 1 decision attribute (output) based on sensory data (quality, score between 0 and 10 made by wine experts). In this experiment, this dataset can be divided into two tasks, i.e., red-wine [Xred , Y red ] for task 1 1 Polymer data set can ftp://ftp.cis.upenn.edu/pub/ungar/chemdata

be

available

from

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. JIANG et al.: MULTITASK TSK FUZZY SYSTEM MODELING

Fig. 3. Experimental results of the proposed MTCS-TSK-FS method, the L2TSK-FS method, TSFS-SVR method, and FS-FCSVM method on the testing dataset of DI-DO(SF&DN) scene in a certain run: 1) (a-1)–(a-3) TSK-FS for each task; 2) (b-1)–(b-3) TSFS-SVR for each task; 3) (c-1)–(c-3) FS-FCSVM for each task; and 4) (d-1)–(d-3) MTCS-TSK-FS for each task.

11

partitioned with ratio 3:1 for training and testing, respectively. The procedure was repeated ten times to obtain average performance of each algorithm. 5) Multivalued (MV) Data Modeling: This is a benchmarking artificial dataset [61] with dependency between attribute values.2 It is a large scale dataset, containing 40 768 data points, and there are ten input variables and one output variable. According to its 8th input variable, we can divide this dataset into two tasks, i.e., normal-type for task 1 and largetype for task 2. In this experiment, the dataset was also randomly partitioned with ratio 3:1 for training and testing, respectively. The procedure was repeated ten times to obtain the average performance of each algorithm. 6) Comparing With Related TSK-FS Modeling Methods: These five real-world multitask regression datasets were used to test on the proposed method MV-TSK-FS and other three related TSK-based fuzzy system methods, i.e., L2-TSK-FS, TSFS-SVR, and FS-FCSVM, and the results are given in Table IV. The findings are similar to those presented in Section IV-B for the experiments performed on the synthetic dataset. The results show that our proposed MTCS-TSK-FS has obtained better performance than the other three methods on the four small scale datasets (i.e., fermentation dataset, polymer dataset, wine dataset, and concrete dataset) and the large-scale dataset (i.e., MV dataset). This can be explained again by the fact the proposed method MTCS-TSK-FS can effectively exploit not only the independent information of original data space of each task but also the useful intertask hidden correlation information by mining common hidden structure among all tasks. Therefore, both the generalization and fitting capabilities of TSK-FS obtained by the proposed MTCS-TSK-FS for each task are all promising on the adopted five datasets, which include a large-scale dataset. V. C ONCLUSION

Fig. 4. Illustration of the glutamic acid fermentation process prediction model based on TSK-FSs.

and white wine [Xwhite , Y white ] for task 2. In this experiment, the dataset was also randomly partitioned with ratio 3:1 for training and testing, respectively. This procedure was repeated ten times to obtain the average performance of each algorithm. 4) Concrete Slump Modeling: This dataset was taken from a ready mix concrete batching plants [58], [59]. Concrete is a highly complex material, which makes modeling its behavior a very difficult task. This dataset includes 103 data points and there are seven input variables [i.e., Cement, Slag, Fly ash, Water, SP, Coarse Aggr., and Fine Aggr.], making X ∈ Rd , d = 7, and three output variables [i.e., slump, flow and compressive strength (Mpa)] [Y 1 , Y 2 , Y 3 ]. It is a MIMO system. According to its output, we can divide this dataset into three tasks, i.e., [X, Y 1 ] for task 1, [X, Y 2 ] for task 2, and [X, Y 3 ] for task 3. In our experiment, this dataset was also randomly

In this paper, a multitask fuzzy system modeling method by mining intertask common hidden structure is proposed to overcome the weaknesses of classical TSK-based fuzzy modeling methods for multitask learning. When the classical (single-task) fuzzy modeling methods are applied to multitask datasets, they usually focus on the task independence information and ignore the correlation between different tasks. Here we mine the common hidden structure among multiple tasks to realize multitask TSK fuzzy system learning. It makes good use of the independence information of each task and the correlation information captured by the common hidden structure among all tasks as well. Thus, the proposed learning algorithm can effectively improve both the generalization and fitting performances of the learned fuzzy system for each task. Our experiment results demonstrate that the proposed MTCS-TSK-FS has better modeling performance and adaptability than the existing TSK-based fuzzy modeling methods on multitask datasets. Although the performance of the proposed multitask fuzzy system is very promising, there are still rooms for further study. For example, for the proposed MTCS-TSK-FS, fast learning algorithm is needed in order to 2 MV data set can be available from http://www.keel.es/

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE TRANSACTIONS ON CYBERNETICS

TABLE IV G ENERALIZATION P ERFORMANCE (J) OF THE P ROPOSED MTCS-TSK-FS AND S EVERAL R ELATED TSK-BASED F UZZY S YSTEM M ETHODS ON R EAL -W ORLD M ULTITASK DATASETS

make it more efficient to large scale datasets. For this purpose, the minimum enclosing ball approximation technique [33] and the stochastic dual coordinate descent algorithm [60] can be considered to develop the corresponding the fast algorithm of the proposed method. Another example is how to determine an appropriate size of the common hidden structure and how to explain and leverage the knowledge contained in the common hidden structure which has been mined after the proposed modeling here. Besides, development of transfer learning mechanisms for MTCS-TSK-FS by mining the knowledge from the common hidden structure is also very important to deal with applications where the data is missing on some tasks. Future work will be devoted to these issues.

From this equation, the optimal values can be computed by − setting the derivatives of L(·) with respect to U, θ˜ g,k , ξ+ k , ξk and εk to zeros, respectively, that is k  ∂L ˜H ˜ TU − H ˜ =H λ+ i,k xgi,k ∂U

K

+

k=1 i=1 N K k  ˜ H λ− i,k xgi,k k=1 i=1

=0

(A2.a)

˜H ˜ T = Ir×r ⇒H k  ∂L ˜ = U−H λ+ i,k xgi,k ∂U

K

N

k=1 i=1

A PPENDIX For (10), corresponding Lagrangian function is given in (A1), shown at the bottom of the page.

N

˜ +H

Nk K  

λ− i,k xgi,k = 0

(A2.b)

k=1 i=1

˜ θ˜ 1 , . . . , θ˜ k , ξ + , . . . , ξ + , ξ − , . . . , ξ − , ε , . . . , ε , λ+ , . . . , λ+ , λ− , . . . , λ− ) L(U, H,    1  k 1  k 1  k  1  k  1  k K K K K K  K K  + 2 − 2 1 Nk ˜ T θ˜ k + K ˜ T U)T (H ˜ T U) + λ (ξ ) + (ξ ) = 12 (H θ k=1 k k=1 Nk τk i,k i,k i=1 2K . K 1 K Nk +  + T T ˜ ˜ + 2 k=1 τk εk + k=1 i=1 λi,k yi,k − (H U + θk ) xgi,k − εk − ξi,k Nk −  T  ˜ U + θ˜ k )T xgi,k − yi,k − εk − ξ − ( H λ + K k=1 i,k i=1 i,k

(A1)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. JIANG et al.: MULTITASK TSK FUZZY SYSTEM MODELING

13



⎛

L(λ+ , . . . , λ+ , λ− , . . . , λ− )=  1  k  1  k K

s.t.

K − λ+ k ≥ 0, λk ≥ 0

K K Nk Nl + − + − T k=1 l=1 i=1 j=1 (λj,l − λj,l )(λi,k − λi,k )xgj,l xgi,k    ⎟ K Nk + ⎜ N N − + k k T ⎟ + Kλ K (λ+ − λ− − 12 ⎜ k=1 i,k − λi,k )(λj,k j,k )xgj,k xgi,k ⎠ + i=1 j=1 k=1 i=1 (λi,k  ⎝   Nk K Nk τk + 2 − 2 + k=1 2 i=1 (λi,k ) + (λi,k )

Nk  2 − (λ+ i,k + λi,k ) = τ ∀kk = 1 . . . K k

− λ− i,k )yi,k

(A12)

i=1

k k   ∂L λ = θ˜ k − λ+ x + λ− i,k gi,k i,k xgi,k = 0 ˜ K ∂ θk

N

N

i=1

∂L 2 + + + = N τ ξi,k − λi,k = 0 ∂ξi,k k k ∂L 2 − − − = N τ ξi,k − λi,k = 0 ∂ξi,k k k

i=1

(A4) (A5)

k k   ∂L 2 = − λ+ − λ− i,k i,k = 0. ∂εk τk

N

N

i=1

i=1

(A6)

From (A2) to (A6), we obtain

K N Nk K  k   + − ˜ λ xgi,k − λ xgi,k U=H K θ˜ k = λ

i,k k=1 i=1 k=1 i=1 N k  − (λ+ i,k − λi,k )xgi,k i=1

Nk τk + λ 2 i,k Nk τk − − λ ξi,k = 2 i,k Nk  2 − = (λ+ i,k + λi,k ). τk + ξi,k =

(A3)

i,k

(A7)

(A8) (A9) (A10) (A11)

i=1

Substituting (A7)–(A11) into (A1), the optimization problem (A12), shown at the top of the page, is obtained. It is clear that (A7), (A8), and (A12) here are equivalent to (15.b), (15.a), and (11) in the main content, respectively. R EFERENCES [1] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1, pp. 41–75, 1997. [2] S. Sun, “Multitask learning for EEG-based biometrics,” in Proc. 19th Int. Conf. Pattern Recognit., Tampa, FL, USA, 2008, pp. 1–4. [3] X. T. Yuan and S. Yan, “Visual classification with multi-task joint sparse representation,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., San Francisco, CA, USA, 2010, pp. 3493–3500. [4] S. Parameswaran and K. Q. Weinberger, “Large margin multi-task metric learning,” in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 1867–1875. [5] Y. Ji and S. Sun, “Multitask multiclass support vector machines: Model and experiments,” Pattern Recognit., vol. 46, no. 3, pp. 914–924, 2013. [6] T. Evgeniou and M. Pontil, “Regularized multi-task learning,” in Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, Seattle, WA, USA, 2004, pp. 109–117. [7] T. Evgeniou, C. Micchelli, and M. Pontil, “Learning multiple tasks with kernel methods,” J. Mach. Learn. Res., vol. 6, no. 1, pp. 615–637, 2005. [8] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task feature learning,” Mach. Learn., vol. 73, no. 3, pp. 243–272, 2008. [9] T. Jebara, “Multi-task feature and kernel selection for SVMs,” in Proc. 21st Int. Conf. Mach. Learn., Banff, AB, Canada, Jul. 2004. [10] A. Caponnetto, C. A. Micchelli, M. Pontil, and Y. Ying, “Universal multi-task kernels,” J. Mach. Learn. Res., vol. 68, pp. 1615–1646, Jul. 2008.

[11] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile, “Linear algorithms for online multitask classification,” in Proc. COLT, 2008. [12] J. Fang, S. Ji, Y. Xue, and L. Carin, “Multitask classification by learning the task relevance,” IEEE Signal Process. Lett., vol. 15, pp. 593–596, Oct. 2008. [13] J. Chen, L. Tang, J. Liu, and J. Ye, “A convex formulation for learning shared structures from multiple tasks,” in Proc. ICML, Montreal, QC, Canada, 2009, p. 18. [14] Q. Gu and J. Zhou, “Learning the shared subspace for multi-task clustering and transductive transfer classification,” in Proc. ICDM, Miami, FL, USA, 2009, pp. 159–168. [15] Q. Gu, Z. Li, and J. Han, “Learning a kernel for multi-task clustering,” in Proc. AAAI, 2011. [16] Z. Zhang and J. Zhou, “Multi-task clustering via domain adaptation,” Pattern Recognit., vol. 45, no. 1, pp. 465–473, 2012. [17] L. Jacob, F. Bach, and J.-P. Vert, “Clustered multi-task learning: A convex formulation,” in Proc. Adv. Neural Inf. Process. Syst., 2008, pp. 745–752. [18] S. Xie, H. Lu, and Y. He, “Multi-task co-clustering via nonnegative matrix factorization,” in Proc. ICPR, Tsukuba, Japan, 2012, pp. 2954–2958. [19] J. Zhou, J. Chen, and J. Ye, “Clustered multi-task learning via alternating structure optimization,” in Proc. NIPS, 2011, pp. 702–710. [20] S. Kong and D. Wang, “A multi-task learning strategy for unsupervised clustering via explicitly separating the commonality,” in Proc. ICPR, Tsukuba, Japan, 2012, pp. 771–774. [21] B. Bakker and T. Heskes, “Task clustering and gating for Bayesian multitask learning,” J. Mach. Learn. Res., vol. 4, pp. 83–99, May 2003. [22] F. Cai and V. Cherkassky, “SVM+ regression and multi-task learning,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, USA, 2009, pp. 418–424. [23] H. Wang et al., “A new sparse multi-task regression and feature selection method to identify brain imaging predictors for memory performance,” in Proc. ICCV, 2011, pp. 557–562. [24] M. Solnon, S. Arlot, and F. Bach, “Multi-task regression using minimal penalties,” J. Mach. Learn. Res., vol. 13, pp. 2773–2812, Sep. 2012. [25] Y. Zhang and D. Y. Yeung, “Semi-supervised multi-task regression,” in Proc. Eur. Conf. Mach. Learn. Knowl. Discov., Bled, Slovenia, 2009, pp. 617–631. [26] J. Zhou, L. Yuan, J. Liu, and J. Ye, “A multi-task learning formulation for predicting disease progression,” in Proc. KDD, San Diego, CA, USA, 2011, pp. 814–822. [27] S. Kim and E. P. Xing, “Tree-guided group lasso for multi-task regression with structured sparsity,” in Proc. 27th Int. Conf. Mach. Learn., Haifa, Israel, 2010, pp. 543–550. [28] K. Puniyani, S. Kim, and E. P. Xing, “Multi-population GWA mapping via multi-task regularized regression,” Bioinformatics, vol. 26, no. 12, pp. 208–216, 2010. [29] K. J. Astrom and T. J. McAvoy, “Intelligent control,” J. Process Control, vol. 2, no. 3, pp. 115–127, 1993. [30] F. L. Chung, Z. H. Deng, and S. T. Wang, “An adaptive fuzzy-inferencerule-based flexible model for automatic elastic image registration,” IEEE Trans. Fuzzy Syst., vol. 17, no. 5, pp. 995–1010, Oct. 2009. [31] A. H. Sonbol, M. S. Fadali, and S. Jafarzadeh, “TSK fuzzy function approximators: Design and accuracy analysis,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 3, pp. 702–712, Jun. 2012. [32] Q. Gao, X. J. Zeng, G. Feng, Y. Wang, and J. B. Qiu, “T-S-fuzzymodel-based approximation and controller design for general nonlinear systems,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 4, pp. 1143–1154, Aug. 2012. [33] Z. H. Deng, K. S. Choi, F. L. Chung, and S. T. Wang, “Scalable TSK fuzzy modeling for very large datasets using minimal-enclosing-ball approximation,” IEEE Trans. Fuzzy Syst., vol. 19, no. 2, pp. 210–226, Apr. 2011.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 14

[34] Z. H. Deng, Y. Z. Jiang, K. S. Choi, F. L. Chung, and S. T. Wang, “Knowledge-leverage-based TSK fuzzy system modeling,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 8, pp. 1200–1212, Aug. 2013. [35] Z. H. Deng, Y. Z. Jiang, F. L. Chung, H. Ishibuchi, and S. T. Wang, “Knowledge-leverage based fuzzy system and its modeling,” IEEE Trans. Fuzzy Syst., vol. 21, no. 4, pp. 597–609, Aug. 2013. [36] R. K. Ando and T. Zhang, “A framework for learning predictive structures from multiple tasks and unlabeled data,” J. Mach. Learn. Res., vol. 6, pp. 1817–1853, Nov. 2005. [37] S. J. Pan, J. T. Kwok, and Q. Yang, “Transfer learning via dimensionality reduction,” in Proc. 23rd Nat. Conf. Artif. Intell., vol. 2. 2008, pp. 677–682. [38] S. Ji, L. Tang, S. Yu, and J. Ye, “A shared-subspace learning framework for multi-label classification,” ACM Trans. Knowl. Discov. Data, vol. 2, no. 1, pp. 1–29, 2010. [39] G. Obozinski, B. Taskar, and M. I. Jordan, “Joint covariate selection and joint subspace selection for multiple classification problems,” J. Statist. Comput., vol. 20, no. 2, pp. 231–252, 2010. [40] J. Zhang, Z. Ghahramani, and Y. Yang, “Flexible latent variable models for multi-task learning,” Mach. Learn., vol. 73, no. 3, pp. 221–242, 2008. [41] S. Han, X. Liao, and L. Carin, “Cross-domain multitask learning with latent probit models,” in Proc. Int. Conf. Mach. Learn., Edinburgh, U.K., 2012, pp. 1463–1470. [42] T. Takagi and M. Sugeno, “Fuzzy identification of systems and its application to modeling and control,” IEEE Trans. Syst., Man, Cybern., vol. 15, no. 1, pp. 116–132, Jan./Feb. 1985. [43] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York, NY, USA: Plenum Press, 1981. [44] Z. H. Deng, K. S. Choi, F. L. Chung, and S. T. Wang, “Enhanced soft subspace clustering integrating within-cluster and between-cluster information,” Pattern Recognit., vol. 43, no. 3, pp. 767–781, 2010. [45] J. Leski, “TSK-fuzzy modeling based on ε-insensitive learning,” IEEE Trans. Fuzzy Syst., vol. 13, no. 2, pp. 181–193, Apr. 2005. [46] J.-S. R. Jang, “ANFIS: Adaptive-network-based fuzzy inference systems,” IEEE Trans. Syst., Man, Cybern., vol. 23, no. 3, pp. 665–685, May 1993. [47] I. W. Tsang, J. T. Kwok, and J. M. Zurada, “Generalized core vector machines,” IEEE Trans. Neural Netw., vol. 17, no. 5, pp. 1126–1140, Sep. 2006. [48] K. Ito and R. Nakano, “Optimizing support vector regression hyperparameters based on cross-validation,” in Proc. Int. Joint Conf. Neural Netw., 2003, pp. 2077–2082. [49] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthonormality constraints,” SIAM J. Matrix Anal. Appl., vol. 20, no. 2, pp. 303–353, 1998. [50] R. Keshavan, A. Montanari, and S. Oh, “Matrix completion from noisy entries,” J. Mach. Learn. Res., vol. 11, pp. 2057–2078, Jul. 2010. [51] N. Del Buono and T. Politi, “A continuous technique for the weighted low-rank approximation problem,” in Proc. Int. Conf. Comput. Sci. Appl., Assisi, Italy, 2004, pp. 988–997. [52] C. F. Juang, S. H. Chiu, and S. J. Shiu, “Fuzzy system learned through fuzzy clustering and support vector machine for human skin color segmentation,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 37, no. 6, pp. 1077–1087, Nov. 2007. [53] C. F. Juang and C. D. Hsieh, “TS-fuzzy system-based support vector regression,” Fuzzy Sets Syst., vol. 160, no. 17, pp. 2486–2504, 2009. [54] S. Kinoshita, S. Udaka, and M. Shimamoto, “Studies on amino acid fermentation, Part I. Production of L-glutamic acid by various microorganisms,” J. Gen. Appl. Microbiol., vol. 3, no. 6, pp. 193–205, 1957. [55] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, “Modeling wine preferences by data mining from physicochemical properties,” Decis. Support Syst., vol. 47, no. 4, pp. 547–553, 2009. [56] D. Kim, S. Sra, and I. S. Dhillon, “A scalable trust-region algorithm with application to mixed-norm regression,” in Proc. ICML, Haifa, Israel, 2010, pp. 519–526. [57] X. J. Su, L. G. Wu, P. Shi, and Y. D. Song, “Model reduction of Takagi– Sugeno fuzzy stochastic systems,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 6, pp. 1574–1585, Dec. 2012. [58] I.-C. Yeh, “Simulation of concrete slump using neural networks,” Construct. Mater., vol. 162, no. 1, pp. 11–18, 2009. [59] I.-C. Yeh, “Modeling slump flow of concrete using second-order regressions and artificial neural networks,” Cement Concrete Comp., vol. 29, no. 6, pp. 474–480, 2007. [60] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan, “A dual coordinate descent method for large-scale linear SVM,” in Proc. ICML, Helsinki, Finland, 2008.

IEEE TRANSACTIONS ON CYBERNETICS

[61] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, and S. García, “Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” Multiple-Valued Logic Soft Comput., vol. 17, nos. 2–3, pp. 255–287, 2011. Yizhang Jiang (M’12) received the B.S. degree in computer science from Nanjing University of Science and Technology, Nanjing, China, in 2010, the M.S. degree in computer science from Jiangnan University, Wuxi, China, in 2012, and is currently pursuing the Ph.D. degree from the School of Digital Media, Jiangnan University, Wuxi, China. He was a Research Assistant with the Computing Department, Hong Kong Polytechnic University, Hong Kong, from May 2013 to January 2014. His current research interests include pattern recognition, intelligent computation, and their applications. Mr. Jiang has published several papers in international journals, including the IEEE T RANSACTIONS ON F UZZY S YSTEMS and the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS. Fu-Lai Chung (M’95) received the B.Sc. degree from the University of Manitoba, Winnipeg, MB, Canada, in 1987, and the M.Phil. and Ph.D. degrees from the Chinese University of Hong Kong, Hong Kong, in 1991 and 1995, respectively. In 1994, he joined the Department of Computing, Hong Kong Polytechnic University, where he is currently an Associate Professor. His current research interests include transfer learning, social network analysis and mining, kernel learning, dimensionality reduction, and big data learning. He has authored or co-authored over 80 journal papers published in the areas of soft computing, data mining, machine intelligence, and multimedia. Hisao Ishibuchi (M’93–F’14) received the B.S. and M.S. degrees in precision mechanics from Kyoto University, Kyoto, Japan, in 1985 and 1987, respectively, and the Ph.D. degree from Osaka Prefecture University, Osaka, Japan, in 1992. Since 1999, he has been a Full Professor with Osaka Prefecture University. His current research interests include artificial intelligence, neural fuzzy systems, and data mining. Dr. Ishibuchi is on the editorial boards of several journals, including the IEEE T RANSACTIONS ON F UZZY S YSTEMS and the IEEE T RANSACTIONS ON S YSTEMS , M AN , AND C YBERNETICS —PART B. Zhaohong Deng (M’12–SM’14) received the B.S. degree in physics from Fuyang Normal College, Fuyang, China, in 2002, and the Ph.D. degree in information technology and engineering from Jiangnan University, Wuxi, China, in 2008. He is currently an Associate Professor with the School of Digital Media, Jiangnan University. He has visited the University of California-Davis, Davis, CA, USA, and the Hong Kong Polytechnic University, Hong Kong, for over two years. His current research interests include neuro-fuzzy systems, pattern recognition, and their applications. He has authored or coauthored over 50 research papers in international/national journals. Shitong Wang received the M.S. degree in computer science from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 1987. He has visited London University, London, U.K., Bristol University, Bristol, U.K., Hiroshima International University, Hiroshima, Japan, Osaka Prefecture University, Osaka, Japan, Hong Kong University of Science and Technology, Hong Kong, and Hong Kong Polytechnic University, Hong Kong, as a Research Scientist, for over six years. He is currently a Full Professor with the School of Digital Media, Jiangnan University, Wuxi, China. His current research interests include artificial intelligence, neuro-fuzzy systems, pattern recognition, and image processing. He has published over 100 papers in international/national journals and has authored seven books.

Suggest Documents