3 Finding Semantically Valid and Relevant ... - ACM Digital Library

35 downloads 0 Views 1MB Size Report
(Chen and Liu 2014), lexical priors (Jagarlamudi et al. 2012), or any other ...... Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and Maosong Sun. 2010. Automatic ...
Finding Semantically Valid and Relevant Topics by Association-Based Topic Selection Model YANG GAO, Beijing Engineering Research Center of Massive Language Information Processing and Cloud Computing Application; Beijing Institute of Technology; and Beijing Advanced Innovation Center for Imaging Technology YUEFENG LI, School of Electrical Engineering and Computer Science, Queensland University of Technology(QUT) RAYMOND Y. K. LAU, City University of Hong Kong YUE XU and MD ABUL BASHAR, School of Electrical Engineering and Computer Science, Queensland University of Technology(QUT) Topic modelling methods such as Latent Dirichlet Allocation (LDA) have been successfully applied to various fields, since these methods can effectively characterize document collections by using a mixture of semantically rich topics. So far, many models have been proposed. However, the existing models typically outperform on full analysis on the whole collection to find all topics but difficult to capture coherent and specifically meaningful topic representations. Furthermore, it is very challenging to incorporate user preferences into existing topic modelling methods to extract relevant topics. To address these problems, we develop a novel personalized Association-based Topic Selection (ATS) model, which can identify semantically valid and relevant topics from a set of raw topics based on the semantical relatedness between users’ preferences and the structured patterns captured in topics. The advantage of the proposed ATS model is that it enables an interactive topic modelling process driven by users’ specific interests. Based on three benchmark datasets, namely, RCV1, R8, and WT10G under the context of information filtering (IF) and information retrieval (IR), our rigorous experiments show that the proposed ATS model can effectively identify relevant topics with respect to users’ specific interests, and hence to improve the performance of IF and IR. CCS Concepts: • Information systems → Association rules; Personalization; Document topic models; Content analysis and feature selection; • Computing methodologies → Topic modeling; Additional Key Words and Phrases: Topic selection, topic evaluation, topic components, information filtering ACM Reference format: Yang Gao, Yuefeng Li, Raymond Y. K. Lau, Yue Xu, and Md Abul Bashar. 2017. Finding Semantically Valid and Relevant Topics by Association-Based Topic Selection Model. ACM Trans. Intell. Syst. Technol. 9, 1, Article 3 (August 2017), 22 pages. https://doi.org/10.1145/3094786 This work was supported by the National Key Research and Development Program of China (Grant No.2016YFB1000902), and Grant DP140103157 from the Australian Research Council (ARC), and the National Natural Science Foundation of China (Grant No. 61602036). Lau’s work was partly supported by grants from the Research Grants Council, Hong Kong (Projects: CityU 11502115 and CityU 11525716). This work was supported by Beijing Advanced Innovation Center for Imaging Technology (Grant No. BAICIT-2016007). Authors’ addresses: Y. Gao; email: [email protected]; Y. Li; email: [email protected]; R. Y. K. Lau; email: raylau@cityu. edu.hk; Y. Xu; email: [email protected]; Md A. Bashar; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax + 1 (212) 869-0481, or [email protected]. © 2017 ACM 2157-6904/2017/08-ART3 $15.00 https://doi.org/10.1145/3094786 ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

3

3:2

Y. Gao et al.

1 INTRODUCTION Topic modelling is an unsupervised model that provides a powerful tool for discovering and exploiting the hidden thematic structure in large archives of texts (Blei et al. 2003). It can automatically characterize a collection of documents by a mixture of topics, which is represented by a topicword distribution. Topic modelling methods have been extended in many ways, such as extending the basic statistical assumptions to uncover more sophisticated latent structures in texts (Teh et al. 2006; Mccallum et al. 2009), incorporating metadata Barbieri et al. (2013) or external sources (Andrzejewski et al. 2009) to guide the sampling process, so that topics with higher quality are uncovered. By extending existing topic modelling methods, there is a great potential to enhance various applications such as text mining, signal processing, human-machine interactions, and so on. One fundamental problem of existing topic modelling methods is that users may find it difficult to interpret the uncovered topics and leverage relevant topics to facilitate various applications. Accordingly, there is a pressing need to develop novel methods for identifying the most semantically valid and relevant topics so users can apply relevant topics to support their specific tasks. A topic is considered semantically valid if its topical words are related to the underlying domain. For example, for documents of the finance domain in the AP news dataset, the topical words of a semantically valid topic could include “merger, acquisition, takeover, consolidation, corporate, deal, transaction,” which together represent the concept of merger and acquisition in the finance domain. But some invalid topics are often included in the same topic model, which are represented by irrelevant words such as “method, takeover, shutter, screen, camera.” Much current research work is confined to modelling distributions over topics instead of identifying relevant semantic structures that can really facilitate real-world applications. For instance, various distributions (e.g., Dirichlet, Gaussian, Indian Buffet, Chinese Restaurant, etc.) are incorporated into topic models with respect to different applications. However, these heuristically predefined distributions have their underlying assumptions and characteristics that may not fit well with different kinds of applications. Another drawback of existing topic modelling methods is that topics are represented by a set of words with distributed probabilities. Though it is easy for machine to represent topics with such a format, humans may find it difficult to interpret such a topic representation. Recently, a patternbased representation scheme has been applied to represent topics in a more meaningful way to enhance information filtering (IF) performance (Gao et al. 2014, 2015). Different from traditional topic representation, the main benefit of pattern-based representation is that it can capture the rich semantics of the underlying corpus, and so it is easier for humans to interpret the uncovered topics. In this article, we further define a component space to facilitate the transformation of the word space of a topic model to the transaction space, which can represent topics by using semantic components induced from these topics. Essentially, each topical component contains groups of patterns and the associations among words. We believe that the induced components from topics can conquer the problem of a simple “bag-of-words” representation. When compared to the classical topic representation that is characterized by a probability distribution over words, the semantically enriched topic components can facilitate the identification of semantically valid and relevant topics with respect to users’ specific information requirements. Even if a number of semantically valid topics are presented to a user, it is extremely difficult for the user to manually read through all of these topics to verify if they are relevant or not, especially when there are a huge number of topics and the user is not very familiar with the problem domain. Therefore, it is essential to develop an automated method to identify relevant topics from unlabelled textual corpus. To address such a problem, we develop the novel Association-based Topic Selection (ATS) model, which can automatically identify relevant topics by considering users’ specific

ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

Finding Semantically Valid and Relevant Topics by ATS Model

3:3

information preferences. More specifically, the ATS model can detect inherent semantic associations between some keywords provided by users and the structural components induced from topics. According to these semantic associations, the proposed ATS model can identify relevant and semantically valid topics with respect to specific users’ needs and remove irrelevant topics so a set of optimized topics are applied to support real-world applications (e.g., IF). The proposed ATS model operates in the following way. First, the classical LDA model and a pattern mining method are applied to uncover pattern-based topics from a corpus. Second, accept a user’s input, which we name query so the ATS model can match the semantic components of topics with respect to the query, and hence identify relevant topics. Third, divide the set of relevant topics to a subset of relatively certain topics and another subset of relatively uncertain topics according to the match of the structural patterns appearing in the user query and each topic, respectively. Finally, we develop an evaluation framework to verify the quality of the selected topics, which will be applied to IF tasks. In sum, the main contributions of our research are as follows. (1) Topical Components: Existing topic modelling methods represent topics by various distributions of words. However, such a topic representation is not human interpretable. In this article, we propose semantically rich pattern-based topic representation. More specifically, topical components consist of primary components (i.e., patterns) and a set of relations between components (i.e., association rules). (2) Topic Selection: Topics are structurally represented by associative patterns, which provide different levels of abstractions to facilitate user access. The proposed model enables the effective application of semantically valid and relevant topics to real-world applications with minimal user involvement. (3) Topic-based Ranking: The relevance of topics is estimated based on a novel topic performance function, which is underpinned by a different level of topic certainty and the notion of topic significance. For the document ranking task, the proposed algorithm of exploiting semantically valid topics can enhance task-based analysis of topics. Experimental results show the outstanding performance of the ATS model, which is verified as an effective and flexible way to systematically utilise topic models in real tasks. 2

RECONSTRUCTING COMPONENTS OF UNSUPERVISED TOPIC MODELLING

2.1 Background Topic modelling algorithms are used to discover a set of hidden topics from collections of documents, where a topic is represented as a distribution over words. Topic models provide an interpretable low-dimensional representation of documents (i.e., with a limited and manageable number of topics). Latent Dirichlet Allocation (LDA) (Blei et al. 2003) is a typical statistical topic modelling technique and the most common topic modelling tool currently in use. Let D = {d 1 , d 2 , . . . , d M } be a collection of documents. The total number of documents in the collection is M. For the ith word in document d, denoted as wd,i , zd,i is the topic assignment for wd,i , zd,i = Z j means that the word wd,i is assigned to topic j, and the V represents the total number of topics. Let ϕ j be the  multinomial distribution over the words for Z j , ϕ j = (φ j,1 , φ j,2 , . . . , φ j,n ), nk=1 φ j,k = 1. θd refers  to multinomial distribution of the topics in document d. θd = (ϑd,1 , ϑd,2 , . . . , ϑd,V ), Vj=1 ϑd, j = 1. ϑd, j indicates the proportion of topic j in document d. LDA is a generative model in which the only observed variable is wd,i , while the others are all latent variables that need to be estimated. Gibbs sampling method is an effective strategy for hidden parameters estimation (Steyvers and Griffiths 2007) that is used in this article. The resulting representations of the LDA model are at two levels, document level and collection level. Apart from these, the LDA model also generates word-topic assignments, that is, the word occurrence is considered related to the topics. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

3:4

Y. Gao et al.

Fig. 1. Visualisation of the whole procedure of generating topical components. The heatmap on the left represents an alignment between words and topics in the LDA model. The columns represent the distribution over the topics by colours. In the middle, the topic representation is extended into three dimensions, which forms a word-document-topic triple representation as component space. In the space, each coloured dot represents word assignment for each topic over all documents. As a result, topical components are generated, which is shown on the right of the figure. The length of circle for each word w i represents the statistical importance of this word in the topic Ti . The coloured links between different words demonstrate the associations between them.

Pattern-based representations (Gao et al. 2014) were considered more meaningful and more accurate to represent topics than word-based representations (Blei et al. 2003). In the previously proposed model, the pattern can uncover groups of word combination in the topic, but cannot find word associations and more complex relations in the topic. Therefore, in this article, we reconstructed components of each topic that contain structural information, which can reveal the associations between words. In order to discover these underlying structures in the topics and documents, we transfer word-topic pairs to word-document-topic triples. Based on the triple relationship, we creatively construct a component space to facilitate associations extraction for topics. Generally, two steps are involved: first, construct a new component space from the LDA model results of the document collection D; second, generate topical components from the component space, as shown in Figure 1. 2.2

Construct Component Space

In addition to the standard results of topic models, such as topic distribution over documents and word distribution over topics, word assignments are also provided after the iterative topic learning process. Therefore, each word will be assigned with a specific topic. Specifically, let Rdi ,Z j represent the word-topic assignment to topic Z j in document di . Rdi ,Z j is a sequence of words assigned to topic Z j . Let Ii j be a set of words that occur in Rdi ,Z j , Ii j = {w |w ∈ Rdi ,Z j }, that is, Ii j contains the words that are in document di and assigned to topic Z j by LDA. Ii j , called a topical document transaction of document i for topic j, is a set of words without any duplicates. Let D = {d 1 , . . . , d M } be the original document collection, from all the word-topic assignments Rdi ,Z j to Z j , i = 1, . . . , M, we can construct a component space Γj , where the component space   Γj for topic Z j is defined as Γj = I 1j , I 2j , . . . , I M j . From this component space of the topic, we can easily find the word occurrences in documents as well as in the same topic. For the topics in the ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

Finding Semantically Valid and Relevant Topics by ATS Model

3:5

Table 1. Component Space of Topic 7 in Collection 125 of RCV1

Doc. d1 d2 d3 d4 d5

growth 1 0 0 0 1

scottish 0 1 1 1 0

scotland 0 1 1 1 1

power 1 1 1 0 1

nationalist 1 0 1 0 1 ···

country 0 0 1 0 0

british 0 0 0 1 0

independ 0 0 1 0 1

collection of D, we can construct a number of V component spaces (Γ1 , Γ2 , . . . , ΓV ). For example, in collection 125 of RCV1 dataset, we create a component space for topic 7 as Table 1. The occurrences of words assigned by topic 7 in each document are counted in the transactions. 2.3 Generate Topical Components Based on the discovered component space, we intend to discover more complex relations and word associations in the topic, using pattern mining techniques. In this article, association rule mining is leveraged for uncovering the hidden structures in topics. In this article, the pattern and association rules are particularly defined as: — Pattern X is a combination of some individual terms, that is, X = {x 1 , . . . , x b } is a set of terms; b is the length of this pattern X , where b = 1 is the special case that means X is an individual word. — An association rule is an implication in the form of X ⇒ Y , where X and Y are disjoint itemsets, we also call them patterns, that is, X ∩ Y = ∅. X is called antecedent and Y is called consequent; the rule means that X implies Y , X ⇒ Y . The strength of an association rule can be measured in terms of its support and confidence. — Support determines how often the rule (X ⇒ Y ) is applicable to a given component space, which is denoted as supp, while the relative support of the rule is the percentage of transactions that contain X and Y . — Confidence determines degree of interest or strength of the associations of this rule, which is denoted as con f , and the confidence of the rule is the ratio between the support of the rule (X ⇒ Y ) and the support of X . We need to mention that confidence measures the reliability of the inference made by a rule, which provides foundational theory for the proposed topic selection model. For a given X ⇒ Y , the higher the confidence, the more likely it is for Y to be present in transactions that contain X , and the stronger relation that represents X to Y . The association rule suggests a co-occurrence relationship between items in the antecedent and consequent of the rule (Tan and Kumar 2005). For a topic Z j , we define a L j to represent all the existing complex and compound association rules in the topic as follows. —L j , a set of strengths of discovered association rules for topic Z j : From each component space “Γj ” for topic Z j , we can generate a set of association rules, which satisfy the predefined minimum support σ and confidence η for a given component space. Here, for the given minimal support threshold σ , an itemset X in Γj is frequent if supp(X ) >= σ , where supp(X ) is the support of X, which is the number of transactions in Γj that contain X (X is the ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

3:6

Y. Gao et al.

Fig. 2. Samples of components in topic 7 of the collection 125.

frequent pattern (Han et al. 2007; Tan and Kumar 2005) in Z j ). The frequency (also called relative supp (X ) support) of the itemset X i is defined fi j = |Γj | i . For simplicity, topic Z j can be represented by a set of closed patterns (Han et al. 2007; Tan and Kumar 2005), denoted as XZ j = {X 1j , X 2j , . . . , Xm j j }, in which m j is the total number of patterns in XZ j and each X i j in XZ j is a unique pattern with corresponding weight fi j . For the given minimal confidence threshold η, an association X k j ⇒ X hj supp (X

X

)

k j hj in Γj is accepted if supp (X >= η, and the strength of the association is denoted as lkh . kj ) Based on the above definitions, a topic consists of a set of primary components (i.e., closed patterns (Han et al. 2007; Tan and Kumar 2005)) with corresponding weightings, and also a set of relations between components (i.e., association rules X ⇒ Y , where X and Y are frequent patterns). These primary components and the rules are called Topical Components of this topic. In summary, we use Rj to represent the components of topic j:

Rj < XZ j , F j , L j >, where XZ j is the set of patterns in the topic j, F j is the set of patterns’ weightings, and L j is the set of associations between patterns in the topic. We can find that the components in the topic are complicated and with different types. After the two-step operation, the reconstructed topic modelling has richer and in-depth structural representations for each topic. Therefore, we name the R < X, F , L > in topics as topical components. For example, the topical components of topic 7 in the collection of 125 are displayed in Figure 2, followed by the example from Table 1. 3 TOPIC SELECTION MODEL In the reconstructed model, the components in every topic are modified with underlying associative structures comparing with traditional simple word spaces. The powerful structures can somehow remove random noises that are derived from distribution-based assumption in the LDA model. However, in most topic models, latent topics rarely explicitly exist but are created upon a pre-defined number of spaces. The topics can be biased, and not all of them are necessarily of good quality or interesting to users. In this section, we propose an ATS model to identify most relevant topics by incorporating users’ light inputs. In the approach, the topical components creatively build trustful and strong relations between users’ inputs and topic inside, then the relevant topics can be chosen and differentiated by varied certainties to the users’ interests from those originally discovered topics. In real cases, users’ light inputs, such as tags, click preferences, or chosen categories/titles, are easily obtained. However, it is expensive and difficult to fully understand users’ preferences. The gap is that the users’ inputs can be dramatically diverse, as the result, the real user preference aspect is hardly specified by the minimal prior. Therefore, in this article, we intend to utilise the ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

Finding Semantically Valid and Relevant Topics by ATS Model

3:7

user’s inputs (i.e., queries) and further select those truly relevant topics leveraged by topical components. 3.1

Relevant Topic

In this article, the query is formally represented by a set of terms that can be user tags, categories, or summaries, denoted as S = {s 1 , s 2 , . . . , sn }, where sk is one of the terms in the query S, k = 1, 2, . . . , n. Based on the discovered rules (i.e., X ⇒ Y , the relation of components in the topic, which is introduced in Section 2.3) from pattern-based topics {XZ 1 , XZ 2 , . . . , XZV }, for a given query S = {s 1 , s 2 , . . . , sn }, we can discover the connection between the query-terms and patterns in topics, thus find the relevant topics. The rationale behind using pattern-based topic models is that topical patterns contain more strong relations among words and these associations create reliable links between original query-terms and their relevant topics. The detailed process is described as follows. As mentioned above, the pattern-based topic is XZi = {X i1 , X i2 , . . . , X imi} for topic Z i , in which l the pattern X i j = {x i1j , x i2j , . . . , x i ijj } is a set of terms; li j is the length of this pattern X i j ; and supp(X i j ) is the support of X i j . The relevant topics of the query S can be discovered as the following steps: (1) A term x ikj ∈ X i j , k ∈ {1, . . . , li j }, sk = x ikj , that is, sk ∈ X i j , and sk ⇒ X i j \sk is a rule in Rj that is defined in Section 2.3, topic Z j is considered as a relevant topic of sk . The pattern X i j is relevant candidate for sk . The set of relevant topics of the term sk , denoted as RTsk can be defined as RTsk = {Z j |∃(sk ⇒ X i j \sk ) ∈ Rj , sk ∈ X i j }. (2) The set of relevant topics for a query S is defined as n  RTsk . RTS =

(1)

(2)

k=1

For a term sk , there could be many relevant candidate patterns in X Z j that make the topic Z j a relevant topic of sk . Let X sjk be a set of relevant candidates in X Z j for sk , X sjk can be used to represent topic Z j in terms of sk . X sjk is defined below: X sjk = {X |X ∈ X Z j , ∃(sk ⇒ X \sk ) ∈ Rj , sk ∈ X }.

(3)

For each pattern X i j ∈ X sjk , the relevant pattern is X i j and the relevance of X i j to sk with respect to the topic Z j is defined as fsjk (X i j ) = fi j , (4) where fi j is the weighting of the pattern X i j in topic Z j . The relevance fsjk (X i j ) will be used to determine the relevance of a document to a query in the evaluation stage, which will be discussed in Section 5. 3.2

Topical Relatedness

The structural components in topics create a reliable mapping “word-association-topic” triple relation. Strong associations between words can extend the user specific information to a more meaningful and completed structure. In this way, it helps users to find useful topics. However, associating with different words, the same word can often represent different topics. For example, “south” in “south Africa” refers to a country name, but in “south west” the “south” refers to a direction. As a result, not all the relevant topics generated using Equation (2) can “equally” represent ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

3:8

Y. Gao et al.

the user’s real interests just simply based on word-topic relations. This is also the main problem that the traditional topic models have, in which each topic is represented by individual words. Normally, the more prior words the user provides, the more concrete meaning for the selection process, thus the more certain the meaning of the selected topics express. Therefore, in this article, we select the relevant topics not only in terms of detecting relations between query-terms and topics but also further finding more certain relations between query-patterns and topics. Topic relatedness analysis is a process that helps differentiating those relevant topics with different certainties. In this section, we will optimise the selected relevant topics, RTS , for the query S, and define a level of certainty for these relevant topics: certain topics and uncertain topics. — Certain Topics A relevant topic Z j is considered as the user’s certain topic if it meets the following conditions: —Z j is a common relevant topic of a pattern in S, that is, a pattern X  = {sh , sk }, which satisfies ∃sh ∈ S, Z j ∈ RTsh , ∃sk ∈ S, Z j ∈ RTsk , k  h. Formally, the set of certain topics of the query S, denoted as TSc , is defined by Equation (5):    RTsh , ∃k, h ∈ {1, . . . , n} , and k  h . (5) TSc = Z j |Z j ∈ RTsk The set of certain topics can be considered as the closest topics to the user’s interests, because they are related to a pattern from the query in the user specified information. This feature is very important, because two or more words can form stronger patterns than single words. The pattern consisting of multiple words can be considered as a “user-specific pattern.” It is because of the “user-specific pattern” that the topics in TSc are more certain to represent the user’s interests. —Uncertain Topics The other relevant topics other than the certain topics in TSc are considered as a set of uncertain topics, TSu : TSu = RTS \TSc . (6) A relevant topic in the set of uncertain topics contains only one original term in the query, which satisfies the following condition: —Z j is a relevant topic of exactly only one term in S, that is, ∃!sk ∈ S, Z j ∈ RTsk . Let RTsck be the set of certain topics of sk and RTsuk be the set of uncertain topics of sk : RTsck = {Z |Z ∈ RTsk , Z ∈ TSc },

(7)

RTsuk = {Z |Z ∈ RTsk , Z ∈ TSu }.

(8)

The other topics rather than the topics in RTS are considered irrelevant topics in the topic model for the document collection. For those term-based topic models, they have been extended with domain-related knowledge (Chen and Liu 2014), lexical priors (Jagarlamudi et al. 2012), or any other supervised information. However, all the selected useful topics are based on the “word-topic” relationship between prior information and topics. Alternatively, in the proposed model, it enables patterns in the query connecting with the patterns in the selected topic, which establishes a more strong link between the query and the chosen topic. This is also the main reason why it is convincing to divide the relevant topics into certain topics and uncertain topics by the proposed ATS model. To understand this process clearly, we formally describe the process in two algorithms: Reconstructing Components Algorithm and ATS Algorithm. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

Finding Semantically Valid and Relevant Topics by ATS Model

3:9

ALGORITHM 1: Reconstructing Components Input: a collection of training documents D; minimum support σ j and minimum confidence η j as thresholds for topic Z j ; the number of topics V Output: Topical components in the collection, R1 , . . . , RV 1: Generate topic representation ϕ and word-topic assignment zd, j by applying LDA to D 2: for each topic Z j ∈ {Z 1 , . . . , ZV } do 3: Construct component space Γj based on ϕ and zd, j 4: Generate pattern-based topic presentations XZ j , for each pattern X ∈ XZ j , supp (X )

5: 6: 7:

8: 9: 10: 11:

fi j = |Γj | > σ j for each pattern X i j ∈ XZ j , do for each sub-pattern X k ⊂ X i j do supp (X ) an association X k ⇒ X i j \X k is accepted if supp (Xikj ) >= η j , and the strength of the association, denoted as likj . end for end for Generate topical components Rj < XZ j , F j , L j > for topic Z j , where F j is the set of patterns’ relative supports and L j is the set of associations between patterns in the topic. end for

ALGORITHM 2: ATS Algorithm Input: Topical components in the collection, R1 , . . . , RV ; a query S = {s 1 , s 2 , . . . , sn }. Output: Set of certain topics for query S, TSc ; Set of all related topics for the query S, RTS ; Set of relevant candidate patterns in X Z j for sk , X sik . 1: for each s k ∈ S do 2: RTsk := ∅ 3: for each topic Z i ∈ {Z 1 , . . . , ZV } do 4: X sik := ∅ 5: for each pattern X i j ∈ XZi do 6: Scan patterns and find sk = x i j , x i j ∈ X i j 7: if (sk ⇒ X i j \sk ) ∈ Ri then 8: RTsk = {Z i } ∪ RTsk 9: X sik = {X i j \sk } ∪ X sik 10: end if 11: end for 12: end for 13: end for  14: RTS = n k=1 RTs k  c TS = {Z i |Z i ∈ RTsk RTsh , ∃k, h ∈ {1, . . . , n} , k  h}

ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

3:10

Y. Gao et al.

Fig. 3. An example of topic selection and topic relatedness in collection 125 of the RCV1.

3.3

Example

The query in collection 125 of RCV1 is “Scottish Independence,” and it discusses how the Scottish people have been pushing for independence from Great Britain. In training dataset, a collection of 36 documents without any relevance judgement are trained by the LDA model to generate a topic model of 10 topics. The experimental settings will be introduced in Section 5.5. Figure 3 presents the process of selecting certain topics from all topic candidates and determining semantic-related patterns. The association rules in R satisfy minimum support σ = 0.2 and minimum confidence η = 0.3. According to Equation (1), take an example of a pattern “Scotland Scottish” in topic 7, its support is 0.40625 and the confidence of “Scottish ⇒ Scotland” is 0.812 (as shown in Figure 2), which satisfies the minimum support and confidence. In this way, the related topics RT (Scot t ish) = Z 7 and RT (I ndependence ) = Z 7 , Z 1 are found, and patterns such as “Scottish Scotland British” and “ Scotland nationalist independence” are chosen as the related patterns. Since topic7 contains two patterns that are related to different query-terms, topic7 is the certain topic and topic1 is the uncertain topic, other topics uncovered by the LDA are irrelevant. As shown in Figure 3, topic1 delivers the intention of conferring and opposition, while topic7 expresses more closed meaning to the title “Scottish independence.” Instead of using all patterns that are generated in topics, we select more relevant patterns within topics as topic representation. 4

TOPIC QUALITY METRICS

We conduct the following two measures from different perspectives to evaluate the quality of topics, which are coherence and randomness. Represented by the top words, the topic quality is both evaluated in terms of how these words coherently convey a concentrated topic (i.e., coherence) and how stable the specific topic is represented under different iterations (i.e., randomness). To measure the topic coherence quantitatively, we adapt the measure proposed by Mimno et al. (2011), which is an intrinsic measure to compute the coherence of topics. It uses document frequency of the top M most probable terms for a specific topic. Specifically, the topic coherence ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

Finding Semantically Valid and Relevant Topics by ATS Model

3:11

Fig. 4. Coherence (left): The coherence value in 10 topics that are derived from the LDA model and the pattern enhanced topic model, respectively, both in collection 125 of RCV1. Randomness (right): The average changes of topic representations between two iterations in the collection 125 of RCV1, using the LDA and pattern enhanced topic representations. The topic number is 10 in this dataset.

for topic k is given by coherencek =

M m−1  

loд

m=2 l =1

k , wk ) + 1 D(wm l

D(w lk )

,

(9)

k and w are the mth and lth most probable terms within topic k, D (w k , w k ) is the cowhere wm l m l k and w k , and D(w k ) is the number of documents containing term document frequency of term wm l l wlk . A smoothing count of 1 is included to avoid taking the logarithm of zero. We calculate the coherence of 10 topics in collection 125 of RCV1 dataset and show the different results of the LDA model and our proposed topic model in Figure 4. Statistics in Choo et al. (2013) showed that the word-topic assignment and topic-document assignment of the LDA model are partially no convergence or stability, due to the nature of samplingbased LDA algorithm, denoted as randomness, which indicates the randomly differing results in the same collection of data. To be specific, in this article, we formulate the randomness of a topic by averaging the changes of the top words from the topic representations between the results of different iterations on the semantically same topic from one collection of documents (i.e., the collection 125 of RCV1; the details of the dataset RCV1 is described in Section 5.3). The formulation is followed as

randomness =

V 1 different(k ), V

(10)

k=1

where different(k ) is the number of different words of the semantically similar topic within the results from two consecutive iterations. In the figure, we run 500 to 4500 iterations in the collection 125 of RCV1 and calculate the randomness of the LDA and the pattern enhanced topic model for each iteration of the collection. As the figure shows, the coherence of the topical components are superior than the topics in LDA, and the randomness of them are relatively more stable in the identical topics from the LDA upon different iteration. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

3:12

Y. Gao et al.

5 EVALUATION In topic models, developing an effective evaluation method is one of the most important tasks that has been highlighted in Blei (2012). Open questions are how to evaluate qualities of topics and how these topics are used. To answer these questions, in this article, we intend to utilise the topic model by selecting the most interesting topics to users, instead of accepting all topic distributions. Evaluation is to verify the quality of topics after applying the ATS model and whether the users are interested in the selected topics. Topic significance is defined to estimate the topical quality in terms of calculating the importance of all the “user-specific patterns” in a certain topic. In order to effectively apply the topic significance in estimating relevance between the selecting topicbased users’ interests and documents, only the most representative patterns (Maximum Matched Patterns) from the topic are used. Therefore, the larger the number of relevant documents are retrieved for the users, the higher the quality the selected topics are. The basic idea is that the topic that contains the most specific, coherent, and relevant patterns is the semantically valid topic. In the following section, we statistically compute topic performance in the real application of Information Filtering and Information Retrieval (IR), which can be an application-oriented way to evaluate the quality of the chosen topics. 5.1

Hypothesis

In order to investigate the effectiveness of the proposed topic selection model on selecting semantically valid and relevant topics, we conduct a comprehensive experiment in IF scenario. The proposed model is discussed under the following three hypotheses: — H1: The ATS model is effective on identifying semantically valid and relevant topics. — H2: Topics with further differentiated level of certainty can enhance accuracy of the topic selection. — H3: Unsupervised topic modelling underpinned by the proposed ATS model can be an effective solution for IF and IR systems. 5.2 Document Ranking Based on Topical Component Structures Topic significance has been proposed and successfully applied in Gao et al. (2015), which considered both pattern specificity and patterns’ statistical significance. Let d be a document, X id be one of the matched patterns that X id ∈ X sjk for topic Z j in document d, i = 1, . . . , ni , and fi1 , . . . , fini be the corresponding supports of the matched patterns, then the topic significance of Z j to d is defined as ni ni   j d siд(Z j , d ) = spe (X i ) × fsk (X i j ) = a|X id |m × fi j , (11) k=1

k=1

where m is the scale of pattern specificity (we set m = 0.5), and a is a constant real number (in this article, we set a = 1). X sjk can be used to represent topic Z i in terms of sk as it illustrates in Equation (3), and ni is the number of patterns in X sjk . In IF environment, incoming documents can be modelled by document relevance ranking, which accordingly represents the relevance to users’ interests. The relevance is dominantly determined by topic significance, and correspondingly the document relevance r (d, S ) is defined as r (d, S ) ∼  Z j ∈RTS siд(Z j , d ). Combining it with Equation (11), the document relevance is formulated by more specific components in topics. Besides, a constant parameter λs can balance the impact of original queries and associated closed patterns in all selected relevant topics. Specifically, the more related topics a query-term has, the more diverse the term is, thus the higher the diversity of the specific ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

Finding Semantically Valid and Relevant Topics by ATS Model

3:13

query-term is. The number of relevant topics in RTsk is defined as diversity of sk , denoted as div (sk ). For the set of certain topics, divsck = |RTsck |. If a word has high diversity, then it will not be discriminative relevance; therefore, the importance weight should be lower than the word with low diversity. For details, the relevance r (d, S ) is formulated as following Equation (12). If the model does not find any certain topics among the related topics (i.e., TSc = ∅), then the chosen topics will be all relevant topics; thus, in Equation (12), the TSc is replaced by RTS and the RTsck is replaced by RTsk . The higher the r (d, S ), the more likely the document is relevant to the user’s interests: ⎫ ⎧ ⎫ ⎧   ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎬⎪ ⎨ 1 1 + λ δ (sk , d ) ⎨ siд(Z , d ) r (d, S ) = s j c ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎪ ⎪ divsk Z ∈T c sk ∈ S j ⎩ ⎭⎭ ⎩ S sk ∈ d

=

 sk ∈ S sk ∈ d

⎧ ⎪ ⎪ ⎪ 1 + λs δ (sk , d ) ⎨ ⎪ ⎪ ⎪ ⎩

where



δ (sk , d ) =

⎧ ⎪   ⎪ ⎪ ⎨ 1 c ⎪ ⎪ ⎪ |RTsk | Z j ∈T c X \s ∈X j S ij k sk ⎩

⎫ ⎫ ⎪ ⎪ ⎪ ⎪ ⎪⎪ ⎬, |X i j \ sk |m × fsjk (X i j ) ⎬ ⎪ ⎪ ⎪ ⎪ ⎪⎪ ⎭⎭

(12)

1 if sk ∈ d . 0 otherwise

5.3 Data Datasets: We used three popular datasets to test the proposed model: Reuters Corpus Volume 1 (RCV1), R8 of Reuters 21578, and WT10G from TREC data. RCV1 contains 100 collections of documents, which were developed for TREC filtering track. In TREC track, a collection is also referred to as a “topic.” To differentiate from the “topic” in LDA model, “collection” is used to refer to a collection of documents in the TREC dataset. The first 50 collections are composed by human assessors and the another 50 collections are constructed artificially from intersections collections. In this article, only the first 50 collections are used for experiments. R8 dataset is a widely used collection for text mining. The data was originally collected and labelled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system.1 In this experiment, we picked up the set of 10 classes. According to Sebastiani’s convention (Debole and Sebastiani 2005), it was also called “R8,” because two classes, corn and wheat, are intimately related to the class grain, and they were appended to class grain. WT10G consists of around 1.7 million documents, totalling 10 gigabytes, which is a relatively big test collection. WT10G is a TREC test collection in the Terabyte Track, and it contains 100 collections (topics) of documents. User queries and document format: Documents in both RCV1 and R8 are described in XML. Documents are treated as plain text documents by a pre-processing, which includes removing stopwords according to a given stop-words list and stemming terms by applying the Porter Stemmer. The title in the “Topic Statements” file in the RCV1 (i.e., “Economic espionage,” “Scottish Independence”) and the name of classes in the R8 (i.e., “acq,” “crude”) are used as the user’s specified queries. For WT10G, only the title portion of the TREC topics (from topic 451 to topic 550) are used to construct queries. The documents are indexed and extracted by modified version of Indri, which is part of the Lemur Toolkit.2 1 Reuters-21578,

http://www.daviddlewis.com/resources/.

2 http://www.lemurproject.org/lemur.php.

ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 1, Article 3. Publication date: August 2017.

3:14

Y. Gao et al.

Fig. 5. Filtering results (evaluation on MAP) and selected number of closed patterns on RCV1 with different values of minimum confidences, σ = 0.2.

5.4

Evaluation Metrics

The effectiveness is assessed by four different measures: average precision of the top K (K = 5, 10, 20) documents, F β (β = 1) measure, Mean Average Precision (MAP), and 11-points. MAP measures the precision at each relevant document first, then it obtains the average precision for all the collections. It combines precision and overall recall together to measure the performance of the models. The F-beta (F β ) measure is a function to adjust the assessment standard of both Recall (R) and Precision (P), together with a parameter beta β. The parameter β = 1 is used in this article, which denotes that precision and recall are weighed equally. Therefore, F β is denoted by R . 11-Points is used to measure the performance of different models by averaging the preF 1 = (P2P+R) cisions at 11 standard recall levels (recall = 0.0, 0.1, . . . , 1.0, where “0.0” means cut-off = 1 in this article). We also used a statistical method, the paired two-tailed t-test, to analyze the experimental results. The statistical method, T-test, is also used to verify the significance of the experimental results. If the p-value associated with t is significantly low (