Extended information inference model for ... - Semantic Scholar

3 downloads 4996 Views 888KB Size Report
domain knowledge; information inference; short text; text categorization; ... web short texts are derived from a variety of domains, and possess real-time ...
Article

Extended information inference model for unsupervised categorization of web short texts

Journal of Information Science 38(6) 512–531 Ó The Author(s) 2012 Reprints and permission: sagepub. co.uk/journalsPermissions.nav DOI: 10.1177/0165551512448985 jis.sagepub.com

Tao Xu Systems Engineering Institute, Xi’an Jiaotong University, China

Qinke Peng Systems Engineering Institute, Xi’an Jiaotong University, China

Abstract Traditional text-processing methods encounter significant performance degradation when they are applied to web short texts, with their inherent characteristics including feature sparseness, lack of sufficient hand-labelled training examples, domain dependence, and asyntactic expression. In this paper we propose a modified information inference model that can mimic human cognitive behaviour to categorize various web short texts in an unsupervised manner. The model is based on the conceptual space theory and hyperspace analogue to language (HAL) model, and it is a novel development in that it combines domain-specific knowledge and universal knowledge via a fusion mechanism for multiple HAL spaces. Moreover, in the realization of conceptual space, a concept is represented geometrically by a two-tuple of property sets, which can effectively improve the representation accuracy of the information contained in combined concepts. Two measurements of the relationship between concepts are used to implement the information inference for web short texts. The experimental evaluation of our model is conducted via three different tasks on web short text categorization, and the results indicate the applicability and usefulness of the proposed method.

Keywords domain knowledge; information inference; short text; text categorization; unsupervised learning

1. Introduction With the pervasion of the web in our daily lives, we spend a great deal of time every day reading and dealing with a variety of web short texts. These are typically 5–30 words long, and are not required to be strictly grammatical – for example, online consumer reviews, online comments on news events, captions from search engine results, and most microblogging information. Such web short texts usually contain a large amount of useful knowledge. Despite our efforts to cope with such information, there is still a huge gap between personal knowledge (and time resource) and the explosive growth of web short-text information. Therefore the demand for new tools that assist us in automatically categorizing such massive numbers of web short texts has become increasingly urgent. Some inherent characteristics of web short texts, such as feature sparseness [1–4], labelling problems [5, 6], domain dependence [2, 7] and asyntactic expression [1], cause several difficulties in research into their categorization methods. Because their short length leads to feature sparseness, commonly used text categorization methods based on a bag-of-words encounter significant performance degradation when applied to web short-text situations. In most cases, web short texts are derived from a variety of domains, and possess real-time characteristics. This not only limits the application of supervised learning methods, in which a sufficient number of labelled examples are needed to train a reliable classifier, but also makes it necessary to seriously consider domain-specific background knowledge in web shorttext categorization. In addition, because web short texts contain a lot of asyntactic expressions and noise (as they are Corresponding author: Qinke Peng, Systems Engineering Institute, Xi’an Jiaotong University, Xi’an, 710049, P.R. China. Email: [email protected]; [email protected]

Xu and Peng

513

generally posted by ordinary web users, with less expertise and a free writing style), improving the accuracy of text representation is an important step in web short-text categorization. Because web short texts are essentially a type of text data, all techniques for categorizing traditional text data can be applied to categorize web short texts. As stated earlier, such techniques work well on traditional long texts, but usually suffer a decline in performance level when used on web short texts. This degradation of performance can be attributed to the inherent characteristics of web short texts. To date, short-text categorization is still a tough issue that perplexes researchers, and explicitly declared methods for short-text categorization are very limited in the published literature. In order to overcome the difficulties brought about by the characteristics of web short texts, some researchers have recently proposed several new methods for short-text categorization. Zelikovitz et al. proposed an approach to short-text categorization based on latent semantic indexing (LSI), which combines training data with unlabelled test examples to improve classification performance in the process of creating a reduced vector space [8, 9]. Zelikovitz et al. also presented a WHIRL-technique-based short-text categorization method that adopts the unlabelled corpus as background knowledge to enhance the classification performance [10, 11]. Turney et al. reported an unsupervised classification method for online consumer reviews based on mutual information, which adopts an internet search engine to calculate the semantic relation between two phrase-level short-text segments [12, 13]. Healy et al. presented a case-based reasoning approach to short-text message classification, and investigated the effect of different features and feature representation in this classification [14]. The research of Cormack et al. showed that the performance of a bag-of-words-based short message classifier could be markedly improved by expanding the set of features to include orthogonal sparse word bigrams, and also to include character bigrams and trigrams [15]. Yan et al. proposed a dynamic assembly classification method for short-text classification that could reduce the impact of insufficient hand-labelled training data by constructing a treelike assembly classifier [16]. In summary, by adapting to some of the inherent characteristics of short texts, these methods have improved the performance of short-text categorization to a certain degree. However, a more effective unsupervised categorization method for web short texts that comprehensively considers their various inherent characteristics is still needed. Because understanding text is a kind of cognitive behaviour, it is essential to build a text categorization method upon the human cognitive system, as this may be closer to completely solving the problem [17]. Human beings possess a special ability to make hasty but reliable judgements about the subject (or non-subject) of terse text fragments [18]. For example, in the web page titles ‘Welcome to Penguin Books’ and ‘Antarctic penguins’, the term ‘penguin’ is about two rather different concepts: the publisher Penguin Books, and the short, black bird living in the ice-cold Antarctic. The process of making such ‘about’ judgments has been referred to as information inference by Song and Bruza [19]. To mimic this particular ability of human beings, and assist us in managing the explosion in text information, Song and Bruza have been engaged in intensive work over the past decade [19–23]. They used the HAL model for the realization of conceptual space (a three-level cognitive model proposed by Gardenfors [25]), and they proposed an information inference model (referred to below as the ‘SB model’) that combines ideas of information flow [19]. The SB model was used in the fields of information retrieval and knowledge discovery, and an empirical study showed that it could successfully imitate the ability of human beings to interpret and infer rough information [19, 22, 23]. Applying an information inference mechanism at the conceptual level to categorize web short texts has several advantages. The mechanism could eliminate the negative impact caused by the feature sparseness of short texts, because it is not based on the bag-of-words model or on feature occurrence frequency. More importantly, it contributes to the realization of unsupervised short-text categorization by inferring the semantic relation between a given concept and predefined paradigm concepts. However, the direct use of the SB model to handle web short-text categorization tasks covering a wide range of domains still has some limitations, owing to some particular characteristics of such tasks. First, it is quite useful to consider domain knowledge and universal knowledge simultaneously in the information inference process [7–11, 26, 27], because there is no guarantee that domain knowledge is available or adequate for a given task. Second, the idea of information flow is a unidirectional heuristic method that aims to discover the implicit information carried by the flow, but it is still valuable to modify the SB model to enable it to measure the conceptual similarity between web short texts. Lastly, according to the definition of a concept in the SB model, a combined concept will contain a growing number of attributes as the number of concept combinations increases. Therefore improving the representation accuracy of the information contained in a combined concept is an important issue, especially in cases where concept combination must be repeated many times for relatively long web short texts. This paper presents a modified information inference model, and an information inference-based web short-text unsupervised categorization method that can mimic human information processing to categorize various web short texts at the level of conceptual inference. The main work of this paper is as follows:

Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng • •



514

Because of the significant role of domain knowledge in natural-language processing, we present a fusion mechanism that combines the HAL space derived from the domain corpus with that derived from the universal corpus. We propose a new definition of a concept and a corresponding concept combination heuristic that enables combined concepts to contain more structured information. According to our new definition, two approaches measuring the semantic relationship between concepts, namely the concept inclusion degree and the concept similarity degree, are redefined based on fuzzy set theory, such that the most suitable measurement approach can be chosen flexibly according to the specific application. For web short-text categorization tasks, we realize corresponding unsupervised categorization algorithms by learning from predefined paradigm concepts. This is similar to Turney’s work on the classification of online consumer reviews [12, 13].

The rest of this paper is organized as follows. Section 2 gives a brief introduction to related research that constitutes the basis of our work. Section 3 describes the extended information inference model that is developed from the SB model. In Section 4, we empirically study the effectiveness of the web short-text unsupervised categorization method based on the information inference model, and Section 5 concludes this paper with ideas for future work.

2. Related work 2.1. Hyperspace analogue to language HAL is an implementation of semantic space proposed by Lund and Burgess [24, 28]. A semantic space is one in which words are represented by points, often with a large number of dimensions; the position of each point along each axis is somehow related to the meaning of the word [29]. HAL is enlightened by the intuition phenomenon in the human cognitive process: a human encountering a new concept derives the meaning by employing their accumulated experience of the contexts in which the concept appears. Following this idea, Lund and Burgess proposed a procedure by which highdimensional semantic spaces, i.e., HAL space, may be automatically constructed from a text corpus. In HAL space, every word in the vocabulary can be represented as a vector of length 2N, where N denotes the vocabulary size. In follow-up research on HAL, Azzopardi transformed the HAL model into a probabilistic framework, and then proposed the probabilistic hyperspace analogue to language (pHAL), which was applied in information retrieval [30]. Empirical results showed that pHAL is a competitive alternative to the original HAL. A recent paper reports that Yeh improved HAL by adding a close temporal association, and used this modified model in the information retrieval system of a mental disease forum [31]. Another typical implementation of semantic space is the LSA (latent semantic analysis) model, which can be substituted for HAL in our work [32, 33]. As our previous experiments showed that the effect of using LSA to realize a conceptual space was weaker than HAL, we do not discuss the detail of LSA here.

2.2. Conceptual space Gardenfors (2000) proposed a cognition model that separates the cognitive process into three levels: symbolic, conceptual, and associationist [25]. Figure 1 presents the three-level model of cognition. The symbolic and associationist levels

Figure 1. Gardenfors’ three-level model of cognition. Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

515

correspond to the symbolic and connectionist approaches respectively, which currently dominate in cognitive science, and the conceptual level is a bridge for connecting the symbolic level with the associationist level. Consequently, the conceptual level, as a meso level, is the key to Gardenfors’ three-level model of cognition. The representation of information varies greatly across the different levels. The theory of conceptual space is introduced as a knowledge representation scheme to support reasoning at the conceptual level. As an alternative to both symbolic and connectionist knowledge representation, conceptual space represents knowledge using a geometric structure. A conceptual space is a set of quality dimensions with a geometric or topological structure for one or more domains. Domains are represented by sets of integral dimension, which are distinguishable from all other dimensions. Every instance of the corresponding category is represented as a point in conceptual space. The semantic relationship of two concepts is measured naturally as an inverse function of the distance between them in the conceptual space. The computational feature is one of the major advantages of the conceptual space representation. In addition, well-defined concept operations can create new combined concepts, which provide a basis for mimicking concept learning and concept inference in the actual human cognitive process.

2.3. Information inference model (SB model) Inspired by the conceptual space model, Song and Bruza proposed an information inference model. This has allowed the realization of some mechanisms in conceptual space theory, such as representing knowledge via geometric structures, concept combination to derive new, complex concepts, and measuring the semantic relationships between concepts [19]. The SB model initializes the conceptual space by employing HAL spaces, which derive a vector representation of each concept. In addition, the SB model features a concept combination heuristic and an information flow calculation for measuring the semantic relationship between concepts. The former imitates the human process of learning new knowledge, and the latter underpins informational inference at the symbolic level. 2.3.1. HAL-based concept geometrical representation. The meaning of each word in HAL space is represented by a weighted vector consisting of other words. Analogous to the form of HAL, the SB model formally defines concept ci as a weighted vector ci = < wci p1 ,wci p2 ,    wci pn > , where p1, p2, ., pn denote the attributes of concept ci, P(ci) = {p1, p2, ., pn} denotes the attributes set of concept ci, wci pj denotes the weight of attribute pj in the vector representation of concept ci, and n is the dimension of the HAL space. Obviously, the vector representation of concept ci in conceptual spaces can be derived directly from the vector representation of word ci in HAL spaces. 2.3.2. Concept combination. The ability to combine multiple concepts and understand new combined concepts is a remarkable feature of human cognitive behaviour. The SB model developed a method of heuristic concept combination that can generate combined concepts based on the respective geometric representations. Consider two concepts c1 = < wc1 p1 ,wc1 p2 ,    wc1 pn > and c2 = < wc2 p1 ,wc2 p2 ,    wc2 pn > , where concept c1 is assumed to dominate concept c2. The resulting combined concept is denoted by c1 ⊕ c2 . The heuristic for concept combination proposed by the SB model can be described as follows. Step 1: Reweight c1 and c2 to assign a higher weight to the attributes in c1. wc1 pi ¼ ‘1 þ

‘1 × wc1 pi ‘2 × wc2 pi ; wc2 pi ¼ ‘2 þ ; ‘1 ; ‘2 ∈ ð0:0; 1:0 and ‘1 > ‘2 : max1 ≤ k ≤ n wc1 pk max1 ≤ k ≤ n wc2 pk

Step 2: Strengthen the weight of attributes simultaneously appearing in c1 and c2 via multiplication by a factor α > 1:0.   8ðpi ∈ P(c1 ) ^ pi ∈ P(c2 )Þj wc1 pi = α × wc1 pi ,wc2 pi = α × wc2 pi where α > 1:0:

Step 3: Calculate the weight of the attributes in the combined concept c1 ⊕ c2 via vector addition: w(c1 ⊕ c2 )pi = wc1 pi + wc2 pi

Step 4: Normalize the vector representation of c1 ⊕ c2 . The resulting vector can then be considered as a new concept, and can, in turn, be combined with other concepts by applying the same heuristic. Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

516

2.3.3. Information inference via information flow. The SB model has realized an information inference mechanism at a symbolic level by employing the theory of information flow (proposed by Barwise and Seligman [34]) and a HAL-based geometric representation of concepts. The information inference between two concepts is formalized as follows:   ci ‘ cj iff degree ci / cj ≥ λ

where ci and cj denote the conceptual representation of tokens i and j, and λ is a threshold value. The left-hand side of this formula states that the information contained in concept ci carries the information contained in concept cj. The degree( · ) function on the right-hand side of the formula computes the semantic entailment relation between concepts as the ratio of intersecting quality properties of ci and cj to the number of quality properties in the source ci: that is, P degree(ci / cj ) =

pl ∈ ðQPμ (ci ) ∩ QP0 (cj )Þ

P

w ci p l

pk ∈ QPμ (ci ) wci pk  where QPμ (c) = pi jwcpi > μ,μ ∈ ½0,1 denotes the attribute set of concept c.



3. Extended information inference model The information inference model presented in this paper is also based on the conceptual space theory presented by Gardenfors, and HAL is used to initialize the conceptual space. Similar to the SB model, the present model focuses on methods for conceptual space initialization, concept combination, and semantic measurements between concepts. In order to take into account both domain knowledge and universal knowledge in the initialization process, we propose a fusion mechanism for HAL space that integrates a range of knowledge contained in multiple HAL spaces. Unlike the SB model, the formal definition of a concept in the present model consists of two parts: the upper bound of a property set and the lower bound of a property set. The new concept definition allows us to capture the structured information contained in a combined concept more accurately. Under this new definition, we define two measurements of the semantic relationship between concepts according to the method of processing uncertain information in fuzzy set theory. Our measurements are the concept inclusion degree and the concept similarity degree. By defining two different measurements, we ensure that a suitable measurement can be selected according to the specific categorization task. Our method captures the essence of the human cognitive process, which has various complementary ways to infer the relation between concepts.

3.1. HAL space construction and fusion 3.1.1. HAL construction. From Burgess and Lund’s research [24, 28], HAL is known to be a high-dimensional semantic space derived from a corpus of text. The HAL space represents each word in a vocabulary, using a vector representation in which each dimension denotes the weight of a word appearing in the context of the target word. The process of automatically constructing HAL space from a given text corpus can be described as follows. A sliding window of length K is moved across the text corpus in one-word increments, ignoring punctuation, sentence and paragraph boundaries. All words wi , wi + 1 , wi + 2 ,    wi + K1 within the window are considered to co-occur with the first word wi with strengths inversely proportional to the distance between them: that is, the weight between wi and wi + j is calculated as K − j. After moving the window at one-word increments over the whole corpus, the HAL space, an accumulated co-occurrence matrix for all words in the target vocabulary, is produced. The resulting HAL space is an N × N matrix, where N denotes the vocabulary size. Figure 2 presents the HAL space for the example text ‘If I get opportunity, I will work hard.’

Figure 2. An example of a HAL space (K = 5). Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

517

As can be seen from Figure 2, the HAL space is direction-sensitive: that is, for every word in the target vocabulary, the co-occurrence information for words appearing before/after it is recorded separately by the row/column of the HAL matrix. The row/column pair may be concatenated so that, given a vocabulary T of length N, a word can be represented in HAL space via a vector of length 2N. In this research we followed the approach of the SB model, which did not consider directional sensitivity, and added the row and column together to form one vector for every row/column pair. Consequently, the dimension of the representation vector for each word was eventually reduced to the vocabulary size N. In the rest of this paper, we denote the HAL vector representation for a word i by Vi, that is, Vi = wit1 , wit2 ,    witN , where witk represents the weight of property tk and tk ∈ T . Once the HAL space is constructed, context information for each word in vocabulary T will be captured and stored in a HAL vector. As an example, part of the HAL vector for the word ‘compete’ is as follows: Vcompete = < industry: 105, business: 105, firms: 109, market: 336, effectively: 141, markets: 167, ability: 130, world: 214, better: 117, international: 102, . > . The weighting scheme proposed by Lund and Burgess for HAL construction is frequency-based. In follow-up research, Azzopardi and Yeh transformed the HAL model into a probabilistic framework so that it could be seamlessly integrated into other models based on probability theory [30]. Here, we transform the HAL model into a fuzzy set framework so that another effective tool for processing uncertain information, fuzzy set theory, can be applied to the semantic knowledge in the HAL model. For the frequency-based HAL vector, the following cosine normalization algorithm must be used to normalize each dimension in the HAL vector: witk  2 i1=2 j = 1 witj

witk = h PN

ð1Þ

Normalization allows the HAL vectors of two words to be compared or calculated at the same level. The normalized HAL vectors can be transformed into representations under a fuzzy set framework as follows. Let vocabulary T be a universe of discourse. The HAL set representation of word i, denoted by FSi, is written as FSi = fðtk ,witk Þjtk ∈ T g

ð2Þ

where witk is a measure of the importance of the neighbouring word tk to the target word i. 3.1.2. Multi-HAL fusion. Although there is no explicit distinguishing criterion, we deem that, depending on the training corpus, the HAL model can be classified into two categories: a universal model and a domain-specific model. A universal HAL model is constructed by a massive corpus that includes a wide range of domains and topics. In comparison, a domain-specific HAL model is derived from a corpus related to a specific domain or issue. Figure 3 shows the HAL set representation for the word ‘compete’ in both a universal and a domain-specific model. (The universal HAL model is constructed by the Reuters English Language Corpus, and the domain-specific HAL model is constructed by the ‘Sport’ subset of this corpus.) From Figure 3, we can see that there are significant differences between the two HAL set representations of the word ‘compete’. If the corpus used for building the universal HAL model contains abundant information from a variety of domains, we deem that the resulting universal HAL model will be domain-independent and well suited to various newly encountered tasks. However, in some specific applications, if we adopt a targeted corpus to construct the domain-specific HAL model, its performance may be improved. Obviously, this raises the question: could the universal model be completely replaced if the domain-specific corpus is adequate? The prerequisite condition that the domain-specific corpus is adequate is not only difficult to meet, but also difficult to judge. In addition, it depends on subjective judgements to determine which domain the encountered task ought to be classified as, and which is the corresponding domain-specific

Figure 3. HAL set representations for the word ‘compete’ in different HAL models. Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

518

Figure 4. Framework for the fusion of multiple HAL models.

corpus. In many cases, this process of determination is difficult, and selecting an appropriate domain-specific corpus is also difficult. If there is a deviation in the course of determining the domain and selecting the domain-specific corpus, the performance of the domain-specific HAL model may degrade significantly. Hence, in order to avoid this deviation, it is useful to integrate the domain-specific HAL model into a universal HAL model. Following the framework shown in Figure 4, a fusion for n mechanism o multiple HAL models is proposed based on the theory of fuzzy sets.  ~ itk of the ∈ T be the HAL set representation of word i in the mth HAL model. The weight w jt Let FSim = tk ,wm k itk tkth dimension of word i in the resulting fusion HAL model can be calculated as "

M  q 1 X ~ itk = wm w M m = 1 itk

#1=q , q>0

ð3Þ

where M is the total number of HAL models used for the fusion.

3.2. Concept and concept combination 3.2.1. Concept. In the SB model, a concept is represented by an attribute vector in which each element denotes the importance of that attribute to the concept. These concepts can be directly initialized by the HAL vector representation of the corresponding word. In our study, we transform HAL vectors of words into the set representations under a fuzzy set framework. A new definition of a concept can be described as follows. Definition 3.1 (Property set). The property set (PS) of a concept is defined as PS ≝ fðx; μð xÞÞjx ∈ X g

ð4Þ

where x is the property of the concept, X consists of all x, and μ(x) is the weight of property x (representing the importance of property x), with 0 ≤ μ(x) ≤ 1. Definition 3.2 (Concept). The concept c is defined by a two-tuple of PS: c ≝ < L,U >

ð5Þ

where the property set L is referred to as the lower bound of PS, and the property set U is referred to as the upper bound of PS. Figure 5 illustrates the concept definition schematically. Unlike the representation of a concept via a single attribute vector in the SB model, this paper expands the concept into two property sets. This is because our model has a different application from the SB model, which is applied mainly to the task of information retrieval and knowledge discovery from short texts containing only a few words and with relatively simple combined concepts. However, the web short texts that are the focus of this paper often contain several dozen words. According to the definition of a concept in the SB model, combined concepts would contain many Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

519

Figure 5. Schematic illustration of concept definition.

attributes with the increasing number of concept combinations. Hence it is extremely important to identify key information from the immense number of attributes of a combined concept. After the concept is expanded into the lower bound of PS and the upper bound of PS, key information and general information will be stored separately in the concept combination. Accordingly, the precision of the representations of knowledge contained in concepts may be improved. Definition 3.3 (Atomic concept). An atomic concept is a concept in conceptual space that corresponds to a word in natural language. It may be viewed as the basic unit of conceptual space, because a word is the basic unit of natural language. In the HAL model, a word i is represented by a HAL vector representation or a HAL set representation. In conceptual space, a word i is regarded as an atomic concept. Thereby, the lower and upper bounds of PS of an atomic concept i can be initialized by the HAL set representation of word i in the HAL model. This process can be formalized as follows: Li = Ui = fðtk ,witk Þjtk ∈ T ^ witk > δg ∪ fðtk ,0Þjtk ∈ T ^ witk ≤ δg, δ > 0

ð6Þ

where δ denotes the threshold for determining the quality property of a concept. Because each word i is represented by a high-dimension vector in the HAL model, this results in high computational complexity. Lund and Burgess found that the performance of the HAL model relies only on the 100–200 most important vector elements. Accordingly, by setting the value of δ, we can select some of the most important elements in the HAL vector as the properties of a concept in this paper. Note that, in the experiments in this paper, we adopt the practice of setting the value of δ manually so that about 200 important elements are selected to initialize an atomic concept. Figure 6 shows an example of the concept definition for ‘university’. As described earlier, the idea by which a concept is expanded into the two-tuple consisting of the lower bound of PS and the upper bound of PS is aimed mainly at improving the effectiveness of concept combination (see the next section). By equation (6), the lower and upper bounds of PS for an atomic concept are consistent. However, the difference between the lower bound of PS and the upper bound of PS will emerge during concept combination. In this paper, this difference is viewed as a type of structured information contained in combined concepts.

Figure 6. An example for the concept ‘university’. Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

520

3.2.2. Concept combination. Understanding a new combination of concepts is a remarkable feature of human thinking. Hence an appropriate algorithm for concept combination is crucial for the implementation of conceptual space. Song and Bruza introduced a concept combination heuristic on the basis of HAL vectors. In this section, we propose a method of concept combination based on set operations. In our study, the definition of a concept is similar to the fuzzy set, and hence we propose a method of concept combination based on the following set operations. Definition 3.4 (Operators of a PS). Empty: A property set A is empty, A = B, if and only if 8xi ∈ X , μA (xi ) = 0. Containment: A property set A is contained in property set B, A ⊆ B, if and only if 8xi ∈ X , μA (xi ) ≤ μB (xi ). Intersection: The intersection of two property sets A and B is a property set C, written as C = A ∩ B, with 8xi ∈ X , μC (xi ) = min(μA ðxi Þ, μB (xi )). Union: The union of two property sets A and B is a property set C, written as C = A ∪ B, with 8xi ∈ X , μC (xi ) = max(μA ðxi Þ, μB (xi )). Definition 3.5 (Concept combination). The result of combining two concepts ci and cj is a concept, written as ci ⊕ cj . The concept ci ⊕ cj is defined as ci ⊕ cj ≝ < Li ∩ Lj , Ui ∪ Uj >

ð7Þ

where Li and Lj are the lower bounds of PS of concepts ci and cj, and Ui and Uj are the upper bounds of PS of concepts ci and cj. Obviously, according to equation (7), the resulting concept ci ⊕ cj is a new concept that, in turn, can be composed with other concepts via concept combination.

3.3. Measurements of semantic relationship between concepts Accurately measuring the semantic relationship between two concepts is of great importance in many applications. The SB model adopted information flow to measure the inferential relation between concepts. In addition, many researchers have studied similarity measurements between short texts [35–39]. It is vital to define a set of reasonable semantic measurements between concepts, based on the definition of a concept proposed in Section 3.2. In the following, we present two semantic measurements between concepts, based on fuzzy set theory. 3.3.1. Concept inclusion degree. Similar to the information flow mechanism in the SB model, the concept inclusion degree is used to measure the semantic entailment relationship between two concepts. Hence the concept inclusion degree aims to discover the latent entailment relationship between two concepts. Inclusion is an important measurement in fuzzy set theory [40]. In this paper, the concept inclusion degree is based on the inclusion degree for the property set of a concept. Definition 3.6 (Property set inclusion degree). Given two concept property sets A and B, a real function I: PS(X) × PS(X) ! [0,1] is called the property set inclusion degree if I has the following properties: 1. 2. 3.

if A ⊆ B, then I(A, B) = 1. if A 6¼ 1 and A ∩ B = 1, then I(A, B) = 0. if A ⊆ B ⊆ C, then I(C, A) ≤ I(B, A) and I(C, A) ≤ I(C, B).

From the definition of the property set inclusion degree, the following theorem is easily proved. P μA (xi ) be the cardinal number of the property set A. The following Theorem 3.1. For all A, B ∈ PS(X ), let M(A) = function is then a property set inclusion degree: xi ∈ X I ðA, BÞ =



1,

M ðA ∩ BÞ M ð AÞ ,

A=1 A 6¼ 1

ð8Þ

According to the definition of the property set inclusion degree, we can calculate the inclusion degree between property sets. Because a concept contains the lower and upper bounds of PS, the concept inclusion degree needs to take two kinds of property set into account simultaneously. Accordingly, we propose a definition of the concept inclusion degree on the basis of the property set inclusion degree. Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

521

Definition 3.7 (Concept inclusion degree). Given two concepts ci and cj, the concept inclusion degree between ci and cj, denoted by I (ci , cj ), is defined as follows:         I ci , cj = α × I Ui , Uj + β × I Li , Uj + γ × I Li , Lj , α + β + γ = 1:0

ð9Þ

By employing the concept inclusion degree, the semantic entailment relationship between two concepts can be measured. This is similar to the information inference mechanism based on information flow in the SB model. 3.3.2. Concept similarity degree. In addition to measuring the semantic entailment relationship between concepts, the semantic similarity relationship between concepts also needs to be measured in many applications. Therefore it is necessary to define the concept similarity degree. Following the definition of the concept inclusion degree, we can easily derive the definition of the concept similarity degree. Definition 3.8 (Property set similarity degree). Given two concept property sets A and B, a real function S: PS(X) × PS(X) ! [0,1] is called the property set similarity degree if S has the following properties: 1. 2. 3.

S ðA, AÞ = 1 for all A ∈ PS(X ); S(A, B) = S(B, A) for all A, B ∈ PS(X ); For all A, B, C, D ∈ PS(X ), if A ⊆ B ⊆ C ⊆ D, then S(A, D) ≤ S(B, C).

From the definition of the property set similarity degree, the following theorem is easily proved. Theorem 3.2. For all A, B ∈ PS(X ), let I be a property set inclusion degree. The following function is then a property set similarity degree: S(A, B) = I(A, B) × I(B, A)

ð10Þ

According to the definition of the property set similarity degree, we can calculate the similarity between property sets. The concept similarity degree needs to take two kinds of property set into account simultaneously. Accordingly, we propose a definition of the concept similarity degree based on the property set similarity degree and property set inclusion degree. Definition 3.9 (Concept similarity degree). Given two concepts ci and cj, the concept similarity degree between ci and cj, denoted by S (ci , cj ), is defined as follows:   S (ci , cj ) = α × S(Ui , Uj ) + β × I(Li , Uj ) × I(Lj , Ui ) + γ × S Li , Lj , α + β + γ = 1:0

ð11Þ

By employing the concept similarity degree, the similarity between two concepts can be measured, which is different from the information inference mechanism in the SB model.

4. Evaluation of unsupervised categorization method based on extended information inference model This section empirically evaluates the performance of an unsupervised categorization method for web short texts based on the information inference model proposed in Section 3. The evaluation was performed through three different experiments for web short-text categorization. The first experiment, which was used to evaluate the SB model in [23], was conducted to categorize news titles by topic. In the second experiment, web news titles were automatically classified according to related events. The final experiment evaluated semantic orientation classification for consumer reviews, which is similar to the work reported in [12, 13]. The overall framework of the short-text unsupervised categorization method based on information inference is shown in Figure 7. The paradigm concepts are used for defining the category label, and our algorithm realizes the categorization process by unsupervised learning from the descriptive definition of category label.

4.1. Experiment 1: categorizing news titles by topic 4.1.1. Experimental set-up. The first experiment, which was also used to examine the SB model, focused on the categorization of Reuters news titles. The primary purpose of the experiment was to evaluate the ability of the concept inclusion Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

522

Figure 7. Framework for short-text unsupervised categorization based on an information inference mechanism.

Table 1. Test topics used in Experiment 1 Topics

Number of relevant documents

Average length of titles

Paradigm concepts set for corresponding topic category

GWEA GVOTE GVIO GSPO GSCI GDIS GCRIM GTOUR GJOB E11

90 61 501 728 107 166 629 13 422 159

5.90 5.67 5.65 5.79 5.57 5.52 5.59 5.92 5.65 5.86

{weather, atmospheric, rain, sunny, cloudy, hail, shower, .} {vote, elect, candidate, party, voter, Democratic, poll, .} {war, conflict, soldier, attack, peace, combat, martial, .} {sport, championship, scorer, athlete, Olympics, football, .} {scientific, technical, gene, electricity, engineer, physical, .} {disaster, accident, quake, typhoon, wreckage, flood, .} {crime, policeman, accused, law, testify, prisoner, jail, .} {tour, holiday, journey, travel, vacation, tourism, trip, .} {job, labour, employ, unemployed, jobless, engagement, .} {economic, inflation, deflation, GDP, deficit, CPI, PPI, .}

degree to discover implicit knowledge from short texts. In addition, this experiment allowed us to compare differences between the proposed information inference model and the SB model. We adopted the Reuters English Language Corpus RCV1 (1996-08-20 to 1997-08-19) to conduct the news title categorization experiment (the Reuters Corpus is available at http://trec.nist.gov/data/reuters/reuters.html). The Reuters English Language Corpus contains 787,801 documents, and each document consists of a title, content, topic labels, etc. Human indexers have identified 126 topics, which are used to label the documents. For this experiment, the 10 topic categories shown in Table 1 were selected from the 126 topics. We refer to these 10 topics as the topic set (TS). The corpus was split into three collections: a training set, a domain-specific training set, and a test set. The training set, containing documents published before 1 August 1997, was used to construct a universal HAL space. The domain-specific training set, containing documents published from 1 August 1997 to 10 August 1997, each of which belonged to one of the TS topics, was used to construct a domain-specific HAL space. A collection of 2876 titles of documents published from 11 August 1997 to 19 August 1997, each of which belonged to one of the TS topics, formed a test set that was used to evaluate the short-text categorization algorithm. Note that our short-text classification method is not entirely the same as the method proposed in the literature [23]. The principal difference is that we expanded each topic label into a set of paradigm concepts, whereas the method reported in the literature used only the topic label as the classification paradigm. In our experiment, each topic label i of TS was expanded into a set of paradigm words, denoted by PWi, containing 20 paradigm words relevant to i (based on WordNet). The idea of using more paradigms comes from the work reported in [12, 13], wherein 14 labelled paradigm words were used to define the semantic orientation. In addition, the classification decision rule in our method could be termed a K-NN (K-nearest neighbour) rule, which is slightly different from the decision rule adopted in [23]. In document preprocessing, we removed all punctuation marks, numerical symbols, and stop words, and we also converted all words to lowercase. A universal HAL space was constructed from the training set using a window size of eight words, and a domain-specific HAL space was constructed from the domain-specific training set, again using a window

Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

523

size of eight words. In the experiment, the multiple HAL model fusion parameter in equation (3) was set to q = 2. Each title of the test set was categorized based on Algorithm 1. For comparison, we implemented the SB model-based unsupervised categorization, wherein the SB model was substituted for the role of computing the concept inclusion degree in Algorithm 1. In addition, although our method is unsupervised, we also compared our method with several baseline supervised text categorization algorithms, including the naive Bayes (NB) and K-NN classifiers (using the vector-space cosine similarity measure). For the supervised algorithms, we used fivefold cross-validation to obtain categorization results. Specifically, all samples in the test set (containing 2876 titles) were equally split into five folds and, each time, four folds were used to train the supervised classifier and the other fold was used to test the resulting classifier. The entire test set was classified via repetition five times. Algorithm 1. News title categorization based on concept inclusion degree Input: news title H, topic set TS, set of paradigm words PW, and parameter K Output: the topic category into which H falls Pseudo-Code: Remove all stop words from news title H, and thereby get the word sequence h1, h2,.hn Compute combined concept cH , cH = ch1 ⊕ ch2    ⊕ chn FOR {8i ∈ TS} FOR {8w ∈ PWi } Calculate concept inclusion degree I (cH , cw ) END FOR Select K paradigm words from PWi that correspond to the K largest concept inclusion degrees, and use these K paradigms to form the K-nearest neighbours set KNNi P Calculate the ‘probability’ that H falls into the ith category SOCi (cH ), SOCi (cH ) = I (cH , cj ) cj ∈ KNNi END FOR RETURN (the classification result r, r = arg max SOCi (cH )) i ∈ TS

4.1.2. Experimental results. Precision, Recall, and F-measure are traditional measures that have been widely used by text categorization algorithms for performance evaluation. They are defined as follows For each category Ci: Precision = Recall = F-measure =

tp tp + fp

tp tp + fn

2 × Precision × Recall Precision + Recall

ð12Þ ð13Þ ð14Þ

where tp (true positive) is the number of documents that are correctly categorized into Ci, fp (false positive) is the number of documents that are wrongly categorized into Ci, and fn (false negative) is the number of documents that actually belong to Ci but are categorized into non-Ci. F-measure is a combination of Precision and Recall. The results of Experiment 1 are illustrated in Table 2 and Figure 8. In this experiment, the parameter K in Algorithm 1 was set to 5, and the property set inclusion degree (I) was calculated using equation (8). Note that the results shown in Table 2 are the macro-averaged results across 10 categories. The results displayed in Figure 8 correspond to the concept inclusion degree (parameter settings: α = 0.8, β = 0.0, γ = 0.2). Table 2 and Figure 8 show that the performance of the model proposed in Section 3 is superior to the SB model for the unsupervised categorization of news titles. In the proposed model, a concept consists of both an upper bound of PS and a lower bound of PS: therefore the knowledge contained in the combined concept is more accurate and structured. This may be the reason why the performance of the short-text classification algorithm based on the concept inclusion degree was better than that based on the information flow inference of the SB model. From Table 2, we can see that the performance of the categorization algorithm reached a maximum when α = 0.8, β = 0.0, and γ = 0.2. It appears that the semantic information contained in the upper and lower bounds of PS may be well-fused with these parameter settings.

Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

524

Table 2. Categorization performance with traditional measures Method SB model* Concept inclusion degree (I*)

Concept inclusion degree (I*)

No fusion with domain-specific knowledge

Fusion with domain-specific knowledge

α = 1.0 β = 0.0 γ = 0.0 α = 0.8 β = 0.0 γ = 0.2 α = 0.6 β = 0.0 γ = 0.4 α = 0.4 β = 0.0 γ = 0.6 α = 0.2 β = 0.0 γ = 0.8 α = 0.0 β = 0.0 γ = 1.0 α = 1.0 β = 0.0 γ = 0.0 α = 0.8 β = 0.0 γ = 0.2 α = 0.6 β = 0.0 γ = 0.4 α = 0.4 β = 0.0 γ = 0.6 α = 0.2 β = 0.0 γ = 0.8 α = 0.0 β = 0.0 γ = 1.0

Naive Bayes K-nearest neighbour

Average Precision

Average Recall

Average F-measure

0.781 0.780 0.811 0.802 0.793 0.782 0.755 0.837 0.863 0.847 0.856 0.832 0.789 0.755 0.745

0.734 0.726 0.764 0.744 0.744 0.724 0.694 0.807 0.837 0.827 0.827 0.786 0.706 0.740 0.728

0.732 0.726 0.761 0.743 0.740 0.721 0.683 0.796 0.833 0.826 0.822 0.778 0.700 0.742 0.718

*SB model was substituted for the role of I* in Algorithm 1.

In addition, the results in Table 2 show that the integration of domain-specific knowledge could significantly improve the performance of the model. Compared with the baseline supervised text categorization algorithms (NB and K-NN), the present unsupervised categorization algorithms are more effective. We believe that the degradation of supervised categorization algorithms is due to the feature sparseness of short texts.

4.2. Experiment 2: categorizing news titles by event Unlike Experiment 1, wherein the concept inclusion degree was empirically evaluated, this experiment evaluated the concept similarity degree by categorizing the dataset of web news titles about different events. 4.2.1. Experimental set-up. The purpose of this experiment was to evaluate the ability of the concept similarity degree to discover implicit semantic similarities between short texts. In order to accomplish this objective, we considered the categorization of the titles of news reports. We adopted the TDT Pilot Study Corpus to conduct this experiment (the TDT Pilot Study Corpus is available at http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25). The TDT Pilot Study Corpus collected 15,863 news reports happening in the period from 1 July 1994 to 30 July 1995, with about half taken from Reuters newswire and half from CNN broadcast news transcripts. A set of 25 target events was defined that spans a variety of event types and covers a subset of the events discussed in the corpus. From all 15,863 reports, researchers manually labelled the news reports that were related to these 25 target events. In our experiment, five target events were filtered out because the number of news reports related to these events was less than 10. Table 3 gives a breakdown of the remaining 20 target events. For the experimental evaluation, we organized the corpus as follows. The first 10 events, i.e. from Event 1 to Event 11, and the titles of news reports related to these events were used to construct test set 1; all 20 events and related news reports titles were used to construct test set 2. In this experiment, we adopted the name of the event as the paradigm concept of the corresponding category. All paradigm concepts constitute the set of paradigm concepts, denoted by SP. The universal HAL space was constructed by the training set introduced in Section 4.1. In addition, 200 news reports related to the 20 target events were randomly selected to form the domain-specific training set. A domain-specific HAL space was constructed from the domain-specific training set using a window size of eight words. In our experiment, the multiple HAL model fusion algorithm parameter was set to q = 2. For the collection of news report titles HS = {H1, H2, . Hm} and the set of paradigm concepts SP = {P1, P2, ., Pl}, HS was categorized based on Algorithm 2. As a comparison test, we substituted the vector-space cosine similarity measure into the role of computing the concept similarity degree in Algorithm 2. In addition, two baseline supervised text categorization algorithms, the NB and KJournal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

525

NN classifiers (using the vector-space cosine similarity measure), were employed for comparison with our unsupervised method. For the supervised algorithms, we again used five-fold cross-validation to obtain categorization results.

Algorithm 2. News title categorization based on concept similarity degree Input: news report title set HS = {H1, H2, . Hm}, paradigm concepts set SP = {P1, P2, ., Pl}, parameter K1, and parameter K2 Output: the set of categorization results R Pseudo-Code: FOR {8Pi ∈ SP} Remove all stop words from news title Pi, and thereby obtain the word sequence pi1, pi2,.pin Compute combined concept cPi , cPi = cpi1 ⊕ cpi2    ⊕ cpin END FOR FOR {8Hi ∈ HS} Remove all stop words from news title Hi, and thereby obtain the word sequence hi1, hi2,.hin Compute combined concept cHi , cHi = chi1 ⊕ chi2    ⊕ chin END FOR FOR {8Pi ∈ SP} FOR {8Hj ∈ HS} Calculate concept similarity degree S (cPi , cHj ) END FOR Select K1 titles from HS that correspond to the K1 largest concept similarity degrees, and use these K1 titles to form an expanded paradigm concepts set for class i, denoted by EPSi END FOR FOR {8Hj ∈ HS} FOR {8Pk ∈ ∩ li = 1 EPSi } Calculate the concept similarity degree S (cPk , cHj ) END FOR Select K2 paradigm concepts from ∩ li = 1 EPSi that correspond to the K2 largest concept similarity degrees, and use these K2 paradigm concepts to form the K-nearest neighbours set KNNj of title Hj ! P Add the categorization result rj of Hj to R, rj = arg max S (cHj , ct ) i ∈ SP c ∈ KNN ∩ EPS t j i END FOR RETURN R

4.2.2. Experimental results. The traditional measures given by equations (12)–(14) were used for performance evaluation. The results of Experiment 2 are listed in Table 4. In the experiment, the parameters of Algorithm 2 were K1 = 10 and K2 = 5, and the property set similarity degree (S) was calculated using equation (10). The results shown in Table 4 are the macro-averaged results across all categories. Table 4 shows the following results. First, the performance of the proposed unsupervised categorization algorithm reached a maximum when α = 0.8, β = 0.2, and γ = 0.0. This was probably because the semantic information contained in the upper PS and the lower PS of the concept was well balanced by these parameter settings. Second, the method based on the concept similarity degree had a better performance than that of the vector-space cosine similarity measure for these parameter values. Feature sparseness, caused by the short length of the text, limited the performance of the method based on VSM. However, the concept similarity degree of the information inference model overcame this problem. Third, the integration of domain-specific knowledge improved the performance of the information inference model. Fourth, the performance of our algorithm showed a clear gap between the two test sets. It can be observed that test set 2 contained three events related to ‘bombing’, i.e., Events 17, 18 and 25. We believed that this added more challenges and resulted in performance degradation of the categorization algorithm. Finally, the performance of the present unsupervised categorization algorithm was worse than that of the supervised text categorization algorithms (NB and K-NN). We believe that this is caused by two factors. First, only one paradigm concept was adopted by our unsupervised categorization algorithm; and second, the task of this experiment was to categorize the short texts according to the topic of news events, and thus the feature sparseness of the short-text dataset was not as prominent as in the previous task.

4.3. Experiment 3: categorizing online reviews by sentiment orientation The third experiment aimed to evaluate the proposed method through a non-thematic categorization of web short texts. In order to accomplish this objective, we considered categorization of the sentiment orientation of online consumer reviews. Relative to the thematic information contained in short texts, sentiment orientation information is a more obscure type of Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

526

SB Model

I* without Fusion

I* with Fusion

1 0.9 0.8

Precison

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 GWEA GVOTE GVIO

GSPO

GSCI

SB Model

GDIS GCRIM GTOUR GJOB

I* without Fusion

E11

I* with Fusion

1 0.9 0.8 0.7 Recall

0.6 0.5 0.4 0.3 0.2 0.1 0

GWEA GVOTE GVIO

GSPO

GSCI

SB Model

GDIS GCRIM GTOUR GJOB

I* without Fusion

E11

I* with Fusion

1 0.9 0.8

F-measure

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

GWEA GVOTE GVIO

GSPO

GSCI

GDIS GCRIM GTOUR GJOB

Figure 8. Precision, Recall and F-measure chart for each category.

Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

E11

Xu and Peng

527

Table 3. Summary of target events in TDT Pilot Study Corpus Event ID

Name of event

Number of news reports

Average length of titles

Event ID

Name of event

Number of news titles

Average length of titles

Event 1 Event 2 Event 3 Event 4 Event 5 Event 6 Event 8 Event 9 Event 10 Event 11

Aldrich Ames Carlos the Jackal Carter in Bosnia Cessna on White House Clinic murders (Salvi) Comet into Jupiter Death of Kim Jong Il (N. Korea) DNA in OJ trial Haiti ousts observers Hall’s copter (N. Korea)

22 14 40 16 41 48 74 206 19 95

7.95 7.71 7.58 8.13 7.68 7.54 8.05 7.85 7.89 7.87

Event 12 Event 15 Event 16 Event 17 Event 18 Event 20 Event 21 Event 22 Event 24 Event 25

Humble, TX, flooding Kobe Japan quake Lost in Iraq NYC subway bombing OK-City bombing Quayle lung clot Serbians down F-16 Serbs violate Bihac USAir 427 crash WTC bombing trial

23 89 43 24 282 12 77 135 43 24

7.61 7.53 7.79 7.58 7.58 8.08 7.73 7.96 7.88 7.96

Table 4. Categorization performance with traditional measures Method

Test set 1

VSM-based cosine similarity* Concept similarity No fusion with domain-specific degree (S*) knowledge

Concept similarity degree (S*)

Naive Bayes K-nearest neighbour

Fusion with domain-specific knowledge

α = 1.0 β = 0.0 γ = 0.0 α = 0.8 β = 0.2 γ = 0.0 α = 0.6 β = 0.4 γ = 0.0 α = 0.4 β = 0.6 γ = 0.0 α = 0.2 β = 0.8 γ = 0.0 α = 0.0 β = 1.0 γ = 0.0 α = 1.0 β = 0.0 γ = 0.0 α = 0.8 β = 0.2 γ = 0.0 α = 0.6 β = 0.4 γ = 0.0 α = 0.4 β = 0.6 γ = 0.0 α = 0.2 β = 0.8 γ = 0.0 α = 0.0 β = 1.0 γ = 0.0

Test set 2

Average Precision

Average Recall

Average F-measure

Average Precision

Average Recall

Average F-measure

0.797 0.827 0.842 0.826 0.819 0.814 0.818 0.834 0.847 0.838 0.844 0.828 0.834 0.912 0.933

0.789 0.839 0.851 0.852 0.819 0.811 0.812 0.864 0.868 0.864 0.866 0.848 0.854 0.831 0.826

0.769 0.807 0.828 0.818 0.791 0.786 0.790 0.834 0.848 0.833 0.833 0.818 0.821 0.853 0.857

0.737 0.771 0.779 0.776 0.766 0.763 0.771 0.815 0.821 0.826 0.804 0.812 0.805 0.886 0.863

0.718 0.778 0.797 0.785 0.772 0.765 0.756 0.835 0.840 0.840 0.840 0.837 0.825 0.819 0.814

0.683 0.747 0.762 0.749 0.742 0.731 0.734 0.81 0.821 0.819 0.807 0.808 0.798 0.837 0.823

*Vector-space-model-based cosine similarity was substituted for the role of S* in Algorithm 2.

knowledge. Accordingly, the objective of this experiment could be regarded as evaluation of the concept inclusion degree used for discovering more implicit knowledge from short texts.

4.3.1. Experimental set-up. Table 5 describes the 2000 consumer reviews from the Epinions website (www.epinions.com) that were used as the experimental test set. These reviews consisted of five product domains, and were divided into two categories according to the ‘recommended’ or ‘not recommended’ label selected by the reviewer. When a reviewer submits their review, the Epinions website requires each reviewer to summarize their views into a one sentence ‘bottom line’. Because the work in this paper is focused on the analysis of short texts, we used the ‘bottom line’ instead of the whole review. A universal HAL space was constructed by the training set introduced in Section 4.1. In addition, 3500 reviews related to the five domains listed in Table 5 were randomly collected from the Epinions website and used as a domain-specific training set. A domain-specific HAL space was then constructed from the domain-specific training set using a window size of eight words. Our experiment is closely related to Turney’s work on classifying the sentiment orientation of reviews [12, 13]. Turney used pointwise mutual information (PMI) to calculate the sentiment orientation of each word, and classified the review as positive or negative based on the average sentiment orientation of all words in the review. Note that our method Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

528

Table 5. Summary of corpus of reviews Domain of review

Automobiles Banks Movies Travel Digital cameras

Number of reviews

Average phrases per review

Recommended

Not recommended

Total

200 200 200 200 200

200 200 200 200 200

400 400 400 400 400

20.46 18.08 18.86 22.06 20.19

differs from Turney’s method. The first difference is that we classify the sentiment orientation of each sentence directly, rather than the phrase-level classification of Turney’s work. In addition, we utilize a larger paradigm set. Instead of the seven positive and seven negative paradigm words used by Turney, we select a set of 20 positive paradigm concepts (denoted by PWpos) and a set of 20 negative paradigm concepts (denoted by PWneg) to measure the sentiment orientation of sentence-level text. These 40 paradigm concepts come from the work in [41]. The sentiment orientation of each review in the test set may be classified based on Algorithm 3.

Algorithm 3. Sentiment orientation classification of online reviews based on concept inclusion degree Input: Web consumer review R, parameter K, paradigm concepts set PWpos and PWneg Output: the sentiment orientation of R Pseudo-Code: Remove all stop words from web consumer review R, and thereby obtain a word sequence r1, r2,.rn. Compute combined concept cR , cR = cr1 ⊕ cr2    ⊕ crn FOR {8i ∈ fpos, negg} FOR {8w ∈ PWi } Calculate the concept inclusion degree I (cR , cw ) END FOR Select K paradigm words from PWi that correspond to the K largest concept inclusion degrees, and use these K paradigm words to form the K-nearest neighbours set KNNi END FOR P P I (cR , cj )  I (cR , cj ) Calculate the sentiment orientation degree of R, SO(R) = c ∈ KNN c ∈ j pos j KNNneg IF (SO(R) > 0) RETURN (the classification result ‘Recommended’) ELSE RETURN (the classification result ‘Not Recommended’)

For comparison, we implemented an SB-model-based unsupervised categorization, wherein the SB model was substituted for the role of computing the concept inclusion degree in Algorithm 3. The PMI-based unsupervised categorization method in [12, 13] was also used as a comparison method. In addition, although our method is unsupervised, we again compared our method with two baseline supervised text categorization algorithms, the NB and K-NN classifiers (using the vector-space cosine similarity measure). For the supervised algorithms, we used fivefold cross-validation to obtain categorization results.

4.3.2. Experimental results. The traditional measures given by equations (12)–(14) are used for performance evaluation. In this experiment, the parameter of Algorithm 3 was set to K = 5, and the property set inclusion degree (I) was calculated using equation (8). The results of Experiment 3 are presented in Table 6. From Table 6, we can see that the performance of the classification algorithm reached a maximum when α = 0.6, β = 0.0, and γ = 0.4. In this case, the method based on the concept inclusion degree exhibited a better performance than that of both the PMI-based and the SB-model-based methods. There are two reasons for these results. First, the larger set of paradigm words used in our method provides more knowledge than Turney’s PMI method. Second, relative to the PMI model itself, the concept inclusion degree may possess some advantages in discovering the implicit knowledge contained in short texts. Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

529

Table 6. Categorization performance with traditional measures Method

SB model* Concept inclusion No fusion with domain- α = 1.0 β = 0.0 γ = 0.0 specific knowledge α = 0.8 β = 0.0 γ = 0.2 degree (I*) α = 0.6 β = 0.0 γ = 0.4 α = 0.4 β = 0.0 γ = 0.6 α = 0.2 β = 0.0 γ = 0.8 α = 0.0 β = 0.0 γ = 1.0 α = 1.0 β = 0.0 γ = 0.0 Concept inclusion Fusion with domainspecific knowledge α = 0.8 β = 0.0 γ = 0.2 degree (I*) α = 0.6 β = 0.0 γ = 0.4 α = 0.4 β = 0.0 γ = 0.6 α = 0.2 β = 0.0 γ = 0.8 α = 0.0 β = 0.0 γ = 1.0 PMI model in [12,13] Naive Bayes K-nearest neighbour

Not recommended

Recommended

Precision Recall

F-measure Precision Recall

F-measure

0.864 0.864 0.875 0.880 0.808 0.773 0.762 0.857 0.862 0.839 0.767 0.690 0.720 0.840 0.687 0.704

0.667 0.667 0.712 0.733 0.689 0.597 0.571 0.762 0.781 0.788 0.708 0.625 0.600 0.700 0.636 0.670

0.771 0.771 0.790 0.800 0.760 0.723 0.714 0.805 0.816 0.811 0.747 0.684 0.700 0.775 0.687 0.702

0.543 0.543 0.600 0.629 0.600 0.486 0.457 0.686 0.714 0.743 0.657 0.571 0.514 0.600 0.591 0.639

0.667 0.667 0.696 0.711 0.682 0.625 0.612 0.738 0.756 0.769 0.700 0.634 0.622 0.689 0.645 0.672

0.914 0.914 0.914 0.914 0.857 0.857 0.857 0.886 0.886 0.857 0.800 0.743 0.800 0.886 0.734 0.734

*SB model was substituted for the role of I* in Algorithm 3.

In addition, the results in Table 6 show that the integration of domain-specific knowledge can improve the performance of the model slightly. However, the role of domain-specific knowledge in this experiment was not as remarkable as in previous experiments. This was probably because the corpus of related reviews was not the best domain-specific knowledge for sentiment orientation classification. Finally, as shown in Table 6, the present unsupervised categorization algorithm showed more promising performance than the baseline supervised algorithms (NB and K-NN). It seems that our approach is more suitable for mining deeply the implicit knowledge contained in web short texts, such as sentiment orientation.

5. Conclusion This paper has presented an extended information inference model and several unsupervised categorization algorithms for web short texts based on an information inference mechanism. The information inference model integrated the HAL model, the conceptual space model, and fuzzy set theory to enable it to mimic human cognitive behaviour effectively and infer semantic information contained in short text fragments from the web. To evaluate the effectiveness of employing an extended information inference model to deal with web short-text categorization tasks, three different experiments were performed. The experimental results indicated that the unsupervised categorization methods built upon the information inference mechanism could well overcome the difficulties brought about by feature sparseness and a lack of labelled examples in web short-text categorization tasks. Moreover, the multi-HAL fusion mechanism in the initialization process of a conceptual space provided an effective way of simultaneously using domain-specific background knowledge and universal knowledge, effectively improving the performance of information-inference-based short-text categorization. Furthermore, by transforming HAL vectors into representations under a fuzzy set framework, this work provides a referential theory for further research to optimize the realization of conceptual space and information inference. Finally, it is important to point out that we tested the parameters α, β, and γ in only a small range according to our prior experience. Consequently, the optimization of these parameter settings is a subject for further study. In addition, in the experiments presented in Sections 4.1 and 4.3, all applications depend on the paradigm concepts that were manually set; accordingly, future work will also include research into a method for automatically generating paradigm concepts. Acknowledgements This work was jointly supported by the National Natural Science Foundation of China (grant numbers 61173111, 60774086) and the PhD Programmes Foundation of the Ministry of Education of China (grant number 20090201110027).

Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng

530

References [1] [2] [3] [4] [5]

[6]

[7] [8] [9] [10]

[11] [12] [13]

[14] [15] [16] [17] [18] [19] [20]

[21] [22] [23] [24] [25] [26] [27]

Li YH, McLean D, Bandar ZA, O’Shea JD, Crockett K. Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering 2006; 18(8): 1138–1150. Pinto D. On clustering and evaluation of narrow domain short-text corpora. PhD dissertation, Universidad Polite´cnica de Valencia, Spain, 2008. Ingaramo D, Pinto D, Rosso P, Errecalde M, Gelbukh A. Evaluation of internal validity measures in short-text corpora. In: Proceedings of CICLing’08, 2008, pp. 555–567. Ni XL, Quan XJ, Lu Z, Liu WY, Hua B. Short text clustering by finding core terms. Knowledge and Information Systems 2010; 27(3): 345–365. Phan XH, Nguyen LM, Horiguchi S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web (WWW ’08), Beijing, China, 2008, pp. 91–100. Bharath S, Dave F, Engin D, Hakan F, Murat D. Short text classification in Twitter to improve information filtering. In: Proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’10), Geneva, Switzerland, 2010, pp. 841–842. Pinto D, Rosso P, Jimenez-Salazar H. A self-enriching methodology for clustering narrow domain short texts. Computer Journal 2011; 54(7): 1148–1165. Zelikovitz S, Hirsh H. Transductive LSI for short text classification problems. In: Proceedings of the 19th conference on artificial intelligence (AAAI ’04), San Jose, CA, USA, 2004, pp. 556–561. Zelikovitz S, Marquez F. Transductive learning for short-text classification problems using latent semantic indexing. International Journal of Pattern Recognition and Artificial Intelligence 2005; 19(2): 143–163. Zelikovitz S, Hirsh H. Improving short text classification using unlabeled background knowledge to assess document similarity. In: Proceedings of the 17th international conference on machine learning (ICML ’00), San Francisco, CA, USA, 2000, pp. 1183–1190. Zelikovitz S, Cohen WW, Hirsh H. Extending WHIRL with background knowledge for improved text classification. Information Retrieval 2007; 10(1): 34–67. Turney PD, Littman ML. Measuring praise and criticism: inference of semantic orientation from association. ACM Transactions on Information Systems 2003; 21(4): 315–346. Turney PD. Thumbs or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL ’02), Philadelphia, PA, USA, 2002, pp. 417–424. Healy M, Delany SJ, Zamolotskikh A. An assessment of case base reasoning for short text message classification. In: Proceedings of the 16th Irish conference on artificial intelligence and cognitive science (AICS ’05), 2005, pp. 257–266. Cormack GV, Maria J, Sanz EP. Spam filtering for short messages. In: Proceedings of the 16th ACM conference on information and knowledge management (CIKM ’07), Lisbon, Portugal, 2007, pp. 313–319. Yan R, Cao XB, Li K. Dynamic assembly classification algorithm for short text. Acta Electronica Sinica 2009; 37(5): 1019– 1024. Budd JM. Revisiting the importance of cognition in information science. Journal of Information Science 2011; 37(4): 360–368. Bruza PD, Song DW, Wong KF. Aboutness from a commonsense perspective. Journal of the American Society for Information Science 2000; 51(12): 1090–1105. Song DW, Bruza PD. Towards context sensitive information inference. Journal of the American Society for Information Science and Technology 2003; 54(4): 321–334. Song DW, Bruza PD. Discovering information flow using a high dimensional conceptual space. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’01), New Orleans, LA, USA, 2001, pp. 327–333. Cheong P, Song DW, Bruza P, Wong KF. Information flow analysis with Chinese text. In: Proceedings of the 1st international joint conference on natural language processing (IJCNLP ’04), Sanya, China, 2004, pp. 91–98. Song DW, Bruza PD. Text based knowledge discovery with information flow analysis. In: Proceedings of the 8th Asia-Pacific web conference (APWeb ’06), Harbin, China, 2006, pp. 692–701. Song DW, Lau RYK, Bruza PD, Wong KF, Chen DY. An intelligent information agent for document title classification and filtering in document-intensive domains. Decision Support Systems 2007; 44(1): 251–265. Gardenfors P. Conceptual spaces: The geometry of thought. Cambridge, MA: MIT Press, 2000. Lund K, Burgess C. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers 1996; 28(2): 203–208. Guzman-Cabrera R, Montes-y-Gomez M, Rosso P, Villasenor-Pineda L. Using the web as corpus for self-training text categorization. Information Retrieval 2009; 12(3): 400–415. Lee JW, Lee SG, Kim HJ. A probabilistic approach to semantic collaborative filtering using world knowledge. Journal of Information Science 2011; 37(1): 49–66.

Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Xu and Peng [28] [29] [30]

[31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41]

531

Burgess C, Livesay K, Lund K. Explorations in context space: words, sentences, discourse. Discourse Processes 1998; 25(2&3): 211–257. Osgood CE, Suci GJ, Tannenbaum PH. The measurement of meaning. Urbana, IL: University of Illinois Press, 1957. Azzopardi L, Girolami M, Crowe M. Probabilistic hyperspace analogue to language. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’05), Salvador, Brazil, 2005, pp. 575–576. Yeh JF, Wu CH, Yu LC, Lai YS. Extended probabilistic HAL with close temporal association for psychiatric query document retrieval. ACM Transactions on Information Systems 2009; 27(1): 1–30. Landauer TK, Foltz PW, Laham D. An introduction to latent semantic analysis. Discourse Processes 1998; 25(2–3): 259–284. Hofmann T. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 2001; 42(1–2): 177–196. Barwise J, Seligman J. Information flow: the logic of distributed system. Cambridge Tracts in Theoretical Computer Science 1998; 49(3): 397–401. Sahami M, Heilman T. A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on world wide web (WWW ’06), Edinburgh, UK, 2006, pp. 377–386. Pinto D, Jime´nez-Salazar H, Rosso P, Gelbukh A. Clustering abstracts of scientific texts using the transition point technique. In: Proceedings of CICLing ’06, 2006, pp. 536–546. Yih WT, Meek C. Improving similarity measures for short segments of text. In: Proceedings of the 22nd conference on artificial intelligence (AAAI’07), Vancouver, Canada, 2007, pp. 1489–1494. Pinto D, Benedı´ J-M, Rosso P, Gelbukh A. Clustering narrow-domain short texts by using the Kullback–Leibler distance. In: Proceedings of CICLing ’07, 2007, pp. 611–622. Quan XJ, Liu G, Lu Z, Ni XL, Liu WY. Short text similarity based on probabilistic topics. Knowledge and Information Systems 2010; 25(3): 473–491. Zhang HY, Zhang WX. Hybrid monotonic inclusion measure and its use in measuring similarity and distance between fuzzy sets. Fuzzy Sets Systems 2009; 160(1): 107–118. Xu T, Peng Q, Cheng Y. Identifying the semantic orientation of terms using S-HAL for sentiment analysis. Knowledge-Based Systems 2012, http://dx.doi.org/10.1016/j.knosys.2012.04.011.

Journal of Information Science, 38 (6) 2012, pp. 512–531 Ó The Author(s), DOI: 10.1177/0165551512448985

Suggest Documents