Text Classification for a Large-Scale Taxonomy Using ... - Springer Link

5 downloads 451 Views 260KB Size Report
nodes. Finally classification is performed after training a classifier for the candidate .... available search engines with the category based strategy which outperforms the .... to optimize the weights for individual documents, instead of using a ...
Text Classification for a Large-Scale Taxonomy Using Dynamically Mixed Local and Global Models for a Node Heung-Seon Oh, Yoonjung Choi, and Sung-Hyon Myaeng Department of Computer Science, Korea Advanced Institute of Science and Technology {ohs,choiyj35,myaeng}@kaist.ac.kr

Abstract. Hierarchical text classification for a large-scale Web taxonomy is challenging because the number of categories hierarchically organized is large and the training data for deep categories are usually sparse. It’s been shown that a narrow-down approach involving a search of the taxonomical tree is an effective method for the problem. A recent study showed that both local and global information for a node is useful for further improvement. This paper introduces two methods for mixing local and global models dynamically for individual nodes and shows they improve classification effectiveness by 5% and 30%, respectively, over and above the state-of-art method. Keywords: Web Taxonomy, Hierarchical Text Classification, ODP.

1 Introduction Web taxonomy is a large-scale hierarchy of categories where each category is represented by a set of web documents. Utilizing web taxonomies has been shown useful for several tasks such as rare query classification, contextual advertising, and web search improvement [1,2,3]. For example, query classification to a web taxonomy can help detecting a specific topic of interest, which can lead to more relevant online ads than those searched by the query directly. Despite its usefulness, text classification for a large-scale web taxonomy is challenging because the number of categories organized in a big hierarchy is large and the training data for deep categories are usually sparse. As such, the traditional text classification methods relying on statistical machine learning alone are not satisfactory especially when the classification task involves web taxonomies such as ODP and Yahoo! Directories that have hundreds of thousands of categories and millions of documents distributed unevenly. Many methods have been developed and studied to deal with the hierarchical text classification problem [4,5,6,7,8,9,10,11,12,13,14,15]. Compared to the previous research focusing on machine learning algorithm alone, a narrow-down approach was proposed by introducing a deep classification algorithm in [8]. This algorithm consists of two stages: search and classification. In the search stage, several candidate categories that are highly related to the input document are retrieved from the entire hierarchy using a search technique. Then training data are collected for each candidate by considering the documents associated with the node and those with its ancestor category nodes. Finally classification is performed after training a classifier for the candidate P. Clough et al. (Eds.): ECIR 2011, LNCS 6611, pp. 7–18, 2011. © Springer-Verlag Berlin Heidelberg 2011

8

H.-S. Oh, Y. Choi, and S.-H. Myaeng

categories using the training documents. By focusing on highly relevant categories from the entire hierarchy, this approach can alleviate error propagation and time complexity problems of a pure machine-leaning based algorithm. Despite the improved performance, the afore mentioned approach suffers from a relatively small number of training documents that are local to a category in a huge hierarchy. As a remedy, an enhanced algorithm was proposed, which makes use of the path from the candidate category node to the root of the hierarchy. Not only the documents associated with the candidate category (local information) but also the documents associated with the top-level categories (global information) connected to the candidate node are used to enrich the training data. Experimental results showed a remarkable performance improvement [15]. In this work, a fixed mixture weight was applied universally to each category node inmodulating the relative contributions of local and global models. However, we note the role of global information may vary depending on the richness of local information. Suppose there are two candidate categories D and D’ that are highly similar to each other based on the associated documents but have entirely different paths: A/B/C/D and E/F/G/D’, where A and E are the roots of two sub-trees in the hierarchy, respectively. If a fixed mixture weight is applied to combine information associated with A (global) and D (local) and E (global) and D’(local), it ignores relative importance of global and local information in each case. This paper proposes efficient and effectivemethods for determining the mixture weight automatically in this context, which avoid EM-like time-consuming updates. The rest of this paper is organized as follows. Section 2 delivers relevant work briefly. The details of the proposed methods are explained in Section 3 and 4. Section 5 presents the experimental results and error analysis of the results. Finally we conclude this paper with discussion and future work in Section 6.

2 Related Work Four types of approaches have been proposed to deal with hierarchical text classification: big-bang, shrinkage, top-down and narrow-down approaches. In big-bang approaches, a single classifier is trained for the entire set of categories in a hierarchy. To build a classifier, various algorithms have been utilized, including SVM [10], Centroid-based [11], rule-based [12], and association rule-based [13] approaches. It’s been shown that it takes much more time in a big-bang approach than in a top-down one even when a hierarchy contains thousands of categories [14]. With hundreds of thousands of categories, moreover, building a single classifier for large-scale hierarchy is intractable [6]. In a top-down approach, a classifier is trained for each level of a category hierarchy. Several studies adapted a top-down approach with various algorithms such as multiple Bayesian[16] and SVM [6,7]. In [6], it reports that a top-down based SVM classifier results in big performance drops on Yahoo! Directory as it goes to deeper levels. This is caused by propagations of errors at early stages involving the ancestors in classifications. In [7], two refinement methods using cross-validation and metafeatures which are the predication of children of a target node are introduced to deal with error propagation and nonlinearity of general concepts in high level.

Text Classification for a Large-Scale Taxonomy

9

In [9], shrinkage approach was proposed for hierarchical classification. Probabilities of words for a class corresponding to a leaf node are calculated for the parents up to the root and combined with a set of parameters. This method of dealing with the data sparseness problem is similar to our method in that it uses the data on the ancestor nodes. However, our method stops gathering training data right before a shared ancestor while the shrinkage approach uses all the parent data up to the root of a hierarchy. The shrinkage method requires heavy computation not only because of the need to consider all the data on the path to the root but also because of the time for parameter estimations with the EM algorithm. By focusing on the top- and leaf-level information, our method reduces time complexity considerably. A recent study proposed the narrow-down approach[8]. The algorithm consists of search and classification stages. In classification, the algorithm just focused on local information by utilizing trigram language model classifier. In [15], an enhanced algorithm of deep classification is proposed by introducing neighbor-assistant training data selection and the idea of combininglocal and global information based on naïve Bayes classifier. While it showed the possibility of augmenting local with global information, it used a fixed mixture weight for all candidate categories.

3 Search Stage The enhanced deep classification approach consists of search and classification stages. For a document to be classified, top-k candidate categories are retrieved at the search stage. In order to select training data, a local model is constructed with the documents associated with each candidate category. Similarly, a global model is constructed for a top-level category connected to the candidate node in the given hierarchy. Finally, classification is performed using the classifier trained by the local and global models constructed based on a dynamically computed mixture weight. The aim of the search stage is to narrow down the entire hierarchy to several highly related categories by using search technique. Two strategies are proposed in [8]: document based and category based search strategies. In the document based strategy, the retrieval unit is a document. Simply we index each document and obtain top-k relevant documents based on relevance scores. Then, the corresponding categories of the retrieved documents are taken as candidate categories. In the category based strategy, we first make a mega document by merging the documents belonging to a category and build an inverted index with a set of mega documents. This ensures a one to one relation between categories and mega documents. Then the search is performed with respect to the given document and produces top-k relevant mega documents. The corresponding categories are taken as candidates since a mega document only links to a unique category. We chose Lucene1 among several publicly available search engines with the category based strategy which outperforms the document based strategy[8].

4 Classification Stage At the classification stage, a final category among the candidate categories is assigned to the given document. Since a classifier is trained for candidate categories, a set of 1

http://lucene.apache.org

10

H.-S. Oh, Y. Choi, and S.-H. Myaeng

training data must be collected from the hierarchy. This section explains how we dynamically collect the training data, i.e., the documents somehow associated with the candidate categories. 4.1 Training Data Selection For the narrow-down approach, two training data selection strategies have been proposed in [8] and [15]: ancestor-assistant and neighbor-assistant strategies. The common purpose of these strategies is to alleviate data sparseness in deep categories. The ancestor-assistant strategy utilizes the structure of the hierarchy. For each candidate, it borrows the documents of its ancestors all the way up until a common ancestor shared by other candidate node appear in the hierarchy. The neighborassistant strategy, an extension of the ancestor-assistant strategy, borrows documents not only from the ancestors but also from the children of the ancestors. The neighborassistant strategy produces slightly better performance than the ancestor-assistant strategy but requires more time [15]. We opted for the ancestor-assistant strategy in the paper for the sake of reduced time complexity. 4.2 Classifier Given that training must be done for each document to be classified, a lightweight classifieris preferred tothose that require longer training time. For this reason, we opted for a naïve Bayes classifier (NBC) in this work. NBC estimates the posterior probability of a test or input document as follows:

P(ci | d ) =

P (d | ci ) P (ci ) ∝ P (ci ) P (d )



N j =1

P(t j | ci )

vj

(1)

where ci is a category i, d is a test document, N is the vocabulary size, t j is a term j in d and v j is term frequency of t j . NBC assigns a category to a document as follows: c* = arg max{P (ci )∏ j =1 P (t j | ci ) j } N

v

(2)

ci ∈C

In general, NBC estimates P(ci ) and P(t j | ci ) from the entire collection D of training data. In deep classification, however, D denotes the training data from candidate categories. Since both local and global documents are used for training, the probabilities are computed separately and then combined with a mixture weight as follows. P (t j | ci ) = (1 − λi ) P (t j | ciglobal ) + λi P (t j | cilocal )

P(ci ) = (1 − λi ) P(ciglobal ) + λi P(cilocal )

(3) (4)

where ciglobal is the top-level category of ci and cilocal is the same as ci but rephrased for explicit explanation, i.e. ciglobal = A and cilocal =A/B/C for ci =A/B/C.

Text Classification for a Large-Scale Taxonomy

11

4.3 Global and Local Models

The aim of the global model is to utilize information in the top-level categories under the root node. Because of the topical diversity of the categories in a hierarchy, the vocabulary size is very large although the number of training data is usually small. To alleviate this problem in representing top-level categories, 15,000 terms are selected using the chi-square feature selection method, which is known for the best performing one in text classification [17]. The prior probability is estimated as follows: Di

P (ciglobal ) =

(5)

D

where D is the entire document collection and Di is a sub-collection in ciglobal . The conditional probability is estimated by mixture of P(t j | ciglobal ) and P (t j ) to avoid zero probabilities. global

P(t j | ci

) = (1 − α ) Pglobal (t j | ci

global

) + α Pglobal (t j )

(6)

where α is a mixture weight (0 ≤ α ≤ 1) . Two parameters are estimated as follows: global

Pglobal (t j | ci

)=

Pglobal (t j ) =

∑ ∑

tu ∈V



t u ∈V

global

global





d k ∈D

global

tf jk

d k ∈ci

d k ∈ci

global

tfuk

(7)

tf jk



d k ∈D

tfuk

(8)

where tf ji is the term frequency of term t j in top-level category ciglobal and V global is a set of terms selected by the chi-square feature selection method over the entire document collection. The role of our local model is to estimate the parameters of candidate categories directly from the set of documents associated with them. Unlike the global model we don't limit the feature space using a feature selection method. The time required for feature selection is not tolerable given that generating a local model cannot be done off-line. The terms selected for the global model cannot be used for the local model either because a limited number of terms concentrated in a semantic category is used in deep categories. Parameter estimation is similar to that of the global model except that it is done for selected training data corresponding to candidate categories. 4.4 Dynamic Determination of a Mixture Weight

The main step in classification is to properly combine local and global models for each candidate category node by determining a proper mixture weight. The shrinkage approach proposed in the past finds a mixture weight using a simple form of expectation maximization (EM) algorithm [9]. Unfortunately, this method is not applicable to a large-scale web taxonomy. Too many parameters are involved in

12

H.-S. Oh, Y. Choi, and S.-H. Myaeng

estimating a mixture weight because a word is generated conditioned on different categories and levels. For example, ODP has hundreds of thousands of terms, millions of categories, and 15 levels of depth. As a result, heavy computation is required if the EM algorithm is used for training that must be done for individual documents. Even though only a small number of candidate categories are chosen, the cost is prohibitive as we should dynamically compute mixture weights for different input documents. As an alternative to the approach, we propose two methods that determine the mixture weights dynamically for individual documents to be classified. They are used to optimize the weights for individual documents, instead of using a “one-for-all” type of weight. Content-Based Optimization (CBO). The main idea is to utilize the difference between local and global models in terms of their semantic contents that can be estimated based on word frequencies. The similarity can be computed based on the probability distributions of words in the two models. Given a candidate category, we don’t need to give a high weight to the global model if it is similar to the local model. For example, when a classification decision is to be made between D and D’ having paths A/B/C/D and E/F/G/D’, respectively, there is no point of giving a high weight to the top level categories, A and E, if D and D’ are similar to them, respectively. In this case, it is reasonable to focus on local information for classification. We capture this intuition with the following equation. λi = 1 −

∑ P(t

| cilocal ) ⋅ P(t j | ciglobal )

j

(9)

j

∑ P(t

j

∑ P(t

local 2 i

|c

)

j

j

global 2 i

|c

)

j

Relevance-Based Optimization (RBO). The main idea hereis to utilizethe relevance scores of candidate categories obtainable from the search stage. The higher the relevance score of a category is in comparison with others, the higher weight it is given. Note that relevance scores are calculated with no regard to the hierarchy information since all the documents under a categoryis treated as a mega document regardless of its position in the hierarchy. In this work, we use cosine similarity between the input document to be classified and the set of documents belonging to the candidate category at hand. Based on this interpretation, a mixture weight is calculated as follows: local

λi =

RelScore( ci

∑ RelScore(c

)

local k

)

(10)

k

where k is the number of candidate categories for the initial search result.

5 Experiments 5.1 Experimental Set-Up Data. ODP has about 700K categories and 4.8M documents with 17 top-level categories: Arts, Business, Computer, Games, Health, Home, Kids_and_Teens, News,

Text Classification for a Large-Scale Taxonomy

13

# of Categories

Recreation, Reference, Regional, Science, Shopping, Society, Sports, World, and Adult. Among them, we filtered out World and Regional categories since documents in the Regional category exist in other categories and those in the World category are not written in English [15]. In addition, the categories having non-English documents were also discarded. As a result, the total number of top-level categories is 15 in our experiments. For leaf categories, when the names are just enumeration of the alphabets such as “A”, “B”, …, and “Z”, we merged them to their parent category. For example, documents that belong to …/Fan_Works/Fan_Art/Mcategory are merged to…/Fan_Works/Fan_Art category. Finally, the categories with less than twodocuments were discarded. 23047

25000 20000

16614

15336

15000

10713

10000 5000 0

6785

4793 15

463

1

2

2661

3

4

5

6

7

8

9

827

93

2

10

11

12 Level

# of Documents

Fig. 1. Category Distributions at Different Levelsin theFiltered ODP 359733 313953

400000 300000

230477

200000

151225

150343

92499

100000 0

76

7827

1

2

39130 3

4

5

6

7

8

9

12511 875 10

11

13 12 Level

Fig. 2. Document Distributions at Different Levels in Filtered ODP

Figures 1 and 2 show the category and document distributions, respectively, at different levels after filtering. ODP contains 81,349 categories and 1,358, 649 documents. About 89% of the documents belong to the categories at the top 7 levels, and about 67% of them belong to the categories at the 4th, 5th, and 6th levels. Most of the documents (96.46%) belong to only one category. In this experiment, 20,000 documents that belong to a unique category are used for testing, and the rests are used for training. In selecting the test data, we considered the proportions of the numbers of documents and categories at each level to those in the entire hierarchy as in [15].

14

H.-S. Oh, Y. Choi, and S.-H. Myaeng

Evaluation Criteria. To show the efficacy of our methods, evaluations were conducted for different levels and categories. For level-based evaluation, we tested sensitivity of the methods to different abstraction levels as in [8,15] by looking at the micro-averaged F1 values at different levels. For example, when a document is classified to Science/Biology/Ecology/Ecosystems, we first check whether Science matches the first level of the correct category. If it passes the first test, it is tested for the second level by checking whether Science/Biology matches the correct category, and so on. This progressive tests stop at level 9 because the number of documents at a deeper levelis very small. For category-based evaluation, we tested whether the methods depend on different topic categories. The categories at the top level were only used. 5.2 Evaluation

We compared our methods with the deep classification (DC) in [8] and the enhanced deep classification (EDC) [15] as summarized in Table 1. DC employs the categorybased search, ancestor-assistant training data selection, and trigram language model classifier. EDC also utilizes the same search strategy but a different method for training data selection and a different classifier. Table 1. The summary of algorithm settings

Deep Classification (DC)

EnhancedDC (EDC)

Search Training

Category-based Ancestor-assistant

Classifier

Trigram Language Model

Smoothing

No-smoothing

Category-based Neighbor-assistant Naïve Bayes Combination of Local and Global Information Add-One in Global Model

We implemented these methods and experimented with our dataset. Top-5 categories were considered as the candidates for classification. For the implementation of EDC, ancestor-assistant strategy was chosen to avoid the time complexity of neighborassistant strategy. Our basic model is the same as EDC but different in two different aspects: the smoothing and mixture weight computation methods. Compared to the add-one smoothing applied only to the global model, we utilize interpolation for both local and global models. More importantly, the two optimization methods were used to determine the appropriate mixture weights dynamically in the proposed approach whereas the mixture weight between local and global models was set empirically. Overall Performance. Figure 3 shows that our proposed deep classification with CBO (PDC-CBO) and RBO (PDC-RBO) out performs both DC and EDC at all levels with only one exception (PDC-CBO slightly worse than EDC at level 1. PDC-RBO attains 0.8302 at the top-level, for example, while DC and EDC reach 0.6448 and 0.7868 respectively. Overall, PDC-CBO and PDC-RBO obtained 5% and 30% improvements respectively over EDC, and 77% and 119% over DC.

Text Classification for a Large-Scale Taxonomy

15

Micro-F1

Since the global model provides information about the top-level category, it is most useful when two or more sub-categories (i.e. local models) contain similar information. With PDC-CBO, for example, Kids_and_Teens/Arts/Dance and Arts/ Performing_Arts/Dance may contain similar document contents at the Dance level. In this case, the top-level category information can provide a guide to the right path. One possible drawback of CBO is that the information content at the top level may be overly general. On the other hand, RBO uses relative information, i.e., relevance scores that effectively reflect the local information. It attempts to compute the degree to which the local information can be trusted and then fill in the rest with the global information. 0.9

DC

0.8

EDC

PDC-CBO

PDC-RBO

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1

2

3

4

5

6

7

8

9 Level

Fig. 3. Comparisons among Four Different Methods

Role of Optimization. We conducted experiments to investigate the role of optimization methods. The EDC method with no optimization and add-one smoothing method in the global model is regarded as the baseline to which each proposed optimization method is applied. As shown in Figure 4, the RBO method achieved 23.4% improvement over the baseline. On the other hand, the CBO method makes the performance somewhat decrease even though it gains some improvement at the 9th level. While the main thrust of the CBO method is to utilize the difference between the local and global models, the add-one smoothing applied to the global model makes the advantage disappear. Effects of Interpolation. Figure 5 and 6 show the effects of the modified classifier with interpolation aimed at avoiding zero probabilities. The alpha is set to 0.7 since it achieved the best performance. As shown in Figure 5, CBO with the modified classifier obtains better performance than the EDC algorithm although CBO with addone smoothing results in lower performance than EDC. The improvement becomes larger with deeper levels. Moreover, the performance improves 7% over the CBO method with the add-one smoothing method. As shown in Figure 6, RBO with the modified classifier also shows 5% improvement compared to RBO with add-one smoothing. The interpolation method is valuable in estimating the global and local models.

H.-S. Oh, Y. Choi, and S.-H. Myaeng Micro-F1

16

0.9

EDC (No Optimization)

0.8

CBO

RBO

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

1

2

3

4

5

6

7

8

9 Level

Micro-F1

Fig. 4. Comparisons among Different Optimization Methods 0.9

EDC

0.8

CBO with add-one smoothing

0.7

CBO with modified classifier

0.6 0.5 0.4 0.3

0.2 0.1 0

1

2

3

4

5

6

7

8

9 Level

Micro-F1

Fig. 5. Comparisons among EDC, CBO with Add-One Smoothing, and CBO with Interpolation EDC RBO with add-one smoothing RBO with modified classifier

0.9 0.8

0.7 0.6

0.5 0.4

0.3 0.2

0.1 0

1

2

3

4

5

6

7

8

9 Level

Fig. 6. Comparisons among EDC and RBO with Add-One Smoothing, and RBO with Interpolation

Performance with Different Categories. With the limited space and resources required to run the experiments for all the categories in the hierarchy, we show the performance for the top-level categories only. As shown in Table 2, PDC-CBO and PDC-RBO show the remarkable improvements over DC and EDC across different categories. The macro-F1 scores of PDC-CBO and PDC-RBO show 25.05% and 77.83% improvements, respectively, compared to EDC. Our method with RBO shows the best performance on Adult category where micro-F1 is 0.4428. Other categories, except for Kids_and_Teens and Reference, also achieved over 0.35% improvementsin micro-F1.

Text Classification for a Large-Scale Taxonomy

17

Table 2. Performance on Top-Level Categories DC

Category

EDC

PDC-CBO

PDC-RBO

MacroF1 MicroF1 MacroF1 MicroF1 MacroF1 MicroF1 MacroF1 MicroF1 Adult

0.0148

0.0183

0.1819

0.2012

0.2224

0.2407

0.4010

0.4428

Arts

0.0202

0.0194

0.1884

0.1895

0.2307

0.2204

0.4315

0.4036

Society

0.0108

0.0152

0.1652

0.1951

0.2187

0.2291

0.3799

0.3963

Recreation

0.0072

0.0097

0.1757

0.2007

0.2453

0.2648

0.4009

0.4279

Home

0.0077

0.0143

0.1711

0.1972

0.2386

0.2350

0.4104

0.4222

Kids_and_Teens

0.0121

0.0115

0.1015

0.1368

0.1007

0.1265

0.1861

0.2170

Reference

0.0031

0.0063

0.1123

0.1316

0.1268

0.1638

0.2248

0.2602

Computers

0.0073

0.0064

0.1555

0.1914

0.2062

0.2201

0.3267

0.3582

Business

0.0062

0.0069

0.1601

0.2003

0.1968

0.2287

0.3556

0.4036

Games

0.0085

0.0142

0.1339

0.1483

0.1721

0.1878

0.2818

0.3045

Sports

0.0137

0.0185

0.1580

0.1844

0.2306

0.2411

0.3855

0.4177

Shopping

0.0079

0.0076

0.1737

0.2110

0.2091

0.2394

0.3678

0.4073

Health

0.0064

0.0077

0.1613

0.1853

0.2199

0.2416

0.3884

0.4221

Science

0.0143

0.0170

0.1974

0.2229

0.2438

0.2569

0.4015

0.4202

News

0.0031

0.0074

0.1401

0.1849

0.1096

0.1357

0.3420

0.3563

0.0096

0.0120

0.1584

0.1854

Average Improvement

0.1981

0.2154

0.3523

0.3773

25.05%

16.22%

77.83%

75.14%

6 Conclusion In this paper, we propose a modified version of deep classification, which introduces two optimization methods for the purpose of properly combining local and global models. The experimental results show that our approach with the two optimization methods achieves 5% and 30% improvements in micro-F1, respectively, over the state-of-art methods in terms of level-oriented evaluations. For category-oriented evaluations, we also achieved 77.83% and 75.14% improvements in micro-F1 and macro-F1. Even though our optimization methods perform quite well, there are some remaining issues that need to be investigated further. One is that RBO may be very sensitive to the number of candidate categories, and therefore there should be further work in investigating ways to make the method invariant. The other one is that what needs to be one when the difference between local and global models is significantly large. These issues will be studied with more extensive experiments. Acknowledgement

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0028079), and Global RFP program funded by Microsoft Research.

18

H.-S. Oh, Y. Choi, and S.-H. Myaeng

References 1. Broder, A.Z., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., Zhang, T.: Robust classification of rare queries using web knowledge. In: 30th ACM SIGIR, pp. 231–238 (2007) 2. Broder, A., Fontoura, M., Josifovski, V., Riedel, L.: A semantic approach to contextual advertising. In: 30th ACM SIGIR, pp. 559–566 (2007) 3. Zhang, B., Li, H., Liu, Y., Ji, L., Xi, W., Fan, W., Chen, Z., Ma, W.Y.: Improving web search results using affinity graph. In: 28th ACM SIGIR, pp. 504–511 (2005) 4. Kosmopoulos, A., Gaussier, E., Paliouras, G., Aseervatham, S.: The ECIR 2010 large scale hierarchical classification workshop. SIGIR Forum. 44, 23–32 (2010) 5. Sun, A., Lim, E.P.: Hierarchical text classification and evaluation. In: IEEE ICDM, pp. 521–528 (2001) 6. Liu, T.Y., Yang, Y., Wan, H., Zeng, H.J., Chen, Z., Ma, W.Y.: Support vector machines classification with a very large-scale taxonomy. ACM SIGKDD Explorations Newsletter 7, 36–43 (2005) 7. Bennett, P.N., Nguyen, N.: Refined experts: improving classification in large taxonomies. In: 32nd ACM SIGIR, pp. 11–18 (2009) 8. Xue, G.R., Xing, D., Yang, Q., Yu, Y.: Deep classification in large-scale text hierarchies. In: 31st ACM SIGIR, pp. 619–626 (2008) 9. McCallum, A., Rosenfeld, R., Mitchell, T., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: 15th ICML, pp. 359–367 (1998) 10. Cai, L., Hofmann, T.: Hierarchical document categorization with support vector machines. In: 13th ACM CIKM, pp. 78–87 (2004) 11. Labrou, Y., Finin, T.: Yahoo! as an ontology: using Yahoo! categories to describe documents. In: 8th ACM CIKM, pp. 180–187 (1999) 12. Sasaki, M., Kita, K.: Rule-based text categorization using hierarchical categories. In: IEEE International Conference on Systems, Man, and Cybernetics, vol. 3, pp. 2827–2830 (1998) 13. Wang, K., Zhou, S., He, Y.: Hierarchical classification of real life documents. In: 1st (SIAM) International Conference on Data Mining, pp. 1–16 (2001) 14. Yang, Y., Zhang, J., Kisiel, B.: A scalability analysis of classifiers in text categorization. In: 26th ACM SIGIR, pp. 96–103 (2003) 15. Oh, H.S., Choi, Y., Myaeng, S.H.: Combining global and local information for enhanced deep classification. In: 2010 ACM SAC, pp. 1760–1767 (2010) 16. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: International Conference on Machine Learning, pp. 170–178 (1997) 17. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: International Conference on Machine Learning, pp. 412–420 (1997)

Suggest Documents