2-way Text Classification for Harmful Web Documents? Youngsoo Kim12 , Taekyong Nam1 , and Dongho Won2 1
Network Security Group Electronics and Telecommunications Research Institute (ETRI) 161 Gajeong-dong, Yuseong-gu, Daejeon, 305-350, Korea {blitzkrieg, tynam}@etri.re.kr 2 Information Security Group, School of Information and Communication Engineering, Sungkyunkwan University 300 Cheoncheon-dong, Jangan-gu, Suwon, Gyeonggi-do, 440-746, Korea
[email protected]
Abstract. The openness of the Web allows any user to access almost any type of information. However, some information, such as adult content, is not appropriate for all users, notably children. Additionally for adults, some contents included in abnormal porn sites can do ordinary people’s mental health harm. In this paper, we propose an efficient 2-way text filter for blocking harmful web documents and also present a new criterion for clear classification. It filters off 0-grade web texts containing no harmful words using pattern matching with harmful words dictionaries, and classifies 1-grade,2-grade and 3-grade web texts using a machine learning algorithm.
1
Introduction
The World Wide Web is growing ever more rapidly. More and richer information sources and services, such as news, advertisements, consumer information and adult contents are available on the Web everyday. Simultaneously, user communities are becoming increasing diverse. The openness of the Web allows any user to access almost any type of information. However, some information, such as adult content, is not appropriate for all users, notably children. Additionally for adults, some contents included in abnormal porn sites can do ordinary people’s mental health harm. Some companies have partial solutions about this problem. Their products have concentrated on IP-based filtering, and their classification of Web sites is mostly manual. But, as we know, the Web is a highly dynamic information source. Not only do many Web sites appear everyday while others disappear, but also site content (include linkage information) is updated frequently. Thus, manual classifications and filtering systems are largely impractical. The highly dynamic character of the Web calls for new techniques designed to classify and filter web sites and URLs automatically. ?
Corresponding author: Dongho Won (
[email protected])
Additionally, most of conventional filtering products are only focusing on adults web sites for protecting children. In real world, some non-adult sites can contain adult contents like adult education, sex consultation or sex-related gossips. On the other hand, some abnormal adult sites include a lot of objectionable contents being able to do ordinary people’s mental health harm. A new criterion for classifying contents of web pages is needed. In this paper, we focus on texts included in web documents. we propose an efficient 2-way text filter for blocking harmful web documents and also present a new criterion for clear classification. It filters off 0-grade web texts containing no harmful words using pattern matching with harmful words dictionaries, and classifies 1-grade,2-grade and 3-grade web texts using a machine learning algorithm. Since all documents do not have adult-related contents can be 0-grade, there exist huge number of 0-grade documents relatively. It is the reason why it filters off 0-grade web texts firstly. This 2-step filtering can help generating more correct learning models. The paper is organized as follows. Section 1 is an introduction and we discuss contents rating service in real world in section 2. Also, we show a new criterion for rating web documents in here. In section 3, we show the system framework and explain classification processes. We show our system implementation and experimental results at section 4 and 5, and finish it with further works and conclusion in section 6.
2 2.1
Contents Rating Services Some contents rating standards
There are several standards for contents rating and filtering at some countries. ICRA (Internet Content Rating Association) is the representative standard association develops and operates international contents rating system [1]. ICRA announced a new contents rating standard, ICRASaf e in Dec, 2000. In Korea, there is a Korean contents rating system for internet, SafeNet [2]. It was developed by ICEC (Information Communication Ethics Committee) and operated from Dec, 1999 [3]. It has 5 categories and 5 levels for harmful internet contents and interoperates with ICRA. SafeNet’s categories include nudity, sexual intercourse, violence, brutal language usage, drugs, weapon usage and gambling. They do not focus only on pornographic contents. 2.2
A new criterion for rating texts of web documents
We define four grades to develop our system. These grades can be applied to filtering system by user’s age. Harmful texts can contain violence, brutal language usage, nudity, sexual intercourse, drug and so on. But in this paper we focus on pornographic contents. The 0-grade is for non-harmful web documents. All web documents, not having adult-related contents, can be 0-grade. This grade web pages deal with
politics, economy, society, culture, and any huge kind of topics. The 1-grade is for web documents contain adult contents like adult education, sex consultation or sex-related gossips. It can be confused to distinguish between 2-grade and 3-grade web documents if the text filter uses only simple pattern matching, because these grade web documents contain harmful words but overall contexts of them are not harmful. The 2-grade is for web documents whose contents are normal pornographies or erotic stories. These contents include straight or gay sex, fetish, bestiality, bisexual, BDSM, voyeurism, and so on. The 3-grade is for web documents contain abnormal pornographic contents like incest, snuff films, or child porn. These sites, obviously unpleasant to normal adults, are not only immoral but also illegal. We collected about 30,000 harmful texts using web robots and gave grades them using our four-grade criterion manually. We purposely collected many web documents which are not easy to give correct grades.
3
A System Framework and Operation Processes
Fig 1 shows our system framework and rating processes for web documents. It consists of 5 parts; web-documents collection, preprocessing(morphological analysis), rule-based text classification, learn-based text classification, and harmful URL management.
Fig. 1. System framework and web documents rating processes
3.1
Web-documents collection
This is for gathering web documents can be used for learning samples and establishing harmful URL database. Web robots visit internet sites and gather web pages. They are stored at database, after being classified into one of 4 grades according to a new rating criterion. 3.2
Preprocessor (Morphological Analysis)
The main part of this function is the morphological analysis. All web-documents contain HTML tags, so they are HTML-parsed before being divided into many morphemes using a morphological analyzer. This process includes deletion process for symbols and stop-words to help the morphological analysis. All web documents at databases need this process for learning and rating. Some dictionaries are needed for this process. The stop-words dictionary is for deleting symbols and stop-words. To improve performance, morphologically analyzed words are stored at the pre-analysis dictionary. Some words which are not analyzed morphologically can be recorded at the user dictionary. 3.3
Rule-based text classification
This function extracts non-harmful documents (0-grade web documents) from all collected web documents using a pattern matching algorithm. It decides a document is non-harmful if it does not have any harmful words. This function has 2 dictionaries; the harmful words dictionary and the homonymic words dictionary. In the case of containing some homonyms can be confused to be lewd words, those documents should be misunderstood as harmful ones. We checked and extracted from approximately 30,000 documents and made words bags to get those dictionaries. We selected harmful words bags have high term frequency. 3.4
Learn-based text classification
This function adapts the SVM learning algorithm to classify harmful documents (1-grade, 2-grade, and 3-grade). It is divided into 2 main processes; a learning process and a rating process. A learning process consists of feature selection, indexing, SVM preprocessing [4] [5], and generation of learning models. It calculates feature vectors from the result of the morphological analysis. There are some algorithms for selecting features like TF(Term Frequency), MI(Mutual Information), IG(Information Gain), and CHI(χ2 statistics) [6]. Indexing is for endowing features with weights, because features have different depths of importance at each web documents. We adopted TFIDF(Term Frequency Inverse Document Frequency) method for giving a weight to each feature [7]. SVM preprocessing part includes normalization for feature vectors and grid search for finding optimal SVM parameters. Finally, a learning model is generated using optimal SVM parameters.
A rating part consists of indexing, normalization, and rating. This part is for web documents for rating. They need indexing and normalization for comparing to the learning model generated at learning process. After that, harmful grades are given to those web documents for rating; 1-grade, 2-grade, and 3-grade. 3.5
Harmful URL management
This function establishes the harmful URL database and manages it. This database includes many fields like serial number, domain name, IP address, paths, title, document’s grade, the date to get a grade, the time to get a grade, the date of checking availability, and so on. Periodically, this function checks the validity of those URLs, since they can be dead or their contents can be changed. The harmful URL database is used for other filtering tools based on harmful URL lists by distributing.
4
System Implementations
Our system consists of 5 modules; the web-documents collection, the preprocessing(morphological analysis), the rule-based text classification, the learn-based text classification and the harmful URL management. First, the web-documents collection module give a function to collect web documents using meta search engines and web robot agents. It drives meta search engine specific to searching harmful sites and gets a list of harmful sites as a result. This result is used for seed URL for web robot to get more exact harmful web documents. Second, the preprocessing module processes web documents properly and gives them to a rule-based text classification module and a learn-based classification module. Because a HTML text contains many tags that shows text structure, this module must extract only contents words from them. And then the preprocessing module identifies languages and language codes. Our system can process two languages, Korean and English. There are several codes for represent Korean like KSC 5601 or windows Unicode. Our system can process KSC 5601 only. After that it performs the morphological analysis. Some dictionaries are needed for this process. The stop-words dictionary is for deleting symbols and stop-words. To achieve performance, morphologically analyzed words are stored at the pre-analysis dictionary. Some words which are not analyzed morphologically can be recorded at the user dictionary. Third, the rule-based text classification module detects non-harmful documents quickly using a pattern matching algorithm. It decides a document must be non-harmful if it does not have any harmful words. This module has 2 dictionaries; the harmful words dictionary and the homonymic words dictionary. We checked and extracted from approximately 30,000 documents and made words bags to get those dictionaries. We selected harmful words bags have high term frequency.
Fourth, the learn-based text classification module gives grades to harmful documents using the SVM learning algorithm. It consists of 5 units; a feature selection, indexing, SVM preprocessing, learning model generation and rating. A feature selection unit calculates feature vectors from the result of the morphological analysis. Indexing unit is for giving weights to features, because features have different depths of importance at each web documents. SVM preprocessing unit includes normalization for feature vectors and grid search for finding optimal SVM parameters. We get a learning model through learning model generation unit. In rating unit, harmful grades are given to those web documents for rating; 1-grade, 2-grade, and 3-grade. Finally, the harmful URL management module establishes the harmful URL database and manages it. This database includes many fields like serial number, domain name, IP address, paths, title, document’s grade, the date to get a grade, the time to get a grade, the date of checking availability, and so on. Also, this module compresses URLs using hash functions and distributes them to other filtering tools based on harmful URL lists. For validity checks, it calls on the web-documents collection module to recollect the same URLs periodically.
5
Experimental Results
We have experimented using 2 feature selection algorithms; IG and log-TF. The learning model was generated using 12,000 harmful web documents. We gave grades to these documents after collecting by web robots. 16,000 web documents were used for testing. The TFIDF indexing algorithm and the SVM machine learning algorithm was used for experiments. Table.1 and table.2 show some results. For example, in case that we input 4,000 grade2 web documents to our system, 3,103 web documents are rated as grade2 and 897 web documents are rated as other grades. Both tables show that it has better performance when we adapt log-TF algorithm. In case of adapting IG algorithm, number of support vector is 9115 and values of SVM parameters, C and g, are 2 and 0.5. When we used log-TF algorithm, number of support vector is 8182 and values of SVM parameters are 0.8 and 0.125. The rule-based text filtering has a good performance in both cases. In the learn-based text filtering, 2-grade test documents have a low accuracy relatively. It means some 2-grade test documents were wrongly classified to other grades.
6
Conclusion
In this paper, we designed and implemented a hybrid text filtering system for adult web documents. We proposed 4 grades criterion for classifying harmful and non-harmful web documents. This system was experimented using the SVM machine learning algorithm, IG and log-TF feature selection algorithm, and TFIDF indexing algorithm. We became to know that log-TF algorithm is better than IG algorithm for our situation. However this result is not enough, because there are some other algorithms for learning, feature selection and indexing. In the
Table 1. IG Performance grade0 (4,000) grade1 (4,000) grade2 (4,000) grade3 (4,000) Accuracy Overall Accuracy
grade0 3,821 226 434 232 95.53% 85.83%
grade1 67 3,551 135 9 88.78%
grade2 101 208 3,103 501 77.57%
grade3 11 15 328 3,258 81.45%
Table 2. log-TF Performance
grade0 (4,000) grade1 (4,000) grade2 (4,000) grade3 (4,000) Accuracy Overall Accuracy
grade0 3,897 227 322 177 97.43% 87.07%
grade1 41 3,644 101 16 91.10%
grade2 46 115 2,976 392 74.40%
grade3 16 14 601 3,415 85.38%
near future, we will have some experiments for comparing performances. Some other feature selection algorithms will be adapted; MI, CHI and log-TF variants, like double log-TF, root- TF, Okapi, etc. And then we will compare results with those of IG and log-TF. Also we will check if the number of feature words has an effect on generating the better learning model. Acknowledgements The authors are deeply grateful to the anonymous reviewers for their valuable suggestions and comments on the first version of this paper.
References 1. 2. 3. 4.
Internet Contents Rating Association, http://www.icra.org Safenet, http://www.safenet.ne.kr/english/intro/overview.html Information Communication Ethics Committee, http://www.icec.or.kr G. Siolas, ”Support Vector Machines based on a semantic kernel for text categorization”, IJCNN 2000, vol.5, 2000, pp.205-209 5. Support vector machine-Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/SVM 6. Y. Yang, and J. Pedersen, ”A comparative study on feature selection in text categorization”, Proceedings of the 14th international conference on Machine Learning, 1997, pp. 412-420 7. T. Joachims, ”A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization”, Proceedings of the 14th international conference on Machine Learning, 1997, pp. 143-151