E-Newspaper Classification and Distribution Based on ... - CiteSeerX

1

E-Newspaper Classification and Distribution Based on User Profiles and Thesaurus Yousef ABUZIR and Fernand VANDAMME

Abstract-- Electronic mail offers the promise of rapid and primary communication of essential information. It becomes important to correctly use this technology as tool for sending an electronic version of the daily newspaper to the subscribers taking in our account their profiles. In this paper, we describe how a thesaurus-based system can simplify this task. This system provides the functions to index and retrieve a collection of E-mail messages based on thesaurus to create user profiles. The thesaurus is used not only for indexing and retrieving messages, but also for classifying E-Newspaper articles. By automatically indexing the daily newspaper articles using a thesaurus, the system can easily selects relevant E-Newspaper articles according to user profile and send an E-Mail message to the readers containing them. Index Terms-Information Retrieval, E-Newspaper Classification, E-Newspaper Delivery, E-Newspaper, User Profile, Thesaurus and Indexing.

W

I.

INTRODUCTION

ith the advent of the World Wide Web, access to newspapers throughout the world has been revolutionized. For instance, international newspapers that may take weeks to arrive at the university, office or home can now be read over the Web the same day they are published. In addition, a far greater variety of newspapers are available over the Web than can be offered by one library. Bear in mind, however, that electronic newspapers accessible over the web have limitations: the entire print issue is rarely available, and advertisements, the classified pages, and items of local interest are often omitted; usually only the most recent issue (or the most recent week) can be electronically accessed, previous weeks, months, and years of the paper are usually not retained on the web; and the trend seems to be that more and more digital newspapers can be accessed only through a paid subscription, similar to that of a print newspaper. That we are in the age of information is evident quite clearly in newspapers as an information source. Nearly everywhere you can have newspapers delivered to your doorstep, each with hundreds of new articles each day. Nearly everywhere in

Y. Abuzir is a PhD student at the University of Ghent, BIKIT Plateaustraat 22, B-9000 Gent, Belgium (email: [email protected]). F. Vandamme is with the Laboratory of Applied Epistemology Ghent University, Blandijnberg 2, B-9000 Gent, Belgium (email: [email protected]).

the world via the World Wide Web, you can access hundreds of daily newspapers through their online Web sites. Web site NAA [1] provides tens of hundreds of news articles of potential interest. We need information retrieval to help us organize newspaper articles so that we may spend more of our time reading articles of interest. Newspaper classification presents the additional opportunity for the user, which may be has a strong appeal to newspaper readers. The most widespread computer application today used for person-to-person communication is electronic mail (E-mail). E-mail facilitates communication by its high speed, reducing the number of telephone calls, and providing possibilities for automatic documentation. It is a rich source of quality and upto-date information. Users may use E-Mail as advanced tool to receive newspaper articles of interests. Roughly speaking, the mechanism of the thesaurus is utilized for improving the effectiveness to retrieve document in information retrieval [2]. One of the major aspects of the research within our project has been development of the press thesaurus, which is used for indexing, classification of electronic Newspaper articles and extraction of user profiles from e-mail messages. In this paper we focus on a case study involving classification of the E-Newspaper articles. Newspaper articles are helpful for businesses, company managers and other decision-makers. Newspaper has two social functions to perform, education and entertainment. However, due to the sheer number of Newspaper articles published, it is a time-consuming task to select the most interesting one for the subscriber. Therefore, a method of ENewspaper article categorization is useful if it is to obtain relevant information quickly. We have been researching into methods of automated Newspaper-article classification through pre-defined categories for the purpose of developing an E-Newspaper article delivering-service system, which sends the latest headline news in real time to those users who are interested in the particular topics. The contents of this paper are divided in 6 sections (I. Introduction, II. An overview of TDOCS Toolkit, III. Thesaurus Technology, IV. E-Newspaper Classification and Distribution, V. An Experiment and Evaluation and VI. Conclusions). Section II is a general introduction to Thesaurus based DOcuments Classification System "TDOCS" Toolkit. Section III the thesaurus technology, discusses the role of thesaurus in information retrieval and describes how the

2 thesaurus has been created and used to support the classification and user profiling functions. Section IV describes user profile creation and basic functionality of ENewspaper articles classification and delivering. In Section V an experiment and evaluation, we test the classification and delivering functions using an example collection of ENewspaper articles. The last section presents the conclusions.

the thesaurus, an additional iteration is needed, which is an update of the thesaurus with those high-frequency terms using ThesWB Tool. This iteration needs to be followed by reindexing of documents in order to maintain the consistency of the index and thesaurus databases.

II. AN OVERVIEW OF TDOCS

A. Thesauri and Information Retrieval Thesaurus technology has been used to assist in the retrieval of information for more four decades [3]. Although free text retrieval has become quite popular during the last decade, the use of thesauri as a component of information retrieval system is challenging. Thesauri are tools aimed of improving the effectiveness of information retrieval system. First, they can assist term selection through the semantic roadmap they provide [7] and second, they can be used for automatic query expansion. Thesaurus technology undoubtedly indicates an innovative approach in document management, especially in the field of document indexing and retrieval [2], [7]. Every day millions of people are searching every second for information stored in documents somewhere locally, in an intranet or on the Internet, in order to find the specific information they want to find that at specific time. In general, indexing documents based on a thesaurus means looking for a match between thesaurus terms and terms occurring in the document to be indexed. Each time there is a match the thesaurus term is assigned as a key term to the document. Another characteristic feature of thesaurus based indexing is using concept for indexing allows document to be ranked as a relevant to a query, even if the query term itself does not occur in the text, but only a related term which denotes the same concept. Applied to document retrieval this means that thesaurus based indexing allows documents to be retrieved even if one or more of some given words in a search string do not match to any word or combination of words in the documents.

Electronic thesauri have been identified as strategic instruments for indexing electronic documents [3]. However, one of the main problems of using electronic thesauri remains the creation and maintenance [4]. In this section, we provide some basic information towards a solution for this maintenance problem through the TDOCS-Toolkit1 and ThesWB. The concept of the TDOCS document management system [5], [6] is shown in Fig. 1. Electronic documents are imported into the system via the indexing process. This process generates automatically various index terms such as keywords, concepts and relations, which are assigned to documents as a function of the document content.

DocDB

keywords concepts -

Indexing

Thesaurus DB

Query

Retrieval

Index DB

keyword DB concept DB

ThesWB maintenance

Fig. 1:TDOCS system model: document indexing and retrieval.

The indexing and retrieval processes being used in TDOCS are based on one part on the hierarchical structure of the thesaurus. The thesaurus hierarchy is used in order to create associations between documents and concepts. The use of the thesaurus therefore implies that the associations can also be expressed by terms not explicitly present in the analyzed text. Such terms, we call concepts; refer to broader terms of keywords that are found in the document. The indexing process usually identifies in documents not only the keywords, but also high-frequency terms, which are not present in the thesaurus. In spite of their relevancy as index terms, those terms are not indicated as keywords because of their absence in the thesaurus. To incorporate those terms in 1

TDOCS is bsed on IKEM - method (pat.pend).

III. THESAURUS TECHNOLOY

B. International Press Thesaurus (IPT) Construction Various techniques and approaches were used to construct thesaurus. Building manual thesauri requires a lot of human labour from linguists or domain experts and they are expensive to build. Since it is difficult and expensive to build thesauri manually, many researchers attempted to construct thesauri automatically [8]. There are different approaches to construct a thesaurus. The first approach, on designing a thesaurus from document collection, is a standard one [9], [10], [11], [12], [13], [14]. Here the idea is to use a collection of documents as the source for thesaurus construction. This assumes that a representative body of text is available. By applying statistical or linguistic procedures we can identify important terms as well as their significant relationships. The second approach is merging existing thesauri [15], [16], [17], [18]. This second approach is appropriate when two or more thesauri for a given subject exist, that need to be merged

3 into a single unit. If a new thesaurus can indeed be served by merging two or more existing thesauri, then a merger perhaps is likely to be more efficient than producing the thesaurus from scratch. The simplest approach is to reuse existing online lexicographic databases, such as WordNet [19,], [20] Longman's subject codes [21] or using the per-arranged list of terms like table of content or book indexing [22] to aid in the construction. In constructing a system that has to conduct a fair amount of semantic understanding, a thesaurus is known to be very useful tool. International Press Thesaurus (IPT) was construct as part of this research. We construct the first draft version of the International Press Thesaurus using ThesWB Tool [23]. The International Press Thesaurus has been designed especially for the classification of Newspaper material. There is 17 root terms in International Press Thesaurus. These roots comprise 1288 terms, which are hierarchically structured in low levels. The terms are available in English and Dutch. We used TDOCS Thesaurus Manager to test the International Press Thesaurus. TDOCS Thesaurus Manager can administer monolingual and multilingual thesauri. It offers easy, user-friendly interface. Besides the standard features, it provides other administrative functions that keeping track of changes made in the thesaurus (who changed a term, when was it done). A HTML file generator creates all the files that are necessary to put a thesaurus on the Internet. A wide range of reports and displays can be created with this thesaurus tool. It offers all features necessary for thesaurus management including a support for notation. TDOCS Thesaurus Manager is a state-of-the-art thesaurus management system, expressly designed to handle the creation, maintenance and printing of automated vocabularies. The thesaurus can be displayed in alphabetical, hierarchical (tree structure) and HTML. The semantic hierarchy encoded in a thesaurus is extremely important for automatic classification because it hierarchy can TABLE I ROOT TERMS FOR INTERNATIONAL PRESS THESAURUS English Term

Dutch Term

Arts, Culture & Entertainment Crime, Law & Justice Disasters & Accidents Economy, Business & Finance Education Environmental Issues Health Human Interest Labour Lifestyle & Leisure Politics Religion & Belief Science & Technology Social Issues Sport Unrest, Conflicts & War Weather

Kunst, Cultuur en Ontspanning Misdaad, Wetgeving en Rechtspraak Rampen en Ongevallen Economie, Zakenleven en Financiëen Onderwijs Milieu aspecten Gezondheid Human Interest Aarbeid Lifestyle en Ontspanning Politiek Geloof en Godsdienst Wetenschap en Technologie Sociale Aspecten Sport Onrusten, Conflicten en Oorlogen Weer

be exploited by the classifier using inheritance. 1) Choosing the Root Terms For the International Press Thesaurus, the following seventeen terms have been defined as a root terms (Table I). We decided to use these terms as Root Terms of the International Press Thesaurus. The thesaurus constructor tool ThesWB was used to create the first draft version of the International Press Thesaurus. 2) Constructing The International Press Thesaurus In this section we focus on building the International Press Thesaurus. To build this thesaurus we used a pre-arranged list of terms. A sample of this list is present in Table II. ThesWB tool used to build the thesaurus automatically from this list. A reference number "RefNum" is a unique number used to identify each term. This number used to extract the hierarchical relationships between the terms. Numeric functions as well some heuristic rules used to find the hierarchical structure between these terms. For example the number 15001002 can be divided into three parts (Fig. 2). The changes in the left part indicate new root terms. Broader terms relate to root terms can be found in the middle part. The right part shows low-level terms. TABLE II A SAMPLE OF THE INPUT LIST USED TO CREATE INTERNATIONAL PRESS THESAURUS RefNum 15000000 15001000 15001001 15001002 15001000 15002001 15002002 15002003 15002004 15002005 15003000 15003001 15003002 … …

Code SPO

English Term Sport Aero and Aviation Sports Parachuting Sky Diving Alpine Skiing downhill Giant slalom Super G Slalom Combined American Football (US) National Football League (NFL) (North American) Canadian Football League ... ...

Dutch Term sport zweef- en vliegsporten parachutisme sky diving alpine ski afdaling reuze slalom super G slalom combiné amrikaans voetbal ... ... ... ...

C. Resulting Thesaurus Structure The International Press Thesaurus has been designed primarily for indexing and classification of E-Newspaper articles generated for a daily newspaper. This thesaurus provides a core terminology in the field of press. In addition to indexation and classification, it can be used for terminological guide for the standardization of descriptors, and search aid with other subject-related vocabularies in the databases of the electronic newspaper publisher or linguistic equivalencies for translation purposes. The International Press Thesaurus contains three relationships hierarchical relationships BT/NT, equivalence relationships U/UF, associative relationship RT and the SN (scope note):

4

Fig. 2 Finding the relationships between term using RefNum

Fig. 3 shows the International Press Thesaurus. This is the current version of the International Press Thesaurus. International Press can be updated and extended using ThesWB tool. The thesaurus may be viewed here in a hierarchical or alphabetical display. The alphabetical version presents all preferred and non-preferred terms in a single alphabetical sequence. Next to the alphabetical list with relations, there is a graphical tree viewer. This viewer is mainly intended to show terms’ path to the root (it shows the position of a term in relation to the root of the thesaurus, including the path of all parent terms).

Fig. 3 International Press Thesaurus. It is a multilingual thesaurus English and Dutch. The first version contains 1288 terms.

IV. E-NEWSPAPER CLASSIFICATION AND DISTRIBUTION For many years, newspapers have recorded the rise and fall of presidents, the dramatic and banal events of everyday life, and many of the small stories that contribute to the larger saga that shapes a town, a state, and an entire nation. Newspapers are also good places to look for current

information or to get an overview of day-to-day coverage of a particular topic. The Electronic newspaper is in need of some kind of structuring and getting a better overview of the information contained in it. Furthermore, they are in need of retrieving the information in better ways. The amount of effort required to index information is related to the amount of information stored. Among the major reasons for the information retrieval difficulty are the lack of explicit semantic classification of (or linkages between) relevant information and the limits of conventional search techniques using keywords (either full text or index-based). Especially, the organization of new articles becomes more critical as the amount of E-Newspaper articles in the system grows. A system with support for classifying the information would help in the task of selecting relevant article. The classification mechanism in our approach is based on Information Retrieval Thesaurus. The TDOCS Toolkit parses the new E-Newspaper articles and indexes these articles using a thesaurus. Thereafter the Delivery System will map the indexing result with user profiles database to select the right article for each user. In the following section, we will describe how we utilize the International Press Thesaurus for classification. A. User Profiles Use profile is a collection of information that describes a user [13]. User profile may be defined as a set of keywords which describe the information in which the user he is interested in. Similarly, some approaches base the user profile on the user’s likes and dislikes [14]. With user profile the user can set certain criteria of preference and ask for E-Newspaper articles of specific interest to be presented instead of being overwhelmed by everything the publication has to offer. The profiling (synthesis of user profile according to his/her preferences and interests) can be done by user-defined criteria in our case here by collecting keywords from user's E-Mail message that sent to the newspaper publisher. With the use of classification techniques based on thesaurus, the E-Newspaper articles adapts to user needs and interests according to his/her profile. The system seeks to map the indexed E-Newspaper articles with the user interests to choose a subset of articles that best reflects user interests. User’s interests is represented as a profile as described above 1) User Profile Creation User Profile of the subscribers constructed as follows Fig 4. At first email address of the sender, and the other fields are extracted from the E-Mail header lines. The body part of the E-Mail message will be extracted and parsed by TDOCS system. As a result of indexing process all keywords from the message and concepts derived from thesaurus are stored in a database reflects user interests. Thesaurus is used to create the user-profile from received EMail messages. These messages include many common terms appeared in the thesaurus. Moreover, these terms reflect the interest of users.

5 User profiles store the interests of a given user. The user profiles generated automatically from Users' E-Mail Messages. The profile is hierarchically structured, and they are not just a flat list of keywords. Using the end-user system, we can update the user profile.

E-Mail Messages

TDOCS

Indexing

IPT DB

keyword DB concept DB

User Profile DB User

Keyword

Concept

ThesWB maintenance

Fig. 4 User-Profile creation from the E-Mail message of subscribers

2) Updating User Profile The end-user system allows users to browse, read the newspaper articles. The end-user consists of a TDOCS system with interface to the E-Mail Reader application. The end-user system can be used to maintain user profiles. The system manages that by recording user interaction with the document Search "DocSearch" using the concept tree window. The request for an article is done by click on the concept node or low-level terms in the tree. When the user performs this operation, it is taken to be an indication of user's interest in article that represents his/her request. The keywords or concepts are stored into log file. This file sends to the newspaper server to update the user profile of the subscriber in case new keywords or concepts are dedicated. As we see a more fine grained and specific approach is proposed to update the user profile using newspaper articles that are classified by the system. The user shows an interest for the articles by his/her browsing behavior. The system will keep the current version of user profile until the user has revealed new information. Once this new information is available the system will update the user profile. So, the user profile will be updated automatically based on his browsing and content selection from the root tree. B. Classification With the exponentially growing amount of electronic documents, the task of organizing and retrieving documents of interest has become increasingly difficult. In our daily life, we are faced with classification tasks very often. The tasks vary from one domain to other. A simple example is keeping an address book. Organizing an address book involves classification in the sense that we not only write down phone numbers and names in alphabetical order, we also group

certain number as " hotels", service"," restaurant", etc. A more complex example is Web pages classification. In Web pages classification (i.e., Yahoo), the associated Web pages are manually placed in the subject hierarchy. In contrast, the automatic classification will associate the web page automatically to the subject hierarch. In this paper we used the thesaurus to classify E-Newspaper articles into concepts "subject hierarchy". In our application, all E-Newspaper articles are classified into concepts. Our classification approach uses International Press Thesaurus as a reference thesaurus. Each E-Newspaper articles is automatically classified into the best matching concept in the International Press Thesaurus. TDOCS system gives weight for each concept. This weight can be used as selection criteria for the best matching concepts for the articles. In the creation of users profile that shows user's interest we used TDOCS to parse and index the E-Mail message of the users. Later, the system determines which E-Newspaper articles in the input sets are relevant and which are not. Comparing the E-Newspaper articles to a list of key words that describe a user to classify the articles as relevant or irrelevant, Fig. 5. TDOCS uses the user profile to decide which ENewspaper articles were irrelevant and which were not.

E-Newspaper articles TDOCS

Indexing

IPT DB

Index DB

keyword DB concept DB User

User Profile DB

Delivery System

Fig. 5 An overview of E-Newspaper articles Delivering-Service

C. Thesaurus Maintenance There are several reasons why the contents of and the relationships within a thesaurus might change. These changes may not occur frequently but they need to be managed. Reasons to update thesaurus like changing the status of terms or adding new terms and relationships, is to keep up with the growth of the terms. We used TDOCS Batch Parser for indexing E-Newspaper articles on the basis of International Press Thesaurus. The indexing process generates the update list that shows terms and their frequencies in the articles (see Fig 6). These terms are not present in the thesaurus. This list can be used to update the thesaurus using ThesWB tool. Fig. 7 shows a general procedure to update the thesaurus. This is an iterative approach that consists of subsequently parsing of new E-Newspaper articles and inserting new terms into the

6 thesaurus. This process consists of the following steps. We start by gathering the term list after parsing the articles. The second step used to eliminate and remove noisy terms. TheWB tool can be used to find the relationship between the new candidate terms and among the existing terms in the thesaurus.

Fig. 6 User interface for Document Search in TDOCS, to browse and query the indexing result. In the bottom right part of the window is the update list. This update list shows important terms can be added to the thesaurus.

Stop List Parsing

E-Newspaper articles

[27], [28], [29]. Furthermore, there are various implementations of user profile in online newspaper personalization. Today, many newspapers are available on the Internet. Kamba et al. [30] developed Antagonomy, a system that composes personalized newspaper on the Web. The system monitors user operations on the articles and reflects them in the user profile. The layout of the composed newspaper is based on the scores given to the articles that reflect the degree of article matching the user profile. For example, more articles with the higher score are placed on the top of the newspaper. Online newspapers hold great promise for restructuring newspaper layout according to individual preferences. Online newspaper presentation can be personalized in terms contents, layout, media, advertisements and more. Rich possibilities include personalizing the number of articles per page, the inclusion and size of pictures, even the shape and depth of the newspaper tree. With accurate predictions on a user's level of interest in unread articles, P-Tango [31] will seek to deliver a personalized front-page, containing only the articles of highest interest, individually created everyday, for each user that accesses the P-Tango site. Billsus et al. [32], present an intelligent agent designed to compile a daily news program for individual users. Based on feedback from the user, the system automatically adapts to the user’s preferences and interests. Miranda et al. [33], present a new filtering approach that combines the coverage and speed of content-filters with the depth of collaborative filtering. They apply their approach to an online newspaper, an as yet untapped opportunity for filters useful to the widespread news reading populace. VI. AN EXPERIMENT AND EVALUATION

Eliminating Noisy Terms

Relations Extraction

New List

Thesaurus DB Figure 7 International Press Thesaurus Updating Process

V. RELATED WORK One of the key issues in developing e-commerce applications is the problem of constructing accurate and comprehensive profiles of individual users that provides the most important information describing who the user are and how they behave [25], [26]. An overview of what the user profiles and user behavior are and how they can be implemented in various Internet based information system can be found in [25] The user profiles have been discussed in many topics [26],

Words are often used as features in classification. One way to extend that is to use phrase expression [34] and trigrams [35]. However, such approaches are limited since these features are generated only from the document themselves. The use of a thesaurus is expected to solve the problem since semantic knowledge is more general than keywords can be used as classification features. The domain-specific thesaurus can be used to help identify important concepts in document automatically. A. Indexing and Searching While there are many uses for thesauri, this work is aimed at exploring their application to E-Newspaper article classification. Electronic thesauri have been identified as a strategic instrument for indexing electronic documents [3]. The indexing and retrieval processes being used by TDOCS are based on the hierarchical structure of the thesaurus. The thesaurus hierarchy is used in order to create associations between E-Newspaper collection and concepts. The use of thesaurus thereafter implies that the associations can be expressed by terms not explicitly present in the analyzed text. Such terms, we call concepts; refer to broader terms of keywords that are found in E-Newspaper articles.

7 After the International Press Thesaurus was constructed, its performance is tested. Our articles collection was indexed by TDOCS thesaurus-based indexing engine. This automatically created keywords and concepts. In general, as we are only interest in indexing the current edition of the daily newspaper articles, TDOCS index only articles arrive the system as new articles. B. Classification In order to classify E-Newspaper articles a thesaurus is used, the thesaurus can be used to reflect the interests of a user as well as the main topic of the article. The thesaurus is used not only for indexing and retrieving messages, but also for classifying E-Newspaper articles. TDOCS system is a tool that can be used to reduce the cognitive effort, the time required to classify and to facilitate retrieval of E-Newspaper articles. To achieve this, users only need to use the TDOCS Batch Parser and Document Search. The Toolkit parses the E-Newspaper articles and indexes the articles using a thesaurus. Thereafter the user can use Document Search environment to retrieve the related articles. C. Evaluating the Classification Methods We evaluated this classification method using TDOCS Toolkit and a collection of E-Newspaper articles. The collection includes a list of daily E-Newspaper articles. In our experiment we first construct a thesaurus “International Press Thesaurus”. This thesaurus has been used in the process of indexing. The classification parameter in our experiment was the seventeen root terms that represent the main topics. To classify E-Newspaper articles according to these topics, which represent the main concepts, we parsed the articles by TDOCS Toolkit. TDOCS tool Document Search "DocSearch" used to search and retrieve the indexed E-Newspaper articles. For example, the articles can be classified according to Education, retrieving the whole articles related to Education can be done by selecting Education from “concepts” rectangle in Document Search. The selected concept will appear in the top of the right part of the window. Clicking on the down arrow next to the term will list all the E-Newspaper articles related to that concept (see Fig.6. To browse these E-Newspaper articles you can double click on the name of the article. TDOCS Document Viewer "DocViewer" will display the contents of the article in a special editor. The TDOCS-system described in this paper offers tools for monolingual as well as multilingual automatic indexing and classification by taking advantage of thesauri. International Press Thesaurus is a multilingual thesaurus that contains a Dutch translation for the language of original thesaurus. A multilingual indexing requires that the bilingual thesaurus as well as the translation have to be processed. Our experiment, using the Dutch E-Newspaper articles, has show promising results. The classification mechanism in our approach is based on Information Retrieval Thesaurus. The TDOCS Toolkit parses

the E-Newspaper articles and indexes the articles using a thesaurus. Thereafter the user/editor can use Document Search environment to retrieve the related E-Newspaper article D. E-Newspaper articles Distribution The task of the delivery system is to get a collection of ENewspaper articles to be delivered to a user. The ultimate goal the system is to select E-Newspaper articles that best reflect user’s interest. The articles undergo pre-processing. We used TDOCS as tool to index these articles. The indexing process being used by TDOCS are based on the hierarchical structure of the thesaurus. The thesaurus hierarchy is used in order to create association between E-Newspaper article and the concepts in user profile. A user profile consists of one or more topics. Topics represent user’s information and interests. We can use the result of indexing to classify the ENewspaper articles according to the main root terms or other concepts that reflect the different topics. User interest will be match to these database results to select the articles that reflect his interest. We used VC++ API application to map the interest of each user to the indexing result and get the articles the best reflect his/her interest. Then, the system will delivery these articles by electronic mail to that user E. Performance The proposed approach has already been put to practice. About 100 E-Newspaper articles were automatically indexed using International Pres Thesaurus. The results were manually evaluated. The test results showed that a good indexing quality has been achieved. TDOCS fetches each day's articles in XML format, translates into plain text, and index them. Currently we get articles from a cache directory. We get about 100 articles a day, and the size of each article ranges from 1-14 KB. The batch process to index all the articles takes about one minute. It takes about 0.6 seconds to classify each article compared to 1-3 minutes human indexer needs. We used TDOCS to classify 100 articles into 17 concepts. The results have showed that thesaurus is effective in improving classification. TDOCS, thesaurus based indexing provides the fully automatic creation of structured user profiles. It explores the ways of incorporating users’ interests into the parsing process to improve the results. The user profiles are structured as a concept hierarchy. The profiles are shown to converge and to reflect the actual interests of the user. On the other side, the newspaper publisher can benefit in various ways from such an approach. Increase readers base, with services that can attract previously indifferent or unreachable readers. Collect profile information and do some marketing research, the organization can have a clearer picture of the readers' profiles, which will allow it to offer more appropriate content. Use of free advertisement and related exploitation opportunities. The electronic newspaper not only promotes its printed copy but can also serve as a new advertisement place and medium.

8 In this paper, we describe the approach of using thesaurus for E-Newspaper classification and distribution. Experimental results analysis of our approach shows excellent performance and good managing of electronic newspaper. The use of thesaurus is effective for classification problem.

grateful to my colleagues Dr. Vervenne and Dr. Kaczmarski for giving insightful comments and suggestions. REFERENCES [1] [2]

VII. CONCLUSION AND FUTURE WORK The growing domain of online newspaper presents a rich area, which can benefit immensely from automatic classification approach. We present an approach of automated E-Newspaper articles classification through thesaurus for the purpose of developing an E-Newspaper article deliveringservice system. This paper describes experimental trail to test the feasibility of thesaurus-based E-Newspaper articles classification and distribution system. The system seeks to identify keywords and concepts that characterize E-Newspaper articles and classify these articles into pre-defined categories based on the hierarchical structure of the thesaurus. The explicit interest of user in his/her profile enables the system to predict articles of interest Experimental result of our approach shows that the use of thesaurus contributes to improve accuracy and the improvements offered by classification method. The International Press Thesaurus is useful and effective for indexing and retrieval of electronic newspaper articles. Concepts hierarchies in International Press Thesaurus was used to capture user profile and classify E-Newspaper articles content. Our experiment proves that TDOCS indexing platform is not only an indexing tool, but also a classification tool. ThesWB Tool is a useful tool to update or create a thesaurus. User-profiling technologies enable companies to gain a greater understanding of who their customers are and what they want. This understanding results in users getting more of what they want and getting to it faster. In return, companies get the loyalty of these users. This give-and-take is at the heart of the new, customer-centric economy Finding newspaper articles coverage of events can be complicated. Different indexing and searching resources are needed depending on the date and the geographic location of the event and perspective desired. Public indexing of many newspapers is relatively recent phenomena, complicating matters further. More effort is needed to focus on these issues. To summarize, automatic E-Newspaper classification is an important problem nowadays. This paper proposes an approach base on thesaurus to classify and distribute the ENewspaper articles. The experimental results indicate accurate result.

[3] [4] [5] [6] [7]

[8] [9]

[10] [11] [12] [13] [14] [15] [16] [17] [18]

[19] [20] [21]

[22] [23]

ACKNOWLEDGMENT I would like to thank my promoters Prof. Dr. Boute and Prof. Dr. Vandamme for introducing exciting field of thesaurus technology and invaluable advice, support and encouragement.

Raymond me to the providing I am also

[24] [25]

Newspaper Association of America, NAA, Facts About Newspapers, Internet Site: http://www.naa.org/IndexList.cfm?SID=168. Y. Jing & W. B Croft., "An association thesaurus for information retrieval", In RIAO 94 Conference Proceedings, p. 146-160, New York, Oct. 1994. G. Salton,, Automatic Information Organization and Retrieval, McGraw-Hill Book Company, 1968. G. Grefenstette, 1994, Explorations in automatic thesaurus discovery, Kluwer Academic Publishers. D. Vervenne, Advanced document management through thesaurusbased indexing: the IKEM platform, R&D Report BIKIT, BIKIT – LAE , University of Ghent 1998. Y. Abuzir, D. Vervenne, P. Kaczmarski and F. Vandamme, “TDOCS Thesauri for Concept-Based Document Retrieval”, R&D Report BIKIT, BIKIT – LAE, University of Ghent 1999. D. Soergel., “Multilingual thesauri in cross-language text and speech retriev-al”, In AAAI Symposium on Cross-Language Text and Speech Retrieval. American Asso-ciation for Artificial Intelligence, March 1997. http://www.clis.umd.edu/dlrg/filter/sss/papers/ . J. Aitchison, A. Gilchrist, and D. Bawden, Thesaurus Construction. 3ed. ASLIB, 1997. Y. Abuzir, D. Vervenne, P. Kaczmarski and F. Vandamme, "Extracting Semantic Relationships between Terms using IKEM Tool", KIM/KIT NEWS , Vol. 15, nr.3, Nov. 2000. Available from: http://www.ewebtec.com/kim/docs/kimnl_nov_2000.pdf G. Salton, M. J. McGill.. Introduction to modern information retrieval. McGraw Hill, New York, 1983. C. J. Crouch, An approach to the automatic construction of global thesauri, Information Processing & Management, 26(5): 629-40, 1990. Y. Qui, , H. P. Frei, Concept Based Query Expansion. Proc. of the 16th Int. ACM SIGIR Conf. on R&D in Information Retrieval, Pittsburgh, SIGIR Forum, ACM Press, June 1993. G. Grefenstette, Use of syntatic context to produce term association lists for text retrieval. In SIGIR'92, pp. 89--97, 1992. G. Ruge, Experiments on linguistically based term associations. In RIAO'91, pp. 528-545, 1991. R. Rada, B. K. Martin, "Augmenting Thesauri for Information Systems". ACM Transactions on Office Information Systems, 5(4), 1987. H. Mili, R. Rada "Merging Thesauri: Principles and Evaluation". IEEE Transactions On Pattern Analysis and Machine Intelligence,10(2):204220, 1988. M.V. Mannino,, S. B. Navathe and W. Effelsberg "A Rule-Based Approach for Merging Generalization Hierarchies". Information Systems, 13(3):257-272, 1988. M. Sintichakis and P. Constantopoulos, A Method for Monolingual Thesauri Merging. In Proc. 20th International Conference on Research and Development in Information Retrieval, ACM SIGIR, Philadeplphia PA, USA, July 1997. E. M Voorhees, "Using WordNet to Disambiguate Word Sense for Text Retrieval", Proc ACM SIGIR'93, Pittsburgh, 1993, 171-180. G.A. Miller, WordNet: a lexical database for English”, Communications of the ACM, 38(11), 39–41, 1995. Liddy, E. D. and W. Paik, Statistically-guided word sense disambiguation. In Proceedings of AAAl Fall ’92 Symposium on Probabilistic Approaches to Natural Language (Boston, Mass.). AAAI, Menlo Park, Calif. (1992). Y. Abuzir, "Constructing FOLIA-Medical Thesaurus using ThesWB Tool" ", R&D Report BIKIT, BIKIT – LAE , University of Ghent 2001. Y. Abuzir, “ThesWB: Work Bench Tool for Automatic Thesaurus Construction ”, R&D Report BIKIT, BIKIT – LAE, University of Ghent 2001. D. Jovanovic, A Survey of Internet Oriented Information Systems Based on Customer Profile and Customer Behavior, SSGRR 2001, L'Aquila,, Italy , Aug06 12 2001. B. Krulwich, ‘Lifestyle Finder: Intelligent User Profiling Using LargeScale Demographic Data,’ Al Magazine, summer, 1997.

9 [26] Q. Lu, M. Eichstaedt and, D. Ford, ‘Efficient Profile Matching for Large Scale Web Casting,’ Computer Network and ISDN System, Vol. 30, 1998. [27] Y. Abuzir, D. Vervenne, P. Kaczmarski and F. Vandamme, 'E-mail messages classification and user profiling by the use of semantic thesauri,' Proceeding of CIDE2001 - 4e Colloque International sur le Document Électronique, Toulouse - France 24-26 October 2001. [28] S. Laine-Cruzel, T. Lafouge, J. P. Lardy, N. B. Abdallah: Improving Rnformation Retrieval by Combining User Profile and Document Segmentation. Information Processing and Management, Vol. 32, Number 3, pp305-315, May 1996 [29] P.K. Chan, 'A non-Invasive Learning Approach to Building Web User Profiles,’ In Workshop on Web Usage Analysis and User Profiling (WEBKDD’99), August 1999. [30] Kamba, T., Sakagami, H. and Koseki, Y. (1997): "ANATAGONOMY: A Personalized Newspaper on the WWW," International Journal of Human-Computer Studies, Special Issue on Innovative Applications of the World Wide Web (to be appeared in March 1997). [31] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes and M. Sartin, Combining Content-Based and Collaborative Filters in an Online Newspaper, ACM SIGIR Workshop on Recommender Systems, Berkeley, CA, August 19, 1999. [32] D. Billsus, and M. Pazzani, (1999). "A Hybrid User Model for News Story Classification", Proceedings of the Seventh International Conference on User Modeling (UM '99), Banff, Canada, June 20-24, 1999. [33] T. Miranda, P. Murnikov, D. Netes and M. Sartin, Collaborative Filtering of an Online Newspaper, Major Qualifying Project MQPMLC-0002, Computer Science Department, Worcester Polytechnic Institute, Spring 1999. [34] D. D. Lewis, An evaluation of phrasal and clustered representations on a text categorization task. In Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 37-50, 1992. [35] W. W. Cohen and Y. Singer, 'Context-sensitive learning methods for text categorization,' Proc. of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), pp.307-315, 1996.

Yousef S. Abuzir is a PhD student at the University of Ghent, Belgium, since 1997. He graduated in Computer Engineering at the Middle East Technical University, Turkey. Since 1992 he has been an instructor of Department of Computer and Information Science at ALQUDS OPEN University, Palestine. His interest areas comprise knowledge representation and Acquisition, information retrieval and indexing, Automatic thesaurus construction and expert systems.

E-Newspaper Classification and Distribution Based on ... - CiteSeerX

E-Newspaper Classification and Distribution Based on ... - CiteSeerX

Suggest Documents

Automatic Music Genre Classification Based on ... - CiteSeerX

Cardiac Arrhythmias Classification Method Based on ... - CiteSeerX

Image classification based on improved VLAD - CiteSeerX

MULTICLASS CLASSIFICATION BASED ON BINARY ... - CiteSeerX

Classification and Distribution Quartzphyric Rhyolite Distribution and ...

CMAR: Accurate and Efficient Classification Based on ... - CiteSeerX

Vowel Classification Based on Fundamental Frequency and - CiteSeerX

Classification of enteroviruses based on molecular and ... - CiteSeerX

Region Based Segmentation and Classification of ... - CiteSeerX

Region Based Segmentation and Classification of ... - CiteSeerX

Scalable classification-based word prediction and ... - CiteSeerX

Jet Based Feature Classification - CiteSeerX

Wavelet Based Texture Classification - CiteSeerX

On Education and Distribution - CiteSeerX

A Study on Music Genre Classification Based on ... - CiteSeerX

Questionnaire-based survey on the distribution and ... - CiteSeerX

Superpixel-Based Classification Using K Distribution and ... - MDPI

Crack Detection and Classification Based on New

rock image retrieval and classification based on

Vowel Classification based on LPC and ANN

Classification based on data depth

Text Classification based on Associative

Robust Materials Classification Based on

Supervised Probabilistic Classification Based on