InLinx for document classification, sharing and ... - CiteSeerX

InLinx for Document Classification, Sharing and Recommendation Clara Bighini, Antonella Carbonaro*, Giorgio Casadei Department of Computer Science University of Bologna Mura Anteo Zamboni 7, I-40127 Bologna, Italy tel +39 0547 642830 fax +39 0547 610054 e_mail: [email protected] Abstract This paper proposes an hybrid recommender system, InLinx, that combine content analysis and the development of virtual clusters of students and of didactical sources providing facilities to use the huge amount of digital information according to the student’s personal requirements and interests. The paper proposes novel methods for information management, with special focus on the development of new algorithms and intelligent applications for personalized information sharing, filtering and retrieval. InLinx helps the student to classify domain specific information found in the Web and saved as bookmarks, to recommend these documents to other students with similar interests and to periodically notify new potential interesting documents.

INTRODUCTION Nowadays, web technologies will grow and mature, and learning through the World Wide Web will become increasingly popular particularly in distance education systems. Teachers can distribute lecture notes and other required materials via the Web, so learners get the opportunity to use learning materials freely and autonomously, collecting other related materials on the Web as well. As the use of digital libraries increases, users expect more than being able to filter, retrieve and refer to library materials. They prefer to have personalized accesses on library materials in which they can customize them according to their personal requirements and interests. Therefore, new tools should allow the learners to integrate their selections from digital information sources to create their own reference sources. Moreover, in order to provide intelligent support to achieve the expectations of active learning, it is also necessary to provide techniques to locate suitable materials. These mechanisms should extend beyond the traditional facilities of browsing and searching, supporting active learning by integrating the user’s personal library and a remote digital library. The user

will be able to carry out learning activities when browsing both the personal and the remote digital libraries; that is, they can build personalized views on those materials while turning them into an accessible reference collection. Furthermore, based on user’s requirements and interests, filtering and retrieval tools should be developed, improving their usage. Information filtering systems can help learners by eliminating the irrelevant information, operating like mediators between the sources of information and the learners. Personalized filtering should be also a process of filtering based on not only the long-term interests but also the short-term requirements. Web browser bookmarks predominate as the current approach to managing URLs. Hierarchical bookmarking schemes represent the state of the art in the most widely-available Web browsers, however these bookmarking schemes exhibit some significant deficiencies that hinder effective organization and access to URLs. For example, the hierarchical folder organization forces users to think in terms of a neatlydecomposable structure consisting of disjoint clusters of related URLs. However, a single piece of information is often relevant in multiple ways, and thus is not easily categorized within a single folder. That is, when it comes time to retrieve a relevant URL, the predefined hierarchical structure may not appropriate for the current task or context. Secondly, current browser bookmarking schemes are oriented toward a single user, and provide few facilities for sharing URLs within a workgroup. Furthermore, for folders with large numbers of URLs, finding the most useful information often involves scanning a long list of URLs. In these cases, ordering URLs based on previous experience can facilitate rapid location of high-quality information. Finally, with hierarchical schemes, navigational access to information can be tedious and frustrating when information is nested several layers deep. Viewing URLs contained within multiple folders simultaneously involves navigating the tree-structure at multiple levels and consolidating the results within a single view. From our point of view, the most noteworthy lack of this

Proceedings of the The 3rd IEEE International Conference on Advanced Learning Technologies (ICALT’03) 0-7695-1967-9/03 $17.00 © 2003 IEEE

scheme is the lack of immediate portability and of visibility from different locations. In this paper we introduce InLinx, a Web-based hybrid recommender system that combine content analysis and the development of virtual clusters of students and of didactical sources that provides facilities to use the huge amount of digital information according to the student’s personal requirements and interests proposing novel methods for information management, with special focus on the development of new algorithms and intelligent applications for personalized information sharing, filtering and retrieval. InLinx intends to gather content-based and collaborative information filtering helping the student to classify domain specific information found in the Web and saved as bookmarks, to recommend these documents to other students with similar interests and to periodically notify new documents potential interesting to the student. InLinx deals with different approaches: • a content-based information filtering system for bookmarks classification, • a collaborative recommender system for bookmarks sharing, • a content-based recommender system for new sources recommendation.

DOCUMENT CLASSIFICATION InLinx is a system developed with the purpose of support the student in the task of classify documents (bookmarks) retrieved from the World Wide Web, automatically recommend them to other users of the system, with similar interests and of periodically notify new documents potential interesting to the student. To fully exploit the Web potentiality in the instructional field and to facilitate a student-based approach to the provided educational material, modern Web-based learning environments must be able to dynamically respond to the different and personal students’ learning styles, goals, knowledge backgrounds and abilities. To this aim, User Modeling (UM) techniques has emerged as important technology to adapt and personalize the instructional material to different students on the basis of personal interactions. The UM represents the system’s belief about each learner’s knowledge and updates it dynamically, based on the student’s interactions with the system. The selected information could be structured using ad hoc criteria and delivered to interested users to guarantee a personalized and technologically innovative service to the students. Search for documents (e.g. multimedia educational material sources like Web pages) uses queries containing words or describing concepts that are desired in the returned documents. Most content retrieval methodologies use some type of similarity score to match a query describing the content, and then present the user with a ranked list of suggestions. These

methodologies identify the information filtering (IF) problem [Belkin and Croft 1992]. In our system each document is represented as a vector whose components are values reflecting the relevance of different arguments considered in the resource. Suppose that we have n arguments by which the resources can be described; more formally, let be A = {a1, a2,...,an} the finite set of arguments. In particular, the resource profile is described as vector in Rn. To improve the information filtering performance we provide students with ways of finding morphological variants of search terms. A question here arises is how to remove non-informative and irrelevant terms to improve the quality of document clustering as well as to reduce the clustering cost. So, we have implemented and tested an approach to stemming, in particular the Porter stemmer that is more compact than other approaches and seems, on the basis of experimentation, to give retrieval performance comparable to larger algorithms [Porter 1980]. The Porter algorithm is an affix removal stemmer and consists of a set of condition/action rules that specify, for example, how to remove the plurals from terms. To filter information resources according to student interests we must have a common representation for both the users and the resources. This knowledge representation model must be expressive enough to synthetically and significantly describe the information content [Kurki et al. 1999]. In the proposed system we use the vector space model, a popular IF model for textual material [Salton, 1989]. We chose this model for document representation because it has widely tested and is general enough to support other computational requirements of the filtering environment. The use of the vector space model permits to update the user profile in according to consulted information resources. Once the student has registered an interesting page (Fig. 1), InLinx suggest a classification among some predefined categories, based on the document’s content and student’s profile. Then, the user has the opportunity to reconfirm the suggestion or to change classification into another one (Fig. 2) and InLinx updates the user profile (Fig. 3). Fig. 4 shows the personal bookmarks page, when the user login the system: the bookmarks are organized in the categories automatically proposed or chosen during the registration of an interesting page and the student can monitor the recommendation.

DOCUMENT SHARING During the document classification, according to the contents of the personal documents, they are categorized such as belonging to some main topic, for example, Artificial Intelligence, Bio Computing or Machine Learning. In the second phase, the collection personalization is designed to provide facilities such as


personalized retrieving and personalized filtering based on both the user's interest and the user's working context. The personal documents in the material personalization are used to capture the user's interest and working context for the collection personalization. In particular, the system is intended to support the activities of students interested in courseware as Bio Computing, Internet Programming or Machine Learning. In fact, for this kind of courses it is necessary an active involvement of the student during the acquisition of the didactical material that should integrate the lecture notes specified and released by the teacher. The level of integration basically depends both

on the prior knowledge of the student in that particular subject and the comprehension he wants to acquire. Furthermore, for the mentioned courses it is necessary to continuously update the acquired knowledge integrating available recent information from remote digital library, for example scientific journals and high level conferences. InLinx checks for newly classified bookmarks and recommend these to other users that can either accept or reject them when they login the system and are notified of such recommendation (Fig. 5).

. Fig. 1. Bookmark classification (1 of 3)

Fig. 2. Bookmark classification (2 of 3)

Fig. 3. Bookmark classification (3 of 3)


Fig. 4. Bookmarks home page

DOCUMENT RECOMMENDATION To fully exploit the IF potentiality in the instruction field and to facilitate a student-centred approach to the provided educational material, modern learning environments must be able to automatically provide the vector space model, that is the used IF model. To this aim, we have introduced a pre-processing step in the InLinx system. This processing also allows the system to automatically work in every knowledge domain environment, without specify the set of meaningful terms for each argument and their relevant values. During this step, the system receives as input different documents that correspond to the potentially interesting contents of the current course, for example to the content of an international journal (Fig. 6). All the documents (in our example the paper of “Artificial Intelligence Review” journal) are lexically analysed to obtain the attributes that describe the input. From this educational material the system extracts a set of meaningful terms, which is the attributes used in the sequent filtering phase. The terms are characterized by

Fig. 6. Journal content

Fig. 5. Bookmarks recommendation their frequency in the document, by the number of documents that contain the word and by their specificity. Furthermore, it is necessary to consider the role of the term inside the document, that is, for example, if it appears in the title or if it represents a keyword. The advantage in introducing an automatic vector space model definition module into our learning system has been twofold. On the one hand, this module has permitted an augmented personalization of the interaction during the use of the system. On the other hand, instead, the real instructor may be relieved from the two tasks of: • designing and proposing to students, with different learning desire, lectures and documents for each new concept of the actual domain knowledge to be learnt, structured under the form required by the vector space model, that is specifying for each entityargument couple a value representing the affinity between the actual entity and the argument,

Fig. 7. Journal papers


•

managing and evaluating each new resource that the Web environment could furnish continuously to the system. To periodically notify new documents to the student, InLinx executes an ad hoc classification respect to the user prototype to effectively manage the actual document set of each student.

5.

CONCLUSION This paper presents the main characteristics and functionalities of InLinx, an hybrid recommender system, InLinx, that combine content analysis and the development of virtual clusters of students and of didactical sources that provides facilities to use the huge amount of digital information according to the student’s personal requirements and interests. The system satisfies the three main requirements of specialization, adaptation and exploration. In particular, we have proposed novel methods for information management, with special focus on the development of new algorithms for information sharing, filtering and recommendation. Indeed, InLinx helps the student to classify domain specific information found in the Web and saved as bookmarks, to recommend these documents to other students with similar interests and to periodically notify new potential interesting documents

8.

References

14.

1. 2.

3.

4.

Belkin, N. J., and Croft, W. B., 1992, “Information Filtering and Information Retrieval: Two Sides of the Same Coin”, Commun. ACM 35, 12, 29-38 Carbonaro, A., 2002, “User and Resource Models Definition and Adaptation to Personalize the Multimedia Instructional Material in a Web-Based Distance Learning System”, in Proc. of 2002 SCS Euromedia Conference (ed: M. Roccetti), The Society for Computer Simulation International, Delft, The Netherlands, pp. 23-27 Carbonaro, A., 2002, “An Information Filtering Approach to Personalize the Multimedia Instructional Material in a Web-Based Multimedia Distance Learning System”, International Conference On Simulation and Multimedia in Engineering Education (ICSEE '02), January 23-27, Texas Carbonaro, A., 2001, “A Comprehensive Approach to the Personalized Information Filtering Problem”, Proceedings of 2001 SCS Euromedia Conference, The Society for Computer Simulation International, Spain

6. 7.

9.

10. 11.

12. 13.

15. 16.

17. 18. 19.

Carbonaro, A., Roccetti M., Salomoni P., 2001, “A WebBased Didactical Environment for Learning Prolog Programming Abilities”, 2001 International Conference on Intelligent Multimedia and Distance Education, USA Delgado, J. and Ishii, N., 1999, “Online learning of User Preferences in Recommender Systems”, Proc. of the IJCAI-99 Workshop on ML for IF Delgado, J., 2000, “Agent-based Information Filtering and Recommender Systems on the Internet”, PhD Thesis, Nagoya Institute of Technology Edmunds, A. and Morris, A., 2000, “The problem of Information Overload in Business Organizations: a Review of the Literature”, International Journal of Information Management, 20: 17-28 Hanani, U., Shapira, B. and Shoval, P., 2001, “Information Filtering: Overview of Issues, Research and Systems”, User Modeling and User-Adapted Interaction, 11: 203-259 Krulwich, B. and Burkey, C., 1997, “The InfoFinder Agent: Learning User Interests through Heuristic Phrase Extraction”, IEEE Expert, September/October 1997 Kurki, T., Jokela, S. and Sulonen, R., 1999, “Agents in Delivering Personalized Content Based on Semantic Metadata”, Proc. of 1999 AAAI Symposium Workshop on Intelligent Agents in Cyberspace, Stanford, USA Lang, K., 1995, “NewsWeeder: an Adaptive Multi-user Text Filter”, Tech. Rep. School of Computer Science, Carnegie Mellon Univ., PA. Lee, J. 1999, “Interactive learning with a Web-based digital library system”, Prepared for the Ninth DELOS Workshop on Digital Libraries for Distance Learning. http:// courses.cs.vt.edu/~cs3604/DELOS.html Mostafa, J., Mukhopadhyay, S., Lam, W., Palaka, M., 1997, “A Multilevel Approach to Intelligent Information Filtering: Model, System, and Evaluation”, ACM Transaction on Information Systems, 15(4): 368-399 Porter, M. F., “An Algorithm for Suffix Stripping”, Program, 14(3), 130-137. Resnick, P., and Varian, H., 1997, “Recommender Systems”, Introduction to special section of Communications of the ACM, March, vol. 40(3). Salton, G., 1989, “Automatic Text Processing: the Transformation, Analysis and Retrieval of Information by Computer”, Addison-Wesley, Reading, Mass. Savia, E., 1999, “Mathematical Methods for a Personalized Information Service”, http://smartpush.cs.hut.fi web page Yan, T. and Garcia-Molina, H., 1995, “Sift – A Tool for Wide-area Information Dissemination”, In Proc. of the 1995 USENIX Technical Conference, USENIX Assoc.. Berkley, Calif., 177-186


InLinx for document classification, sharing and ... - CiteSeerX

InLinx for document classification, sharing and ... - CiteSeerX

Suggest Documents

Altering Document Term Vectors for Classification - CiteSeerX

One-Class SVMs for Document Classification - CiteSeerX

Machine Learning Classification for Document Review - CiteSeerX

Hierarchical Dirichlet Model for Document Classification | CiteSeerX

Discriminative Features for Text Document Classification - CiteSeerX

Structured Multimedia Document Classification - CiteSeerX

Iterated Document Content Classification - CiteSeerX

Document Zone Content Classification for Technical Document

Distributed Document Sharing with Text Classification over Content ...

Transductive Learning for Document Classification

New Resources for Document Classification

Feature Extraction for Document Classification

Unsupervised Learning for Document Classification

Transductive Learning for Document Classification

Automatic Document Classification Temporally Robust - CiteSeerX

Document Classification with Unsupervised Artificial ... - CiteSeerX

Hierarchical Document Classification Using Automatically ... - CiteSeerX

Document Classification using Layout Analysis - CiteSeerX

content-based document classification with highly ... - CiteSeerX

Document Classification with Unsupervised Artificial ... - CiteSeerX

Centroid-Based Document Classification: Analysis ... - CiteSeerX

Centroid-Based Document Classification: Analysis ... - CiteSeerX

Document Classification using Layout Analysis - CiteSeerX

Classification of Document Page Images - CiteSeerX