A Specialization in Information and Knowledge ... - CiteSeerX

1 downloads 0 Views 59KB Size Report
build and work with techniques for data mining, information re- trieval, and text analysis, and ... guage processing. The fifth, capstone, course, Information.
A Specialization in Information and Knowledge Management Systems for the Undergraduate Computer Science Curriculum S. Argamon, N. Goharian, D. Grossman, O. Frieder Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 {argamon,goharian,grossman,frieder}@iit.edu Abstract. We describe our progress extending the undergraduate Computer Science (CS) curriculum to include a deep understanding of techniques for information and knowledge management systems (IKMS). In a novel five-course sequence, students build and work with techniques for data mining, information retrieval, and text analysis, and develop a large-scale IKMS project. We teach in a hands-on lab setting where students use tools they have built, performing experiments that could extend the field. Hence undergraduates have firsthand knowledge of performing CS research using scientific methods. Second, we utilize a rigorous set of evaluation criteria developed in our Psychology Institute to evaluate how well students learn using our approaches. Ultimately, we believe that this specialization warrants inclusion as an option in the standard undergraduate CS curriculum.

1 I NTRODUCTION Our objective is to increase the real-world relevance of the undergraduate CS curriculum by including a specialization option in Information and Knowledge Management Systems (IKMS). The ever-increasing availability of data of all types in corporate settings and on the web, makes information and knowledge management a key need of today’s information society. The core areas of this emerging subdiscipline are text analysis, data mining, information retrieval, and database systems, all of which are connected through a need for understanding the fundamentals of algorithms, statistics, machine learning, and human-computer interaction. The CS specialization we have designed consists in total of five upper-level undergraduate courses (Fig. 1). Theory and applications of the four core areas required for IKMS are taught in: Database Organization, a standard course in relational databases, Information Retrieval and Data Mining, two successful new courses we have developed over the last three years [12, 11], and Intelligent Text Analysis, our newly developed course in applied natural language processing. The fifth, capstone, course, Information and Knowledge Management Systems, is a project course in which student groups are guided in building systems for realistic knowledge management applications, combining development of new software systems with use of existing technologies. This course will integrate what students have learned in previous courses, providing systems development and project management experience.

N. Raju Institute of Psychology Illinois Institute of Technology Chicago, IL 60616 [email protected]

At the end of the course sequence, students will know fundamental algorithms and existing state-of-the-art in search engines, intranets, data mining, customer relationship management, information extraction, summarization, and text categorization. Moreover, students will have worked on a group project over the course of an entire semester, gaining experience in team-based software development. Such a project enables larger implementation achievements that deepen students’ understanding of the algorithms, scalability issues, implementation trade-offs, and software project management. In addition, such a group project is a critical experience that future employers and many graduate schools specifically look for. As in our previous educational innovation work, our curriculum enhancements will be evaluated using the latest proven techniques from psychology. We realize that our curriculum extensions will have to be continuously updated because the subfields of text analysis, information retrieval, and data mining are changing rapidly. Due to the novelty of the area, no textbook for information and knowledge management systems yet exists at either the graduate or undergraduate level, and no undergraduate text exists covering applied natural language processing methods. However, the topics are now mature enough that fundamentals can be taught. Essential theory and algorithms used in most current text analysis systems, such as part-of-speech tagging, probabilistic models, parsing algorithms, information extraction methods, and text classification and summarization are fairly well understood and likely to be of continuing use, although new and improved algorithms are constantly being developed. Similarly, we have found the principles of data mining and information retrieval sufficiently stable in our previous curriculum development efforts. An undergraduate who knows fundamental concepts and algorithms is well prepared to stay current in the field by learning future advancements and refinements. 2

BACKGROUND

2.1 Information and Knowledge Management Systems Information and Knowledge Management (IKM) is a broad term, covering a wide range of technologies as well as organizational methods in business. As a subdiscipline of

Junior year Fall Systems Programming Information Retrieval Database Organization Science elective Humanities elective Senior year Fall Intro. to Operating Systems Software Engineering Intelligent Text Analysis Technical Writing IPRO II Free elective

Spring Programming Languages & Compilers Probability and Statistics Data Mining IPRO I Free elective Spring Info. & Knowledge Management Systems Computers and Society Social science elective Free elective Free elective

Figure 1. Suggested course of study for junior and senior years including Specialization in Information and Knowledge Management Systems. Courses specific to the specialization are shown in italics.

computer science, we focus on knowledge management systems, which are integrated enterprise systems with the goals of “facilitat[ing] development of synergy between the data and information processing capacity of information technologies and the innovative and creative capacity of human beings” [22]. The key role of technology, therefore, in IKM is to enable effective use of knowledge encoded in very large and complex sets of data sources, including at times the human resources of an organization. Key algorithmic topics involved in IKM systems are database systems, information retrieval, data mining, and text analysis, together with human factors related issues such as groupware design and organizational information flow analysis. Active research in these topics is actively pursued and presented at annual conferences such as ACM CIKM, ACM SIGKDD, ACM SIGIR, and IEEE Int’l Conf. on NLP and Knowledge Engineering. 2.2 Intelligent Text Analysis Intelligent text analysis is about processing unstructured natural language text collections to extract useful information. Such analysis is a natural part of many conceivable knowledge management applications. For example, consider the case of a database of medical research articles. Text analysis methods such as information extraction might be used to extract lists of symptoms associated particular diseases, and then data mining methods could be used to connect diseases with similar symptoms, suggesting new research hypotheses. Similar potential applications exist anywhere large collections of meaningful texts are found. Some text analysis methods depend on specific models of a domain, or they may be domain-independent as in much text summarization work. Text analysis systems rely on extraction of natural language structure, as in part-of-speech tagging and various parsing techniques. Although linguistic information must be processed at many different levels using various methods, unifying principles are present in

many extant approaches to different text processing, such as: hidden markov models, instance-based learning, lexical similarity measures, and semantic frame models. Clustering and machine learning techniques are also used. The key text analysis task for knowledge management systems, is information extraction (IE) [9], which attempts to extract meaningful entities and relations from a text. A key component of IE is chunking, or shallow parsing, which identifies relevant subsequences of a text such as noun phrases, verbal groups, or prepositional phrases. This process relies on lower-level processing of text to determine each word’s part-of-speech, which depends critically on context (e.g., ‘book’ is likely a noun near the word ‘read’, but likely a verb near ‘ticket’). Based on chunking, then, semantic analysis using general lexical knowledge such as Wordnet [23] and domain-dependent knowledge can be used for named-entity recognition, in which noun-phrases are identified as companies, people, organizational roles, and so forth. This process can be aided by syntactic analysis as provided by various parsing methods. A second important text analysis task is text summarization [15] in which documents are automatically summarized for more efficient human consumption. The main issue is determining which words, phrases, and sentences are most central in the meaning of a document; a secondary problem is weaving those together to form a coherent summary. The third task we consider is text classification [19], in which topical and stylistic similarity between texts are determined, giving a high-level view of the contents of a text collection. The choice of textual features has many degrees of freedom, and can depend on part-of-speech and syntactic analysis as well as high-level discourse structure. 2.3 Connection to Research In addition to the pedagogical improvements in adding two courses as part of a new specialization to our curriculum, there is the added benefit that this will assist with our research. We have already seen research benefits from the courses we have developed in data mining and information retrieval, and expect to see similar and enhanced payoffs from the new courses in text analysis and knowledge management systems. Much work on text analysis is well-suited to undergraduate contributions. Adapting and modifying known NLP techniques to new topic domains or textual genres requires experimental work that is often ignored in the undergraduate computer science curriculum. Furthermore, the project-based knowledge management systems course will directly support students in building novel systems that may address interesting research questions. We note from our experience that students often find great interest in systems integration and in tuning algorithms for their best performance, particularly when the data to be analyzed are relevant to them. We have already published several peerreviewed research papers based on work done in large part by undergraduate research assistants [4, 24, 17], and expect the courses to synergistically accelerate this process.

2.4 State-of-the-art in IKMS Numerous companies currently develop software products which support information and knowledge management. Companies providing commercial IKM technologies include products for language and text processing from Autonomy, ClearForest, Inxight, and TAI; for unstructured content management from 80-20 Software, BroadVision, and EMC Documentum; for information integration and visualization from Antarctica Systems, Business Objects, Citrix, and Informatica; and for decision support from AskMe, Cognos, and Kamoon. Many major players also have knowledge management related products, such as IBM, SAS, Oracle, Microsoft, and Sun Microsystems. The field also includes much active research in addition to continuing growth of new commercial products. Despite all this, we are unaware of any undergraduate courses of study focusing on IKM-related technology. Perhaps inverted index algorithms or word counting algorithms are embedded in a data structures course, or data mining is mentioned as a side note in a database course. Natural language text analysis might be covered briefly in an elective course in artificial intelligence, while issues of knowledge integration and human factors issues are not covered at all. We submit that the time is ripe to give this burgeoning area of computer science the attention it deserves. A student interviewing for a job in information and knowledge management systems would be laughed at if they claimed experience based on having done a word counting algorithm in data structures, or having listened to two or three lectures on natural language processing in an AI course. 2.5 Related research at IIT Existing related research at IIT has focused on developing new methods for natural language processing, information retrieval, and data mining, and integrating them into systems which enable improved information access to heterogeneous data collections. Our recent research focuses on broadening the scope of data analysis and retrieval techniques in several directions. In natural language processing, we are developing techniques for stylistic analysis of texts [3, 4], and integration of these methods with information retrieval. We are also working on combining structured data with text in information access [10, 14]. The idea is that a user really wants a unified view of data regardless of where or how they are stored [13]. In addition to unified access, we have worked on evaluating information fusion and information extraction for information retrieval [5, 18, 21], as well as on new methods for shallow parsing [2]. While these techniques are promising in the lab, we have not yet been able to fully measure their effect on actual users. 3

C URRICULUM D EVELOPMENT

As two of the four courses we have developed for the IKMS specialization have previously been described in the literature [12, 11], we describe here just the new courses, i.e. ITA and IKMS, being developed as part of the IKMS specialization, as well as how we are integrating our research

with educational activities in these courses and the rest of the specialization. 3.1 Intelligent Text Analysis We have developed and are teaching a new course in intelligent text analysis this spring semester. It has as prerequisites basic knowledge of probability theory and algorithmic techniques such as depth-first search and dynamic programming, both of which are covered by the end of the sophomore year in the standard CS curriculum. The curriculum for this course is as follows: Week 1 2 3-4 5-6 7-8 9 10 11-12 13

Topic Overview, linguistic fundamentals Part-of-speech tagging Chunking and shallow parsing Link grammars and parsing Wordnet and Latent semantic analysis Named-entity recognition Relational information extraction Text summarization Text classification

Thirteen out of fifteen weeks are planned for lectures, leaving two weeks free for exams and student presentations. The first eight weeks of the course plan are focussed on fundamental concepts and algorithms, while the last five weeks are focussed on higher-level integrative material. During the course, a variety of commercial NLP tools will be made available to students for experimentation. The idea is for students to implement key algorithms during the early part of the course and then to use commercial tools for higherlevel tasks towards the end. 3.2 Information and Knowledge Management Systems The capstone course is designed as a project course whose purpose is to enable students to see how the various algorithms and systems they have learned about in their previous courses can be used in context to create useful knowledge management tools. The course is still in the early phases of development, but an outline of the course organization is as follows. Students in the course will be divided into groups of 4-6, each of which will choose a project early in the semester whose results they will present at the end of the semester. Class periods will be divided among discussion of design of information and knowledge management systems, lectures on effective project management techniques, and hands-on advising of student project group meetings. Students in this final Knowledge Management Systems class will have the opportunity to integrate all that they have learned previously in the specialization. Example of some projects that will be used to guide students are as follows (naturally they will be encouraged to creatively augment these projects with other ideas). • Students will choose a ‘home’ company and then build software to analyze ‘competing’ web pages, with the

goal to find information of interest to their home company. For instance, a project for ‘Coca-Cola’ may want to identify employee lists and new product offerings from Pepsi Corporation. A keyword search for “Pepsi” is only the start; data mining and text processing will be key to finding useful information. • Students will be pointed to some complex web sites and instructed to ‘summarize’ the key elements of each site. This can require automated data aggregation of tables on the web site (structured data) as well as implementation of text processing and data mining algorithms for various data summarization techniques. As these examples show, there is a wealth of interesting and meaningful knowledge management projects that can be addressed once students are equipped with an arsenal of search engine, data mining, and natural language processing techniques. We expect that the students will enjoy their projects and suspect further that many results may prove publishable; regardless, the experience will likely prove invaluable to students in their future graduate work or in industrial careers. 3.3 Incorporation of Research at IIT Our goal here is not to frustrate students by asking them to do research for which they are ill prepared. Instead, we believe that enough tuning parameters and variations exist for current algorithms that undergraduates may well develop new means by which to improve accuracy in text analysis and data mining. In no case are we asking students to develop novel algorithms, as we believe that that is properly a graduate-level activity that often takes years to refine. Instead we suspect that undergraduates will benefit from the quest of “playing with” these algorithms and using them to improve their accuracy. Our work in text analysis at IIT has focused on highlevel application of natural language processing techniques for tasks such as stylistic text classification [4, 3] and improved information retrieval [18] and evaluation [6]. These tasks critically depend on lower-level linguistic processing of texts for part-of-speech tagging, chunking, and dependency analysis. In the context of the course in text analysis, we are setting up an environment where, as they learn different techniques, students can choose vary grammatical models and adjust parameters for different algorithms. We believe that this sort of quest teaches more than reading any text book. This also enables us to directly incorporate our newest research on text analysis into the undergraduate curriculum, thus involving students in activities directly related to the research. Similarly, we expect to involve students in our work on multiple data source query integration [13]. The general idea is for users to simply enter a natural language query and retrieve answers to their query regardless of whether or not the answers are in the form of structured or semistructured data, text, video, sound, etc. This area integrates text analysis, data mining, information retrieval. We believe that project development in such a context, with real-world data,

is one of the most effective forms of teaching. Students at the undergraduate level may be expected to be caught by the same enthusiasm we have seen in our graduate students, learning a great deal about algorithms as well as IKM system integration by directly working on them. 4

E XPERIMENTAL M ETHODS

We track students’ progress in our courses with detailed evaluation forms for each assignment, each lecture, and following each test. Additionally, we will track the number of original research contributions obtained by undergraduates. Student projects early in the semester involve simple implementation of a known algorithm while those later in the semester will require students to ‘play’ with an existing algorithms they have implemented to try and improve effectiveness. At a minimum this may lead to new ideas for interaction with these systems. Additionally, we expect some students will develop new modifications to improve effectiveness. Projects will be geared towards clear learning objectives, and tests will be given to ensure that students have met these objectives. In teaching each course, we are conducting an ongoing formative evaluation of student learning on a biweekly basis. The goal of this formative evaluation is to track student learning, identify strengths and weaknesses of individual students and provide them with appropriate remedial help, and review content relevance and mode of instruction and revise content/instruction as needed. While formative evaluation will be an integral part of the specialization throughout our curriculum development effort, it is especially crucial during the first offerings of the courses, and so we will devote substantial resources to this activity during the first two years of the program. The goal is to refine a sequence of courses that is not so difficult that students will be frustrated but sufficiently challenging to motivate students to push themselves to learn more. 4.1 Evaluation of Results Evaluation of the effectiveness of the undergraduate specialization in information and knowledge management systems will consist of three phases covering the multiple offerings of the two new courses in the specialization (similar evaluation activities are already well underway for our previously developed courses in information retrieval and data mining). Phase 1 covers the first year of instruction, including just the first offering of Intelligent Text Analysis (ITA). Phase 2 will cover the second year of instruction, including the first offering of Information and Knowledge Management Systems (IKMS), and the second and third offerings of ITA. Phase 3 will cover the third year of instruction, including offerings of both courses. Since the new undergraduate courses will be available for all eligible undergraduate students, the traditional control/experimental group paradigm will not be very useful for evaluating the effectiveness of the curriculum. Therefore, each group of undergraduates will be compared with itself with appropriately defined preand post-measures.

Phase 1 (Intelligent Text Analysis) A. On the first day of class for the text analysis course (version 1), all students will be asked to fill out a two-part questionnaire. The first part of the questionnaire will be designed to gather demographic information (race/ethnicity, gender, etc.) as well as information about previous course work, especially in computer science and mathematics. B. The second part of the questionnaire, which will consist of items with categorical response options (Likert-type [1]), will ask participants/students to indicate their current knowledge and understanding of the major topics (e.g., part-of-speech tagging, parsing, information extraction, text summarization, etc.) to be covered in the course. In addition, each participant will be asked to describe his/her expectations for the course. Each participant will also have the option to provide written comments. C. At the conclusion of the text analysis course, the second part of the above-described questionnaire, with appropriate modifications to items on expectations and some new items on course content and instruction, will be readministered to the same students. A summative evaluation test [7], reflecting the content of the entire course will be developed and administered to all students. The summative test will consist of a mixture of multiplechoice, written-response, and performance-based type items to provide an accurate assessment of what the students know and are able to do. D. Information from Part A will be used to describe the group of undergraduate students enrolled in the text analysis course. This information may also be used for additional statistical analyses to be described later. E. Results from Parts B and C will form the basis for assessing the degree to which the students’ expectations for the course are met and how much they have learned of the contents of the text analysis course. Percentages of students showing mastery on each of the skill areas/objectives of the course will be computed and an overall profile of mastery for the class as a whole will be developed for use in program evaluation, including course content and instruction. Appropriate statistical analyses [20, 25] will be performed to identify and articulate the significant outcomes about students’ expectations and learning. It should be noted that the sample size for Phase 1 may be small and hence univariate/multivariate statistical tests will not have adequate power to detect significant differences [8]. Therefore, data about students’ expectations will also be reported in effect size units [16] to facilitate the interpretation of outcome measures. Phase 2 (Info. and Knowledge Management Systems) A. The information obtained about each student enrolled in this course will be similar to the information gathered in Phase 1 for the text analysis course. The first part of the questionnaire will be identical to the one

described above. In the second part of the questionnaire, students will be asked to indicate their knowledge and understanding of the major topics in knowledge management systems (organizational decision making, groupware, human factors, etc.), as well as their understanding of how to run and participate in a larger group project. They will also be asked about their expectations for the course. While the general format of the summative test will remain the same, the content of the test will be different, reflecting appropriately the relevant topics for this course. The previously described statistical data analysis procedures will be adopted to provide information about how well the students’ expectations are met and how much they have learned from the course on knowledge management systems. B. In addition, the students that have completed the specialization in knowledge management systems, or courses within it, will be monitored after graduation for a period of about six months. The purpose behind this monitoring is to gather information about time spent looking for a job, starting salary, position, current employer, and the positive effect the specialization may have had on these matters. A specially designed questionnaire will be used for gathering the needed information in this monitoring phase. Whenever possible, graduates who were not exposed to the courses in the specialization will also be monitored, and the resulting data will be analyzed for determining post-graduation benefits as a result of the students’ exposure to the various courses in the specialization. C. Implementation and data analysis for the courses in this phase will be similar to those described in Phase 1. D. In addition, a separate evaluation of the text analysis course will be conducted using pooled data from both Phases 1 and 2. For this segment of the data analysis, the total sample for text analysis will be larger, adding significantly to the power of the statistical analyses. Phase 3 (both new courses) A. Implementation and data analysis in this phase will be similar to the ones described in Phase 1. B. A separate evaluation of the two courses will also be conducted with the questionnaire and summative test data from all three phases, using a much larger sample of students for both courses. As in Phase 2, the increased sample size will contribute significantly to the power of the statistical analyses. The three-phase evaluation plan is designed to evaluate how well students enrolled in the specialization learn major topics and skills areas in text analysis and knowledge management systems. Students’ expectations for these courses and how such expectations are realized will also be documented and evaluated. To the extent possible, information about post-graduation benefits (i.e, time spent on searching for a job and starting salary) resulting from exposure to the specialization will be monitored and analyzed for trends.

Finally, if sample sizes are deemed adequate, the questionnaire and test data from all three phases will be analyzed separately for males and females to identify different learning patterns, if any, that information will be used to redesign the mode of presentation in order to maximally benefit all students. A similar analysis may also be undertaken for various racial/ethnic groups, provided the sample sizes are considered adequate for the statistical analyses. 5

C URRENT S TATUS

We have laid out a clear plan to extend out undergraduate CS curriculum to include a five-course sequence giving a specialization in information and knowledge management systems. We have shown that our plan has the potential to extend the state-of-the-art as well as provide a new pedagogical opportunity for CS students. We have already added a three credit course in Intelligent Text Analysis to our undergraduate curriculum, being taught as an experimental course in Spring 2005. Software and lecture notes/slides have been prepared to supplement existing textbook materials in teaching this course. A detailed evaluation form specific to the goals and objectives of the course has been developed, for administration on the first day of course. Partial results from the first running of the course will be presented in the conference (full evaluations from this offering will be available only in May, after the conference). 6

S UMMARY

There is strong pedagogical value in giving undergraduates educational opportunities to fruitfully integrate theoretical learning with practical experience, as described here. Furthermore, by developing new and exciting specializations in important areas of computer science, we increase the relevance and interest of CS to potential majors and thus expect to improve the quality of the field as a whole. Acknowledgements This work was supported in part by the National Science Foundation under contracts EIA-0119469 and IIS-0417528. R EFERENCES [1] A. Anastasi. Psychological Testing. Macmillan Publishing Co., 1988. [2] S. Argamon, I. Dagan, and Y. Krymolowski. A memory-based approach to learning shallow natural language patterns. Journal of Experimental and Theoretical Artificial Intelligence, 10:1–22, 1999. [3] S. Argamon, M. Koppel, J. Fine, and A. R. Shimony. Gender, genre, and writing style in formal written texts. Text, 23(3), 2003. ˇ c, and S. S. Stein. Style mining of electronic [4] S. Argamon, M. Sari´ messages for multiple author discrimination. In Proc. ACM Conference on Knowledge Discovery and Data Mining, 2003. [5] S. Beitzel, E. Jensen, A. Chowdhury, D. Grossman, O. Frieder, and N. Goharian. On fusion of effective retrieval strategies in the same information retrieval system. Journal of the American Society of Information Science and Technology, 55(10), July 2004. [6] Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, and David Grossman. Using titles and category names from editor-driven taxonomies for automatic evaluation. In Proceedings of the 2003 ACM Conference on Information and Knowledge Management (ACMCIKM), New Orleans, LA, November 2003.

[7] B. S. Bloom, J. T. Hastings, and D. F. Madaus. Handbook of Formative and Summative Evaluation of Student Learning. McGraw-Hill Publishing, 1971. [8] J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Erlbaum Publishing, 1988. [9] J. Cowie and W. Lehnert. Information extraction. Communications of the ACM, 39(1):80–91, 1996. [10] O. Frieder, D. Grossman, and A. Chowdhury. On scalable information retrieval systems. In Proc. IEEE 2nd Int’l Symp. on Network Computing and Applications, April 2003. Keynote Address. [11] N. Goharian, D. Grossman, and N. Raju. Extending the undergraduate computer science curriculum to include data mining. In IEEE International Conference on Information Techniques on: Coding & Computing (ITCC 2004), Las Vegas, Nevada, April 2004. [12] N. Goharian, D. Grossman, N. Raju, and O. Frieder. Migrating information retrieval from the graduate to the undergraduate curriculum. Journal of Information Systems Education, 15(1), April 2004. [13] D. Grossman, S. Beitzel, E. Jensen, and O. Frieder. IIT intranet mediator: Bringing data together on a corporate intranet. IEEE IT PRO, January/February 2002. [14] D. Grossman, O. Frieder, D. Holmes, and D. Roberts. Integrating structured data and text: A relational approach. Journal of the American Society of Information Science, 48(2), February 1997. [15] Udo Hahn and Donna Harman, editors. Proceedings of the Workshop on Text Summarization at the 4Oth Meeting of the Association for Computational Linguistics. Philadelphia, PA, July 11–12 2002. [16] L. V. Hedges and I. Olkin. Statistical Methods for Meta-Analysis. Academic Press, 1985. [17] Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, and Ophir Frieder. Parallelizing the buckshot algorithm for efficient document clustering. In Proceedings of the 2002 ACM International Conference on Information and Knowledge Management (CIKM-2002), Washington D.C., November 2002. [18] M. Jiang, E. Jensen, S. Beitzel, and S. Argamon. Effective use of bigrams in language models for information retrieval. In Proc. Eighth Symposium on Artificial Intelligence and Mathematics, January 2004. [19] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire N´edellec and C´eline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137– 142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. [20] R. E. Kirk. Experimental Design: Procedures for the Behavioral Sciences. Brooks/Cole Publishing Co., 1995. [21] L. Ma, N. Goharian, A. Chowdhury, and M. Chung. Extracting unstructured data from template generated web documents. In Proc. ACM 12th Conference on Information and Knowledge Management (CIKM), November 2003. [22] Yogesh Malhotra. Knowledge management for organizational whitewaters: An ecological framework. Knowledge Management (UK), pages 18–21, March 1999. [23] G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K.J. Miller. Wordnet: An on-line lexical database. International Journal of Lexicography, 3(4):235–312, 1990. [24] S. Stein and N. Goharian. On the mapping of index compression techniques on csr information retrieval. In IEEE Int’l Conf. on Info Techn: Coding & Computing (ITCC 2003), Las Vegas, Nevada, April 2003. [25] M. M. Tatsuoka. Multivariate Analysis: Techniques for Educational and Psychological Research. Macmillan Publishing Co., 1988.