Submitted in Partial fulfillment of the requirements for Master's degree in the. Department of Mathematics and Computer Science, Bar-Ilan University. 2001.
Bar-Ilan University
Intelligent Multimedia Authoring Tools For Electronic Publishing
Lea Tsaban
Submitted in Partial fulfillment of the requirements for Master’s degree in the Department of Mathematics and Computer Science, Bar-Ilan University
2001
Ramat-Gan, Israel
This work has been done under the supervision of
Professor Sarit Kraus
2
c All rights reserved to the author. Parts of this ° work may be used for non-commercial purposes only, with an explicit reference to this work.
3
Contents 1 Electronic multimedia authoring 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . 1.2 Related Work . . . . . . . . . . . . . . . . . . . . 1.2.1 Characteristics of online newspapers . . . 1.2.2 Related Approaches to Online newspapers 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Author’s policy . . . . . . . . . . . . . . . 1.3.2 The system modules . . . . . . . . . . . . 1.3.3 Optimizations . . . . . . . . . . . . . . . . 1.3.4 Realization of Author’s policy . . . . . . . 1.4 Working phases . . . . . . . . . . . . . . . . . . . 1.4.1 High level design phase . . . . . . . . . . . 1.4.2 Simulation phase . . . . . . . . . . . . . . 1.4.3 Experiments . . . . . . . . . . . . . . . . . 2 The Content Manager 2.1 Input . . . . . . . . . . . . . . . . 2.1.1 Documents . . . . . . . . 2.1.2 Keywords and their score . 2.1.3 Constraints . . . . . . . . 2.1.4 The User Profile . . . . . 2.1.5 The number K . . . . . . 2.2 Evaluation of a Set of Documents 4
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
16 16 17 17 18 19 19 20 20 21 21 21 22 22
. . . . . . .
24 25 25 26 28 30 32 32
2.3
2.4
2.2.1
Normalization of the weights of the constraints . . . . . 32
2.2.2
Normalization of the weights of the documents . . . . . 32
2.2.3
The evaluation . . . . . . . . . . . . . . . . . . . . . . 33
Constraint satisfaction . . . . . . . . . . . . . . . . . . . . . . 33 2.3.1
Constraints on subject and sub-subject.
. . . . . . . . 35
2.3.2
Constraints on keywords. . . . . . . . . . . . . . . . . . 36
2.3.3
Constraints on the specifications of the document. . . . 38
Score Maximization . . . . . . . . . . . . . . . . . . . . . . . . 41 2.4.1
Conservative substitution . . . . . . . . . . . . . . . . 41
2.4.2
Quick Replace substitution . . . . . . . . . . . . . . . . 42
2.4.3
Least-contributing substitution . . . . . . . . . . . . . 42
3 Real-Time Annealing
43
3.1
Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . 43
3.2
Adopting simulated annealing for real-time applications . . . . 45
3.3
3.2.1
Finding T0 . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2
The polishing phase and ². . . . . . . . . . . . . . . . . 46
3.2.3
Find C. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Results of the real-time annealing algorithm . . . . . . . . . . 48
4 Simulation of the system
49
4.1
Brief description
4.2
The parameters of the simulation . . . . . . . . . . . . . . . . 51
4.3
Measuring the satisfaction of the users . . . . . . . . . . . . . 51
4.4
Experiments with partial information user profiles . . . . . . . 53
4.5
. . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1
The set of constraints . . . . . . . . . . . . . . . . . . . 54
4.4.2
The results of the experiments with partial information user profiles . . . . . . . . . . . . . . . . . . . . . . . . 56
Experiments with full information user profiles . . . . . . . . . 91 4.5.1
The set of constraints . . . . . . . . . . . . . . . . . . . 92
4.5.2
Parameters . . . . . . . . . . . . . . . . . . . . . . . . 93 5
4.5.3 4.5.4
The results of the experiments with full information user profiles . . . . . . . . . . . . . . . . . . . . . . . . 94 Conclusions from the results. . . . . . . . . . . . . . . 97
5 Conclusions and future work
98
A The Layout Manager 105 A.0.5 The Extensible Markup Language (XML) . . . . . . . 105 A.0.6 The Layout Manager . . . . . . . . . . . . . . . . . . . 106 B Description of the code B.1 Keywords extraction . . . . . . . . . . . . . . B.1.1 Extractor . . . . . . . . . . . . . . . . B.1.2 Sent-morph . . . . . . . . . . . . . . . B.2 The parser . . . . . . . . . . . . . . . . . . . . B.3 Building the user profiles . . . . . . . . . . . . B.3.1 Profiles with random scored keywords. B.3.2 Profiles fully scored by users. . . . . . B.4 Building the constraints . . . . . . . . . . . . B.5 Content manager . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
109 . 109 . 109 . 110 . 110 . 112 . 112 . 114 . 117 . 118
C The Questionnaire
124
D Choosing relevancy-threshold
129
6
List of Figures 2.1 2.2 2.3 2.4 2.5 2.6
The input and the output of the Content The structure of the document profile . . parsing to XML files . . . . . . . . . . . Building the constraints . . . . . . . . . Building a user profile . . . . . . . . . . The structure of a user profile . . . . . .
. . . . . .
. . . . . .
24 26 27 28 30 31
4.1 4.2 4.3 4.4 4.5 4.6 4.7
Flow of the CM . . . . . . . . . . . . . . . . . . . . . . . . . Average time it takes to reach the desired utility . . . . . . . recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average time it takes to reach the desired utility . . . . . . . The time it takes the Quick Replace algorithm to reach the desired utility, as a function of the number of the documents on Quick Replace, the portion of the users for which the software did not reach the desired utility . . . . . . . . . . . . . The time it takes to reach the desired utility, as a function of the number of the documents . . . . . . . . . . . . . . . . . The portion of the users for which the software couldn’t reach the desired utility . . . . . . . . . . . . . . . . . . . . . . . . The recall achieved by the different algorithms . . . . . . . . The precision achieved by the different algorithms . . . . . . recall, as a function of the number of the documents . . . . .
. . . . . .
50 57 58 58 59 60
4.8 4.9 4.10 4.11 4.12 4.13
7
Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 61 . 62 . 63 . . . .
64 65 66 67
4.14 precision, as a function of the number of the documents . . . . 68 4.15 Average time it takes to reach the desired utility on the different values of Par . . . . . . . . . . . . . . . . . . . . . . . . 69 4.16 The recall achieved with the different values of Par . . . . . . 70 4.17 The precision achieved with the different values of Par . . . . 71 4.18 The portion of the users for which the software couldn’t reach the desired utility, for the different values of Par . . . . . . . . 72 4.19 Average time it takes to reach the desired utility on the different values of Par and different number of documents . . . . 73 4.20 Average time it takes to reach the desired utility in the different values of Par and different number of documents, without the extreme values . . . . . . . . . . . . . . . . . . . . . . . . 74 4.21 The portion of the users for which the software couldn’t reach the desired utility, for the different values of Par and different number of documents . . . . . . . . . . . . . . . . . . . . . . . 75 4.22 The average score of the constraints in the utility function, for the different values of Par, desired utility is 0.98 . . . . . . . . 76 4.23 The portion of the users for which the software couldn’t reach the desired utility (0.98), for different values of Par . . . . . . 77 4.24 The average score of the constraints in the utility function, for different groups of constraints, Par=0.5 . . . . . . . . . . . . . 78 4.25 The portion of the users for which the software couldn’t reach the desired utility (0.93), for different groups of constraints, Par=0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.26 The average score of the constraints in the utility function, for different groups of constraints, Par=0 . . . . . . . . . . . . . . 80 4.27 The portion of the users for which the software couldn’t reach the desired utility (0.93), for different groups of constraints, Par=0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.28 The average score of the constraints in the utility function, for different groups of constraints, desired utility is 0.93 . . . . . . 82 8
4.29 The average score of the constraints in the utility function, for different groups of constraints, desired utility is 0.98 . . . . . . 83 4.30 The portion of the users for which the software couldn’t reach the desired utility (0.93), for different values of Par . . . . . . 84 4.31 The portion of the users for which the software couldn’t reach the desired utility (0.98), for different values of Par . . . . . . 85 4.32 The time it takes to reach the desired utility, as a function of the groups of constraints, for different number of documents . 86 4.33 The time it takes to reach the desired utility, as a function of the number of the documents, for different groups of constraints 87 4.34 The portion of the users for which the software couldn’t reach the desired utility, as a function of the different number of the documents, for different groups of constraints . . . . . . . . . 88 4.35 The time it takes to reach the desired utility, as a function of the number of the documents, for different types of constraints 89 4.36 The time it takes to reach the desired utility, as a function of the number of the documents, for different types of constraints 90 4.37 The portion of the users for which the software couldn’t reach the desired utility, as a function of the number of the documents, for different types of constraints . . . . . . . . . . . . . 91 4.38 Average of the scores the newspapers got, experiments with real users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.39 The average portion of the documents that were interesting to the user, experiments with real users . . . . . . . . . . . . . . 96 A.1 One Content - Varying Layout . . . . . . . . . . . . . . . . . . 106 A.2 The Data Flow in the system . . . . . . . . . . . . . . . . . . 107 A.3 Output of the LM . . . . . . . . . . . . . . . . . . . . . . . . . 108 B.1 Flow of the software that builds the user profiles according to the poll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 9
B.2 Flow of the software that builds the user profiles interacting with the user . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Flow of the software that builds the constraints . . . . . . . B.4 Flow of the CM . . . . . . . . . . . . . . . . . . . . . . . . . B.5 Flow of the utility function . . . . . . . . . . . . . . . . . . .
10
. . . .
115 117 119 120
Preface This thesis work describes the results of a research on intelligent content selection for electronic publishing. Our research was not only of theoretical importance: We have actually implemented a massive software tool in C++ as well as scripts allowing the usage of existing advanced AI software tools. This implementation allowed us to test and statistically prove our conjectures. From the practical point of view, this resulted in a ready-to-use software tool for electronic content selection, whose effectiveness is supported by theoretical as well as experimental evidence. The research was carried under the supervision of Professor Sarit Kraus. This research was partially supported by the G.I.F. A complementary research on this subject was made by Alexander Kr¨oner [21] as his Ph.D. thesis work. Kr¨oner’s work was on the layout design of the articles, whereas our work was on the content selection. We will explain our work as well as its interactions with Kr¨oner’s work in the coming chapters.
Publication As explained in the abstract, we have made contributions to several important areas in the field of Artificial Intelligence. We plan to derive two papers from this work: One on the major part of this work, which deals with intelligent content selection, and the other on our algorithm for Real-Time annealing, which is interesting beyond its usage in the current work (this will require testing this algorithm on problems beyond the scope of this work). 11
Acknowledgments I would like to thank the G.I.F. for the financial support of this research. I would also like also to thank all the people for their help and encouragement: Sarit Kraus, my advisor; Eti David, that introduced me to the field, Roy Ben-Ary, that was ready to devote his time to help me when I asked him, to my brother Uzi Vishne, for his help with the statistical analysis, and to his wife Tali Vishne for her great emotional support. More than all, I would like to thank my husband Boaz, that without his help, support, and encouragement I would have never arrived to the moment of finishing my M.Sc.
12
Abstract The problem of electronically choosing the contents of a newspaper lies at the core of artificial intelligence. This problem involves virtually all aspects of that field: Satisfaction The intuitive or psychological notion of “satisfaction” (of the reader) must be approximated by a rigorously defined measure, or score, defined on the space of all possible contents for the newspaper. Several issues are raised in this context, such as whether the contents should be completely determined by the user’s claimed preferences (the user profile). Real-time methods As electronic newspapers are distributed to a massive amount of people and for each reader a personal newspaper must be constructed, there is a need to apply maximization functions a large number of times in a limited time (newspapers must be up-to-date). Therefore, the maximization of the score function must be made in real time. As the space of all possible contents for the newspaper is infeasibly large, efficient approximation methods (global optimizations) and specific implementation speedups (local optimizations) must be used. Experiments The model developed for the user or reader satisfaction must be tested with real users, that is, human beings. We believe that we made contributions to each of these mentioned issues; briefly: 13
1. We introduced a new model for content-selection, where not only the user profile determines the quality of a given newspaper, but also constraints supplied by the editor (or author) of that newspaper. We approximated the mentioned term “quality” by a rigorous score function combining the user preferences with the author’s constraints. Our thesis was: Taking the author’s constraints into consideration not only plays a moral or educational role, but also improves the satisfaction of the user from the resulting newspaper. We claim that the fact that the user is unbiased when filling the user profile can play against him or her not only from the educational point of view, but also with respect to his or her satisfaction. This claim may sound surprising, but our thesis is experimentally proved by comparing the level of satisfaction of users from newspapers chosen according to their wills only to their level of satisfaction from newspapers chosen by our function which combines their wills with the author’s constraints. We also introduce a new measure, called demo-precision, which combines the standard measures called recall and precision, and prove that this measure is more appropriate than the classical measures to estimate the user’s satisfaction from a given newspaper. 2. We have implemented several hill-climbing algorithms of maximization. In order to check whether a better algorithm can be introduced along the lines of simulated annealing, we introduced a novel variant of annealing which is suitable for real-time searching. The efficiency of this algorithm turned out to be close to that of the best hill-climbing algorithm among the ones that we have implemented, despite the fact that the search space is “discrete” in nature.1 We also made several local optimizations which where not observed in earlier works. E.g., we used a simple linear-algebraic fact to get a 1
In the sense that a small change in the contents may cause a large change in the score.
14
speedup factor of almost 2 for the most time-consuming part of our score function. 3. For the first time we made experiments where the users not only score each subject and sub-subject, but also for each subject and sub-subject each user scored a list of keywords. This yielded a more accurate user profile for the users which took part in our experiments. The results of our experiments support our thesis in a statistically significant manner.
15
Chapter 1 Electronic multimedia authoring 1.1
Introduction
For an increasing number of applications, the manual creation of multimedia presentation is no longer feasible. To meet the specific requirements of the individual presentation consumer, information has to be communicated fast and flexibly. The available authoring tools do not solve this problem since they facilitate only the presentation editing. Among the many commercial applications on the World Wide Web, there are also online newspapers [44, 45, 46]. In addition to the wide distribution of documents in electronic form, online newspapers have another quality – personalization (e.g., [48, 28]) . Personalization means to enable each user to get information which is closely related to his fields of interest. Existing content retrieval tools search for individual multimedia objects in response to specific user queries (e.g., [12]), while others try to filter information for a particular user [40, 13, 26]. An agent, such as “PinaWeb” (Personalized Intelligent Newspaper Agent based on the World-Wide Web) [34], generates daily a new personal HTML page for its user containing links to news (or containing the actual articles) which may be of interest to that 16
user. Research in another direction aims at the development of supporting tools for editorial work of online publishing [43].
1.2 1.2.1
Related Work Characteristics of online newspapers
Online newspapers have several characteristics that differentiate them from the conventional hardcopy newspapers. These characteristics should be taken into consideration when developing an online newspaper:
Wide distribution While the distribution of hardcopy newspapers depends on people that deliver them from one place to another, and is therefore limited, online newspapers can be placed on the Internet, and therefore have a worldwide distribution.
Personalization While in a hardcopy newspaper the editor chooses the articles that will be given to the readers (according to some criterias, e.g. some “average” of the fields of interest of all potential readers), in an online newspaper each reader can get a different version of the newspaper. Therefore, an online newspaper can fit better for the individual reader. Personalization became a main issue in the recent years, and there are many researches in this field ([1], [3], [6], [16], [20], [27], [30], [31], [40], [42], [35], [47]). Some of this researches concentrate on electronic access to news (e.g. [2], [4], [5], [17], [18], [22], [25], [34]) 17
The educational function Newspapers have two important social functions: education and entertainment. Personalization of a newspaper enhances the latter but looses almost completely the former. A partial solution to this problem is to aggregate the user’s personal profile with the community’s profile [17]. Gap between time of event and publishing On hardcopy newspapers, it takes relatively long time to print and publish a new edition, and in the era of television, news reaching the readers only after several hours may become obsolete. With online newspaper, the publication of a new edition is almost immediate. In [25] the reader has the possibility to get news updates while reading his newspaper. [31] suggest a method to update the newspaper several times a day.
1.2.2
Related Approaches to Online newspapers
To build a personalized newspaper we need to know the user’s fields of interest, i.e. have a user profile. Morin and Konstantas [25] suggest that in the first time the user gets connected to the system, he or she will specify his or her fields of interest, and thus build a user profile. The approach of [18] is to give the user the option to choose whether to build a user profile in the first time he or she gets connected, and if he or she chooses not to, all the articles are assigned the same score, and after reading the newspaper once, the score of the articles is being calculated by an AI agent according to the level of interest the user shows in each article. The user can also score the articles explicitly. In [17] the user is also asked to build the profile in the first time he or she is connected to the system, but from now on, his profile will be built implicitly, similarly to [18]. According to an experiment described in [18], the profile closest to what the reader really wants is the explicit profile, but sometimes it is very uncomfortable for the reader to answer a long set of questions, state his fields of interest, or state the keywords of the subject 18
he wants (or does not want) to read about. Kamba et. al. [18] suggest to combine the explicit and the implicit scoring of the articles. All existing methods refer only to the wills of the user, and not to the demands of the authors.
1.3 1.3.1
Overview Author’s policy
We developed a new model for content selection for multimedia products, based both on author specifications and user interests, where the interrelations among objects play an important role in the evaluation of the set of objects. Our work can be contrasted with [17], where the score given to the articles by the user in combined with the community’s score of the article. The mechanisms that we developed in this work are flexible enough to allow almost trivial adoption of any of the approaches mentioned in Section 1.2.1. Another problem mentioned in [18] and [31] is that the articles which the users want to read are not necessarily only the ones that match their interest, as appearing in their user profile. We prove this claim statistically in Section 4.5.3. Also, we want to make sure that even when there is a lot of information about the user’s most interesting field, he will get information of other fields as well. The advantages of our approach are: 1. We do not lose the educational value of the newspaper [17]. 2. We give the reader some information in fields that do not exactly match his interests as they appear in his user profile, but which are most likely to interest him or her. 3. The resulting newspaper will not be focused on a specific field. 19
1.3.2
The system modules
We developed a software tool that simplifies and accelerates the authoring process for creation of electronic newspaper by automating the complex, time-consuming tasks of multimedia content selection, formatting and layout design. Our system consists of two modules: A Content Manager (CM), and a Layout Manager (LM). • The CM receives the raw, media-rich data input to the system. This input takes the form of separate or linked objects containing text, graphics, audio and video. It evaluates the content of each object based on the author’s specifications and the interests of the consumers or users of the electronic presentation. The CM screens and rates content, and issues recommendations to the LM. • The LM analyzes the spatial and (in the case of time-dependent material such as audio and video) temporal characteristics of each object, formats it, and produces a near-optimal layout of the various objects, by reconciling the CM’s recommendations with the provider’s layout specifications and spatial/temporal constraints. The output of the system is the finished multimedia product: an electronic magazine, training program, presentation, etc., that is functional, relevant and esthetically pleasing. We focus on the development of the CM, while the LM is developed by our research-partners (see Appendix A and [21]).
1.3.3
Optimizations
The optimization problems involved in creating multimedia products are intractable. For example, within a set of multimedia objects, the choice of a subset that maximizes the satisfaction of users, given a set of constraints, is intractable. The difficulty is compounded by the amount of information 20
available to the system, which is large and changing over time, and by the need to create the product within a short time period. The system should be able to generate time-critical multimedia products on-the-fly, i.e. in real time. Given these requirements, it is necessary to develop sub-optimal methods (heuristics) that solve the problems in a short time with a good outcome. Faster authoring alone, however, will only aggravate the problem of information overload; understanding and accommodating users’ preferences is critical for success. This is why the system also addresses content selection based on user interests, thus ensuring that multimedia products are both visually attractive and relevant in content. By applying AI (artificial intelligent) techniques to characterize interests of users, the system meets these needs.
1.3.4
Realization of Author’s policy
Given a set of multimedia objects, we use soft constraints to assign a score to the set which is combined dynamically with the user-profile-score of the objects included in the set. The combination of the scores is done according to a policy provided by the author or according to a default function. Our algorithm tries to maximize the combined score. There have already been implemented three types of algorithms for similar purposes [36]: The first was a deterministic backtracking algorithm, the second a heuristic repair method [24] based on a random-restart hill-climbing mechanism, and the third a genetic algorithm [41]. We implement the algorithms on our system, replacing the genetic algorithm with the simulated annealing algorithm (described in Chapter 3).
1.4 1.4.1
Working phases High level design phase
We carried out the following steps: 21
1. Defining scope and specifications of the Content Manager (CM) and the User Profile. 2. Implementation of stand alone models for the CM. The stand alone model of the CM operates on simulated input from the UP and the content specification: the CM rates each object by measuring its “closeness” to the user profile, then tries to find a subset of objects that maximize both this rating and the author’s content specification. 3. Definition of the interface between the CM and the LM, and their integration. 4. Testings of the system: (a) Automatic simulation. (b) experiences with real users. (c) Statistical analysis of the results.
1.4.2
Simulation phase
We have designed and developed a simulation for the CM. The simulation enabled us to test the properties of the system defined in the first phase and to evaluate the user’s satisfaction. The simulation model was implemented using the C++ programming language.
1.4.3
Experiments
We made experiments with real users, who supplied their user profile by ranking a tree of subjects, sub-subjects, and for each sub-subject an appropriate list of keywords. We built for each user three newspapers according to the following types: 1. Based completely on the user profile, 22
2. Based on the user profile as well as a “good” set of author constraints; and 3. Based on the user profile as well as a “bad” set of author constraints. Each user scored the newspapers as well as each of the documents. We then analyzed the results of these experiments statistically.
23
Chapter 2 The Content Manager a bank of candidate articles
a set of a user constraints profile
K
Contant Manager
a list of chosen documents Figure 2.1: The input and the output of the Content Manager Our system matches to each reader a personalized newspaper. Figure 2.1 demonstrates the input and the output of the system. The input of the system is: 1. A bank of candidate documents for the newspaper, which we will call documents. 24
2. A set of constraints on the selection of documents provided by the author (editor) of the newspaper. 3. A user profile: The profile of preferences supplied by the reader. 4. A number K: The number of documents the newspaper should include. The output of the system is a list of K documents for the newspaper, ranked according to the level of the reader’s interest. We now give a detailed description of the above items, and of the way the output is created.
2.1 2.1.1
Input Documents
In our system, to each document we generate a document profile containing the following information about the document (see Figure 2.2): 1. Date, 2. subject, 3. sub-subject, 4. serial number, 5. size (in bytes), 6. path to the original file (an HTML file), 7. path to the XML file parsed from the original file (there is a more detailed description in Section A.0.5), 8. a vector of keywords for the document and their weights (the computation of which will be described on section 2.1.2 below), 25
Figure 2.2: The structure of the document profile 9. a vector of pointers to images (pictures - JPG or GIF files). Our system works with XML files, since the Layout Manager (Appendix A) can present only XML files. HTML files will be parsed into XML, as described in Section B.2.
2.1.2
Keywords and their score
The keywords and their score were calculated for each document using the Extractor program [14] (For details, see Section B.1.1). After calculating the score of the keywords, we normalize them according to the l 2 norm s X kV k = Vi2 . i
Remark 2.1.1. It is common to use normalization in l 1 (to project the scores to the interval [0,1]) [9, p. 63]. This is achieved by dividing each keyword 26
html file
PARSE
xml file
Figure 2.3: parsing to XML files weight by the total sum of weights. In our case, such a normalization would be meaningless, since in the computation of similarity below, the vectors are renormalized in l2 . The weights are used to calculate the level of interest of the user in the document. Suppose V is the vector of weights of the keywords in the user profile, and W is the vector of weights of the keywords in the document profile, the level of interest of the user in the document is calculated using the following formula: Similarity(user,document) =
hV, W i . kV k · kW k
This formula measures the cosine of the angle between the two vectors V and W . Remark 2.1.2 (local optimization). The norm of a vector is a fixed value for each vector, therefore we can calculate it offline, while preparing the input for the Content Manager, and not during the run of the program. The mathematical reason allowing us to do this is, that for each pair of 27
vectors V and W we have: hV, W i V W =h , i. kV k · kW k kV k kW k Therefore if we set in the offline phase V˜ = V /kV k for each vector V , the calculation of the similarity amounts to ˜ i, hV˜ , W which is faster by a factor of 3 ! This simple algebraic fact was not observed in earlier works.
2.1.3
Constraints
interface for building constraints
author
output
file of constraint Figure 2.4: Building the constraints Hard constraint are constraints that any solution of the problem should satisfy. The satisfaction of soft constraints is not mandatory. An output may be considered a good solution to the problem even if it does not fully satisfy the soft constraints. The author supplies a set of soft constraints that should be satisfied. Each constraint has a weight — a measure to its importance in the range 1-10. We allow constraints of the following types: 28
Constraints on subject and sub-subject. 1. At least a certain percent of the documents should be from a certain subject and sub subject. 2. At least a certain number of documents from a certain subject and sub subject should appear. 3. No more than a certain percent of the documents will be from a certain subject and sub subject. 4. Only the specified number of documents will appear from a certain subject and sub subject. 5. At least one document from each subject shall appear. 6. At least one document from each subject that the user is interested in shall appear. Constraints on keywords. 1. At least a certain percent of the documents should include at least some given number of the keywords from a given list. 2. At least a certain number of documents should include some given number of the keywords from a given list. 3. No more than a certain percent of the documents will include a given number of the keywords from a given list. 4. Only the specified number of documents will appear with a given number of the keywords from a given list. Constraints on the specifications of the document. 1. From a certain subject and sub subject, only the documents not older than the specified number of days shall appear. 29
2. The documents that appear will not exceed the specified level of similarity. 3. The documents that appear will not go lower than a specified level of relevancy. 4. The documents that appear will not exceed the specified number of bytes. The system we developed is universal in the sense that constraints can be changed, added or replaced according to the author’s wishes, without any need to change the other components of the system.
2.1.4
The User Profile
interface for building a user profile
author
output
a user profile Figure 2.5: Building a user profile The user-profile is a hierarchy representing the reader’s preferences. Each user is assigned a unique user-profile according to a questionnaire describing its interests. (Here we could also use any of the other methods for building a user-profile; see Section 1.2.2). We use a hierarchic structure of subjects and sub subjects: each subject has several sub subjects, and each sub subject has 30