Online and interactive self-adaptive learning of user profile using ...

4 downloads 103 Views 945KB Size Report
optimizing search techniques that deal with dynamic envi- ronment like that of TF. For this purpose, we suggest in this paper an online self-adaptive profile learning agent that is ..... by the user using on some search engine, the learning. Fig.
Evolving Systems DOI 10.1007/s12530-013-9096-3

REVIEW

Online and interactive self-adaptive learning of user profile using incremental evolutionary algorithms Abdelhamid Bouchachia • Arthur Lena Charlie Vanaret



Received: 16 April 2013 / Accepted: 10 September 2013 Ó Springer-Verlag Berlin Heidelberg 2013

Abstract In this contribution, we explore the application of evolutionary algorithms for information filtering. There are two crucial issues we consider in this study: (1) the generation of the user’s profile which is the central task of any information filtering or routing system; (2) self-adaptation and self-evolving of the user’s profile given the dynamic nature of information filtering. Basically the problem is to find the set of weighted terms that best describe the interests of the user. Thus, the problem of user profile generation can be perceived as an optimization problem. Moreover, because the user’s interests are obtained implicitly and continuously over time from the relevance feedback of the user, the optimization process must be incremental and interactive. To meet these requirements, an incremental evolutionary algorithm that updates the profile over time as new feedback becomes available is introduced. New genetic operators (crossover and mutation) fitting the application at hand are proposed. Moreover, methods for feature selection, incremental update of the profile and multiprofiling are devised. The experimental investigations show that the proposed approach including the individual methods for the different aspects is suitable and provides high performance rates on real-world data sets.

A. Bouchachia (&) School of Design, Engineering and Computing, Bournemouth University, Dorset, UK e-mail: [email protected] A. Lena  C. Vanaret University of Toulouse, Toulouse, France e-mail: [email protected] C. Vanaret e-mail: [email protected]

Keywords Online optimization  Incremental evolutionary algorithms  User profile learning  Information filtering  Self-adaptation  Self-evolving

1 Introduction 1.1 Context of the work Over the recent years, the amount of information in different forms, particularly in its textual form, has been increasingly published and delivered online. The challenging task is then how to facilitate the access to such information that is usually highly unstructured and dynamic in nature. Although incremental classification and clustering techniques can be applied to deal with these dynamics, the information seekers still need to be actively involved in the expression of their needs regularly using queries. Another method dedicated to long term information retrieval is information filtering1. In particular, text filtering (TF) aims at reducing the user’s search burden. It deals with more dynamic and unstructured sources of texts while user’s information need, known as user’s profile, is specific and stable (Voorhees and Harman 2005). Information retrieval (IR) on the other hand is the process of identifying all and ideally only those documents, assumed to be stable and unstructured, in response to a dynamic and specific user’s information need expressed in terms of a query. Thus, the tasks of text filtering and retrieval look complementary with respect to aspects like the stability, the structure of document repository and user’s queries. In the context of information retrieval, an IR system relies on the notion of relevance feedback to improve the retrieval 1

In the context of this study by ’’information filtering’’ (IF) we refer to ‘‘text filtering’’ (TF) and we will use them interchangeably

123

Evolving Systems

accuracy. Relevance feedback is the task of reformulating the user’s query according to the feedback provided by the user after examining the retrieval results delivered by the system in response to a previous version of the query. Being a learning and adaptation mechanism, relevance feedback aims at improving the query representation in a way to reflect better the needs of the user. It was shown that equipping the IR system with relevance feedback improves the system performance (Lv and Zhai 2009; Pickens et al. 2010; Algarni et al. 2010). This idea of feedback can be applied in the framework of TF too. The goal is then to learn the profile of the user based on the queries he formulates. Various computational models have been used to build filtering systems, for instance, neural networks (Kuflik et al. 2006), rule-based algorithms (Hannani et al. 2001), Bayesian/inference network (Schiaffino and Amandi 2000), case bases (Schiaffino and Amandi 2000) and genetic and immune algorithms (Nanas et al. 2010; Boughanem et al. 1999), reinforcement learning techniques (Tebri et al. 2005). Because of the learning and the incremental adaptation mechanism needed by TF, we are interested in applying optimizing search techniques that deal with dynamic environment like that of TF. For this purpose, we suggest in this paper an online self-adaptive profile learning agent that is based on an incremental genetic algorithm. The agent should be able to tune continuously the profile of users in an online manner. As new documents are tagged by the users as relevant (or possibly irrelevant), the learning agent adapts the profiles in light of the new evidence. Using evolutionary algorithms for incremental TF can be considered as an instance of dynamic optimization (DO). Basically DO targets optimization problems in which the location of the global and local optima of a state space change over time. Among the real-world applications studied in the context of DO are scheduling, fault diagnostic, vehicle routing, inventory management and control. The application of evolutionary optimization in dynamic environments can be motivated mainly by two goals: –



incrementally adapt the current solution when changes in the environment occur. Change refers to the fact that the location of the optimum moves deterministically or stochastically during the optimization process. In noisy environments (often known as noisy fitness problems), it is recommended to search robust solutions that are insensitive to change.

The authors in (Kyamakya et al. 2010) classified dynamic optimization problems with a moving optimum as follows: – –

The location of the optimum moves linearly in parameter space with time The location of the optimum moves nonlinearly in parameter space with time

123





The location of the optimum oscillates periodically among a given number of points in parameter space deterministically The location of the optimum moves randomly in the parameter space with time

Recently a comprehensive coverage of evolutionary optimization in dynamic environments has been provided. Mainly there exist three classes of dynamic function optimization approaches (Cruz et al. 2011; Jin and Branke 2005; Woldesenbet and Yen 2009) to help track and face change in the environment: – – –

Use of change in the fitness (known as mutation and self-adaptation approaches). Use of population diversity (known as multiple population and diversity preserving approaches) Use of good past solutions (known as memory-based approaches)

For an extensive overview of DO methods, the reader may refer to recent reviews in (Cruz et al. 2011; Jin and Branke 2005; Woldesenbet and Yen 2009). The approach presented in this paper to handle the problem of incremental users’ profile learning can be considered as a memory-based approach, since whenever a new batch of data arrives, the previous profile(s) (the upto-date best solution(s)) obtained sofar is subject to an online update. Specifically, if the accuracy of the IF system (or agent) when using the current active profile against the new documents is less than a threshold, the agent generates a new profile from scratch based on only the new relevant documents. If its accuracy is higher than the given threshold, the current profile is updated by injecting the new documents as new chromosomes and removing the worst chromosomes and keeping the best chromosomes in the population (standing for the memory). Of course, if the user does not provide any feedback to the system, it is assumed that the current profile is a satisfying one and thus remains unchanged. 1.2 Contributions In this study we propose and investigate the following: –

Incremental online approach for information filtering: the state-of-the-art shows that only few online learning methods have been applied in the context of information filtering, meaning that the development is one-shot experiment (Kapp et al. 2011; Reitter and Lebiere 2012; Tebri et al. 2005; Yang et al. 2005). Most of the literature is related to offline learning methods according which once the learning stage has been exhausted,

Evolving Systems

the filtering system can no more be adapted, unless it is retrained from scratch. In the present work, we pursue an evolutionary rather than a revolutionary approach (Bouchachia 2009, 2011). According to the evolutionary strategy, the system should enable learning to take place continuously over long periods of time as new data become available (in the from of feedback from the user after sending queries to the system). The difficultly of this approach is that learning is not a`priori biased by any structural or statistical knowledge about the incoming data in the future. In dynamically changing environments, the challenge can be more crucial, since the system may change drastically over time due to the problem of concept drift (Sahel et al. 2007). In a nutshell, we envision the following generic incremental methodology: 1. 2. 3. 4. 5.

Learn knowledge from existing data Store knowledge, discard data Use knowledge to predict When new data arrive, learn new knowledge using old knowledge and the new examples Go to step 2

In this sense, the system becomes self-adaptive and its underlying computational model becomes self-corrective. –





Proposal of an incremental evolutionary approach: since the problem of constructing profiles can be transformed into a search problem, evolutionary algorithms seem to be a suitable approach. The problem with conventional evolutionary algorithms is that they follow a progressive but offline method to find a solution for the problem at hand. This means that the evolutionary algorithm requires a full description of the problem being solved. On the contrast the notion of incrementality considered here is to enable the evolutionary algorithm to operate in an online context. Hence, the aim is to apply an evolving evolutionary approach. To this end, IGA, an incremental genetic algorithm as an instance of memory-based class of dynamic optimization, which is capable of coping with new data, is proposed. Very often a user has interest in many topics like sport, computer, cars, etc. To capture such a diversity, in this work we suggest a multi-profiling strategy that generates many distinct profiles for the same user. The goal is to enhance the accuracy of the filtering system. The approach presented here is broadly evaluated from different perspectives to show its performance which is indeed very high. In particular, the approach is tested in an online setting allowing the system to generate and adapt profiles continuously as new evidence is

obtained. Moreover, a sensitivity analysis is conducted to show the impact of the various parameters on the system. 1.3 Structure of the paper In the following sections, the problem of information filtering is outlined (Sect. 2). Next, the proposed incremental genetic algorithm is presented (Sect. 3). In Sect. 4, we show how the incremental genetic algorithm is actually applied for profile generation in the framework of information filtering. Section 5 introduces the notion of multiprofiling. Section 6 provides a detailed empirical evaluation of the approach, shedding light particularly on the efficiency of IGA and the multi-profile problem. Finally, Sect. 7 concludes the paper.

2 Information filtering and profile generation A filtering system deals with the specific interests of the user. It must select those arriving documents deemed to be interesting to the user and ignore all the rest. Due to some known difficulties in the representation of documents, understanding the user’s interests, and the matching process between the two, a filtering system might not be able to perfectly distinguish the documents that are actually of interest to the user (relevant) among the ones which are not. The aim is then to minimize the proportion of irrelevant documents sent to the user and that of relevant documents ignored. To ease the task of a filtering system, the users could assist a filtering system at least during a preliminary phase, called the learning phase, to accomplish its task. This can be done by indicating some documents or theme descriptors (topic descriptors) that reflect the user’s interests. After constructing such a profile, the system will be able to automatically filter out new documents arriving in a stream that have a strong similarity with that profile and send them to the user. Note that this process of filtering may be viewed as a binary classification problem where each new document has to be classified in one of the two classes: relevant and not relevant. Depending on the way the documents are selected, IF systems can be classified in three categories. Content-based systems select documents based on their (semantical) content. Social systems select documents based on the recommendations of other users (collaborative filtering) (Ricci et al. 2011). Economic systems select documents based on marketing prices basis (cost-benefit). Our primary interest in this study is the first category. A text filtering system relies on three basic elements:

123

Evolving Systems



– –

A representation model for the user’s interests, expertise and behavior (profile). In particular, the model indicates which features that accurately describe the user’s profile. A representation model for the items to be filtered. A matching function that measures the similarity of the items against the user’s profile.

Filtering as an activity of selecting relevant documents according to some user’s interests can be performed in different ways as defined in the 2002 Trec conference: –





Adaptive filtering: The system starts with a set of topics, a document stream, and a small number of examples of relevant documents. For each topic a profile is created. Incoming documents are assigned to closer profiles. This filtering is called adaptive due to the fact that if for a profile a document with known relevance judgement is retrieved, the system uses this information to update the profile. Batch filtering: The system starts with a set of topics and a set of labelled training documents for each topic. A filtering profile must be created for each topic and a binary classification rule which will be applied to an incoming stream of new documents must be induced. The final output is an unranked set of document for each topic. Routing: Similar to batch filtering, for each topic, the system creates a routing profile which assigns retrieval scores to the incoming documents. The output is a fixed number of top-ranked documents for each topic.

The basic problem in information filtering systems is to compute the profile. Formally, we can provide the following definitions: Definition 1 (Description language T): Given a language L, the description language T is sub-language of L (T  L) defined as a finite set of terms used to represent the semantic content of documents.

similarity measure /(P, di). Note that documents have the same representation as that of the profile (see Definition 2). Because the profile plays the same role as a query from the perspective of information retrieval with the difference that a profile is repeatedly updated over long time, filtering can be defined as a retrieval operation. The filtering process consists of finding documents which are similar to the profile and selecting the top-scoring documents for presentation. Definition 3 (Filtering): Let P in P a user’s profile and q a real threshold value (q 2 ½0; 1). The set D0 of retrieved documents di from a collection D can be defined as follows: D0 ¼ fdi 2 Dj/ðP; di Þ [ qg: The crucial problem in information filtering is therefore to find the profile. In the present paper, we use meta-heuristics to optimize the profile of the users in an online and dynamic way. Definition 4 (Adaptive profiling): An adaptive profile is obtained by incremental update as new data become available following the operations (see Fig. 1): 1. 2. 3.

Filter. When new documents {di} arrive, those satisfying /(P, di) [ q are returned to the user. Label. The user identifies the returned documents that are relevant, as a list L Update. the profile is updated using the documents on the list L. The goal is to identify the accurate representation of the profile such that all relevant documents in L are reflected by the profile and appear at the top of the displayed documents if {di} are presented to the filtering system again.

Formally we can formulate the problem as follows. Assuming we have N documents in the collection D and the system returns at time t a set of documents D0  D of size N0 (N0 B N) to the user who judges their relevance:

In addition to single terms, this language might be extended to include also phrases, word proximity pairs, etc. Definition 2 (Profile): Given a user U, a profile P of U is a vector of terms ki, where each term is represented by a weight wi reflecting the contribution of the corresponding term in defining that profile. Thus, a profile has the following structure: P ¼ hti ; wi i; i ¼ 1. . .M

ð1Þ

where M is the size (or cardinality) of the description language T. Once the profiles are generated, the filtering system will classify incoming documents di using, for instance, a

123

Fig. 1 Process of profile update

Evolving Systems

 pi ¼

1 0

if document di is judged relevant otherwise:

ð2Þ

for i ¼ 1. . .N 0 : Let w be a utility function that ranks the second argument based on its similarity with the first argument (the current profile). Given the current profile P ¼ ½wi i ¼ 1...T ; then, Pnew is computed as follows: ( ) X ð3Þ Pnew ¼ argmax wðP; di Þ  pi P2P

di 2D0

The goal is to find the best representation of the profile that ranks better the relevant documents by updating the current profile representation. In general, to describe the user’s profile, there are mainly two classes of methods: traditional information retrieval methods and feature selection methods from the machine learning literature. The former class includes Rocchio’s formulation (Voorhees and Harman 2005) that has attracted extensive attention (Schapire et al. 1998; Algarni et al. 2010; Singhal et al. 1997; Efron 2008) despite some doubt in a number of studies (Fan et al. 2005; Schu¨tze et al. 1995; Tebri et al. 2005), Robertson selection value (RSV) (Robertson 1986), pivoted document and relevance correlation (DRC) (Singhal et al. 1996). The machine learning class contains latent semantic indexing (Dumais et al. 1988), v2 (Ng et al. 1999) and information gain (Yang and Pedersen 1997). In addition to the structure of the profiles in terms of indexing words, the notion of matching also plays a central role. Given a new document and the query (profile), measuring the similarity between them may involve various matching functions such as Okapi system (Robertson et al. 1996), INQUERY system (Callan et al. 1992), SMART system (Xu and Croft 1996), and pivoted TFIDF (Singhal et al. 1997). Other matching are also possible. These include Cosine, Jaccard, Dice measures, etc. To understand the strength of each of these representation and matching methods, many comparative studies have been conducted as in (Fan et al. 2005; Schu¨tze et al. 1995; Tebri et al. 2005; Yang and Pedersen 1997). Some of these studies view filtering and routing as a binary classification problem (relevance vs. irrelevance). The main outcome of these studies is that: –

Rocchio remains one of the successful formulation of relevance feedback modeling (Singhal et al. 1997) and

– – – –

all comparative studies seem to consider it as a baseline. Both relevant and irrelevant documents are important for learning the user profile (Singhal et al. 1997). Classification algorithms outperform Rocchio’s formulation (Schu¨tze et al. 1995). Reinforcement learning performs better than Rocchio formulation (Tebri et al. 2005). RSV combined with the query term frequency indexing method offers better results compared with several other combined indexing and matching functions.

Note that genetic programming has applied mainly to design matching (ranking) functions (Borji and Jahromi 2008; Fan et al. 2005; Yeh et al. 2007). The effects of different fitness functions used in GP have been also examined. 3 Incremental genetic algorithms Genetic algorithms (GAs) purport to mimic the behavior of natural selection (Goldberg 1989) based on Darwin’s theory. They have been successfully applied to many complex applications where hard optimization and search problems are involved. GAs start off with a population consisting of randomly generated individuals (chromosomes), each representing a candidate solution to the optimization problem at hand. Applying genetic operators on the chromosomes in an iterative process relying on fitness quantification, selection, combination and mutation, the average fitness of the population improves over time yielding better solutions to the given problem. Given the dynamic context of information filtering, it is more adequate to adopt a more fitting version of GA capable of incrementally learning the user profile as new documents arrive over time. In such an adaptive version, the GA is not required to retain the documents seen in the previous optimization session, hence the name of incremental GA (IGA). The algorithm proposed here is capable of handling concept drift, since users may change their interest over time. It is therefore very important to take this aspect into consideration. From this perspective, IGA seems to mimic more appropriately the natural evolution theory in the sense that living organisms evolve their structure/personality over time as they face new experiences and new living conditions. The steps of IGA are shown in Algorithm 1.

123

Evolving Systems

4 Incremental genetic algorithms for adaptive profile generation As defined earlier, a profile is a set of terms that describe the likes (and eventually dislikes) of a given user. The whole task in profile learning is then reduced to the problem of identifying the best set of items that reflect the interests of the user. As proposed and presented in this study, this in turn might be viewed as an optimization problem that can be described as the process of finding the Fig. 2 Information filtering and retrieval

123

optimal (i.e. best) subset of terms among all possible sets of terms that better reflect the user’s profile. This is simply a search problem that needs an optimizing search technique. However the search operation must be done online and in an incremental way. Figure 2 shows how information retrieval and filtering can be integrated together to enhance the level of satisfaction of the users seeking online information. While the IR agent searches the results based on the query entered by the user using on some search engine, the learning

Evolving Systems

agent maintains the profile of the user using both the user queries and the user feedback when interacting with the system. It is also interesting to note that the process of profile generation is incremental and adaptive, since the task is not a one-shot experiment, but recurrent and lifelong. Specifically, the representation of the profile is updated incrementally according the relevance information of retrieved documents returned by the user. When the filtering procedure is triggered (new documents become available), the last version of the profile is applied for comparison against the incoming documents and those documents judged relevant (similar to the profile) are presented to the user. To apply GAs to profile generation, we need to identify these central elements which are in general applicationdependent: – – – – –

Representation of the problem: how should solutions be encoded as chromosomes? Initialization: how is the initial population created? Optimization (fitness) function: how are chromosomes qualitatively evaluated? Genetic operators: how will chromosomes be altered and composed? Parameters of the algorithms: which instances are set for the parameters: the population size, the various probabilities, the stopping criteria, etc.

In the sequel, details on the systematic application of genetic algorithms for profile generation are introduced. 4.1 Representation An individual of the population is to be considered as one possible profile. An individual or a chromosome consists of a set of genes; each gene represents a term. To determine the list of terms, we will rely on a non-controlled vocabulary. In other words, terms forming the vocabulary (or the index) will be identified automatically using a natural language parser that consists of a set of steps well known in the framework of information retrieval namely: 1.

2.

3.

Tagging: The tagger assigns grammatical tags to words. Only words which are tagged as verbs or nouns are included in the representation space. Such words constitute the profile vocabulary. Stemming: The stemmer reduces the words to their roots (mainly by eliminating the endings of the words). It aims at making the similarity matching well focused due to the condensation of words, thus reducing the size of the representation space. Tokenization: It aims at separating words. Each sentence is transformed into a list where each element is a two-field record. The first field indicates the

4.

5.

stemmed word, whereas the second represents the tag of that word. Filtering: Based on the tagging information only verbs and nouns are selected in a first stage. In a second stage further undesirable words are removed. Hence, only meaningful words are retained. Weighing: The aim here is to assign weights to the retained terms based on some weighing formula.

In this paper, the weighing step is executed taking two important aspects into account: (1) selection of terms and (2) initialization of the terms’ weight. To achieve this twofold goal, we rely on the Robertson selection value (RSV). This weighing method is used for selecting terms based on their occurrence in relevant and irrelevant documents returned upon presentation of a query. The top S terms with the highest RSV values in the relevant documents are used to create a vocabulary for the GA’s population. RSV is expressed as follows: RSVi ¼ ðpi  qi ÞRWi

ð4Þ

where pi = P(wi = 1|R) is the probability of the presence of term i in relevant documents, qi ¼ Pðwi ¼ 1jRÞ is the probability of i in non-relevant documents and RWi ¼ log

pi ð1  qi Þ qi ð1  pi Þ

ð5Þ

Clearly, RWi can also be used for initializing the weight of the terms. It is, however, important to note that the weight will then evolve through the GA generations. Moreover, to increase the diversity of the profile candidates, we also use a second selection mechanism that is the Jaccard similarity measure. The idea here is to compute the correlation between the distribution of the terms in the set of all documents and that in the relevant documents only. Those terms for which the similarity is high are weighted using again the RSV method. Jaccard coefficient is given as: PN j ¼ 1 eij dj Ji ¼ Jacðti ; DÞ ¼ PN ð6Þ P PN N 2 2 j ¼ 1 eij þ j ¼ 1 dj  j ¼ 1 eij dj where  1 if term ti appears in document dj eij ¼ 0 otherwise

ð7Þ

and D is the set of documents returned by the system of size N and  1 if the document dj is relevant dj ¼ ð8Þ 0 otherwise Note that selecting, say, the first 100 terms using both RSV and the Jaccard coefficient many not give the same

123

Evolving Systems

set of terms. Hence, the lists of terms resulting from both selection methods are merged. The goal of this double selection is to increase the quality of the indexing operation on one hand and to reduce, for efficiency purposes, the size of the vocabulary used to construct the profiles and ensure some freedom in its extension on the other hand. 4.1.1 Profile encoding Now that the representation space is defined, we have to look at the chromosome representation. Each gene represents a term and contains the weight of that term. Hence, an individual is viewed as: Pk ¼ hw1 ; w2 ; . . .; wS i where S is the number of terms selected and weighted (wi) according to the aforementioned procedure. GA needs an initial population of individuals to start with. There are various ways to initialize the population. A first option consists of generating the individuals by a random initialization based on the first user’s query. An alternative, which we adopt, is rather a constructive one and consists of the following steps. The first individual is the first user’s query submitted to the retrieval system. The rest of the initial population is generated based on the relevance information of the retrieved documents. Further details will follow in Sect. 4.3. 4.2 Fitness function The fitness function serves to quantify the quality of the individuals in the current population using the notion of relevance. That is, the accuracy of a profile is measured as a function of the number of relevant documents returned. In the retrieval process, the closest documents to the current profile are found using a similarity measure. Known measures are the inner product, the dice coefficient, the cosine coefficient, the Jaccard’s coefficient, but also specific similarity measures used in systems like Okapi, INQUERY and SMART. Based on the results in (Fan et al. 2005) comparing these similarity measures, we adopt the Okapi measure which is expressed as follows. X 3  tfi Simðp; dÞ ¼ l i2p 0:5 þ 1:5  lavg þ tfi ð9Þ N  dfi þ 0:5  QTWi  log dfi þ 0:5 where p indicates a profile, tfi is the term frequency of the ith word, N is the total number of documents in the collection of documents, dfi is the number of documents in the collection in which the ith term is present, l is the length of the document, lavg is the average document length in the collection and QTWi is the term weighting of a term in the profile.

123

If the similarity exceeds a specific threshold value, the document is marked as relevant. Once all documents have been processed, some performance measure serving as the fitness function [such as precision and the f-measure (van Rijsbergen 1979)] is computed. When using precision as a fitness function, we are measuring the proportion of relevant documents within the fixed top-ranked retrieved documents. The fitness of each chromosome in the population increases as the number of relevant retrieved documents increases with respect to that chromosome. The f-measure on the other hand is an averaging measure that combines recall and precision. One can place importance on one of them by adjusting a weighing coefficient. However, often in the context of filtering, the average precision, Pavg is used. It is given as follows: Pavg ¼

R X Pi i¼1

R

ð10Þ

i where Pi ¼ Rank and R is the set of relevant documents for i a given query and Ranki is the ranking position for the ith relevant document.

4.3 Genetic algorithm at work The proposed IGA relies on the traditional GA. However this latter differs from those mentioned earlier in Sect. 1 in various aspects like initialization, crossover, and mutation. In the following, each of these operations is briefly described: –

Initialization:

In the first run of the algorithm, as a new query arrives, the system retrieves documents in a ranked order. The user then provides his relevance information by categorizing documents into relevant and non-relevant ones. The initial population consists of the set of documents marked as relevant. –

Selection:

We rely on a linear rank selection method according to which individuals are ranked based on their fitness function values. Then each individual is assigned a selection probability value, depending on its rank in the population (i.e., the highest the rank, the highest the probability to be selected). This latter is given as: fi ¼ ðN  ri  1Þðmax  minÞ=ðN  1Þ þ min

ð11Þ

where N is the population size, ri is the rank of the ith individual, max and min are the fitness values to assign to the best and worst individuals respectively. –

Recombination:

Evolving Systems w1k w2k 1 0.9 0.8 0.7

2k

0.6

w

Specific combination operator is proposed here allowing to generate only one offspring. We define the crossover operator as follows. The weight of a term in the offspring structure is the maximum of the values of that term in the parents whenever both weights exceed the average weight of that term in the set of relevant documents. If the weight of the same term in the parents structure is less than the average weight of the term in the relevant documents, the weight assigned to the offspring is the minimum. We model the maximum and minimum using fuzzy logic operations namely the t-norms and t-conorms. Since the weights are all in the unity interval [0,1], one can see each parent as a fuzzy set whose elements are the terms. Let parents parenti ði ¼ 1; 2Þ : parent1 h w11 ; w12 ; . . .; w1M i and parent2 h w21 ; w22 ; . . .; w2M i subject to crossover. One offspring is generated using the probabilistic norms as follows:  P 1 maxðw1k ; w2k Þ if 9i; wik jRj dj 2R wjk Oðwk Þ ¼ minðw1k ; w2k Þ otherwise

0.4 0.3 0.2 0.1 0 0

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

(a) w1k+w2k−w1k w2k 1 0.9 0.8 0.7 0.6

w2k

ð13Þ

0.2

1k

where the max and min operations over the weights are visualized in Fig. 3 and defined as follows:

minðw1k ; w2k Þ ¼ w1k  w2k

0.1

w

ð12Þ

maxðw1k ; w2k Þ ¼ w1k þ w2k  w1k  w2k

0.5

0.5

where R indicates the set of relevant documents obtained by submitting the parents as queries.

0.4



0.2

0.3

Mutation:

0.1

The gene to be mutated is assigned the average weight of the corresponding term in the set of relevant documents as follows: Offspringðwk Þ ¼

1 X wjk jRj d 2R

ð14Þ

0

0

0.1

0.2

0.3

0.4

0.5

0.6

w

1k

(b) Fig. 3 Computing the genes of the new offspring

j

5 Multiprofiling A very important concept in information filtering and routing is multi-profiling. Obviously an efficient filtering system should keep track of the likes and dislikes of a user. However, a user can be interested in different subjects (i.e., sport, politics, science); hence multiple profiles must be kept to track a user. This problem is approached in this study by three operations: compare, create, update. When the user submits a request to the system for the first time, a new profile is created. Then sequentially whenever a new query is entered, the system computes the similarity

between the query and the existing profiles. If the new submitted request (respectively the new feedback) is not related to the existing profiles (i.e., the similarity between the two profiles does not exceed a certain threshold), a new profile is created and added to the list of existing profiles. Because the number P of allowed profiles per user is restricted, a not-sophisticated rule is applied. If the current number of existing profiles is P, the new profile is accommodated by removing one of the existing profiles. Specifically this is done as follows: –

If the similarity between the new profile and one of the existing profiles exceeds a particular threshold, the

123

Evolving Systems



closest existing profile is discarded and the new profile is inserted at the end of the array. Otherwise, if there is no spot left, the oldest profile is discarded. The new profile is inserted into the list of active profiles.

6 Empirical evaluation In order to evaluate the approach presented in this contribution, we rely on a medical data set, MED, that consists of three files: –

– –

.dat file: contains a collection of textual documents from the domain of medicine. The number of documents is 1033. .qry file: contains 30 different queries submitted by a user dedicated to the collection. .rel file: contains the reference to the documents in the .dat file which are marked as relevant to the queries. It represents the feedback from the user with respect to each query.

The choice of this collection is motivated by the fact that all necessary information for building a user profile is available, since the true relevant documents for each query are known. Precisely, a query serves to initialize the user profile that will later be incrementally updated as shown in Algorithm 2.

As explained in Sect. 4.1, all documents of the collection (.dat file) undergo the stemming, tagging, tokenization, filtering, and weighing. Terms to be considered are selected using RSV (Eq. 4) and the Jaccard coefficient (Eq. 6). Very important to recall that the profile learner works in an online real-time mode, meaning that new documents keep arriving and populating the database of the system. Therefore new documents are compared against the current user’s profile and those judged relevant are automatically forwarded to the user. This latter evaluates the quality and sends his feedback. Then, the system updates the profile based on the new feedback. This procedure is realized by means of the IGA (Algorithm 1).

123

In order to simulate the incremental arrival of new documents in the online real-time setting, the MED collection (.dat file) is uniformly split into several batches. Each batch contains at least one relevant document for each query. Upon arrival of new documents, they will be processed (an updated representation-vocabulary-will be derived) and used to update the profile gradually using GA. The rest of this section will describe the findings about various facets of the approach. The following experimental aspects are considered: – – – – –

Convergence of both IGA and the GA involved within IGA Evolution of the index (vocabulary) to show how new terms are accommodated Sensitivity to the different IGA parameters Generation of multi-profiles over time Performance of IGA in terms of returned documents over time

6.1 Experimental results Before presenting the various experiments conducted in the context of this study, the different parameters are set to the appropriate values after some initial experiments. Further details will follow in Sect. 6.1.4. Table 1 highlights the parameter settings.

6.1.1 Convergence of GA As an initial experiment, the behavior and the performance of the conventional GA (step 12 of Algorithm 1) for finding the optimal profile in an offline mode given a query (query 0 in .rel file) over the collection of documents which consists of 1,033 documents are observed. The genetic algorithm has an apparently satisfactory convergence behavior. We can see that if the stop criterion of the GA is set on the best individual’s performance, a few elite candidates reach quickly an excellent fitness value, however many individuals of the population require longer to reach a high accuracy. Moreover, if the stopping

Evolving Systems

imental evaluation. Note that the documents are split among the batches in such a way that:

criterion is based on the population’s average performance, a lot of good candidates are generated by the GA. Indeed, the average fitness value of the population exceeds 82 % at the end of the simulation (Fig. 4).

– –

6.1.2 IGA: dynamic evolution of the vocabulary

The first aspect of dynamics is the evolution of the lexicon as new batches of documents become available. Thanks to the genuine application of RSV (Eq. 4) and the Jaccard coefficient (Eq. 6), the vocabulary can be updated systematically in real time. Indeed for each GA run (step 6 of Algorithm 2), a representation based on the feedback given by the user is computed. Table 2 describes the evolution of the vocabulary according to the different batches. For the sake of clarity, the maximum length of individuals (number of words) is set to 7. The overlap in terms of vocabulary entries between batches is indicated on the last column of the table. Pairs of successive batches seem quite overlapping but only 2 terms are common to all batches.

Following the spirit of incremental learning, the collection is split into batches as mentioned earlier. For the sake of simulation, 10 batches are generated for use in the experTable 1 Parameter setting Parameter

Value

Population size

100

maximum length of individuals

100 terms

Okapi fitness threshold

2

Mutation probability

0.3

Maximum number of iterations

1,000

Threshold of population’s renewal

0.4

Number of batches

10

for each query, the first relevant document is inserted into the first batch the other documents are successively inserted into the batches

6.1.3 Convergence of IGA By injecting the 10 batches sequentially into the filtering system, the capacity of self-adaptation can be tracked over the 10 cycles of GA runs. Figure 5 portrays the evolution of the incremental genetic algorithm. One can observe the good convergence of the single/average performance after presenting the document batches over time. The drop of the performance stems from the fact that the relevant documents in the new batch do not match the current population of candidates of the IGA and thus, the algorithm tends to force a renewal of the population. However, it is worth noticing that the algorithm ends by learning the profile that fits all batches so far seen. In particular, in the population there are always members that achieve the optimal profile before the average

Fig. 4 Evolution of the fitness values when applying conventional GA: Best vs. average Table 2 Vocabulary evolution at each batch arrival Batch num

Terms

common terms against previous batch (%)

Batch 1

cold

len

seen

electrophoresis

adjuv

Batch 2

cold

len

seen

fraction

adjuv

protein

crm

71.4

Batch 3

cold

len

seen

fraction

a-crystallin

adjuv

electrophoresis

71.4

Batch 4

len

cold

fraction

seen

a-crystallin

insoluble

adjuv

85.7

Batch 5

len

cold

fraction

ceen

lens

a-crystallin

adjuv

85.7

Batch 6

len

cold

fraction

seen

lens

adjuv

acid

85.7

Batch 7

len

air

fraction

cold

acid

duct

lens

71.4

Batch 8

len

air

fraction

cold

lens

duct

acid

100

Batch 9

len

air

fraction

cold

lens

duct

a-crystallin

85.7

Common words

cold

len

form

crm

123

Evolving Systems

fitness value of the population reaches the threshold (i.e., the stopping criterion of GA). In the first batches, there are not enough relevant documents and it is easy to see in Eq. 9 that the value of similarity is low. Therefore, the fitness value (percentage of documents whose similarity with the candidate profile at hand is above the Okapi threshold) is close to 0. In the case where there are enough relevant documents and the Okapi threshold is not too high, the GA converges as expected for each batch. Moreover, we can choose a stopping criterion based on either the fitness of the best individual, the average fitness of the population or the number of iterations. Here the criterion retained for the renewal of the population is the average fitness of the current population (whether it is above the threshold of renewal). This criterion seemed to be more interesting, since we are sure to obtain a large number of individuals that fit the user’s profile and not only one individual as shown in Figs. 5 and 6. 6.1.4 Sensitivity

Fig. 7 Performance of the system as a function of the mutation rate in GA

importance of the sensitivity analysis. The different parameters are discussed in the following. Note that the experiments have been repeated 30 times and the results are averaged. –

IGA depends on many parameters. Their effect on the performance of the algorithm may be crucial, hence the –

The mutation rate: as noticed in the experiments (Fig. 7), small values tend to provide better performance of the algorithm. Hence, we set the rate to 0.3. The Okapi threshold = the relevance threshold (Eq. 9, Sect. 4.2): its value must be neither too low nor too high. In the first case, all the individuals would quickly converge, while they are not accurate at all. Further improvement of the profile quality will be prevented. In the second case, during the very first steps of the algorithm, all individuals would have a mediocre fitness value and few parents would be selected leading to stagnation in the evolution of the population.

Figure 8 illustrates the proportion of documents judged relevant for various values of the Okapi threshold. Fig. 5 Best and average fitness values during a run of GA. Stopping criterion is the average fitness value of the population above 0.9



The threshold s above which the population is renewed (Step 10 of Algorithm 1): it determines whether the population has to be entirely or partially renewed when a new batch of documents arrives.

According to Fig. 9, this threshold does not appear to be a crucial parameter, although it reaches a maximum for both best and average fitness values when set to 0.4. This value has thus been used for the experiments. –

Fig. 6 Best and average fitness values during a run of IGA. Stopping criterion is the fitness of the best individual above 0.9

123

The number of words of the representation has to be set so as to be able to describe properly long queries as shown in Fig. 10. The shortest (query 0) and the longest query (query 28) in the MED collection have been considered in this experiment to check the effect of the representation length.

Evolving Systems

Fig. 8 Evolution of the performance as a function of the Okapi threshold value (Query 28)

Fig. 10 Number of relevant documents returned in function of the representation length size of the query (query 0: 4 words, query 28: 15 words

profiles of the user, the system considers the distance between the queries submitted at different stages by the same user. Hence, for every user an array of profiles of predefined length is created. As explained in Sect. 5, the process of creating of new profiles and updating of the existing ones is based on the similarity between the existing profiles in the array and the current candidate profile corresponding to the current query. The measure chosen in this multi-profiling context is the cosine function. It allows to compute the similarity value between two profiles that do not have necessarily the same representation, hence the motivation for this choice. The cosine function reads as follows: P i¼j w1i :w2j ffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Cosineðw1 ; w2 Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð15Þ Pn Pn 2 2 i¼1 w1i : i¼1 w2i Fig. 9 Query 28: evolution of the performance according to the threshold of population renewal

From this diagram, we see that the number of words used in the representation must be at least the number of words of the query. For instance for query 28, when the number of words is too small, the percentage of relevant documents returned is very low, which means that the profile computed is of low quality. 6.1.5 Generation of multiple profiles Our assumption in this filtering system is that a profile corresponds to one theme or topic (e.g., politics, sports, science, etc.). In other terms, in order to learn multiple

where w1i is the weight of the i-th term of the first vector, w2i is the weight of the i-th term of the second vector and n the size of the representation. In order to show how this concept of multi-profiling works, we input to the IGA successively the following queries: 1; 2; 3; 2; 4; 1; 4; 5; 2. This sequence indicates that the queries 1, 2, 4 are submitted to the system repetitively at different interaction stages. For the sake of illustration, the maximum number of profiles is set to 4. Table 3 illustrates the sequence of actions taken by the filtering system after submitting the set of queries indicated. Note that the semantics of deletion of p? insertion of p’ is equivalent to an update of the profile p. When the same query is submitted for the n-th time at a later stage, the newly created profile replaces the one

123

Evolving Systems Table 3 The process of multiple profile generation Profile rank

0

1

2

3

Action

sub. of q1

p1

sub. of q2

p1

p2

sub. of q3

p1

p2

p3

sub. of q2

p1

p3

p2’

sub. of q4

p1

p3

p2’

p4

insertion

sub. of q1

p3

p2’

p4

p1’

del. p1, insert. p1’

sub. of q4

p3

p2’

p1’

p4’

del. p4, insert. p4’

sub. of q5

p2’

p1’

p4’

p5

del. p3, insert. p5

sub. of q2

p1’

p4’

p5

p2’’

del. p2’, insert. p2’’

insertion

related to the same query in the array. When the new profile is totally different from existing ones, it is inserted into the array. For instance, profiles generated for q2 (which consists of 10 words) are quite distant from profiles generated for q1 (which consists of 4 words), an insertion action is thus taken when p2 is generated after p1. Note that the query itself reflects the actual search of the user (i.e., intention of the user expressed by means of a set of words) less well meaningful than the relevant documents marked during the feedback by the user himself. 6.1.6 Performance The performance of the IGA algorithm on every request is very high as portrayed in Fig. 11. Even if the performance for each query depends on the Okapi threshold, the system can obviously show a great performance with a fixed value. Note also that the results shown are averaged after 30 runs. As an opportunity for accuracy improvement of the system, it would be interesting to adapt the length of the representation to the length of the query, as shown in

insertion insertion del. p2, insert. p2’

Fig. 10. Clearly, the representation of individuals must consists of more words than the length of the query. 7 Conclusion In this paper, we reported on the application of biology-based models for the problem of information filtering. Our primary aim in this research is to use genetic algorithms to generate user’s profile and to perform the task of filtering documents that match the user’s interests. The problem of online real time processing of documents and learning the user’s profile over time in a lifelong mode is considered. Moreover the problem of multiple profiles is discussed. The results obtained show that the proposed approach is appropriate yielding high performance with respect to different aspects. The present study is a first step towards a multilingual filtering system. Facets related to application of the proposed approach to other languages and their integration in an overall framework will be investigated in the future. We will particularly focus on financial news and stock market wires.

References

Fig. 11 Effect of the Okapi threshold on the performance (query 28)

123

Algarni A, Li Y, Xu Y (2010) Selected new training documents to update user profile. In: CIKM, pp 799–808 Borji A, Jahromi M (2008) Evolving weighting functions for query expansion based on relevance feedback. In: Proceedings of the Asia-Pacific web conference, pp 233–238 Bouchachia A (2009) Incremental learning. In: Encyclopedia of data warehousing and mining. IGI Global, Hershey, pp 1006–1012 Bouchachia A (2011) Incremental learning with multi-level adaptation. Neurocomputing 74:1785–1799 Boughanem M, Chrisment C, Tamine L (1999) Genetic approach to query space exploration. Inf Retr 1:175–192 Callan J, Croft W, Harding S (1992) The inquery retrieval system. In: Proceedings of the third international conference on database and expert systems applications. Springer, Berlin, pp 78–83 Cruz C, Gonzalez J, Pelta D (2011) Optimization in dynamic environments: a survey on problems, methods and measures. Soft Computing 15:1427–1448

Evolving Systems Dumais S, Furnas G, Landauer T, Deerwester S, Hrashman R (1988) Using latent semantic analysis to improve access to textual information. In: Proceedings of the conference on human factors in computing systems, pp 281–286 Efron M (2008) Query expansion and dimensionality reduction: notions of optimality in rocchio relevance feedback and latent semantic indexing. Inf Process Manag 44(1):163–180 Fan W, Gordon M, Pathak P (2005) Effective profiling of consumer information retrieval needs: a unified framework and empirical comparison. Decis Support Syst 40(2):213–233 Goldberg D (1989) Genetic algorithms in search, optimization and machine learning. Addison Wesley, Boston Hannani U, Shapira B, Shoval P (2001) Information filtering: overview of issues, research and systems. User Model User Adapt Interact 11(3):203–259 Jin Y, Branke J (2005) Evolutionary optimization in uncertain environments-a survey. IEEE Trans Evol Comput 9(3):303–317 Kapp M, Sabourin R, Maupin P (2011) A dynamic optimization approach for adaptive incremental learning. Int J Intell Syst 26(11):1101–1124 Kuflik T, Boger Z, Shoval P (2006) Filtering search results using an optimal set of terms identified by an artificial neural network. Inf Process Manage 42(2):469–483 Kyamakya K, Bouchachia A, Chedjou J (eds) (2010) Intelligence for nonlinear dynamics and synchronisation. Atlantis Press, Mermaid Waters Lv Y, Zhai C (2009) Adaptive relevance feedback in information retrieval. In: CIKM, pp 255–264 Nanas N, Kodovas S, Vavalis M, Houstis E (2010) Immune inspired information filtering in a high dimensional space. In: Proceedings of the 9th international conference on artificial immune systems, pp 47–60 Ng H, Ang H, Soon W (1999) Dso at trec-8: a hybrid algorithm for the routing task. In: Proceedings of the fourth test retrieval conference Pickens J, Cooper M, Golovchinsky G (2010) Reverted indexing for feedback and expansion. In: CIKM, pp 1049–1058 Reitter D, Lebiere C (2012) Social cognition: memory decay and adaptive information filtering for robust information maintenance. In: AAAI Ricci F, Rokach L, Shapira B, Kantor P (eds) (2011) Recommender systems handbook. Springer, Berlin Robertson S (1986) On relevance weight estimation and query expansion. J Doc 42:182–188 Robertson S, Walker S, Hancock-Beaulieu M, Gutford M, Payne A (1996) Okapi at trec-4. In: Proceedings of the fourth text retrieval conference (TREC-4), pp 73–96 Sahel Z, Bouchachia A, Gabrys B (2007) Adaptive mechanisms for classification problems with drifting data. In: Proceedings of the 11th international conference on knowledge-based intelligent

information and engineering systems (KES’07), LNCS 4693, pp 419–426 Schapire R, Singer Y, Mitra M (1998) Boosting and rocchio applied to text filtering. In: Proceedings of the ACM SIGIR’98 conference on research and development in information retrieval, Melbourne, pp 215–223 Schiaffino S, Amandi A (2000) User profiling with case-based reasoning and Bayesian networks. In: International joint conference, 7th Ibero-American conference on AI, 15th Brazilian symposium on AI Schu¨tze H, Hull D, Pedersen J (1995) A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 229–237 Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 21–29 Singhal A, Mitra M, Buckley C (1997) Learning routing queries in a query zone. In: Proceedings of the ACM SIGIR’97 conference on research and development in information retrieval, Philadelphia, pp 25–32 Tebri H, Boughanem M, Chrisment C (2005) Incremental profile learning based on a reinforcement method. In: Proceedings of the 2005 ACM symposium on applied computing. ACM, New York, pp 1096–1101 van Rijsbergen C (1979) Information retrieval. Butterwortths, London Voorhees E, Harman D (2005) TREC: experiment and evaluation in information retrieval. Digital Libraries and Electronic Publishing. MIT Press, Cambridge Woldesenbet Y, Yen G (2009) Dynamic evolutionary algorithm with variable relocation. IEEE Trans Evol Comput 13(3):500–513 Xu J, Croft W (1996) Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 4–11 Yang Y, Pedersen J (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning. Morgan Kaufmann, Burlington, pp 412–420 Yang Y, Yoo S, Zhang J, Kisiel B (2005) Robustness of adaptive filtering methods in a cross-benchmark evaluation. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05. ACM, New York, pp 98–105 Yeh J, Lin J, Ke H, Yang W (2007) Learning to rank for information retrieval using genetic programming. In Proceedings of SIGIR 2007 workshop on learning to rank for information retrieval, pp 233–238

123

Suggest Documents