Research Group on Soft Computing and Intelligent Information Systems http://sci2s.ugr.es
Tuning the Matching Function for a Threshold Weighting Semantics in a Linguistic Information Retrieval System E. Herrera-Viedma, A.G. López-Herrera
Technical Report #SCI2S-2004-04 May, 2004
Dept. of Computer Science and Artificial Intelligence E.T.S. de Ingeniería Informática. University of Granada E-18071 Granada - Spain Fax +34 958 243317
1
Tuning the Matching Function for a Threshold Weighting Semantics in a Linguistic Information Retrieval System E. Herrera-Viedma*, A.G. López-Herrera** Dept. of Computer Science and Artificial Intelligence, Library Science Studies School, University of Granada, 18017 – Granada e-mails: *
[email protected],**
[email protected] Abstract Information retrieval is an activity that implies to achieve documents that better fulfil the user information needs. For achieving this activity an Information Retrieval System uses matching functions which specify the degree of relevance of a document with respect to a user query. Assuming linguistic weighted queries we present a new linguistic matching function for a threshold weighting semantics which is defined using a 2-tuple fuzzy linguistic approach [1]. This new 2-tuple linguistic matching function can be interpreted as a tuning of that defined in [2] using an ordinal linguistic approach. We show that it simplifies the processes of computing in the retrieval activity, avoids the loss of precision in final results, and consequently, can help to improve the users’ satisfaction. Keywords: Fuzzy Information Retrieval, Linguistic Modelling, Weighted Queries.
1. Introduction The main activity of an Information Retrieval System (IRS) is the gathering of pertinent archived documents that better satisfy the user queries. IRSs present three components to carry out this activity [2, 3]: 1. A database: which stores the documents and the representation of their information contents (index terms). 2. A query subsystem: which allows users to formulate their queries by means of a query language. 3. An evaluation subsystem: which evaluates the documents for a user query obtaining a Retrieval Status Value (RSV) form each document. The query subsystem supports the user-IRS interaction, and therefore, it should be able to account for the imprecision and vagueness typical of human communication. This aspect may be modelled by means of the introduction of weights in the query language. Many authors have proposed weighted IRS models using Fuzzy Set Theory [4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. Usually, they assume
2
numeric weights associated with the queries (values in [0, 1]). However, the use of query languages based on numeric weights forces the user to quantify qualitative concepts (such as ”importance”), ignoring that many users are not able to provide their information needs precisely in a quantitative form but in a qualitative one. In fact, it seems more natural to characterize the contents of desired documents by explicitly associating a linguistic descriptor to a term in a query, like ”important” or ”very important”, instead of a numerical value. In this sense, some fuzzy linguistic IRS models [2, 3, 14, 15, 16, 17] have been proposed using a fuzzy linguistic approach [18] to model the query weights and document scores. A useful fuzzy linguistic approach which allows us to reduce the complexity of the design for the IRSs [2,3] is called the ordinal fuzzy linguistic approach [19, 20, 21]. In this approach, the query weights and document scores are ordered linguistic terms. On the other hand, we have to establish the semantics associated with the query weights to formalize fuzzy linguistic weighted querying. There are four semantic possibilities [2, 5, 15]: i) weights as a measure of the importance of a specific element in representing the query, ii) as a threshold to aid in matching a specific document to the query, iii) as a description of an ideal or perfect document, and iv) as a limit on the amount of documents to be retrieved for a specific element. Usually, in weighted queries the most query subsystems proposed in the literature use only one of the semantic possibilities. In particular the threshold semantics is very applied because it is easily understandable by the users. Assuming an ordinal fuzzy linguistic approach we define a variant for a threshold semantics, called symmetrical threshold semantics [2]. This semantics has a symmetric behaviour in both sides of the mid threshold value. It assumes that a user may use presence weights or absence weights in the formulation of weighted queries. Then, it is symmetrical with respect to the mid threshold value, i.e., it presents the usual behaviour for the threshold values which are on the right of the mid linguistic value (presence weights), and the opposite behaviour for the values which are on the left (absence weights or presence weights with low value). This semantics takes on that a user can search for documents with a minimally acceptable presence of one term in their representations, or documents with a maximally acceptable absence of one term in their representations. To evaluate this semantics, in [2] was defined a parameterized symmetrical linguistic matching function. This function has like main limitation the loss of precision in final results, i.e. in the computation of the linguistic RSVs of documents. The loss of precision appears like consequence of using a discrete representation for the linguistic terms in the ordinal fuzzy linguistic approach. In this contribution we present a new modelling of the symmetrical threshold semantics defined in [2] which overcomes its difficulties. We present a new and alternative definition of the symmetrical matching function that synthesizes the symmetrical threshold semantics and allows to achieve more precise RSVs, improving the results of the retrieval and consequently allowing to increase the users’ satisfaction. This new symmetrical matching function is defined by using the 2-tuple linguistic representation model [1] which improves the precision in the representation of linguistic information.
3
The paper is structured as follows. Section 2 presents the preliminaries, that is, the ordinal fuzzy linguistic approach and the 2-tuple fuzzy linguistic representation model together with its operational resources. Section 3 defines the new symmetrical matching function and accomplishes a study of its performance. Section 4 shows an example of the operation of a linguistic IRS with this new symmetrical matching function. Finally, some concluding remarks are pointed out. 2. Preliminaries In this section, we review some tools of fuzzy linguistic processing that will be used in the new modelling of the symmetrical threshold semantics. 2.1. The Ordinal Fuzzy Linguistic Approach The ordinal fuzzy linguistic approach is an approximate technique appropriate to deal with qualitative aspects of problems [20]. An ordinal fuzzy linguistic approach is defined by considering a finite and totally ordered label set S = {s0, …, sT}, T+1 is the cardinality of S in the usual sense, and with odd cardinality (7 or 9 labels). The mid term representing an assessment of "approximately 0.5" and the rest of the terms being placed symmetrically around it [22]. The semantics of the linguistic terms set is established from the ordered structure of the terms set by considering that each linguistic term for the pair (si, sT-i) is equally informative. For each label si is given a fuzzy number defined on the [0,1] interval, which is described by a membership function. The computational model to combine ordinal linguistic information is based on the following operators: 1. Negation operator: Neg(si) = sj, j = T - i. 2. Maximization operator: MAX(si, sj) = si if si >= sj . 3. Minimization operator: MIN(si, sj) = si if si α2 then (sk, α1) is bigger than (sl, α2).
5
3. Aggregation of 2-tuples: Using the functions ∆ and ∆-1 any numerical aggregation operator can be easily extended for dealing with linguistic 2-tuples. For example, the Ordered Weighted Averaging (OWA) [23] proposed by Yager is an aggregation operator of information which acts taking into account the order of the assessments to be aggregated. Definition 3. [23] Let A = {a1, …, am}, ak ∈ [0,1] be a set of assessments to aggregated, then the OWA operator, φ, is defined as φ( a1, …, am) = W·BT, where W = [w1, …, wm], is a weighting vector, such that wi ∈ [0,1] and Σiwi = 1, and B = {b1, …, bm} is a vector associated to A, such that, B = σ(A) = {aσ(1), …, aσ(m)}, with σ being a permutation over the set of assessments A, such that aσ(j) ≤ aσ(i) ∀i ≤ j. A 2-tuple linguistic extended definition of φ would be as follows: Definition 4. Let A = {(a1, α1),…, (am, αm)} be a set of assessments in the linguistic 2-tuple domain, then the 2-tuple linguistic OWA operator, φ2t is defined as φ 2t ((a1 , α 1 ),..., (a m , α m )) =∆(W ⋅ B T ) , B = σ ( A) = {(∆−1 (a1 , α 1 ))σ (1) ,..., (∆−1 (a m , α m ))σ ( m) }. 3.
A New Modelling of the Symmetrical Threshold Semantics
In this section we present a new proposal to model the symmetrical threshold semantics defined in [2] in order to improve its performance. Before presenting it we show the linguistic IRS assumed. 3.1. An Ordinal Linguistic Weighted IRS Based on a Symmetrical Threshold Semantics In this paper, we assume an ordinal linguistic weighted IRS that presents the following elements to carry out its activity: 1. Database We assume a database of a traditional fuzzy IRS as in [8, 11, 24]. The database stores the finite set of documents D = {d1,…, dm} represented by a finite set of index terms T = {t1,…,tl}, which describe the subject content of the documents. The representation of a document is a fuzzy set of terms characterized by a numeric indexing function F: D×T Æ [0, 1], which is called index term weight [11]: dj = F(dj, t1) /t1 + F(dj, t2) /t2 + …+ F(dj, tl) /tl. F weighs index terms according to their significance in describing the content of a document. Thus F(dj, ti) is a numerical weight that represents the degree of significance of ti in dj. 2. Query subsystem We use a query subsystem with a fuzzy linguistic weighted Boolean query language to express user information needs. With this language each query is expressed as a combination of the weighted index terms that are connected by
6
logical operators AND (∧), OR (∨), and NOT(¬). The weights are ordinal linguistic values taken from a label set S, and they are associated with a symmetrical threshold semantics [2, 3]. Formally, in [14] a fuzzy linguistic-weighted Boolean query with only one semantics was defined as any legitimate Boolean expression whose atomic components are pairs 〈ti, ci〉, where ti is an index term and ci is a value of the linguistic variable, Importance, qualifying the importance that the term ti must have in the desired documents. As in [14], our atomic components are pairs but defining the linguistic variable Importance with the ordinal linguistic approach and associating ci with a symmetrical threshold semantics. Accordingly, the set Q of the legitimate queries is defined by the following syntactic rules: 1. 2. 3. 4. 5.
∀q = 〈ti, ci〉 ∈ T × S → q ∈ Q. ∀q, p ∈ Q → q ∧ p ∈ Q. ∀q, p ∈ Q → q ∨ p ∈ Q. ∀q ∈ Q → ¬q ∈ Q.
All legitimate queries q ∈ Q are only those obtained by applying rules 1-4, inclusive.
3. Evaluation subsystem The evaluation subsystem for weighted Boolean queries acts by means of a constructive bottom-up process based on the criterion of separability [9, 11]. The RSVs of the documents are ordinal linguistic values whose linguistic components are taken from the linguistic variable Importance but representing the concept of relevance. Therefore, the set of linguistic terms S is also assumed to represent the relevance values. The evaluation subsystem acts in two steps: 1. Firstly, the documents are evaluated according to their relevance only to atoms of the query. In this step, the symmetrical threshold semantics is applied in the evaluation of atoms by means of a parameterized linguistic matching function g : D×T×S → S, which is defined as [2] ⎧s Min{a + β ,T } ⎪s ⎪ Max{0,a − β } g (d j ,t i ,ci ) = ⎨ ⎪ Neg ( s Max{0,a − β } ) ⎪⎩ Neg ( s Min{a + β ,T } )
sT / 2 ≤ sb ≤ s a sT / 2 ≤ sb ∧ s a < sb s a ≤ sb < sT / 2 sb < sT / 2 ∧ sb < sa
such that, (i) sb= ci; (ii) sa is the linguistic index weight obtained as sa = Label(F(dj,ti)), being Label:[0,1]ÆS a function that assigns a label in S to a numeric value r ∈ [0,1]; and iii) β is a bonus value that rewards/penalizes the relevance degrees of documents for the satisfaction/dissatisfaction of request , which can be defined depending on the closeness between Label(F(dj,ti)) and ci, for example as β=round(2|b-a|/T). We should point out that whereas the traditional threshold matching function are always non-decreasing [15], g is non-decreasing on the right of the mid term and
7
decreasing on the left of the mid term in order to be consistent with the meaning of the symmetrical threshold semantics. 2. Secondly, the documents are evaluated according to their relevance to Boolean combinations of atomic components, and so on, working in a bottom-up fashion until the whole query is processed. In this step, the logical connectives AND and OR are modelled by means of LOWA [20] operators with orness(W) < 0.5 and orness(W) ≥ 0.5 respectively, being orness(W) a orness measure introduced by Yager in [23] to classify the aggregation of the OWA operators: orness(W) = (1/m-1)(∑ m i=1 (m-i) wi). Remark 2: We should point out that if we have a negated query, or a negated subexpression, or a negated atom, their evaluation is obtained from the negation of the relevance results computed for the query, or the subexpression, or atom in a no-negated situation. 3.2. Problems of the Symmetrical Threshold Semantics Modelled by the Parameterized Linguistic Matching Function g According to the symmetrical threshold semantics the evaluation subsystem assumes that a user may search for documents with a minimally acceptable presence of one term in their representations (as in the classical interpretation happens [15]) or documents with a maximally acceptable presence of one term in their representations. Then, when a user asks for documents in which the concept(s) represented by a term ti is (are) with the value High Importance, the user would not reject a document with a F value greater than High; on the contrary, when a user asks for documents in which the concept(s) represented by a term ti is (are) with the value Low Importance, the user would not reject a document with a F value less than Low. Given a request ∈ T×S; this means that the query weights that imply the presence of a term in a document ci ≥ sT/2 (e.g. High, Very High,) they must be treated differently to the query weights that imply the absence of one term in a document ci < sT/2 (e.g. Low, Very Low). Then, if ci > sT/2 the request , is synonymous with the request , which expresses the fact that the desired documents are those having F values as high as possible; and if ci < sT/2 is synonymous with the request , which expresses the fact that the desired documents are those having F values as low as possible. The linguistic matching function g defined in [2] represents a possible modelling of the meaning of the symmetrical threshold semantics. However, such modelling or interpretation presents some problems: 1. The loss of precision: This problem is a consequence of ordinal linguistic framework which works with discrete linguistic expression domains and this implies to assume limitations in the representation domain of RSVs. Therefore, as linguistic term sets (S) assumed have a limited cardinality (5, 7 or 9 labels) to assess the linguistic RSVs, in consequence, it is difficult to distinguish or specify what documents really satisfy better the atomic weighted request . Although the system retrieves many documents the possible relevance assessments are limited by the cardinality of the label set considered.
8
2. The loss of information: This problem also is a consequence of the ordinal linguistic approach because it forces us to apply approximation operations in the definition of g, in particular, the rounding operation used to calculate the parameter β, and as it is known [1], in such a case almost always there exists a loss of information. Example 1: Let S= {s0 =Null (N), s1=Extremely_Low (EL), s2=Very_Low (VL), s3=Low (L), s4=Medium (M), s5=High (H), s6=Very_High (VH), s7=Extremely_High (EH), s8=Total (TO)} be a label set used to assess the linguistic information in a IRS and consider two documents d1 and d2, such that, Label(F(d1,ti))=EH and Label(F(d2,ti))=TO, respectively, then if the atomic request is we obtain the same relevance degree for both documents as a consequence of the loss of information, g(d1, ti, M)= TO and g(d2, ti, M)= TO. 3. g tends to overvalue the satisfaction/dissatisfaction of the requests: This problem is a consequence of the own definition of g. For example, if we analyze its definition we can observe that relevance degrees generated when the threshold value is satisfied, i.e. sMin{a+β ,T}, always are limited by the index term weight, sa. This shows a too optimistic evaluation of the satisfaction of threshold value and reduces the possibilities of discrimination among the documents that satisfy the threshold value. Similarly, it happens in the dissatisfaction case. In the following subsection, we try to overcome these problems by defining a new threshold matching function. 3.3. A 2-Tuple Linguistic Matching Function to Model the Symmetrical Threshold Semantics In this section, we present a new symmetrical matching function to model the symmetrical threshold semantics that overcomes the problems of the matching function g [2] aforementioned. We design it by using as base the 2-tuple fuzzy linguistic representation model [1] and we call it like 2-tuple linguistic matching function g2t. Firstly, we should point out that the simple fact to define the new matching function g2t in a 2-tuple linguistic approach allows us to solve the first problem of g, given that using the 2-tuple linguistic representation model in its definition g2t inherits its properties, and one of the main properties of the 2-tuple linguistic representation model is to eliminate the loss of precision of the ordinal linguistic model [1]. On the other hand, to overcome the second problem we have to avoid to include approximation operations in the definition of g2t, and to overcome the third problem we have to soften the relevance degrees generared by g2t when threshold value is minimally satisfied by the index term weight. As aforementioned, symmetrical threshold semantics has a symmetric behaviour in both sides of the mid threshold value because it is defined to distinguish two situations in the threshold interpretation: i) when the threshold
9
value is on the left of the mid term and ii) when it is on the right. It assumes that a user may use presence weights or absence weights in the formulation of weighted queries. Then, it is symmetrical with respect to the mid threshold value, i.e., it presents the usual behaviour for the threshold values which are on the right of the mid threshold value (presence weights), and the opposite behaviour for the values which are on the left (absence weights or presence weights with low value). Therefore, analyzing the case of presence weights, i.e. threshold values which are on the right of the mid threshold value, we rapidly derive the case of absence weights. When the linguistic threshold weight sb given by a user is higher, in the usual sense, than middle label of the term linguistic set, sT/2, the matching function g is non-decreasing. As aforesaid, in this case the problem of g is that it rewards excessively to those documents whose F values overcomes to the threshold weight sb and penalizes excessively to those documents whose F values do not overcome sb. We look for a non-decreasing matching function g2t that softens the behaviour of g. Concretely, to achieve this goal g2t should work as follows: the more the F values exceed the threshold values and the closer they are to the maximum RSV sT, the greater the RSVs of the documents. However, when the F values are below the threshold values and closer to s0, the lower the RSVs of the documents and the closer to s0 they are. These two circumstances are called in the literature oversatisfaction and undersatisfaction [15]. Assuming a continuous numeric domain [0, T], in Figure 1 we represent graphically the desired behaviour of g2t for three possible threshold values T/2, u and u´ , being values 0, T/2, and T the indexes of the following terms of S: bottom term, middle term and top term, respectively.
T
Relevance
T 2
0
T 2
u
u’
T
T·F(dj, ti)
Figure 1. Desired behaviour of the matching function g2t
10
T
Relevance β2 T 2
β1
a1
0
T 2
u
a2
T T·F(dj,ti)
Figure 2. Desired behaviour of g2t for a threshold value on the right of the mid term
If we focus on the case of threshold value u (see Figure 2), then given two possible values of index term weight a1< u and a2 > u, the relevance degrees obtained by a desired matching function should be β1 and (T/2) + β2. Assuming this hypothesis the definition of the 2-tuple linguistic matching function g2t on the right of the mid term would be as follows: g2t : D×T×(S×[-.5, .5)) → S×[-.5, .5)) T ⎧ ⎪∆(β + ) g2t (dj ,ti ,(sb,0))=⎨ 2 2 ⎪⎩∆(β1)
if(sa ,αa ) ≥(sb ,0) ∧(sb ,0) ≥(sT / 2,0) if(sa ,αa ) ( sb ,0) ∧ ( sb ,0) < ( sT / 2 ,0)
a ⋅T T ⋅ (u − a1 ) T ⋅ (T − a2 ) T ⋅ ( a2 − u ) , β1 = 1 , β 2* = , β1* = , 2 ⋅ (T − u ) 2 ⋅ (T − u ) 2⋅u 2⋅u
u =∆ ( sb ,0) , a1 =T ·F(dk , ti) and a2 =T ·F(dj , ti). −1
Assuming the label set S defined in Example 1, in Table 1 we show a comparison of the behaviour of both symmetrical matching functions, g and g2t (see fourth and sixth columns) when sb ≥ sT / 2 , that is, for sb ∈ {s4, s5, s6, s7, s8}. To compare better both functions we also show the behaviour of the symmetrical matching function g2t projected in an ordinal linguistic domain (see fifth column), that is considering the results of g2t in the 2-tuple linguistic domain (S×0) or ordinal linguistic domain S. Then, analysing the definition of g2t and the results shown in Table 1 we can point out the following considerations:
12
1. g2t is non-decreasing for threshold values higher than mid threshold value and decreasing for threshold values lower than mid threshold value, and therefore, it works like the ordinal linguistic matching function g, being consistent with the meaning of the symmetrical threshold semantics. 2. The problem of the loss of precision in the results is solved because using the 2-tuple fuzzy linguistic representation model g2t produces more complete and precise results than g, given that relevance results produced not only express the linguistic value obtained in the computing process of the RSVs, but also add a numeric measure of the difference of derived information, the called symbolic translation [1]. Additionally, we should point out that this improvement in the precision of the results can help to improve the ranking processes of documents in the output of linguistic IRS. For example, in the rows 38 and 39 of the Table 1 g returns the same ordinal linguistic RSVs, i.e. s0 and s0, while g2t returns 2-tuple linguistic RSVs (s1,-.5) and (s1,0), respectively. Therefore, in such a case g2t produces more precise results and furthermore, allows to rank better the documents evaluated in rows 38 and 39. 3. The problem of the loss of information in the results provided by g2t is also solved because we do not use approximation operations in its definition and the 2-tuple fuzzy linguistic representation allows to gather all information generated in the processes of computing with words carried out by the application of g2t. For example, In Table 1 we can observe that in many cases (rows 11-14, 20-24, 29-34, 38, 40, 42, 44) if we work with the function g2t in an ordinal linguistic context there exists a loss of information because the value of symbolic translation is not represented. 4. With regard to overvaluation problem of g, we can say that g2t get to soften that overvaluation behaviour of g. For example, if we compare the expressions of both functions in the case of a threshold value on the right of the mid linguistic value and in a satisfaction situation, the results returned by g2t are in the 2-tuple linguistic interval [(sT/2, 0), (sT, 0)] (using the projection of g2t on an ordinal linguistic domain S ( g 2t ( S ) ), this means they are assessed in the label set
{sT/2, sT/2+1, …, sT}, while the results returned by g are assessed in the label set {sp=Label(F(dj,ti)), sp+1,…, sT}, being sp=Label(F(dj,ti)) the ordinal linguistic weight of the index term ti representing the content of the document dj equal to the desired threshold value sb and maintaining the following relationship: g(dj ,ti, sb)≥ g 2t ( S ) (dj ,ti, sb) for all Label(F(dj,ti)≥ sp ≥ sT/2. This fact is easily observable in Table 1. Similarly, it happens in the dissatisfaction case.
13
D
F(dj,ti)
Sb
g
g 2t ( S )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
S0 S1 S2 S3 S4 S5 S6 S7 S8 S0 S1 S2 S3 S4 S5 S6 S7 S8 S0 S1 S2 S3 S4 S5 S6 S7 S8 S0 S1 S2 S3 S4 S5 S6 S7 S8 S0 S1 S2 S3 S4 S5 S6 S7 S8
S4 S4 S4 S4 S4 S4 S4 S4 S4 S5 S5 S5 S5 S5 S5 S5 S5 S5 S6 S6 S6 S6 S6 S6 S6 S6 S6 S7 S7 S7 S7 S7 S7 S7 S7 S7 S8 S8 S8 S8 S8 S8 S8 S S8
S0 S0 S1 S3 S4 S5 S7 S8 S8 S0 S0 S1 S2 S4 S5 S6 S8 S8 S0 S0 S1 S2 S3 S5 S6 S7 S8 S0 S0 S1 S2 S3 S4 S6 S7 S8 S0 S0 S0 S2 S3 S4 S5 S7 S8
S0 S1 S2 S3 S4 S5 S6 S7 S8 S0 S1 S2 S2 S3 S4 S5 S7 S8 S0 S1 S1 S2 S3 S3 S4 S6 S8 S0 S1 S1 S2 S2 S3 S3 S4 S8 S0 S1 S1 S2 S2 S3 S3 S4 S4
g 2t (S0,0) (S1,0) (S2,0) (S3,0) (S4,0) (S5,0) (S6,0) (S7,0) (S8,0) (S0,0) (S1,-0.2) (S2,-0.4) (S2,0.4) (S3,0.2) (S4,0) (S5,0.33) (S7,-0.33) (S8,0) (S0,0) (S1,-0.33) (S1,0.33) (S2,0) (S3,-0.33) (S3,0.33) (S4,0) (S6,0) (S8,0) (S0,0) (S1,-0.43) (S1,0.14) (S2,0.29) (S2,0.29) (S3,0.14) (S3,0.43) (S4,0) (S8,0) (S0,0) (S1,-0.5) (S1,0) (S2,-0.5) (S2,0) (S3,-0.5) (S3,0) (S4,-0.5) (S4,0)
Table 1. Comparing linguistic matching functions
14
4. Operation of a Linguistic Weighted IRS based on the 2-Tuple Linguistic Matching Function g2t
In this section, we present an example of performance of the IRS defined in Subsection 3.1 under the 2-tuple linguistic symmetrical matching function g2t. This linguistic IRS was defined in an ordinal linguistic context. Then, to show the performance of g2t that IRS must be redefined in terms of the 2-tuple fuzzy linguistic representation model. To do that, we have to include the some modifications in the ordinal linguistic IRS model presented in Subsection 3.1. These modifications affect evaluation subsystem in particular, keeping database and query subsystem invariable. They are the following: •
• • •
The ordinal linguistic threshold weights of queries provided by the users have to be transformed to the linguistic 2-tuple domain S×[-.5, .5) to be processed by the evaluation subsystem. As we said in Subsection 2.2, this is carried out by adding the symbolic translation value 0. The numeric index term weights F(dj, ti) have to be transformed to the 2tuple linguistic domain, S×[-.5, .5), by means of the transformation function ∆, as ∆( T ·F(dj , ti)). In the IRS defined in Subsection 3.1 the Booleans connectives of the queries are modelled by means of the LOWA operator. Now, we substitute it by the 2-tuple linguistic OWA operator, φ2t, introduced in Definition 4. Similarly, in the case of the negated queries, we must substitute the ordinal linguistic negation operator by the 2-tuple linguistic negation operator.
Let us suppose a small database containing a set of seven documents D = {d1, ..., d7}, represented by means of a set of 10 index terms T = {t1, ..., t10}. Documents are indexed by means of an indexing function F, which represents them as follows: d1 = 0.7/t5 + 0.4/t6 +1/t7 d2 = 1/t4 + 0.6/t5 + 0.8/t6 + 0.9/t7 d3 = 0.5/t2 + 1/t3 + 0.8/t4 d4 = 0.9/t4 + 0.5/t6 + 1/t7 d5 = 0.7/t3 + 1/t4 + 0.4/t5 + 0.8/t9 + 0.6/t10 d6 = 0.8/t5 + 0.99/t6 + 0.8/t7 d7 = 0.8/t5 + 0.02/t6 + 0.8/t7 + 0.9/t8 Using the set of the nine labels given in Example 1 to provide the linguistic weighted queries, consider that a user formulates the following query: q = ((t5,VH)∨ (t7,H))∧((t6,L)∨ (t7,H)).
15
Then, the evaluation process of this query is developed in the following steps : 1. Evaluation of the atoms with respect to the symmetrical threshold semantics. In this step, firstly, we obtain the documents represented in a 2-tuple linguistic form applying the function ∆ over index term weights F(dj , ti): d1 = (VH,-.4)/t5 + (L,.2)/t6 +(TO,0)/t7 d2 = (TO,0)/t4 +(H,-.2)/t5 + (VH,.4)/t6 +(EH,.2)/t7 d3 = (M,0)/t2 + (TO,0)/t3 + (VH,.4)/t4 d4 = (EH,.2)/t4 + (M,0)/t6 + (TO,0)/t7 d5 = (VH,-.4)/t3 + (TO,0)/t4 + (L,.2)/t5 + (VH,.4)/t9 + (H,-.2)/t10 d6 = (VH,.4)/t5 + (TO,-.08)/t6 + (VH,.4)/t7 d7 = (VH,.4)/t5 + (N,.16)/t6 + (VH,.4)/t7 + (EH,.2)/t8. Then, we evaluate atoms according to the symmetrical threshold semantics by means of g2t : • (t5,VH) : 5 5 5 5 5 { RSV1 = (M,-.27), RSV2 = (L,.2), RSV5 = (VL,.13), RSV6 = (H,-.2), RSV7 = (H,-.2)} (t6,L): 6 6 6 6 { RSV = (M,-.16), RSV2 = (EL,.28), RSV4 = (L,.2), RSV6 = (N,.06), RSV7 = (TO,-.16)} •
6 1
(t7,H) : 7 7 7 7 { RSV = (TO,0), RSV2 = (EH,-.07), RSV4 = (TO,0), RSV6 = (VH,-.13), RSV7 = •
7 1
(VH,-.13)}
being RSV ji = g2t(dj ,ti,(ci,0)), and where, for example, the value RSV27 is calculated by means of g2t as follows : RSV27 =g2t(d2,t7,(H,0))= ∆(
8 ⋅ (7.2 − 5) 8 + ) = ∆(6.93) = ( s7 = EH ,−.07) . 2 ⋅ (8 − 5) 2
2. Evaluation of subexpressions. The query q has two subexpressions, q1= (t5,VH)∨ (t7,H) and q2 = (t6,L)∨ (t7,H). Each subexpression is in disjunctive form, and thus, we must use an operator φ 2t with orness(W) > 0.5 (for example, with W = [0.7, 0.3]) to process them. The results that we obtain are the following:
16
• q1= (t5,VH)∨ (t7,H): { RSV11 = (EH,-.28), RSV21 = (VH,-.19), RSV41 = (VH,-.4), RSV51 = (EL,.49), RSV61 = (VH,-.45), RSV71 = (VH,-.45)}, • q2 = (t6,L)∨ (t7,H): { RSV12 = (EH,-.25), RSV22 = (H,.24), RSV42 = (EH,-.44), RSV62 = (M,.13), RSV72 = (EH,.25)}, being RSV ji the evaluation result of the subexpression qi with respect to the document dj, where, for example, the RSV22 is calculated by means of the 2-tuple linguistic OWA operator φ2t as follows : RSV22 = φ2t ( RSV26 = (EL,.28), RSV27 = (EH,-.07))= ∆(6.93 ⋅ 0.7 + 1.28 ⋅ 0.3) = ∆(5,24) = ( H ,.24) , such that ∆-1(EL,.28) = 1.28 and ∆-1(EH,-.07) = 6.93. 3. Evaluation of the whole query. We evaluate the whole query using an operator φ 2t with orness(W) < 0.5 (e.g. with W = [0.3, 0.7]) given that it is in a conjunctive normal form, obtaining the following relevance results RSVj for each document dj: {RSV1 = (EH,-.27), RSV2 =(H,.41), RSV4 = (VH,-.11), RSV5=(N,.45), RSV6 =(H,-.44), RSV7= (VH,.06)}. To evaluate the impact of the 2-tuple linguistic matching function g2t on the performance of IRS we can compare it with the result obtained by the IRS in an ordinal linguistic framework and applying the linguistic matching function g: {RSV1 =EH, RSV2 =VH, RSV4 = VH, RSV5=EL, RSV6 =H, RSV7= H}. Analyzing these results we should point out the following: 1. Firstly, it is obvious the advantage of the use of the 2-tuple fuzzy linguistic representation model, given that if we use an ordinal linguistic representation it is impossible to distinguish the relevance difference between some documents, for example between d2 and d4 or between d6 and d7, and these facts are easily observable using the 2-tuple linguistic format. 2. On the other hand, we must point that the IRS based on the 2-tuple linguistic matching function g2t obtains results more consistent that reflect better the relevance degree of some documents with respect to the information need expressed by the user. For example:
17
•
•
If we see the representation of the document d5, this document does not satisfy any criteria expressed on the weighted query q, i.e., it does not contain terms t6 and t7, and although it contains the term t5, however, its index term weight is lower than the threshold value associated with t5 in the query, and therefore, it seems more reasonable and consistent to assess this satisfaction situation with a relevance value N (Null) than with a value EL (Extremadely_Low). If we see the representation of the documents d1 and d7, we can observe that both documents present a satisfaction level with respect to the query very similar, however, the IRS based on g returns for both relevance degrees which are more different than in the case of the IRS based on g2t.
4. Concluding Remarks In this paper we have described a new modelling of the symmetrical threshold semantics [2] in a linguistic framework. We have defined a new symmetrical linguistic matching function to model the meaning of the symmetrical threshold semantics that overcomes the problems found in the linguistic matching function defined in [2]. We have defined this new linguistic matching function in a 2-tuple fuzzy linguistic context [1] to take advantage of the usefulness of the 2-tuple fuzzy linguistic representation model with respect to avoid the problems of loss of precision and information in the results. In the future, we shall research the different threshold matching functions existing in the literature in order to define a general application framework that facilitates us their design and use in the IRSs. 5. References
1. F. Herrera and L. Martínez, A 2-tuple fuzzy linguistic representation model for computing with words, IEEE Transactions on Fuzzy Systems 8:6 (2000) 746-752. 2. E. Herrera-Viedma, Modelling the retrieval process for an information retrieval system using an ordinal fuzzy linguistic approach, Journal of the American Society for Information Science and Technology 52:6 (2001) 460475. 3. E. Herrera-Viedma, An information retrieval system with ordinal linguistic weighted queries based on two weighting elements, Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems 9 (2001) 77-88. 4. A. Bookstein, Fuzzy request: An approach to weighted Boolean searches, J. of the American Society for Information Science 31 (1980) 240-247. 5. G. Bordogna, C. Carrara and G. Pasi, Query term weights as constraints in fuzzy information retrieval, Information Processing & Management 27 (1991) 15-26. 6. G. Bordogna and G. Pasi, Linguistic aggregation operators of selection criteria in fuzzy information retrieval, Int. J. of Intelligent Systems 10 (1995) 233-248. 7. D. Buell and D.H. Kraft, Threshold values and Boolean retrieval systems, Information Processing & Management 17 (1981) 127-136.
18
8. D. Buell and D.H. Kraft, A model for a weighted retrieval system, J. of the American Society for Information Science 32 (1981) 211-216. 9. C.S. Cater and D.H. Kraft, A generalization and clarification of the WallerKraft wish list, Information Processing & Management 25 (1989) 15-25. 10. D.H. Kraft and D.A. Buell, Fuzzy sets and generalized Boolean retrieval systems, Int.l J. of Man-Machine Studies 19 (1983) 45-56. 11. W.G. Waller and D.H. Kraft, A mathematical model of a weighted Boolean retrieval system, Information Processing & Management 15 (1979) 235-245. 12. R.R. Yager, A Hierarchical Document Retrieval Language, Information Retrieval 3 (2000) 357-377. 13. R.R. Yager, A note on weighted queries in information retrieval system, J. of American Society of Information Science 38 (1987) 23-24. 14. G. Bordogna and G. Pasi, A fuzzy linguistic approach generalizing Boolean information retrieval: A model and its evaluation, J. of the American Society for Information Science 44 (1993) 70-82. 15. D.H. Kraft, G. Bordogna and G. Pasi, An extended fuzzy linguistic approach to generalize Boolean information retrieval, Information Sciences 2 (1994) 119-134. 16. E. Herrera-Viedma, O. Cordón, M. Luque, A.G. Lopez, A.M. Muñoz, A model of fuzzy linguistic IRS based on multi-granular linguistic information, Int. J. of Approximate Reasoning 34 (2003) 221-239. 17. G. Bordogna and G. Pasi. An ordinal information retrieval model, Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems 9 (2001) 63-76. 18. L.A. Zadeh, The concept of a linguistic variable and its applications to approximate reasoning. Part I, Information Sciences 8 (1975) 199-249, Part II, Information Sciences 8 (1975) 301-357, Part III, Information Sciences 9 (1975) 43-80. 19. F. Herrera and E. Herrera-Viedma, Aggregation operators for linguistic weighted information, IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 27 (1997) 646-656. 20. F. Herrera, E. Herrera-Viedma and J.L. Verdegay, Direct approach processes in group decision making using linguistic OWA operators, Fuzzy Sets and Systems, 79 (1996) 175-190. 21. R.R. Yager, An approach to ordinal decision making, Int. J. of Approximate Reasoning 12 (1995) 237-261. 22. P.P. Bonissone and K.S. Decker, Selecting uncertainty calculi and granularity: An experiment in trading-off precision and complexity, in: L.H. Kanal and J.F. Lemmer, Eds., Uncertainty in Artificial Intelligence (North-Holland, 1986) 217-247. 23. R.R. Yager, On ordered weighted averaging aggregation operators in multicriteria decision making, IEEE Transactions on Systems, Man, and Cybernetics 18 (1988) 183-190. 24. S. Miyamoto, Fuzzy Sets in Information Retrieval and Cluster Analysis (Kluwer Academic Publishers, 1990).
19