etc.). It is not obvious which textual tag-name is the most appropriate for a particular query. The question of relevance is more based on the text than the tag-name. It is therefore probably best to leave this problem to the retrieval systems to solve.
Vitaes Vitaes () are indeed textual elements, but their semantics is different from the layout semantics of the textual tags in the previous section.
Abstracts Abstracts () are also textual elements with a slightly different semantics, since they contain a condensed description of the content of an article, and no detail information.
Bibliographical entries Bibliographic entries are a different class of answers, since they contain only references to publications, but no real “content” like the textual elements. They represent information needs such as “find references to papers about compression” or “give me all bibliographic details of publications cited within papers about compression”. Note that for example queries such as
//article[about(.,’neural networks’)]//fm//au
which says something like ”give me authors of articles about neural networks” are not considered interesting. From an IR perspective this query is equivalent to the query ”give me articles about neural networks”. The problem of extracting the authors is trivial. Therefore author names is not considered here as a natural target. The above list is based on discussion in the working group and it is not necessary complete. If topic authors find other natural target elements they are encouraged to use them.
2.2
Structural conditions
The working group distinguished three natural types of structural conditions.
This list of natural constraints must be viewed in the context of the current collection. Different collections have different information needs. For collections that have a larger variety in tag-names it is probably easier to formulate natural structural queries.
2.3
Consider for example the query //sec[about(.,’solar powered robots’) and about(.//fig,’robot on mars’)]
Co-occurrences We want certain concepts to be covered in the same unit. Say, for example we would like to retrieve documents that discuss the use of handheld computers in health care. We would like to minimize the change of getting documents that discuss handheld computers and health care separately. We could try to express this in a query that asks for articles were handheld computers and health care are discussed in the same section. //article[about(.//sec,’handheld computers health care’)]
Separation of constraints and targets
It was discussed whether the structural constraints and targets needed to be expressed in the same expression. More precisely the question was whether we should go back to the INEX 2002 notation. The main reason behind abandoning the INEX 2002 notation, was that the semantics of that notation was unclear.
Where we want the retrieved sections to contain figures. Note however, that this is perhaps not a good example for the current LATEX–originated collection, since authors often use tricks to include figures.
3.
QUERY LANGUAGE FOR INEX 2004
This section will report on the discussion within the working group about requirements for a query language. We will then outline the syntax and semantics of a query language that is currently being constructed as a future language for INEX.
Note that since we are doing IR, we do not enforce term occurrence restrictions. By co-occurrences we are referring to the co-occurrence of concepts but not terms.
3.1 Requirements
Data-types
As a query language for CAS titles, the group considered an extension of a subset of XPath. The idea is to take the current syntax extension of XPath, used at INEX 2003, but restrict the usage to an ”IR minimum” as described in [4]. This restriction in functionality supports all the important features used in previous workshops. Some queries are known to contain deprecated features and are excluded from this compatibility requirement.
Data-types are interesting for retrieval in structured documents. For this particular collection they are of limited use. They should however be considered in retrieval from semantically richer collections which contain not only layout semantics. Examples are markup for chemical processes, financial market developments and geographical locations.
Roles We want to restrict our attention to XML elements that represent a certain role; such as article author, author affiliation, etc. For example if we want articles authored by Bruce Croft: //article[about(.//fm//au,’Bruce Croft’)]
Similarly if we want articles that cite Bruce Croft: //article[about(.//bb//au,’Bruce Croft’)]
We could also restrict our attention to articles where an author is affiliated in California: //article[about(.//fm//au//aff,’California’)]
The existing syntax of CO proved adequate. Any changes must maintain compatibility with the existing CO topics.
There already exist two data types, numeric and string. This is anticipated to expand in the future to include names, units of measure, and even geographic locations. The language must be extensible to include these at a future date. Tag instancing is to be deprecated. Restricting a search to a first paragraph (p[1]) was considered unnecessary and unlikely to be used. Query 13 already uses this feature, but this query was considered contrived. Furthermore no relevance assessments are available for this query. The use of XPath axis, the plethora of XPath syntax for discussing paths, is to be limited to the descendant axis. In particular, the child axis is to be outlawed. None of the queries used so-far, relied on the usage of the child axis. The child axis can be added at any time if a future collection calls for such information need. Path filtering is to remain. Application of multiple filters is to remain.
Use of the (not)-equal operator is to be deprecated for the string data-type. All textual queries are to be expressed in terms of the about predicate. For arithmetic qualification the operators are to be limited to >, =, = and = 1998]//sec[about(., ’"virtual reality"’)]
Return sec elements of documents where the yr tag under the fm tag under the article tag is numerically greater than or equal to 1998, and where a sec tag discusses ’virtual reality’. //article[(.//fm//yr = 2000 OR .//fm//yr = 1999) AND about(., ’"intelligent transportation system"’)] //sec[about(., ’automation +vehicle’)]
Return sec elements about vehicle automation from documents published in 1999 or 2000 that are about intelligent transportation systems. We are currently working on a more detailed description of the syntax and semantics of the future INEX query language.
4.
REFERENCES
[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley-Longman, 1999. [2] K. Hatano, H. Kinutani, M. Watanabe, Y. Mori, M. Yoshikawa, and S. Uemura. An evaluation of inex 2003 relevance assessments. In INEX 2003 Workshop Proceedings, pages 25–32, 2003. [3] M. Hearst. User Interfaces and Visualization, chapter 10. In [1], 1999. [4] R. A. O’Keefe and A. Trotman. The simplest query language that could possibly work. In INEX 2003 Workshop Proceedings, pages 117–124, 2003.