Martin Doerr
Reference Information Acquisition and Coordination
February, 1997
Martin Doerr e-mail:
[email protected]
Institute of Computer Science Foundation for Research and Technology - Hellas Submitted for ASIS ’97 Washington November 1997
February, 1997
1
ICS-FORTH
Reference Information Acquisition and Coordination
Abstract
Reference information, i.e. thesauri and any other kind of authority data, are critical for the intellectual access to the growing heterogeneous digital collections in wide area networks, be it to enhance precision and recall of information retrieval systems or for the manual classification of "non-verbose" objects as museum artifacts, images and music. These data sets become huge semantic networks, which are better produced by multiple experts world-wide in an appropriate CSCW environment. We present the core data structure of such a system, which defines a suitable granularity of information to handle simultaneously multiple versions, opinions and multilingual data structures. Building on the experience from previous work on thesaurus management systems and respective co-operations, we propose an architecture for the overall coordination system. Besides technical issues, questions on required international cooperation practices and standards emerge.
ICS-FORTH
2
February, 1997
Martin Doerr
Reference Information Acquisition and Coordination February, 1997 Martin Doerr e-mail:
[email protected] Institute of Computer Science Foundation for Research and Technology - Hellas
1. Introduction Reference information, i.e. thesauri and any other kind of authority data, are critical for the intellectual access to digital collections. It is more and more recognized, that the precision and recall of conventional statistics based information retrieval is considerably improved, if the search terms and words found in texts are enriched, normalized or interpreted by various algorithms through reference information [Kri93],[Spink94]. This was confirmed e.g. in recent user studies as [Spink96],[Braj96]. Spink stresses the importance of access to domain knowledge and the value of query expansion techniques. Brajnik experimented with on-line access to a thesaurus for query formulation. The opposite direction, i.e. using statistical methods to find classification terms from a controlled vocabulary, is followed e.g. by [Plau95]. Besides confirmation from research, industrial IR products and systems use thesauri (e.g. MISTRAL-x, product of Bull, France).
High quality reference information is even more critical for the manual classification of "non-verbose" objects as museum artifacts, images and music [Cons95],[Smea96]. They can be regarded as "raw data" [Oom93],[Chan92], which acquire their meaning by human interpretation. The meaning may vary strongly with the aspect of research, as people put stress on different features. Quite often the relevance of these objects lies not even in their nature, but in their relation to events in the past, which can hardly be derived by automated means. Images may illustrate parts of things or phases of processes hardly recognizable as a whole in any one of these. In these cases, we would like to capture formally the relevant context and background knowledge to correlate objects with for retrieval purposes [Cons95]. Semantic formal knowledge structures can be used as well to create surrogates of semantic contents of texts, images and video with domain knowledge for indexing, as e.g. [Megh95,96], [Oom93]. A slightly different approach based on text grammars, i.e. a linguistic analysis, is presented in [Rama93]. These formal structures can be matched with reference information available as equally developed knowledge structures [Smit89], most current systems are however still less advanced. Topical subject thesauri and data structures as the Thesaurus of Geographical Names [TGN] and Union List of Artist Names [ULAN94] are a good start for such domain knowledge or reference data sets, serving a multitude of retrieval paradigms simultaneously. We expect, that these data will grow considerably in size and be structurally refined and enriched with respect to current standards.
In distributed environments, three factors will from now on increase considerably the importance of the reference information:
the sheer amount of data the diversity and specialization of the logical organization of collections the differences in language and intellectual approach.
February, 1997
3
ICS-FORTH
Reference Information Acquisition and Coordination
The greater amount of data requires more specialized and precise terms in order to retrieve the same small and tractable answer sets. If we issue a query simultaneously against a heterogeneous federation of collection management systems, logical diversity and domain specialization will not allow us to find valid terms easily. It will be necessary to merge general and special indexes into large networks, which will allow to identify for each collection the terms or descriptors wich is compatible with the one in the initial query. The differences in language and approach require structures mediating between various sets of concepts and terms. A formalization of such structures is the [ISO5964]. See also on this issue [Doer96].
The respective necessary reference data sets become huge semantically interconnected networks, which should be better produced by multiple experts from distributed sites world-wide any more. A computer supported cooperative work (CSCW) environment is needed, to gather efficiently the information, to organize it, and to control the agreement process on hundreds of thousands of items in a distributed expert group [RFC95], which in turn should find the agreement of a wide community of users. Some of the more impressive data sets of this kind are the Art and Architecture Thesaurus [AAT94] and the Library of Congress Subject Headings (LCSH), which represent decades of data collection and semantic organization, hundreds of thousands of subjects and terms, which still do not cover specialized topics (The general coverage is however quite impressive). Larger European thesauri of the cultural domain are those developed by the RCHME (Royal Commission on the History Monuments of England), SHIC [SHIC94], the French Ministry of Culture, Direction du Patrimoine (Merimee), and the Italian Ministry of Culture, ICCD, with some ten thousand terms each. MERIMEE is one of the few larger thesauri providing multilingual linkage (to AAT, RCHME and ICCD). The need of such linkage is strikingly demonstrated, that by the fact, that many European libraries use LCSH due to lack of national subject hierarchies, which allow transparent queries on foreign libraries. The slow progress on this field motivates the utility of communication tools for that sake. Automatic translation tools are of course the ideal solution, but they depend on the existence of multilingual corpora or doubly indexed data records [Amba96], which are rare in many domains .
Whereas the difficulty to create topical subject hierarchies lies in defining uniquely fuzzy human concepts and their interrelations, reference data sets on past reality become fairly complex by gaps in our knowledge and the subsequent reasoning on truth and believe on secondary or tertiary sources, which are contradictory from time to time. Examples are the ULAN and TGN. Similar questions may emerge in other attempts to project with electronic means an image of the past, as e.g. the Survivors of the Shoah Visual History Foundation in Hollywood, CA, or the Foundation of the Whole Hellenic World (FWHW) in Athens, Greece. The associated problems do not question at all the sense and success of these undertakings to render the best possible knowledge of the past. Characteristically, source information and conclusions of the editors of reference information must be kept permanently in correlation for eventual reinspection. In the widest sense, this holds for all encyclopedia editors as well. [Stre89] for example provides a very detailed analysis of argumentation during the authoring process and its formal capturing in hypertext systems as means of knowledge production and communication.
These aspects widen the meaning and application of reference information. We can identify three main purposes:
to classify a collection record with agreed on, true or unique retrieval.
terms or identifiers for direct
to identify in an arbitrary source the possible identities of ambiguously referred concepts, names, items or events. to study the assembled knowledge as information source in its own right.
ICS-FORTH
4
February, 1997
Martin Doerr
A more detailed analysis of these aspects can be found in [Soer96].In the following we shall present an overview of conceptual models actually employed in real applications or discussed by our group, to capture these kinds of data and to cater for their maintenance within an integrated system. We present these models only by key examples, as they are necessary to understand the principle. We further propose an architecture for an overall coordination system with a process model driven control.
February, 1997
5
ICS-FORTH
Reference Information Acquisition and Coordination
2. Data Structures 2.1 The Semantic Index System We employ an object-oriented semantic network database called SIS (Semantic Index System, [Cons93]) for storage and maintenance of formal reference information and other KR tasks. It is a product of the Institute of Computer Science at FORTH, Greece (http://www.forth.gr). It implements an interpretation of the data model of the knowledge representation language TELOS [Mylo90], omitting logical rules evaluation. As a semantic network, it has set-valued "fields". All "fields" are actually implemented as links between the entity instance and a value. They are objects in their own right, and the “field name” is interpreted as the class they belong to. They can be classified exactly as the classes themselves. A comprehensive discussion of the differences between Relational, object-oriented and Semantic models can be found in [Schn93]. The distinct features of the SIS from most other semantic nets are multiple instantiation, multiple inheritance, an unbounded hierarchy of extensible schema and metaschemata, orthogonally for nodes and links. In particular it means, that a link can be simultaneously under more than one "field", and links may have links themselves.
These features allow to express the structures discussed here in a compact and elegant way. The SIS is an efficient DBMS highly optimized for these structures. Some major applications incorporating or dealing with reference information are so far:
SIB : A software static analyzer and repository of industrial strength [Cons95b] CLIO: A system to capture the cultural and historical context of museum objects for professional documentation purposes.[Cons94] The VCS Prototype: An experimental system for the AAT and ULAN data developed in cooperation with the Getty Information Institute [Doer96b]. The AQUARELLE folder server and multilingual thesaurus server [Aqua96], [Chri97].
We use in the following discussion the characteristic data modeling structures of TELOS. Nevertheless, equivalent schemata can be implemented in Relational and o-o systems, eventually loosing some integrity constraints and performance. Simply all links or labeled arcs in the figures displayed here have to be implemented by intermediate tables.
2.2 Thesaurus structures Monolingual thesauri are typically formatted following [ISO2788], which is the semantically richest standard. A similar format is defined in the [USMARC]. The [TEI] prepares for yet another standard. [Soer95] shows however in much detail, that users indexing manually for high precision may need considerably richer structures. We support this opinion, but for reasons of simplicity we use in the following examples typically the structural analysis as employed in the AAT.
Fig.1 shows the interpretation of the more detailed notions of the AAT as isA hierarchy ([Doer96c]). We distinguish “ThesaurusConcepts” from “ThesaurusExpression”. The latter are determinators of any kind for the concepts, typically linguistic forms. The practice, to name concepts with “good” linguistic descriptors, mixes both notions on the level of the AAT descriptors. Terms can be preferred or nonpreferred . The latter are not identified with concepts. "HierarchyTerm" are those carrying narrower terms. "Descriptor" are those to be used as controlled vocabulary. The prefix “UK” denotes the British English linguistic forms.
ICS-FORTH
6
February, 1997
Martin Doerr
Another approach could be, to replace Descriptors by neutral concept identifiers, regarding them as language-neutral concepts in the sense of Pinker's mentalese [Pink94], and to implement "guide terms" as a system of classes on top of the terms and outside of the hierarchy. This would result in a clearer separation between concepts ("mentalese"), language (terms and words) and organization (facets, hierarchies and guide terms).
ThesaurusNotion
ThesaurusExpression
ThesaurusConcept
Term
HierarchyTerm
Name
Non-PreferredTerm
PreferredTerm
GuideTerm
Descriptor
AATDescriptor
UKAlternativeTerm
AbandonedTerm
TopTerm
DeletedTerm
PlaceType
UKDescriptor
UsedForTerm
AlternativeTerm
SourceTerm
UKUsedForTerm
VariantTerm
Role
Figure 1 : Term Hierarchy
Fig.2 shows our interpretation of the links as used in the AAT. Narrower terms are understood as inverse of the broader term (BT). We have defined a metacategory "RT_types" on top of the related Term (RT), which allows for a controlled extension of “RT” into different kinds of RT-like relations, as required for example in [RFC95]. Typical examples are producer-product and product-production relations. In the same way, all other basic relations can be generalized and subsequently specialized in a controlled manner as e.g. BT into BTG, BTP, BT (see [ISO2788]), and others. This allows to smoothly develop a thesaurus into more complex KR models, without loosing compatibility with the standards. It means, if we want to exchange data with a system based on simpler standards, we flatten these relations down to the base types.
February, 1997
7
ICS-FORTH
Reference Information Acquisition and Coordination
related_to
ThesaurusNotionType M1_Class S_Class date
ThesaurusNotion
ThesaurusExpression
Telos_Time
scope_note
ThesaurusConcept
hyperText _note xing ource inde ted_in_s illustra ource preferred_in_s
Term
approved
source _term BT
Source
Telos_Time SourceTerm
AlternativeTerm
HierarchyTerm T AL RT
UF UK
AATDescriptor
UsedForTerm
UKDescriptor
UKA LT
UK UF
UKAlternativeTerm
SPLIT_into
UKUsedForTerm
AbandonedDescriptor
Figure 2 : AAT Term relations
The AAT accepts only one broader term (monohierarchy), used-for terms are unique and assigned to one descriptor only. Our models, as most new thesaurus support software, have no built-in cardinality constraints. Polyhierarchies are for instance required in [RFC95]. To our opinion, any kind of synonym or non-preferred term should apply to more than one concept, as in any dictionary. This separates better the conceptual from the linguistic level, as words apply to many concepts, depending on the context. Steven Pinker[Pink94] explains in more detail this “economy” of language. A realistic synonym representation is obviously useful for thesaurus supported full text retrieval techniques. Only the "controlled term", the identifier of the concept, should not apply to multiple concepts.
Multilingual thesauri are typically formatted following [ISO5964]. This standard is rather bilingual than multilingual, as it defines binary relations only. It basically links concepts between two thesauri, assuming a very narrow relationship between concepts and language. Actually, it applies to different thesauri in the same language as well, as these tend to illustrate similar facts from different perspectives using different concepts. See e.g. applications of the SHIC, which is based on human activities, and the AAT, which is more morphological oriented, on classifying museum artifacts . On the MDA Workshop on Terminology in Oxford, Sept 11-13, 1996 the audience agreed, that more detailed links have to be defined for interthesaurus linkage in order to enable transparent query processing across languages. A more detailed discussion of thesaurus linkage in distributed systems can be found in [Doer96].
2.3 Thesaurus structure implementation In order to keep in one logical database all thesauri and links distinguishable, we put each thesaurus in its own subschema by isA relations between the classes and link types of the generic schema as shown in fig.3. Due to the properties of the isA relation, the different thesauri can be accessed regardless of their different origin, or specifically. A metacategory of interthesaurus links (fig.4) is defined, which is specialized down into the [ISO5964] notions of equivalence between the language pairs represented in the database. In addition, we define "common sense" translations of concepts, i.e. a list of non-preferred synonyms in another language, which are independent from the existence of a respective concept definition in a thesaurus of that language. This model is implemented for the AQUARELLE project on the SIS as thesaurus server [Doer97]. At present, we have loaded the AAT and the French MERIMEE thesaurus with its links to the English AAT and the English RCHME. We intend to load other European cultural thesauri as well.
ICS-FORTH
8
February, 1997
Martin Doerr
This novel approach makes extensive use of isA relations between inherited fields and use of a schema, which can be extended at run-time. Furthermore, the multiple instantiation feature allows to put one term in more than one class without creating a new record, like an additional flag. In practice, we put a term, which appears in two thesauri of the same language, in both subschemata simultaneously, a kind of trivial merge. The merging can be improved by removing spelling variants, if wanted. As a consequence, we may see on a common concept identifier or on a common term, synonym or not, all links made by all thesauri. Each link is characterized by the respective subschema where it comes from, such that no confusion about its origin arises. This feature can be very helpful for development purposes, as it provides complete overview over the local situation. Terms of different languages are regarded as a priori alien and are kept separate. According to the subschema selected as view, thesauri can be seen isolated or linked with equal performance.
February, 1997
9
ICS-FORTH
Reference Information Acquisition and Coordination
A more "conservative" approach would be, to regard all terms and links in all thesauri as a priori alien. Then we can use one fixed generic schema. In that case, we loose the viewing mechanism, and the trivial merge. This is counterintuitive, as terms usually are not the invention of the thesaurus editor, but of the language group (or social subgroup), and hence are expected to reappear in different thesauri. In this model, we have still made no attempt to identify the common conceptual level between the languages. As a consequence, each new thesaurus must be linked with all others, a nearly intractable problem, if all European languages would have thesauri loaded.
In [Doer96] we have proposed an approach to merge the concepts, eventually by introducing artificial concepts to describe overlaps between partially equivalent terms, and leaving concepts that are not in use in some language simply unnamed in that language. These concepts may appear nevertheless in the form of scope notes, but they may be suppressed in the language specific view. As the BT relation is transitive, these gaps in the hierarchy can be bridged by the next not suppressed broader term. To our opinion, this comes closer to Pinker's mentalese [Pink94] and to the way in which humans understand multiple languages.
2.4 Modeling opinions In all our models of reference data sets, we characteristically abandon a model of an associated record of a term or a "term detail". The term is not regarded as a part of a greater information unit, a product of the thesaurus editor or proposer of a term. We regard terms as a priori meaningless linguistic identifiers and we regard concepts as "atoms" in our mind, without reasoning who made them and when. They just exist. Obviously we register in a thesaurus well established notions. They should be old relative to our current reasoning on them. The relevant knowledge lies in the claims that connect terms with concepts and people. These are expressed by links in our model. Links cannot exist without the nodes they connect. Nodes can exist alone. E.g. "dolls, AAT_BT : figurines" can be read as: "The AAT team claims, that the broader term of the concept dolls is figurines", whereas I may claim, that the broader term of the concept dolls is toys :"dolls, Doerr_BT : toys". (For simplicity, I take the AAT descriptor "dolls" as concept identifier).
ICS-FORTH
10
February, 1997
Martin Doerr
The reasoning presented here is very close to Toulmin’s microarguments, as interpreted in [Stre89,92], with the difference, that we use contentless nodes as primitives. The different types of arguments in [Stre89] as “so”, “contradicts” etc. could be used in our framework without any problem. I.e. we separate in our model entities from relations, and the entities have no inner state or values. About entities I can only argue if they exist or not. We may model that linking the names of those supporting the existence or non-existence to the entitity, or denote the fact by a class as in fig.1, “DeletedTerm”. All other information is expressed by linking the semantic entities with each other. Opinions of groups and versions are expressed by types of links( fig 5) . Individual opinions can also be expressed by linking the name of the individual to links instead of creating a link type, in case many individuals are involved. For concepts and terms, reasoning about existence is typically not needed.
Recreational Artifacts
Visual Works
IN
IN RT
Figural Toys
BT
Baby Dolls
BT
BT
RT BT
Recreational Dolls
BT
Figurines
NoBT
Dolls
BT
BT
Paper Dolls
Kachina Dolls
BT Fashion Dolls
AAT Opinion Someone’s Opinion
Figure 5 : Classifying links by opinion
This approach can be extended to statements about historical discrete entities: persons, organizations, names, events, dates, objects, places.[Doer96c] If we regard places as pure coordinates, no existence question on places, dates, and names arise. Rather, we deal with claims as "this person was called Leonardo; this person's birth was in Italy; this person was seen at the meeting etc.". For a more detailed discussion of modeling historical notions in the SIS see [Dion94], [Chri95].
Summarizing, we can model a large class of reference information as a network of links between a set of independent entity identifiers. Personal or groupwise opinions are modeled by marking links and in some cases entities with the claiming person or group. The use of link types is efficient for filtering on opinion groups and versions, individual links on links may be useful for handling massive numbers of different persons expressing their opinions in one database. Versions are just seen as a change of opinion of the database maintainer. This normalization allows for an approach to the management of opinions orthogonal to the objects of discourse. Given this model, we need only three rules to ensure semantic consistency in a cooperative environment:
Entities are not deleted. Instead, a link or class marks a person’s opinion about its non-existence.
All links are marked by a claiming person or group. Non-marked links are the opinion of the database maintainer, if such a “super role” is wanted.
Referential integrity is maintained.
Garbage collection is not excluded by these rules. Entities may be physically deleted, if no positive information in any supported version is around. In the sequence, we can embed the reference data set
February, 1997
11
ICS-FORTH
Reference Information Acquisition and Coordination
production in a cooperative environment. For recent solutions of concurrency and semantic integrity in such systems see [Waes96],.[Deco96], [Deco95].
2.5 Submissions The next question is, how to manage the incoming data. It may be single facts or larger parts of networks. Before incoming data can be added, even though there source can be sufficiently marked in the final product, several processing steps have to be done. These may be control of quality, transcriptions, identification of referred concepts with respect to the reference information base etc.
24.12.94
at
by
submission1
M. Doerr by
r _fo
es pos pro
submission2
dolls
as_ NT
for es_ pos o r p as_ UF
teddy armonicas bears
teddies
at
24.12.95
Figure 6 : Proposal of links and implicit nodes, example
Lets regard first the case of a single incoming fact. Organizations, that produce reference data sets typically maintain an open network of contributors, which submit facts as proposals for the reference data, as the Getty Information Institute [RFC95], ICCD and others. We capture a submission as an event, i.e. a node comprising a reference to the submitter, the proposition, and a date. As discussed above, the proposition refers two entities as being related by a certain kind (fig 6). We require one entity at least to preexist in the reference data, the other may be new. Subsequent submissions may refer to entities established by previous submissions. In that way, a submitter can propose completely new information chains. As a special case, a submission may propose the non-existence or non-use of an entity. This model keeps the proposed link out of the data until it is accepted by the editor. It allows submittors to enter the data directly in the reference database without disturbing the editor. The editor is relieved from finding the identity of the preexisting node, as the submitter makes a direct reference to it within the database. The final acceptance is expressed by the editor drawing a direct link between the referred entities. The characteristic triangle is shown in fig 7. This model is also convenient to attach atomic processing information for each fact proposed, as discussions, expert opinions, votes between alternatives and other procedures. We have demonstrated this approach in a prototype implementation for the AAT in cooperation with the Getty Information Institute. Collection management systems can be regarded as “clients” of a reference data set (or authority), if they embed data from it in their records as controlled vocabulary. Many of those like to enrich the given authority with “local terms”. The submission information could be quite well automatically generated from these local terms with a least overhead for all involved parties.
ICS-FORTH
12
February, 1997
Martin Doerr
date
at
Submitter
submitted_by
because_of
Citation
Submission
se po pro
AATDescriptor
for s_
as _
has UF
UF
(Link proposal)
Term
(Link confirmation)
Figure 7 : Sample schema of link submission and confirmation
An alternative approach is, to allow one submission to refer multiple facts. We model that by linking links to the submission event. I.e. instead of making a triangle around the intended semantic link as in paragraph 2.5, we introduces the new links right away (fig 8) and connect them to the submission event. The latter points to the proposer, whereas in paragraph 2.4 we discussed a direct link between the object of discourse and the proposer (see also [Stre89]). The model here allows however to do reasoning on the submission as a whole. This model reflects the logical structure of the ULAN database edited by the Getty Information Institute. In contrast to the AAT, which talks about concepts at present in our heads, historical information is believed in units from other sources, often historical themselves, as long as no contradiction arises. Editors of the ULAN point out the need to refer to sets of propositions as they were found together. For example, one source gives a birth date and place together, whereas another refers a date only, and a third a place only. The latter two may talk about controversal opinions indeed. We cannot simply put both information together. The latter issue raises interesting questions about mapping opinions on opinions on opinions etc. that exceed the scope of this paper [Schu97]. We can of course introduce more and more indirections between sources of information and the objects of discourse, introducing more and more overhead for the maintenance of simple cases. In practice, database maintainers must decide on an optimal balancing between the formal representation of reasoning in the database and the associated overhead. Databases are basically search engines. To our opinion, all information that will not be searched for should go better into free text fields.
February, 1997
13
ICS-FORTH
Reference Information Acquisition and Coordination
M. Doerr
Submission1
Theotokopoulos
Fodele
n_in bor
Submission2
at _th
s im cl a
n _i rn bo
agrees
C. Georgis
claims _that cla im s_ tha t confirms
has_name
Editor
El Greco
claims_that
Unknown Place
Figure 8 : Submissions on multiple facts, example
2.6 Fragment exchange More importance has the exchange of larger pieces of reference information well studied by different expert groups. Without going too much into detail here, we can easily see, that the above principles can be applied analogously. As we have achieved a model, were multiple opinions can coexist, until contradictions are resolved, and a merge can be created just as yet another opinion, the hard question that remains is the identity of the entities referred. In other words, do we talk about the same thing at all, or do we talk about different things and mean the same? Typically this process is time consuming and errorprone, when reference sets from isolated teams are to be merged. The same or similar concepts may be named differently and may be “buried” deep in alien hierarchiest, or terms and names are referred with different spelling or just inverted as “abbeys, byzantine” and “byzantine abbeys” (preor postcoordination).
Even if the teams cooperate and read the same shared resources in different versions, the identity of a concept may become ambiguous. A sensible procedure could be, that cooperating teams use global identifiers for their entities that survive renaming of concepts, but not redefinition of scope. For the human reader, the identity of an entity in a reference information network is given by the whole context, i.e. the neighbouring information in the net. Those editing new resources could read existing resources and correlate their entities directly with the identifiers there, thus preparing for an automatic merge afterwards. The SGML community currently discusses the static interchange of document fragments [TR9601], and a quite sensible identification scheme of the fragment environment has been defined. The question of updates through fragment interchange was however not addressed. As several groups plan for SGML formats for reference information interchange, these questions should be addressed. On the other side, the identification of concepts needed have by far less complexity than the issues discussed in TR9601.
Following further this line, we would ideally see, that information submission and distributed development of reference information fragments would occur in a CSCW environment right away. Various models of decision taking can be thought of in such an environment. One is that of several central and independent authorities which exchange only semantic links between their work, but decide on unique concepts within each ones scope. Another could be to hold multiple opinions in one reference set as outlined above and manage the complete system by a central committee. A third would be democratic voting by many users, realistic only for larger fragments regarding the number of
ICS-FORTH
14
February, 1997
Martin Doerr
decisions to be taken, and finally we could imagine to take directly statistics of use from client collection management systems, and interpret these as votes, as long as they are consistent. All these cases have enough in common to discuss a general architecture.
3. Architecture 3.1 Organizational environment The system as envisaged here, let’s call it Reference Information Coordination System (RICS), intends to connect three main actors. Those are: The end-users, which seek information on collection management systems of any kind, be it libraries, digital libraries, museum collections, product information systems. The collection management system maintainers, librarians and documentalists, and eventually authors themselves, which use reference information to create valid data records in the system. The reference information editors and maintainers, which continuously improve the reference data sets.
Fig. 9 symbolically shows their cooperation (see also [Doer96b], [RFC95]). End-users and collection maintainers identify in their daily work information missing in the reference data sets in use, or shortcomings in its internal organization. Missing information may or may not be stored for local use and be directly or at appropriate interval communicated to the editors. The editors provide new versions, which may be directly used by the end-users. The collection maintainers will in general need a certain time to update the data fields with the new or better notions in a semiautomatical process (for more details see [Doer96]). The editors will maintain communications with various domain experts to validate the proposed information, besides their own research in literature.
3.2 System components The overall system architecture is shown in fig.10 as a block diagram of the major components. Arrows denote data and control flow. Lightnings denote wide area communication. In the following, we give an overview of the RICS components in terms of basic tasks and communications. The system may become distributed, where different sites work on logical subsets of the reference data, but share the
February, 1997
15
ICS-FORTH
Reference Information Acquisition and Coordination
whole. In that case, the complete system is installed at each editors groups. The Alliance editor for instance provides an interesting cooperative data sharing paradigm [Deco96], principles of which could be applied here as well. The system components are :
Synchronization module
Discussion Communicator
Editors Client
Client Database Communication
Fragment Preprocessing
Discussion Communicatior
Process Controller
Discussion Log
Submission Control
Online Submission Interface
RICS
Publishing Client
database
Online Reference Tool
Off-line Submission
Off-line Reference Tool
submission control communication
Figure 10 : RICS System Architecture
(1) The RICS Database It consists of three logical parts : It holds and manages the reference data, the process configuration and data, and associated documents. The latter are archives of communications, text, images and graphics illustrating concepts. The three parts may be implemented on a single database, or in a heterogeneous system. As outlined above, the reference information is optimally maintained in a semantic net, whereas the documents are better held in specialized text/image bases. (2) Submission Control This module is responsible for the validation and registration of the submitted data to the RICS. It may provide statistics on the submission activities. It is responsible for rejecting inconsistent data. It also registers all submitters and checks their permission for the appropriate operations. In particular, it may enforce a cooperative synchronization protocol on data parts, if the system is going to be distributed. It is under the control of the Process Controller (see below). Several subcomponents undertake the actual data entry: (2.1) Online Submission Interface This subcomponent is responsible for interactive data entry of candidate information in small units. There could be two versions, one for a local area network, with rich prompting and display of relevant current database contents, and an optional WWW-version for low-bandwidth communication. The LAN version can also be used for typing in proposals send on paper. (2.2) Client Database Communication This is on one side a backend to receive directly local terms or other candidate reference information from client collection management system. The basic idea is, that this information is already linked in a valid way to the reference data set identifiers, and that it has undergone already a local quality control.
ICS-FORTH
16
February, 1997
Martin Doerr
On the other side, there should be a mechanism to update in an efficient way the used terms in these databases after a new reference data set release will be developed. This can be based on explicit semantic links between versions. (2.3) Off-line Submission Tool Such a tool may be distributed to end-users for remote submission. It will enable them to compose machine-readable submissions in a valid format to the Submission Control module of the RICS database. Information transfer is possible through e-mail and/or fax with optical character recognition (OCR). (2.4) Fragment Preprocessing This component is responsible for the pre-processing of larger data fragments before these are fed into the RICS database system. This will typically be necessary for bulk loading of external reference sets for merging. It should be able to accept data in a variety of formats, SGML, MARC, HTML, or other tagging systems A sensible procedure would be, to use tools for transforming structured documents into SGML format for that purpose, and do further processing in SGML form only. In the sequence, matching algorithms for terms and concept identifiers can be run between the candidate sets and the database. Matching can be by spelling variants, linguistic processing, and structural position in the reference set. As this is open-ended, an open interface has to be foreseen to attach new algorithms. (3) On-line Reference Tool This should be a GUI for the end-user to access directly the latest version of the reference data set, preferably as an HTML Web-browser. It can be used as assistance for proposal submission. (3.1) Off-line Reference Tool A usual browser for the electronic edition of a reference data set, typically on CD-ROM. (4) Editors Client The editors client consist of two subsystems: The first is a menu-guided system, which proposes the appropriate steps, actions and data options to deal with the incoming data stream. It is subordinated to the Process Controller. It takes over valid submissions, and by it approved concepts or terms for publication are created. It should be configurable to accommodate the editor team’s internal task distributions and permissions. This is the task of the Process Controller, which synchronizes its actions with the other tasks in the system, and with other cooperative installations. The second subsystem supports the structural evolution of the link types and logical data organization as hierarchies and other groupings and the overall quality maintenance. (5) Publishing Client This component may be regarded as a part of the editors client as well. We treat it separately, because its processes are asynchronous with the others. It allows the editors to define consolidated parts of the vocabularies to be published. This includes version and configuration control, quality and presentation style decisions. It should be possible to published products in various standard formats. Unfortunately, good world-wide standards are still missing, and moreover competing ISO standards exist. (6) Discussion Communicator This module is a server which is transparent to the end-user. It sets up messages, sends them according to the configuration and process parameters via e-mail or fax, receives messages and distributes them to the appropriate processes. It holds folders and log files of the communications. The messages may refer to discussion subjects between experts and the editor group, or may acknowledge submissions and
February, 1997
17
ICS-FORTH
Reference Information Acquisition and Coordination
inform partners about their acceptance. Appropriate partner servers could be installed at collaborating remote sites requesting such a service. This module is subordinated to the Process Controller . (7) Synchronization module This module is a central unit needed only, if the system works distributed as a cooperative environment, where multiple installations work half-autonomously on parts of one reference data set. In particular, this may mean no more than the cooperative creation of [ISO5964]-style links between two independent thesauri. This module exchanges data access control and update information between the collaborating teams. (8) Process Controller This module is responsible for the integration of the RICS processes, i.e. the workflow of Submission Control, Discussion and Editing processes, as well as the Publishing Control module. It supports the states of the RICS processes, asks for decisions, and invokes system components. It is the “head” of the system. It should follow a widely configurable process model stored in the central database. Its services are persistent, i.e. they recover from system shutdown or crash at the point they were interrupted based on states stored in the database. They must be “multithreaded”, i.e. they initiate processes at other sites in a controlled manner and wait for their termination at suitable points (eventually sending reminders or canceling overdue inputs) They must interleaved with and controlled by human decisions. In a distributed solution, the process controller communicates with the synchronization module as well. In the VCS prototype, we have installed a trial process model [Doer96b], the detailed description of which would exceed the framework of this paper. A similar model is referred in [Waes96]. The basic idea of our model is the integration of workflow control data, configuration data and the reference data in one semantic net, exploiting the metamodeling capability of the TELOS/SIS data model. [Maid95]. On the data level (token level in TELOS terminology), we hold the reference data, the process events and organization descriptions. The above described submissions are a characteristic example of process events, and their connections to reference data and description of actors in the system. The schema level (Simple Class level in TELOS) holds the reference information schema and the configuration of the processes, i.e. the plan of optional and necessary steps. It contains notions of subprocess (thread) creation, and synchronization wait states We exploit here the run-time extensibility of the SIS schema. The meta level (M1 Class level) contains the activity types and formalizes the general relation to the reference data. The use of the TELOS data model allows for a uniquely elegant and compact formulation of process models and data. Of course it can be implemented on other databases as well. A definite operational advantage is achieved, when organizational data , i.e. roles and actual persons filling out roles, are integrated with process data and the processed data, as the system can react by itself on changes of personal and roles in a subject-specific way. Such models are referred as “product-oriented” or “decision-oriented”, in contrast to plain “activity orient” process models [Schm93].
3.3 System compatibility The architecture proposed needs of course much refinement, depending on the more specific requirements of the respective user group. Nevertheless at the state of the art, the implementation of such a system can be regarded as a riskless development task. A more interesting question is the actual installation and acceptance in the perceived user group. As pointed out in the beginning, the system gets its value if it attracts a large contributor group. This means, that one should care for compatibility with existing and future systems by different providers. The envisaged international cooperations can hardly be based on single software solutions. Another problem is, that many organizations are not willing to do higher investments for reference data, as they are regarded as auxiliary, with a hard to define return of investment. We see so far three levels, on which open interfaces and standards could be defined: reference data formats, interoperability of collection and reference information systems and communication protocols on data contents. Whereas the latter requires some hands-on experience on running systems of that kind, enough effort is done already to specify SGML standards for reference data interchange. These
ICS-FORTH
18
February, 1997
Martin Doerr
efforts concentrate however mainly on complete thesauri of topical subjects, and the problem of consistent fragment interchange and partial update unfortunately is not yet regarded.
4. Conclusions We have presented a way to model a large class of reference information based on a semantic network technique with embedded argumentation of the information provider. We have shown by discussion of several alternatives, that the approach is flexible enough to be adapted to different user and system requirements. These models define a suitable granularity of argumentation, on which a consistent cooperative environment and workflow management can be built, one of the key issues of cooperative environments. [Deco95]. The granularity chosen is relatively high, but absolutely appropriate to the information density of reference information sets. In the larger communities to use such systems, global authoritative decisions on certain structures become more and more difficult. These models imply the chance, that more than one opinion coexists in the final product. This does not hinder the functionality as reference information, as long as the semantic relationships between the opinions are explicit in the system. Thus the system acquires a “democratic touch”, which may be crucial for its acceptance in the future. We have discussed an architecture in general terms, in order to assess the feasibility and complexity of a reference information coordination system. Given the state of the art of cooperative systems, the major problems seem to lie more in social questions of defining communication and agreement processes in large communities and the necessary standards, than on the technical side. We have discussed the importance of reference information in the future large federations of information systems, and the advantage of the software support as proposed to create these data at lower cost. As no such system is on the market, we would like to encourage other interested groups to join this effort on an international scale.
February, 1997
19
ICS-FORTH
Reference Information Acquisition and Coordination
5. References [AAT94] : Introduction to the Art & Architecture Thesaurus, Published on behalf of The Getty Art History Information Program, Oxford University Press,1994. [Amba96] S.Amba, N. Narasimhamurthi, K.C. O’Kane, P.M. Turner. Automatic Linking of Thesauri, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 18-22, pp181-187, Zurich, August 1996 [Aqua96] The AQUARELLE project ,TELEMATICS Application Program of the European Commission, Project IE-2005 1996, (http://.aqua.inria.fr) [Braj96] G.Brajnik, S.Mizzaro, C.Tasso. Evaluating User Interfaces to Information Retrieval Systems: A case study on User Support . Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 18-22, pp120-127, Zurich, 1996 [Chan92] S.K. Chang, Arding Hsu, Image Information Systems: Where Do We Go From Here?, IEEE Trans. on Knowledge and Data Engineering, Vol. 4, No. 5, October 1992. [Chri92], M. Christoforaki, P. Constantopoulos, M. Doerr. CLIO: An Object-Oriented Model of Cultural Data, Technical Report ICS-FORTH.MUIS.92.1, Information Systems and Software Technology Group, Institute of Computer Science, FORTH, Heraklion, Crete, Greece, November 1992. [Chri95], M. Christoforaki, P. Costantopoulos M. Doerr, "Modelling occurences in cultural documentation" , III Convegno Internazionale di Archeologia e Informatica, Roma 25 November 1995. (http://www.ics.forth.gr/proj/isst/Publications/articles.html) [Chri97], V. Christophidis, M. Doerr, I. Fundulaki. The specialist seeks expert views Managing digital folders in the AQUARELLE project. To be published: Museums and the Web, Los Angeles, March 15-19, 1997 [Cons93], P. Constantopoulos, Martin Doerr. The Semantic Index System: A brief presentation, Working Paper #6, Information Systems and Software Technology Group, Institute of Computer Science, FORTH, Heraklion, Crete, Greece, August 1993. (http://www.ics.forth.gr/proj/isst/Systems/SIS/index.html) [Cons94], P. Constantopoulos: Cultural Documentation: The CLIO System. Technical Report FORTH-ICS/TR-115, 12 pages, January 1994. (http://www.ics.forth.gr/proj/isst/Publications/TechnicalReports.html) [Cons95], P. Constantopoulos, M. Doerr. An Approach to Indexing Annotated Images, Multimedia Computing and Museums, Selected Papers from the Third International Conference on Hypermedia and Interactivity in Museums, by David Bearman, pp278—298, San Diego, CA, USA, October 1995. (http://www.ics.forth.gr/proj/isst/Publications/articles.html) [Cons95b] P. Constantopoulos and M. Doerr, Component Classification in the Software Information Base, in O. Nierstrasz and D. Tsichritzis, eds., Object-Oriented Software Composition, Prentice-Hall, 1995. [Deco95], D. Decouchant, M. R. Salcedo, Structured Cooperative Editing and Group Awareness, HCI International'95, 6th International Conference on Human-Computer Interaction, Y. Anzai, K. Ogawa and H. Mori, ed., pp. 403-408, Elsevier Science, Yokohama, 9-14 July 1995. [Deco96], D. Decouchant, M.R. Salcedo. Alliance: A Structured Cooperative Editor on the Web, Proceedings of the ERCIM workshop on CSCW and the Web, Sankt Augustin, Germany, February 7-9, 1996 [Dion94] I. Dionysiadou, Martin Doerr, Mapping of material culture to a semantic network,1994 JOINT ANNUAL MEETING of the CIDOC/ICOM and MCN, Washington USA, August 1994. (http://www.ics.forth.gr/proj/isst/Publications/articles.html)
ICS-FORTH
20
February, 1997
Martin Doerr
[Doer96], M. Doerr: Authority Services in Global Information Spaces. Technical Report FORTH-ICS/TR-163, February 1996. (http://www.ics.forth.gr/proj/isst/Publications/TechnicalReports.html) [Doer96b], M.Doerr, Terminology Management on the Semantic Index System, Presentation at the Getty Workshop on Semantic Systems, Los Angeles, CA, October 10-11 1996. Slides to be published at the Getty Information Institute Web site, (http://www.gii.getty.edu). [Doer96c], M.Doerr, M. Christoforaki. The Getty AHIP VCS Schema. Technical Report VCS.FORTH.96.1, Institute of Computer Science, FORTH, Heraklion, Crete, Greece, March 1996. [Doer97] M.Doerr, I. Fundulaki:A conceptual model for the representation of multiple interlinked thesauri, Working Paper #26, Institute of Computer Science, Foundation of Research and Technology-FORTH, March 1997. [ISO2788] : Documentation - Guidelines for the establishment and development of monolingual thesauri, International Organization for Standardization, Ref. No ISO 2788-1986, 1986. [ISO5964]: Documentation - Guidelines for the establishment and development of multilingual thesauri, International Organization for Standardization, Ref. No. ISO 5964-1985, 1985. [Kris93] Jaana Kristensen. Expanding end-user’s query statements for free text searching with a search-aid thesaurus. Information Processing & Management, 29(6):733-744, 1993 [Maid95] N.A.M. Maiden, P. Assenova, P. Constantopoulos, M. Jarke, Computational Mechanisms for Distributed Requirements Engineering ,Proc. 7th International Conference Software Engineering and Knowledge Engineering - SEKE '95, Maryland, USA, June 1995. [Megh95] C. Meghini, An Image Retrieval Model Based on Classical Logic, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995. [Megh96] C. Meghini, A Relevance Terminological Logic for Information Retrieval, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 18-22, pp197-205, Zurich, August 1996 [Mylo90] J. Mylopoulos, A. Borgida, M. Jarke, M. Koubarakis, Telos: Representing Knowledge About Information Systems, ACM Trans. on Information Systems, October 1990. [Oom93] E. Oomoto and K. Tanaka, OVID: Design and Implementation of a Video-Object Database System, IEEE Trans. on Knowledge and Data Engineering,Vol. 5, No. 4, August 1993. [Pink94] S.Pinker, The Language Instinct, New York:W.Morrow and Co,c1994 [Plau95] C. Plaunt, B.A. Norgard. An Association Based Method for Automatic Indexing with a Controlled Vocabulary, 2 October 1995, http://bliss.berkeley.edu/papers/assoc/assoc.html [Rama93], D.V. Rama, P.Srinivasan, An Investigation of Content Representation Using Text Grammars, ACM Transactions on Information Systems, Vol. 11, No1, pp51-75, January 1993 [RFC95] Request for Comment Issued for New Vocabulary Coordination System for Getty Information Institute Authorities: Art & Architecture Thesaurus, Union List of Artist Names, and Thesaurus of Geographic Names , http://www.gii.getty.edu/gii/newsarch.html#article6, Santa Monica, January 1995. [Schm93] J.R. Schmitt, Product Modelling for Requirements Engineering Process Modeling, IFIP WG 8.1 Conference on Information Systems Development Process, September 1993 [Schn93] J.Schnase, J.Leggett, D.Hicks, R.Szabo,Semantic Data Modeling of Hypermedia Associations,ACM Transactions on Information Systems, Vol.11,No1, 1993, pp27-50. [Schu96], F.Schutte, Correlating Documents with Common Knowledge by Using Attribute Links, Technical Report FORTH-ICS/TR-186, February 1997. [SHIC94] SHIC Working Party. Social History and Industrial Classification. Cambridge: MDA, 1994
February, 1997
21
ICS-FORTH
Reference Information Acquisition and Coordination
[Smea96] A.F.Smeaton, I.Quigley. Experiments on Using Semantic Distances between Words in Image Caption Retrieval. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 18-22, pp174-180, Zurich, August 1996 [Smit89], P.J. Smith, S.J. Shute, D. Galdes, M.H. Chignell. Knowledge-Based Search Tactics for an Intelligent Intermediary System, ACM Transactions on Information Systems, 7(3):246-270, 1989. [Soer95] D.Soergel. The Art and Architecture Thesaurus (AAT): A Critical Appraisal, Resources, Vol. X, pp369-400, Malaysia, 1995
Visual
[Soer96] D.Soergel. SemWeb: Proposal for an Open, Multifunctional, Multilingual System for Integrated Access to Knowledge about Concepts and Terminology. Advances in Knowledge Organization Vol.5(1996) pp165-173. [Spin94] A.Spink. Term Relevance Feedback and Query Expansion: Relation to Design. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp81-90, 1994 [Spin96] A.Spink, A.Goodrum, D.Robins, Mei Mei Wu. Elicitations during Information Retrieval: Implications for IR System Design. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 18-22, pp120-127, Zurich, August 1996 [Stre89] N.A.Streitz,J.M.Haake, J.Hannemann, A.C.Lemke, W.Schuler, H.A.Schütt, M.Thüring. SEPIA: A Cooperative Hypermedia Authoring Environment. Proceedings of the 2nd ACM Conference on Hypertext (Hypertext’89), pp. 343-364, 1989. [Stre92] N.A.Streitz,J.M.Haake, J.Hannemann, A.C.Lemke, W.Schuler, H.A.Schütt, M.Thüring. SEPIA: A Cooperative Hypermedia Authoring Environment. Proceedings of the 4th ACM Conference on Hypertext (ECHT '92), pp. 11 - 22 ,. Milan, Italy, November 30 - December 4, 1992. [TEI] : Text Encoding Initiative : ( http://etext.virginia.edu/TEI.html) [TGN] The Getty Information Institute: Thesaurus of Geographic Names (TGN) (a database of geographic place names organized into hierarchical clusters) http://www.gii.getty.edu/gii/projlist.html [TR9601] SGML Open Technical Resolution TR9601 on Fragment Interchange, ftp://ftp.exotica.com/sgmlopen/9601 [ULAN94], J.M. Bower, M. Baca et al., Union List of Artist Names – User’s Guide to the Authority Reference Tool, Version 1.0, Getty Art History Information Program, G.K. Hall, New York, 1994 [USMARC] Format for Authority Data Including Guidelines for Content Designation, 1993 Edition prepared by Network Development and MARC Standards Office, Library of Congress Cataloging Distribution Service Washington D.C. http://lcweb.loc.gov/marc/ [Waes96] J.Waesch, W. Klas. History Merging as a Mechanism for Concurrency Control in Cooperative Environments, Proceedings of the 6th International Workshop on Research Issues in Data Engineering - Interoperability of Nontraditional Database Systems (RIDE-NDS’96), pp.76-85, New Orleans, Louisiana, 1996.
ICS-FORTH
22
February, 1997