Understanding and Improving Content Markup for the Web: from the perspectives of formal linguistics, algebraic logic, and cognitive science Ladislav J. Kohout
Department of Computer Science Florida State University Tallahassee, Florida 32306-4019, USA
[email protected]
Andreas Strotmann
Zentrum fuer Angewandte Informatik Universitaet zu Koeln, Rechenzentrum Robert-Koch-Str. 10, D-50931 Koeln, Germany
The paper (i) examines the issue of compositionality in symbolic computing from the point of view distributed computing; (ii) proposes the use of extended relational framework for combining mathematical and conceptual nonmathematical knowledge within the context of OpenMath protocols; (iii) presents an application of computational semiotics to manufacturing1 .
KEYWORDS: Computational semiotics, compositional-
ity, knowledge networks, fuzzy knowledge representation, semiotic descriptors, relational analysis, BK-products of relations, syntactic/semantic categories.
1 INTRODUCTION 1.1 Overview and Objectives
In recent years, languages have been proposed in several domains of expertise for exchanging semantic information (meaning) via a distributed communication network across the globe. Among the more recent developments in this eld are CML (Chemical Markup Language), MathML (Mathematical Markup Language), and OpenMath, all based on XML (eXtensible Markup language). Exchange formats for bibliographic data (UniMARC) or for geographic and geologic data (SDTS) are perhaps better established. In addition, communities of specialized elds of computer science have de ned information interchange languages for actor or computer programs working on similar tasks: e.g. KQML/KIF and its accompanying ontologies, a standard description of Arti cial Life, and the much older XDR that de nes the data exchange level for the Internet RPC. There is a danger in these projects that they de ne their respective tasks both too narrowly (visual presentation of 1 Acknowledgment: This work has been partially supported by the National Science Foundation grant DMI-9726027.
[email protected]
mathematical formulas, say) and too arbitrarily, (putting mathematics on the web, whatever that means), and to get so lost in the technical aspects of the problem, failing to see the forest for the trees. For this reason a thorough analysis of the problem of designing good content markup languages, i.e. languages for representing meaning on the web, is overdue. In order to get the full perspective, one has to look at the forest of content markup from the outside, using insights and methods from algebraic logic and formal linguistics supplemented by semiotic notions, particularly from the eld of pragmatics, namely the theory of actions. We have the following long term objectives: (1) To put the design of content markup languages on a sound theoretical and computational basis. (2) To extend relational computational methods based on BK-relational methods and fast fuzzy relational algorithms to encompass problems of computational semiotics. (3) To apply computational semiotics of relational computations to knowledge engineering and design of distributed intelligent systems. In the present paper we address the following issues: Compositionality in content markup. Interpretations of content markup: Use of fuzzy BKrelational products as the basis for a semantics of relational computations. Use of extended relational semiotic models in manufacturing. 1.2 The Method of Our Approach
Initially we concentrate on the questions involved in designing a mathematical content markup language because that is a non-trivial sub-problem of essentially all other projects trying to communicate scienti c content across the globe, a sub-task that likely covers all the major pitfalls to be encountered in other elds of knowledge. At the same time it is somewhat simpler than other elds of knowledge because the meaning of meaning has been studied inten-
sively in the eld of mathematics, and there are extant computer algebra and computational logic programs that can handle, to some extent, mathematical meaning. Concentrating on the eld of mathematics also allows us to perform an empirical study of existing software systems to supplement our theoretical analysis. Indeed, the last two decades of symbolic computing have seen substantial progress. Available mathematical software includes special and general purpose algebra packages, symbolic integration and dierentiation packages, logic theorem provers and proof checkers, libraries of numerical algorithms, visualisation packages, statistical packages, etc. In the paper "OpenMath: Communicating Mathematical Information between Cooperating Agents in a Knowledge Network" (by Abbot, van Leeuwen and Strotmann { Int. J. of Intelligent Systems, in press), an important parallel is drawn between the structural decomposition of the task of transmitting mathematical information (meaning, content) and a linguistic communication model. Extending OpenMath for communication into the elds of Knowledge Engineering and Intelligent Knowledge-Based Systems crucially depends on preserving this linguistic parallel and on extending it by incorporating semiotic notions into it. In particular, extending OpenMath into the domain of relational computations will provide a useful testbed for experiments with distributed relational computations using "semiotic descriptors" [3].
(semantic) structure. Both markup languages may coexist in the same document; their common formal syntax is called XML. An XML-based physics content markup language would perhaps add to these the notions of dimensions ("length", "duration", "speed"), units ("inch", "meter"), physical constants, etc., and a chemical content markup language might again add vocabulary for notions of atoms, molecules, reactions, and so forth. An engineering vocabulary would likely build on all of these. Each eld would also extend the presentation markup languages of the elds whose notions it extends. In order to present to a reader a document containing structured markup, a set of rules mapping structural vocabulary to presentation styles is required (e.g. a section title to a particular font and a particular indentation, a mathematical concept to a particular notation, a representation of a chemical molecule to a particular notation or graph, an engineering concept to a chart component notation). A particular publisher will usually have his own house-rules concerning the most aesthetic way to print section titles, mathematical formulas, or blueprints, but all publishers who need to print texts from these elds do have such rules. While the printed versions of such a document may dier considerably from publisher to publisher, the information content of the document will remain unchanged, and may thus be exchanged without undue loss of information.
1.3 What is a Markup Language
2 WHAT IS COMPOSITIONALITY
The term "markup" is used for annotations inserted into a document in order to make explicit its internal logical structure and/or its semantic content. A "markup language" is a particular system for adding markup to documents stored as computer les. One part of a markup language is a formal syntax that allows a computer both to distinguish between markup and the original text and to build an internal representation of the document's internal logical structure as described by the structure of the markup. Another part of a markup language is a vocabulary of logical and semantic structuring concepts (e.g. "book", "chapter", "section", "title", "author", but also "sine function", "number", "species", "DNA sequence", "building", "door", "wood", "concrete", "steel", "valve", "date", "nautical miles", "US Dollar", "interest rate", and so on) that are de ned in it. Dierent markup languages may dier in the vocabularies they de ne while using the same underlying formal syntax to separate annotation from data, thus allowing them to co-exist and to be used side-by-side within a single document. In this case, they each ideally cover distinct conceptual spaces; in the case of MathML, for example, the MathML-Presentation markup language provides vocabulary to describe the notational structure of a mathematical formula, while the MathML-Content markup language provides a vocabulary for representing its notional
The compositionality principle states that "the meaning of a compound expression is a function of the meaning of its parts and [the meaning of] the syntactic rule by which they are combined." (see [4] p. 462). The analysis of some current approaches to content markup indicates that the issue of compositionality is important not only from a theoretical but also from a practical point of view, as it helps to achieve correctness and scalability in communicating content. We have observed that most software systems in current use are compositional only to a limited extent. This raises an important question: What is the role of compositionality, and can it always be achieved? 2.1 Scalability of Design of a Mathematical Knowledge Network and Compositionality
Obviously, compositionality is not "needed" in the sense that you could not do without it to some extent, but it is certainly desirable from an engineering point of view { to have compositionality. Semantic information processing on a computer always means applying a semantic interpretation procedure to
some concrete tree- or graph-like representation of an object that is being interpreted, and to then represent the new interpretation given to the object thus treated in another concrete tree or graph. For a language for expressing semantics for exchange among computer programs, this means that it must perforce de ne a basic interpretation procedure { how to interpret the parts, and how to interpret the combinations of parts. Compositionality is important because it means that this interpretation algorithm can be de ned completely with respect to how to interpret combinations of parts once you have interpreted the parts. De ning a skeleton interpretation involves de ning the structure of the lexicon and that of its entries, but no speci c lexicon entries should need to be referred to in its de nition. Thus, compositionality means that one is able to provide a rmly grounded scaolding from which one can work to extend the scope of your language by adding only lexical entries, while keeping the scaolding intact. Analysis of some current approaches mentioned above indicated that the issue of compositionality is an important one not only form the theoretical but also practical point of view. Namely for achieving the correctness and scalability of communication protocols. Some extant systems, however, have protocols that are not compositional. This raises an important question that has to be answered, in order to achieve systems with trustworthy interpretation of messages from other dierent systems: What is
the role of compositionality and can that be always achieved?
When there is context dependency, full compositionality cannot be always achieved.
3 COMPOSITIONALITY, PRESENTATION AND MEANING IN MATHEMATICAL SYSTEMS To avoid common confusion about the role of compositionality in design of distributed mathematical networks, we have to distinguish languages (or their fragments) that handle presentation from those that handle mathematical meaning. It is thus only meaningful to ask if a language obeys the compositionality principle if the language under scrutiny professes to handle "meaning". In the special case of languages for handling mathematics, this restriction rules out judging presentation languages like LaTeX or the presentation markup of the MathematicaML or any other formula editor's internal language for representing the twodimensional layout of formulas. For these systems, the question is moot. There are, however, systems that do profess to handle mathematical meaning rather than presentation. General purpose computer algebra systems like REDUCE, Macsyma, Maple, Mathematica, and Axiom make this claim,
as do theorem provers and proof checkers. Recent "Content" or "Semantic" markup languages for mathematics also profess to provide a means for representing the meaning rather than the form of a mathematical formula. It is helpful to examine for compositionality some speci c mathematical operations as implemented in current mathematical tools. 3.1 Compositional Treatment of De nite Integration
In this section we shall consider the dierent ways that de nite integrals are represented in dierent languages or language proposals. The example of a de nite integral that we will use for this purpose is commonly written as
Z
x
0
sin xx.
Before considering how some systems express the meaning of this mathematical expression, let us work out a compositional answer to this question: What is the meaning of the expression presented above? For a compositional answer to that question, we must rst consider what are the "parts" of the compound expression. In the most simplistic analysis, the ve "parts" of this compound expression are the integral operator the integrand the integration variable the lower and upper bounds of the domain of integration However, the two bounds may obviously be grouped together to form a single compound part of the expression. This semantic grouping goes well with other, more advanced notions of integration that cover integration over n-dimensional domains, for example. A second grouping is less obvious, but will be crucial to our argument. The integration variable denotes the fact that the integrand is to be considered as a unary function in that variable, a concept that is variously expressed as x 7! sin x or x: sin x. The latter interpretation of this grouping unveils its most crucial aspect, namely, that the integration variable is bound by the integration operator, and that its scope isR the integrand. Notations occasionally encountered like f capture the same notion. This aspect is underscored by the choice of the upper bound in our example: the variable x occurs free in the upper bound of the integral, bound in the integrand, and as denoting the binding in the integration variable. Putting the pieces together, we conclude that a compositional interpretation of the example expression is facilitated by noting that
Z 0
x
sin xx. =
Z [0;x]
(x: sin x)
Viewed like this, the integration operator is interpreted as just one member of a large class of reduction operators that act on functionalParguments over given sets or domains. The summation ( =0 sin ix) and product operators are two more operators of this very common kind of generalized quanti ers. Handling of the binding and scoping of variables, a crucial component of rst and higher order expressions like this one, has been delegated away from the speci c operator (integration) to a single generalpurpose ingredient of the language (lambda abstraction) whose speci c task is to express this notion, and this notion only. In summary, a compositional meaning assignment for our example might go like this: part 0: a numeric constant part x: a free variable part [:; :]: a binary operator syntactic rule "n-ary apply": [0; x] ("apply operator interval to arguments") n i
part x: a free variable part sin: a unary operator syntactic rule "n-ary apply": sin x ("apply operator sine to arguments") part sin x: a compound expression part x: a free variable syntactic rule "abstraction" (v:f (v)): x: sin x ("bind one free variable, producing a unary operator in that variable") part x: sin x: a compound expression part [0; x]: a compound expressionR syntactic rule "n-ary apply": [0 ] (x: sin x) ("apply operator de nite integral to arguments") ;x
3.2 De nite Integration in Computer Algebra Systems
Here is a small table of representations of de nite integration in some computer algebra systems: REDUCE uses defint(sin x,x,0,x) . Maple uses int(sin(x), x=0..x) . Mathematica uses Integrate[Sin[x], x,0,x]. A common semantic task in a general purpose CA system is the evaluation of a complex expression involving free variables at speci c points within the range of those variables, usually performed by substitution followed by simpli cation. Now consider asking any one of the systems above to substitute for x in the above examples. They will likely give the right answers, but these answers come at a price. Each of these systems needs to tag
their respective names for the integration operator as what LISP would call an "FEXPR" { an operator that handles the interpretation of its arguments on its own rather than leaving their interpretation to the system. In REDUCE, for example, the integration operator is tagged with its very own substitution and simpli cation routines. 3.3 Systems in Which Compositionality Fails and the Consequence of this Failure
Strotmann has made a detailed analysis of treatment of De nite Integration in several Computer Algebra Systems (REDUCE, Maple, Mathematica) from the point of view of compositionality along the lines adumbrated in Sec. 3.1 above. His analysis shows that \the operators in each system handle the interpretation of their arguments on their own rather than leaving their interpretation to the system" { which is just another way of stating that they are not handled in a compositional manner by these systems. From a software engineering point of view it means that the design does not scale. Each new operator (summation, product, dierentiation, roots extraction, limits, and so on ad in nitum) will need to be handled as its own syntactic rule if compositionality were to be claimed. Looking at MathML \it is fair to say that while it is well on its way towards becoming a properly compositional mathematical markup language, there still remains some work to do to really make it so." In OpenMath, whether or not an operator is compositional depends on whether it is placed in the Basic content dictionary or Meta content dictionary. 3.4 Compositionality in AI-Ontology Systems
In AI and Knowledge Engineering, practical schemes have been developed for the purpose of combining mathematical concepts with the concepts of other domains of expertize, yielding computer representations of the knowledge of these combined domains. One such knowledge representation language is KIF. Knowledge Interchange Format (KIF) is a computer-oriented language for the interchange of knowledge among disparate programs. It has declarative semantics (i.e. the meaning of expressions in the representation can be understood without appeal to an interpreter for manipulating those expressions); it is logically comprehensive (i.e. it provides for the expression of arbitrary sentences in the rst-order predicate calculus); it provides for the representation of knowledge about the representation of knowledge; it provides for the representation of nonmonotonic reasoning rules; and it provides for the de nition of objects, functions, and relations. Most of the KIF is compositional (or can easily be re-done compositionally). It does, however, contain the "setofall" and "the" operators that fail to meet a strict de nition of compositionality. The interpretation rule for "setofall" appears to be the instance of interpretation by substitution.
4 DISTRIBUTED COMPUTATIONS IN KNOWLEDGE NETWORKS AND COMPUTATIONAL SEMIOTICS A number of computing systems that eectively represent mathematical knowledge, and that enable the construction of databases of mathematical results and mechanically checked proofs in forms that are not only readable but also computationally re-usable by people have become available. Dierent mathematical systems, however, have dierent competence, and networked cooperative communities of such systems are desirable. Ideally, users of such systems could shift among dierent but clear and unambiguous representational syntaxes and semantics that capture the same underlying mathematical knowledge in forms that are tailored for use in dierent contexts of mathematics as well as in dierent applications in many disciplines. The development of the OpenMath international standards/protocols (see http://www.openmath.org) for sharing/exchange of mathematical information that enables distributed co-operation of agents/systems with dierent kinds of competence is indispensable for achieving such goals. The OpenMath scheme allows for distributed cooperative computing involving mathematical agents with dierent kinds of mathematical competence e.g. combined distributed action of numerical and symbolic mathematical systems (e.g. Mathematica, Matlab, Maple, Axioma etc.). 4.1 Extending the OpenMath with Linguistic and ExtraMathematical Concepts
Practical problems of engineering design, decision making in business, architecture, environmental protection, medicine and social planning, however, require not only mathematics, but also information and knowledge of nonmathematical kind to be incorporated. It is crucial that this extension does not eliminate the mathematical component but makes it an integral part of the extended scheme. The purpose of our MathIn project is to develop a scheme for knowledge networking which utilizes the OpenMath international protocol structures but at the same time extends these, incorporating linguistic and nonmathematical symbolic representations and communication schemes. In order to integrate the individual facets of the MathIn project, two mutually complementary tools are used: (i) Activity Structures methodology and (ii) Relational Calculi exploring BK-relational products. These methodologies jointly provide the means for formal representation and manipulation of knowledge and metaknowledge that can unify the proposed extensions within a sound mathematical core, including relational mathematical representations of semiotic descriptors [3].
4.2 Compositionality Revisited
Ontolingua that uses KIF as the base language assumes a completely sharable knowledge base that is used by all agents participating in a cooperative computational scheme. A truly distributed knowledge network on the other hand contains a conglomerate of partially sharable knowledge bases used in a highly parallel fashion. KIF uses the 1st order logic. Both these assumptions of KIF and Ontolingua have to be carefully scrutinized if applied to distributed computing structures, because their violation may lead to the breakdown of compositionality and subsequent failure of the 1st order logic. For this purpose, one has to view compositionality in a wider context. Jaakko Hintikka's analysis of the issue of compositionality [2] in mathematical and philosophical logic yielded the following dependency graph: 9 parallelism = Lerneability =) invariance ; =) compositionality determinacy These notions represent the important contribution to the investigation of compositionality. A logic system is compositional only if parallelism, invariance and determinacy hold simultaneously in it. We apply this insight in the analysis of the extant markup languages. Hintikka also demonstrated that when compositionality fails, the Tarski semantics for the First order logic is not possible. Some Knowledge representation languages such as KIF and related Ontolingua, however, crucially depend on the rst order logic semantics. Hence the careful investigation of the issues compositionality is extremely important for construction of correct distributed knowledge networks. 4.3 Eective Computations with Distributed Knowledge
Both in human learning and in arti cial intelligent systems, the management of information is becoming a major practical concern. It is structuring the knowledge in an organized manner, and clustering and "chunking" information, as well as linking contexts in linguistic representation that facilitates and speeds up the information retrieval. It is not only the adequate representation of knowledge but also its ecient localization, retrieval and transformation that are important in distributed intelligent systems. In the latter tasks, granulation of information plays an essential role. The notion of granularity was introduced into fuzzy sets by Zadeh. In his interpretation, granulation involves a decomposition of the whole into the parts, and conversely organization involves an integration of parts into the whole. Both these complementary activities composition and decomposition play a crucial role especially in distributed knowledge manipulation. To deal with granularity in intelligent systems, the potential of fuzzy logic to approximate is explored. Fuzzy relations provide a
high level speci cation language and computational tool for forming granules. Very special cases of granular structures are equivalences, similarities, congruences and hierarchies of various knowledge components. The contributions of Bandler and Kohout crucial for distributed multi-level knowledge representation was provide an adequate de nition of locality for both crisp (non-fuzzy) and fuzzy relations [1]) and develop software tools for computational testing of local properties and comparing partial relational structures. Their relational methods can deal with locality of symbolic structures (see below). 4.4 Manufacturing Application: A Relational Semiotic Model of Aordability
High technology companies often have a need to make technological and business decisions about the products that are yet to be designed and manufactured. Scarcity of information concerning untried technologies and the lack of historical data base are the main characteristics of this problem. The industry needs aordability models applicable to such manufacturing problems. Fuzzy relational techniques provide a framework for working with incomplete and/or con icting information, integrating linguistic information with uncertainty of probabilistic as well as of non-probabilistic nature. In order to integrate human and technological factors one has to take into account not only the technological design and production concepts and data, but also the psychological and linguistic constructs utilized by human participants. Also management, nancial and organizational activities have to be captured in the model. This requires special semiotic techniques { semiotic descriptors that may have local relational properties. We have designed a set of repertory grids to capture the expertize of engineers which is one of the most important sources of information in the situations where no historical data on the manufactured product are available. Repertory grids utilize verbal descriptors, thus making it possible to assign dierent levels of accuracy, precision, or certainty to each part and process, such as cost, material input, or processing condition. The relational structures computed from the elicited information yield semiotic descriptors that supplement the available ontologies and extend these, capturing the locality of participation of descriptors within a family of contexts. Methodologically, the key elements of each subsystem and their interactions were identi ed using the exploratory knowledge elicitation in two steps. (1) speci cation of generic framework of knowledge structures; (2) construction of speci c knowledge structures within a selected generic framework. The data has been collected and analysis performed so far at the following dierent resolution levels: (1) the level of component parts of an aircraft jet engine; (2) a supplementary level adding non-engineering factors in uencing a
component's cost; (3)The level of integration components into a subsystem; (4) the level of cost estimation of competing technologies. The relational model at each level may contain several or all the following conceptual categories: Objects, attributes, values, agents, perspectives, contexts, views. Relational products can unify quantitative and qualitative processing of data and knowledge representation, capturing also the interaction and communication of agents. Also perspectives contain the notion of communication, but only implicitly. AI-ontologies do not pay attention to communication. When incorporating explicitly the notion of communication, one has to address the following activities: (1) Forming a question or specifying a query, (2) Communicating information. Topic and focus of Prague Linguistic Circle are relevant to both; to 2 directly and to 1 indirectly. The topic { focus paradigm provides a directed ow of information of communication between two participants { a sender and a receiver of a message. Across dierent resolution levels, topic and focus may not match. Vertical shift re-interpretations are needed. For combining mathematical and linguistic descriptions, categorial type logics [4] (e.g. Lambek's calculus) are of considerable interest. There are mutual links between Lambek's and / syntactic operations and semantics of BK-products [1] which is explored in our computational approach. In particular BK-triangle superproduct leads to the constructive de nition of generalized morphisms which we explore theoretically as well as computationally. Thus, BK-products are explored for their promise of serving as both a theoretical and a computational framework for a semantic and pragmatic interpretation of content markup. Unifying representation of relational computations by means of BK-products makes it possible to link various tools for relational computations such as Trysis or Afreval with other computational tools such as Guha { using extended OpenMath protocols incorporating semiotic descriptors.
REFERENCES
[1] W. Bandler and L.J. Kohout. Relations, mathematical. In Systems and Control Encyclopedia, pages 4000 { 4008. Pergamon Press, Oxford, 1987. [2] J. Hintikka and J. Kulas. The Game of Language. D. Reidel, Dordrecht, 1983. [3] L.J. Kohout. A Perspective on Intelligent Systems: A Framework for Analysis and Design. Chapmann & Hall, London, 1990. [4] J. van Benthem and A. ter Meulen (eds.), Handbook of Logic and Language, Elsevier, Amsterdam, 1997.