An Architecture for Musical Score Recognition using High ... - CiteSeerX

2 downloads 180 Views 33KB Size Report
technology allow to retrieve instances of other classes. An inference method .... influence the conceptual system, it also drives it by chaining inferences and ...
An Architecture for Musical Score Recognition using High-Level Domain Knowledge Marc Vuilleumier Stückelberg

Christian Pellegrini

Mélanie Hilario

Computer Science Department C.U.I. – University of Geneva CH – 1211 Geneva 4

Abstract This work proposes an original approach to musical score recognition, a particular case of highlevel document analysis. In order to overcome the limitations of existing systems, we propose an architecture which allows for a continuous and bidirectional interaction between high-level knowledge and low-level data, and which is able to improve itself over time by learning. This architecture is made of three cooperating layers, one made of parameterized feature detectors, another working as an object-oriented knowledge repository and the other as a supervising Bayesian metaprocessor. Although the implementation is still in progress, we show how this architecture is adequate for modeling and processing knowledge.

1 Introduction This work proposes an original approach to musical score recognition, a particular case of high-level document analysis. We aim to solve the problem completely, but using simple means, i.e. a regular personal computer and a standard 300 dpi scanner, without heavy preprocessing. We shall make up for these real-world constraints by using more intelligence. In particular, we will take benefit of as much domainknowledge as possible, and of modern artificial intelligence techniques. The innovation of our system should not be sought in its techniques of image analysis. It uses a combination of three widely known principles [1]: recognize objects by their description in a feature space; recognize objects by the description of their parts and/or by their structural description (syntactic pattern recognition); recognize objects by alignment of a model, through simple transformations, to optimize a given measure. The originality of our contribution lies in the architecture for document analysis which combines Bayesian reasoning,

high-level structured knowledge and low-level image analysis tools in a continuous interaction. There have already been several knowledge-based approaches to musical score recognition (see for instance [2] and [3]), mostly based on syntactic pattern recognition of segmented symbols. These systems work quite well, but have limitations inherent to the absence of interaction between symbol segmentation, recognition and syntactic analysis. In particular, such systems cannot recover from preprocessing mistakes and cannot improve themselves over time. Moreover, the abstraction level of the entities that can be recognized is often strongly constrained by the poverty of the language for expressing high-level domain knowledge. It should be noted that our goal is to provide a model suitable for implementation. There are many psychological and philosophical models for active perception systems that would let an agent learn, discover new concepts and even much more. Such models are so far from computational realities that they have little or no influence on real-world systems. To avoid this pitfall, we shall relate our model to existing knowledge processing paradigms and describe it at the computational level, even if it is inspired by higher level considerations. In the next section, we will describe our architecture and the underlying principles. In Section 3, we will describe its dynamic behavior. Section 4 will provide an evaluation of the model, from several points of view.

2 The architecture We give an overview of the proposed model. First, we will present it as a layered system, in order to clearly distinguish between the different kinds of information processing that will take place. We will then describe more precisely how runtime data is processed, how perceptions are represented and how uncertainty is handled.

2.1 A layered view As shown in the figure below, the system is made of three layers, with bi-directional exchanges between them.

inference methods, one of the feature detectors will take a specific measure on a given area of the data and return the result in the form expected by the caller. attention area

feature selection w/ parameters

The Metaprocessor (Top Layer) driving

result

approximation

bitmap

likelihood of perceptions Hough transform

The Conceptual System (Intermediate Layer) attention area, parameters

type of representation

measured features

The Feature Detectors (Bottom Layer) Raw data (scanned document to be analyzed, ...)

The only layer which is directly in contact with the runtime data is the so-called feature detectors layer. Object recognition is solely based on the output of these feature detectors, as in the traditional feature-space approach. But in opposition to the conventional approach, the feature detection process is driven and parameterized by the intermediate layer instead of being a one-way stand-alone process. The feature detectors have two main roles: selecting the task-relevant properties of the data, and providing redundant interfaces so that every component of the conceptual system, whether symbolic or connectionist, has the most adequate access to the data. From an algorithmic point of view, the feature detectors are mostly ad hoc representation-space transformers or statistical analyzers. The core of the architecture is the conceptual system, which holds most domain knowledge about the task to be achieved. Concepts are defined in an objectoriented fashion, and relations between them express their structural relations. Domain knowledge is also provided in the form of inference methods, for validating already discovered objects and for deducing new ones. Such inference methods are either hand-crafted or inductively learned, and can use connectionist as well as symbolic algorithms. On top of the system there is a metaprocessor, the role of which is to orchestrate the perception task. From the metaprocessor point of view, each object discovered is just a perception hypothesis, which is correct with a probability dependent in other given perception hypotheses. By dynamically building a belief network, the metaprocessor will try to find the most probable set of entities for describing the input score.

2.2 The processing of runtime data Runtime data comes to the system as a collection of bitmaps, each representing a page of the score. According to the needs of the conceptual system’s

region

document

maxima list

blob matching

Most feature detectors are specifically designed for particular tasks of the recognition process. It has been shown that this can highly increase the robustness of the system [4]. In this sense, the feature detectors implicitly contain low-level domain knowledge. A feature detector can even encompass a parametric model of a specific object to recognize, and its output be a measure of adequacy between the model and the input data. We will not further detail the feature detectors, as our contribution is the overall architecture and not the feature detectors themselves. In fact, we will mainly use feature detectors which have proved to be efficient in existing systems for musical score recognition.

2.3 The representation of concepts We represent concepts in a frame-like[5] (or schema-like[6]) manner, strongly influenced by objectoriented technology. We allow single-inheritance between concepts, and relations between instances by the mean of instance variables. These can be represented graphically using the OMT[7] object model notation, as shown below for a small part of our domain model. Score

Page

StaffSystem

Measure 1-2

SystemBar 1-2 1-2 1-2

StaffGroup

BigBarline

GroupMeasure 1-2

Barline Part

Instrument

1-2

InstrumentStaff

SingleStaff

DoubleStaff

InstrumentMeasure

Accolade

2

NoteHolder

RegularStaff

StaffSegment

PercussionStaff

In addition to its relations, each concept owns a set of inference methods which allow to infer the presence other objects, as methods in traditional object-oriented technology allow to retrieve instances of other classes. An inference method works by applying some feature

detectors on selected areas of the score and processing the result with a mechanism specific to the inference to be performed. Any mechanism, whether symbolic or connectionist, can be used. For instance, an inductive regression algorithm as well as fuzzy logic system could be used to guess the position of a clef given a staff. In traditional object models, the result of a method is an unambiguous pointer to a well-defined instance. This is neither possible nor desirable in our system, since it would prevent us from using inexact inference methods, such as inductively learned estimators. Instead of that, the result of our inference methods is a range of areas, in which the exact resulting concept instance probably is. We call such an approximate answer an hypothesis and a real instance of a concept a perception. The resulting meta-model of our conceptual system is sketched below, as well as a the traditional meta-model of an object-oriented system. Attribute

Class

role

Method

class

SlotValue

Execution arguments, result

a traditional object-oriented meta-model Attribute

Concept

role

class

SlotValue

Perception

Inferences

Inference

inferenceClass

contents

Derivation

proposition

Formulation origin value

2.4 The handling of uncertainty

code

Instance value

have to decide at each time whether it is more profitable to refine or to propose the perception of one of the existing hypothesis. Which inference methods can be applied at any time to a given hypothesis in order to refine it depend on its context. First, when an hypothesis is created, instances of inferences can be attached, that will help to refine it. This is what we called a formulation in the meta-model above. Second, if there are any subsuming (i.e. more general) hypotheses, we can also use their formulation since what holds for a more general case also holds for a more specific case. For instance, if we have a formulation of how to find a clef at a given place, we can use it if we are looking for a treble clef at the same place. As it will be shown in Section 3, the use of our conceptual system in conjunction with the mechanism of subsumption makes a powerful inference engine.

context

Hypothesis arguments, result

the meta-model used in our conceptual system

Concretely, an hypothesis is defined by the set of classes it is supposed to belong to (usually a tree), a search area for the hot-point of the object and a dimension range. For the sake of simplicity, we only consider rectangular dimensions (the bounding box of the object). We can now show how the conceptual system works. A new hypothesis is guessed by an inference method. It is then further refined by other inference methods, until it becomes precise enough to build a true perception on it. At that time, the constructor (in the object-oriented meaning of the term) of this perception will create new hypotheses for each slot value, according to the domain object model. Each of these hypotheses will be processed in the same way, until the full document structure is recognized. Of course, this is a non-deterministic process, and therefore we have a metaprocessor on top of the conceptual system. Basically, the metaprocessor will

It would not make any sense to use the conceptual model described previously without handling its inherent uncertainty. We have chosen to use a Bayesian probabilistic model. Each hypothesis in the conceptual model (for instance that there is a clef in the upper left corner of the score) is only true with a given probability. Moreover, this probability depends on what other hypotheses we believe in, since most hypotheses are interrelated. An elegant way of handling this situation is to bind all hypothesis using a belief network [8][9]. In this way, the metaprocessor can compute at every time the posterior probability of any hypothesis, given a subset of hypotheses that it currently assumes to be true or false. The belief network has to be built dynamically, since the set of nodes (of hypotheses) is growing continuously. This cannot be done using the basic network construction algorithm, first because it is not easy to determine which of all known hypotheses are dependent on a new one, and second because it would anyway result in an excessively complex network. Instead of that, we insert each new node exactly between its direct causes and its direct consequences. In order to do that, we have to be able to determine which hypothesis is the direct cause of which other. Fortunately, this can be done quite easily since the only possible causes of an hypothesis are the hypotheses from which it has been inferred, and the inferences which it subsumes (for instance, the hypothesis that there is a treble clef at some place is a cause of the hypothesis that there is any clef at this place, which in turn is a cause of finding a clef somewhere on the score). Combined together, these two causality criteria bind all related hypotheses into a direct acyclic graph.

In order to transform this DAG into a belief network, it is necessary to label the edges with conditional probability values, and the root nodes with prior probability values. Edges coming from subsumption can easily be labeled with theoretic values, because a subsumed hypothesis is an unconditional cause. Edges coming from inferences must be labeled by the inference method themselves. That means, all inference algorithms we use must provide their result with a measure of likelihood. This is very feasible, even for connectionist processing [10], and it is the natural condition to having a rigorous handling of uncertainty. Finally, the prior probabilities of root nodes can be extracted from a the prior probability distribution of each concept at each place of the score, which is part of the domain knowledge. The resulting belief network is likely to be severely multiply-connected. For this reason, we evaluate it using stochastic simulation. The only drawback of stochastic simulation is that it takes a long time to get accurate values for very unlikely events. This is not a problem for us since we are mostly concerned about likely events.

3 Dynamic behavior In the last section, we presented the main elements of our architecture and their relations with each other. This section will show how the control flow goes from the one to the other, and how the whole system works.

3.1 The control structure Initially, the conceptual system is primed by creating for each concept of the domain model an hypothesis, that will subsume all hypotheses of that concept to come. This initial hypothesis has in its context the formulation of every inference methods that can help at any time to refine an hypothesis of its class. Since we use single inheritance, these priming hypotheses are bound by subsumption into a tree. At the root this tree, there is the most general hypothesis, stating that there is something somewhere. This node is also the ultimate consequence in our belief network, and its posterior probability will tell us in the end to what probability we have recognized something at all. The recognition process itself can be triggered in three ways, depending on the level of interaction the user wants to keep with the system. In the most interactive mode, the user can propose himself the hypothesis he wants to analyze (for instance the hypothesis that there is a staff in the upper part of the first page). In a semiautomated mode, the user can simply tell what concept he is interested in, and the system will produce the initial hypothesis by refining the priming hypothesis of the selected class. In a fully automated mode, the system

takes the most aggregating concept (in our case, the score) and refine it to get the initial hypothesis. Once the initial hypothesis is given, the metaprocessor will consider it as true and explore it, until a stopping condition is reached. One way to explore an hypothesis is to let the conceptual system transform it into a perception, and create new hypotheses for all its attributes. If the initial hypothesis was precise enough, the resulting new hypotheses will be given a high measure of likelihood by the inference method, and will therefore be considered as true by the metaprocessor. If it is not the case, the metaprocessor will try another way of exploring the initial hypothesis : it will refine it into another hypothesis, more precise but less probable. The third way by which an hypothesis can be explored by the metaprocessor is by using a special kind of inference method, called a validation inference. A validation inference answers a boolean, telling whether a given measurable property of the concept has been verified on the hypothesis or not. By propagation in the belief network, the result of the validation will update the posterior probability of the target hypothesis, and indirectly of all related hypotheses. It should be noted that while the set of hypotheses in the conceptual system grows monotonically, the subset of hypotheses that the metaprocessor holds for true change non-monotonically. Typically, the metaprocessor deduces a number of new hypothesis from the initial one, and holds them for true until it comes to a dead end. At this point, it will not remove any hypothesis from the conceptual system, but only backtrack in its assumptions about which hypothesis is true. Several stopping conditions can be used, and even combined together. First, the user can decide at any point to stop the search, and the posterior probability of the initial hypothesis given the most probable other hypothesis will tell him the confidence he can have in the result. Second, the metaprocessor can pursue the search until the posterior probability of the initial hypothesis is above or below given thresholds. Eventually, the metaprocessor can be kept running until there is no more hypothesis that is worth refining or changing into a perception, because all possible inferences have already been applied on all fairly likely hypotheses.

3.2 Interactions between layers As shown by the previous example, there is a strong interaction between the adjacent levels of the architecture. metaprocessor ⇒ conceptual system. The metaprocessor does not simply influence the conceptual system, it also drives it by chaining inferences and

choosing which method to apply at each time. It could even change the behavior of the conceptual system by teaching it positive and negative examples during the reasoning process. ⇒ conceptual system metaprocessor. Reciprocally, the conceptual system affects the metaprocessor by creating the hypotheses themselves, and assigning them conditional probabilities. These will be decisive for the choice of the next inference method to apply by the metaprocessor. conceptual system ⇒ feature detectors. The conceptual system uses the feature detector parameters to get the most relevant data each time. This means that the conceptual system can make the feature detectors show what it wants to see. feature detectors ⇒ conceptual system. The feature detectors have an incontestable influence on the conceptual system, since they represent its interface to the data. They do not only feed it with the concrete material it needs to build up hypotheses, but they can also show the absence of evidence, hence leading the whole system to revoke an hypothesis.

4 Preliminary evaluation We propose an evaluation of the architecture from several points of view. We first evaluate the associated knowledge engineering process required to operate the architecture. We then compare the computational process resulting from our architecture to that of some classical models, in order to get a feeling of the kind of computation involved. Finally, we compare the potential task capabilities of our architecture to existing musical score recognition systems.

4.1 The knowledge engineering process Our architecture offers a convenient and safe framework for making domain knowledge explicit and available to each layer. Since the domain knowledge is modeled in an object-oriented fashion, we benefit from the usual advantages of object-oriented design. Elementary operations are simple, but the target task complexity is almost unlimited. Problems of arbitrary depth and dimension are known to be efficiently tractable by an object-oriented decomposition. Third, a decade of object-oriented software engineering with realworld and industrial applications has led to efficient methodologies for object-oriented modeling and design. In addition to its reliable knowledge modeling technique, our architecture clearly separates different types of knowledge. The illustration below shows a commonly accepted hierarchy of knowledge forms, as proposed for instance in [11]. The actual distribution of these various forms of knowledge within our architecture

precisely reflects this hierarchy: meta-knowledge is naturally located in the metaprocessor layer; knowledge about the target domain is given a priori and further refined in the intermediate layer; information is extracted by the feature extractors and made available for recognition and analysis; finally, data is given in raw form as input to the system. This effective separation of different knowledge types helps modeling the domain correctly. Moreover, the resulting reasoning process can then easily be explained at several levels of abstraction. Meta-Knowledge knowledge about knowledge

Highest level of abstraction

Knowledge information items and their relationships Information processed data Data raw, unprocessed input

Lowest level of abstraction

4.2 Comparison to similar architectures The chart parser [12] is one of the most efficient syntactic parsers; it has been widely used in the field of natural language analysis. It is known to be of time complexity O(n3), with n given as the number of terminal symbols. Our perception instanciation mechanism produces a computational process which is strictly equivalent to that of a chart parser, under the following conditions : • the conceptual model encodes a context-free grammar • the terminal symbols of the grammar which defines the language are defined as concepts with no attribute but at least one validation inference • the non-terminal symbols of the grammar which define the language are defined as aggregates of their components. This lets us intuitively believe that the complexity of the computational process produced by our system will be at least as good. Anon [6] is an system for the interpretation of images of engineering drawings. Our three-layered architecture is somewhat similar to Anon’s cycle of perception, where the control flow goes from a control system to a current schema, and then to an image analysis library and back. An important similitude is that both architectures use a frame-like representation, that does not only try to model the domain objects but also organizes strategies by which the components of a drawing can be recognized.

4.3 Comparison of task capabilities We shall end this evaluation by comparing the

potential capabilities of our architecture to other existing systems for musical score recognition. Most of the existing systems perform quite well on high-quality, selected examples. The recognition rate is usually between 95 and 99 percent for the recognition of simple symbols, usually lower when the output is a syntactic description of the music. We cannot yet compare the effective performance of our architecture, since its implementation is currently in progress. Nevertheless, the first observation that can be done is that we could without pain get similar results. Indeed, all strategies could easily fit into our architecture, by putting the low-level algorithms into feature detectors, the higher-level procedures into composite methods and the overall strategy in a trivial metaprocessor. In most cases, the resulting domain model would look sparse and incomplete, but this would just reflect the lack of a true domain model grounding these systems. Anyway, they would work exactly as they currently do. There are mainly four points where our architecture will help. First, most optical music recognition systems currently available decompose the job into the following sequential tasks: staff detection, staff removal, symbol finding, symbol matching and eventually syntactic reconstruction of the score. This is essentially a bottomup approach, often combined with a simple top-down selective attention mechanism to improve efficiency. This approach has the advantage of being simple, well known and relatively robust to the presence of unidentifiable symbols. But the drawback is that the absence of high-level musical knowledge during the first phases leads to a loss of resources in searching for symbols where they have no chance to be, as well as possibly unrecoverable mistakes due to initial misinterpretations. The constant knowledge-data interaction in our architecture should overcome this limitation. Second, in many systems, knowledge about the musical domain is hard-coded into smart ad hoc algorithms. When declarative knowledge is present, it is often in such a constrained form (for instance a contextfree grammar) that it is only used for a very limited task, typically the syntactic reconstruction of the score. As a consequence, neither reasoning nor explanation can be performed on the recognition process itself. In opposition, our architecture allows explicit knowledge representation and uses a metaprocessor to reason on the progress of the recognition process. Third, we have seen no system for musical score recognition that was able to deal rigorously with uncertainty. Instead of that, most systems proceed in a pipeline-like way, where uncertain results at one stage are either discarded or assumed to be true during the next phases.

Fourth, the most common disadvantage of all existing systems is the absence of improvement over time. In the rare cases where learning algorithms are used, they are limited to training symbol detectors prior to recognizing the partition. In opposition, our architecture is open to the use of learning algorithms in every inference method, that is at every place of the recognition process. Moreover, the metaprocessor holds every necessary information to be able to teach these learning algorithms positive and negative examples, for instance when the result of one inference method is strongly disconfirmed by several other inference methods.

5 Conclusion and future work We proposed an innovative architecture for the optical recognition of musical scores. After explaining it in detail, we showed that it is adequate for modeling knowledge, that it behaves similarly to well-known paradigms and that it should not only equal existing systems but also overcome some of their intrinsic deficiencies. Our next objective is to complete the implementation of the described system.

References S. Ullman. High-level Vision. MIT Press, 1995. H.S. Baird, H. Bunke, K. Yamamoto, editors. Structured Document Image Analysis. Springer-Verlag, 1992. [3] B. Couasnon and B. Rétif. Using a grammar for a reliable full score recognition system. In International Computer Music Conference, 1995. [4] J.W. Roach and J.E. Tatem. Using domain knowledge in low-level visual processing to interpret handwritten music. Pattern Recognition, vol. 21, n. 1, 1988, pp. 3344. [5] M.L. Minsky, A framework for representing knowledge. In P.H. Winston, editor, The Psychology of Computer Vision, pages 211-277. McGraw-Hill, 1975. [6] S. Joseph and T. Pridmore, Knowledge-directed interpretation of mechanical engineering drawings, IEEE PAMI, vol. 14, n. 9, 1992, pp. 928-940. [7] J. Rumbaugh et al.. Object-Oriented Modeling and Design. Prentice Hall, 1991. [8] J. Pearl. Probabilistic Reasoning in Intelligent Systems : Networks of Plausible Inference. Morgan Kaufmann, 1988. [9] D. Heckerman, D. Geiger and M. Chickering. Learning Bayesian networks : the combination of knowledge and statistical data. Technical Report MSR-TR-94-09, Microsoft Research, Redmond, Washington, 1994. [10] C. Bishop. Neural Networks for Pattern Recognition. Oxford Press, 1995. [11] F. Kurfess. Automn school on hybrid systems. 1996. [12] R. Kaplan. A general syntactic processor. In R. Rustin, editor, Natural Language Processing. Algorithmics Press, 1973.

[1] [2]