automatic parsing and recognition of hand-drawn ... - Semantic Scholar

2 downloads 1879 Views 2MB Size Report
helped me tremendously with many of my computer-related questions and problems. ...... hard-coded recognizers that assume a fixed drawing order. Also, the ...
AUTOMATIC PARSING AND RECOGNITION OF HAND-DRAWN SKETCHES FOR PEN-BASED COMPUTER INTERFACES

a dissertation submitted to the department of mechanical engineering and the committee on graduate studies of carnegie mellon university in partial fulfillment of the requirements for the degree of doctor of philosophy

Levent Burak Kara September 2004

c Copyright by Levent Burak Kara 2005 ° All Rights Reserved

ii

Abstract Pen-based computer interaction is becoming increasingly ubiquitous as evidenced by the growing interest in Tablet PC’s, electronic whiteboards and PDA’s. Many of these devices now come equipped with robust hand-writing recognizers. However, a problem that remains largely unsolved is the recognition of graphical input such as schematic sketches and diagrams. When faced with such input, these devices either leave the pen strokes uninterpreted, or offer only limited support for recognition, while placing many unnatural constraints on the way the user draws. These constraints might include limitations to single-stroke objects, or the need for user involvement in separating different visual objects. In this work, we present a new approach for recognizing hand-drawn, diagrammatic sketches. The key advance is an integrated sketch parsing and recognition model designed to enable natural and fluid pen-based computer interaction. With this approach, the stream of pen strokes is first examined to identify certain delimiter patterns called “markers.” These then anchor a spatial analysis which groups the remaining strokes into distinct clusters, each representing a single visual object. Finally, a shape recognizer is used to find the best interpretations of the clusters. This approach eliminates many of the unnatural constraints imposed by existing sketch understanding systems. To demonstrate our techniques, we have built SimuSketch, a sketch-based interface for Matlab’s Simulink package, and VibroSketch, a sketch-based interface for analyzing vibratory mechanical systems. In both systems, users can construct functional engineering models by simply sketching them on a computer screen. Users can then interactively manipulate their sketches to change model parameters and run simulations. Our user studies have indicated that even novice users can effectively utilize these systems to solve real engineering problems, without having to know much about the underlying recognition techniques.

iii

Acknowledgements I would like to thank all the people who have given me encouragement and helped me in the completion of this study. First and foremost, I would like to express my gratitude to my advisor, Professor Thomas F. Stahovich, for his invaluable supervision throughout my M.S. and Ph.D. studies. In the past six years, he provided me with tremendous knowledge, motivation and encouragement. Without his support, I would never have been able to see the end of the tunnel. I would also like to thank him for the endless hours he has spent on revising my paper drafts and presentations. I also wish to thank Professors Kenji Shimada, William Messner and Tsuhan Chen for agreeing to be my thesis committee; their suggestions and comments during the proposal gave me useful guidance in the research. In the last year of my studies, Professor Shimada has provided me with a comfortable office space and helped me resolve some of my administrative issues. I would like to thank all of my colleagues in the Smart Tools Laboratory for all the times we spent together. I am especially grateful to Leslie Gennari for giving me a lot of useful feedback on my paper drafts, and for her willingness to discuss new research ideas. Members of the Computer Integrated Engineering Laboratory are also gratefully acknowledged. Particularly, Soji Yamakawa and Tomotake Furuhata have helped me tremendously with many of my computer-related questions and problems. Their skills have always amazed me. Many people have participated in the user studies presented in this thesis. I would like to thank them all for agreeing to do so. Without the data they provided, I would not have been able to evaluate my work. Finally, I would like to thank my parents Serap and Mustafa Kara for all their love, support, and encouragement they have given me throughout my life. This thesis is dedicated to them. iv

Contents Abstract

iii

Acknowledgements

iv

1 Introduction 1.1

1

Challenges in Sketch-Based Computer Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2

The Focus of The Work . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.3

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.4

Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . .

9

2 Literature Review

10

2.1

Symbol Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2

Approaches to Parsing 2D Visual Scenes . . . . . . . . . . . . . . . .

12

2.3

Sketch Interpretation Systems . . . . . . . . . . . . . . . . . . . . . .

14

2.4

Commercial Pen-Based Products . . . . . . . . . . . . . . . . . . . .

17

3 Hierarchical Parsing and Recognition of Sketches

20

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.2

Domains of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

4 SimuSketch: A Sketch-Based Interface for Simulink

24

4.1

User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

4.2

Preliminary Recognition . . . . . . . . . . . . . . . . . . . . . . . . .

31

v

4.3

Stroke Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

4.4

Generating Symbol Candidates . . . . . . . . . . . . . . . . . . . . .

47

4.5

Symbol Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

4.6

Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

4.7

Digit Recognition in the Dialog Boxes . . . . . . . . . . . . . . . . . .

54

4.8

Complexity Analysis of SimuSketch . . . . . . . . . . . . . . . . . . .

55

4.9

Evaluation of SimuSketch . . . . . . . . . . . . . . . . . . . . . . . .

57

5 VibroSketch: A Sketch-Based Interface for Vibratory Systems

64

5.1

User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

5.2

Preliminary Recognition . . . . . . . . . . . . . . . . . . . . . . . . .

67

5.3

Stroke Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

5.4

Symbol Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

5.5

Connectivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

5.6

Complexity Analysis of VibroSketch . . . . . . . . . . . . . . . . . . .

79

5.7

Evaluation of VibroSketch . . . . . . . . . . . . . . . . . . . . . . . .

80

6 Symbol Recognition: An Overview 6.1

Comparison of the Recognizers

. . . . . . . . . . . . . . . . . . . . .

7 Image-Based Symbol Recognition

84 86 89

7.1

Preprocessing and Representation . . . . . . . . . . . . . . . . . . . .

94

7.2

Template Matching Using Multiple Classifiers . . . . . . . . . . . . .

95

7.2.1

Hausdorff Distance . . . . . . . . . . . . . . . . . . . . . . . .

96

7.2.2

Modified Hausdorff Distance . . . . . . . . . . . . . . . . . . .

98

7.2.3

Tanimoto Coefficient . . . . . . . . . . . . . . . . . . . . . . .

99

7.2.4

Yule Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.2.5

Distance Transform . . . . . . . . . . . . . . . . . . . . . . . . 102

7.3

Combining Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.3.1

Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.3.2

Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.3.3

Combination Rule

. . . . . . . . . . . . . . . . . . . . . . . . 107 vi

7.4

Handling Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.4.1

Polar Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.4.2

Finding the Optimal Alignment Using Polar Transform . . . . 110

7.5

Polar Transform as a Pre-recognizer . . . . . . . . . . . . . . . . . . . 113

7.6

User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.7

7.6.1

Graphic Symbol Recognition . . . . . . . . . . . . . . . . . . . 115

7.6.2

Digit Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 117

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8 Feature-Based Statistical Symbol Recognition

120

8.1

Preprocessing and segmentation . . . . . . . . . . . . . . . . . . . . . 122

8.2

Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

8.3

Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

8.4

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8.5

User Study

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

9 Graph-Based Symbol Recognition

134

9.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

9.2

Error-driven Stochastic Graph Matching . . . . . . . . . . . . . . . . 136 9.2.1

Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 137

9.2.2

Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

9.2.3

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

9.2.4

Combining Error Metrics . . . . . . . . . . . . . . . . . . . . . 148

9.2.5

Handling Different Drawing Orders Using Stochastic Primitive Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

9.3

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

10 Limitations and Future Work

150

11 Conclusions

152

Bibliography

155

vii

List of Tables 4.1

Average scores obtained from user questionnaire. Scale: 1-10, 10 being excellent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.1

63

Results from the graphic symbol recognition study. The first two columns show the top-one and top-two accuracy, respectively. All tests were conducted on a Pentium 4 machine at 2.0 GHz. with 256 MB of RAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.2

Results from the digit recognition study. All tests were conducted on a Pentium 4 machine at 2.0 GHz. with 256 MB of RAM. . . . . . . . 117

9.1

Error metrics and corresponding weights used in the graph-based symbol recognizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

viii

List of Figures 1.1

(a) Schematic of a vibratory system. (b) A computer model for the same system created using ADAMS, a commercial dynamic simulation package. (c) A hand-sketch of the same system drawn by a mechanical engineer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1

3

Pen-and-Internet’s riteShape can only recognize simple geometric shapes such as rectangles, triangles, circles and ellipses [http://www.penandinternet.com] 19

3.1

Parsing and Recognition Architecture. . . . . . . . . . . . . . . . . .

22

4.1

(a) SimuSketch, (b) Automatically derived Simulink model. . . . . . .

25

4.2

SimuSketch is deployed on a Wacom Cintiq tablet with cordless stylus.

26

4.3

Users can bring up a scroll bar and instantly use it by drawing a line along the right border of the display. They can also annotate their sketches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4

28

The user can interact with the system through sketch-based dialog boxes. In the instance shown, the user is editing a Sine Wave block. Simulation results are presented to the user through conventional Simulink displays which pop up when the user clicks on Scope blocks. . . .

30

4.5

A dialog window for training object shapes. . . . . . . . . . . . . . .

30

4.6

The user can view a beautified version of his model in which the original sketch is replaced by cleaned-up objects. . . . . . . . . . . . . . . . .

4.7

31

Arrow recognition. (a) A one-stroke arrow with the key points labelled. (b) Speed profile. Key points are speed minima. . . . . . . . . . . . . ix

33

4.8

Filtering used in detecting the key points. (a) Speed minima that occur in the first 40% (the dismissed region in the speed profile) are not considered as they typically correspond to the bends in the arrow shaft. (b) Speed minima that occur too close to one another (encircled in the speed profile) are condensed into a single point. . . . . . . . . .

4.9

34

Examples of (a) arrows and (b) arrow heads, that are correctly recognized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.10 The five key points on a two-stroke arrow. . . . . . . . . . . . . . . .

35

4.11 The single stroke arrow fails when the arrow head is either too big or too small. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.12 The speed profile of a single stroke arrow when the user draws too slowly. 37 4.13 The speed profile of a single stroke arrow when the user draws too fast. Only three of the four key points are determined toward the end of the stroke. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.14 Resampling a raw stroke to obtain data points equally spaced along the stroke’s trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . .

39

4.15 (a) The inverse-curvature at a point is calculated as the cosine of the angle between the segments before and after the point. (b) The inversecurvature is minimized at the sharp corners. . . . . . . . . . . . . . .

40

4.16 The structure of the neural network. . . . . . . . . . . . . . . . . . .

41

4.17 Illustration of the cluster analysis. (a) Each stroke is assigned to the nearest arrowhead or tail. (b) Strokes assigned to the same arrow are grouped into clusters. (c) Clusters with overlapping bounding boxes are merged. (d) Arrows that did not receive any strokes are attached to the nearest cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

4.18 Example of a branching arrow. The program must infer that the source object of the branching arrow is the Sine Wave and not the upper Scope. 46 4.19 The distance between a branching arrow’s tail point and a primary arrow is the minimum of the five distances computed from the branching arrow to the five representative points on the primary arrow. . . . . .

46

4.20 The number of legal input/output channels for the Simulink objects. .

48

x

4.21 (a) Malformed arrowhead causes bottom left arrow to be missed. (b) After detecting the abnormal cluster, the program finds the arrow and interprets the sketch correctly. . . . . . . . . . . . . . . . . . . . . . .

51

4.22 (a) Undetected arrows can be corrected with the ‘o’ gesture. (b) After the correction, the program rectifies the rest of the sketch. . . . . . .

53

4.23 Digit recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

4.24 The processing times of different modules of SimuSketch for three different cases. All times are in milliseconds. . . . . . . . . . . . . . . .

56

4.25 Test problems employed in the user studies. . . . . . . . . . . . . . .

59

4.26 (a) Pairs of most frequently confused objects. (b) A misrecognition due to non-uniform scaling. (Left) Definition symbol. (Right) One user’s misrecognized symbol. . . . . . . . . . . . . . . . . . . . . . . . 5.1

61

A typical vibratory system created in VibroSketch. The program interprets the sketch, performs a simulation of it, and displays the results in the form of live animations and graphical plots. . . . . . . . . . . .

5.2

66

A stroke chain involves the original pen strokes and the hypothetical linkages between them. The stroke-linkage sequence on the right shows the resulting stroke chain. The numbers and arrows indicate the order and directions in which the strokes were drawn. The stroke chain does not assume a particular drawing order or direction. . . . . . . . . . .

5.3

67

(a) Examples of correctly recognized ground symbols. (b) For recognition, our program considers various features such as the length of the skeleton, the separation between hatches and the orientation of hatches. 68

5.4

(a) The remaining objects that need to be identified once the masses and the grounds in Figure 5.1 have been recognized. (b) The hierarchical clustering algorithm separates the scene into distinct clusters. In the configuration shown, the algorithm has been run until a single cluster was obtained. The marked clusters are later determined by analyzing the distance between the merged clusters at every iteration. xi

70

5.5

The dissimilarity score δ increases monotonically with the number of iterations. Sharp leaps, such as the one at iteration 17, usually correspond to forced merges and thus can be used to determine the number of natural clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.6

The clustering algorithm falls short when symbols overlap or when intra-symbol distances are comparable to inter-symbol distances. . . .

5.7

74

The original (left) and segmented (right) versions of a spring, damper and a force symbol. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.8

73

76

The processing times of different modules of VibroSketch for three different cases. All times are in milliseconds. ‘Number of Markers’ include both the ground and mass symbols.

. . . . . . . . . . . . . .

80

5.9

Two successfully recognized sketches employed in our user study. . . .

81

6.1

Two beams, one with two supports and the other with three. . . . . .

87

6.2

A comparative illustration of the attributes and performance metrics of the three symbol recognizers. The performance metrics are based on a maximum of five stars, five being the best. . . . . . . . . . . . .

7.1

88

Examples of symbols correctly recognized by our system (at the time of the test, the database contained 104 definition symbols). The top row shows symbols used in training, and the bottom row shows correctly recognized test symbols. With our approach, over-tracing, missing / extra pen strokes, different line styles, or variations in angular orientation do not pose difficulty. . . . . . . . . . . . . . . . . . . . . . . .

90

7.2

Recognition Architecture. . . . . . . . . . . . . . . . . . . . . . . . .

93

7.3

Examples of symbol templates. Left: a mechanical pivot, Middle: ‘a’, Right: ‘8’. The templates are shown on 24X24 grids to better illustrate the quantization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.4

95

Illustration of directed Hausdorff distances. The Hausdorff distance is the maximum of the two directed distances, in this case h(A, B). . . . xii

97

7.5

Schematic illustration of the overlap between two patterns A and B. The number of image pixels are denoted by na and nb respectively. The intersection denotes the number of overlapping black pixels nab . n00 denotes the number of overlapping white pixels (i.e., background pixels) in the two patterns.

7.6

. . . . . . . . . . . . . . . . . . . . . . .

99

When determining the coincident pixels during in the Tanimoto and Yule coefficients, a threshold of 4.5 pixels is used to take into account variations in the patterns. The figure shows the boundaries of the admissible region around an image pixel at the center. . . . . . . . . . 101

7.7

(a) A checkmark symbol quantized into a 10X10 template. (b) The corresponding distance transform matrix. . . . . . . . . . . . . . . . . 103

7.8

(a) Left: Letter ‘P’ in screen coordinates. Right: in polar coordinates. (b) When the letter is rotated in the x-y plane, the corresponding polar transform shifts parallel to the θ axis. . . . . . . . . . . . . . . . . . . 108

7.9

Top: Initial polar image of the rotated ‘P’ from Figure 7.8. Bottom: Same image mapped to the range -π to +π. In effect, the portions overstepping the +π boundary are moved near the -π boundary. The two images are equivalent. . . . . . . . . . . . . . . . . . . . . . . . . 111

7.10 The θ coordinate of the polar transform is sensitive to the origin for points near the image center. (a) Letter ‘T’ and its polar transform. (b) Nearly the same letter except for the curl of the tail. The difference causes noticeable difference in the polar transform at small values of r. 112 7.11 Weighting function used to suppress the sensitivity to the origin. . . . 113 7.12 Symbols used in the graphic symbol recognition experiment. . . . . . 115 8.1

Left: The unprocessed shape as drawn by the user. Right: The resulting shape after segmentation. . . . . . . . . . . . . . . . . . . . . . . 122

8.2

Left: A square drawn in a single stroke in the clock-wise direction. Right: The corresponding speed profile of the pen tip. As shown, the corners of the square can be identified by locating the minima in the speed profile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 xiii

8.3

Illustration of the feature set. . . . . . . . . . . . . . . . . . . . . . . 125

8.4

Examples of a damper symbol used for training. Segmented ink is shown.127

8.5

Examples of a square symbol that lead to a deficient covariance matrix. 129

8.6

The set of symbols used for testing the feature-based recognizer. . . . 130

8.7

The confusion matrix of the feature-based recognizer. A number in the matrix indexed by (row, column) indicate how many times a row symbol is misclassified as a column symbol. Each row symbol was tested using 21 test cases. . . . . . . . . . . . . . . . . . . . . . . . . 131

8.8

Top: The recognition rate of the feature-based recognizer. Middle: The recognition rate when the Squareroot symbol is left out from the analysis. Bottom: The recognition rate when both the Squareroot and the P ivot symbols are left out. . . . . . . . . . . . . . . . . . . . . . . 132

8.9

The recognition rates when the correct symbol is identified in the top-2 and top-3 spots of the recognition lists. . . . . . . . . . . . . . . . . . 132

9.1

The semantic network definition of a square. The links represent parallel and perpendicular relationships and intersections. . . . . . . . . 135

9.2

(a) A hypothetical symbol. (b) Its statistical graph representation. (c) The same graph represented in a matrix form. Note that all examples used to construct the graph had four segments drawn in order (1)Line(2)Arc-(3)Line-(4)Line. ‘RL’ stands for relative length. . . . . . . . . 139

9.3

The probabilistic definition function P (x). Left: m = 30% and σ = 4%. Right: m = 80% and σ = 10% . . . . . . . . . . . . . . . . . . . 143

9.4

The squashing function S(x). . . . . . . . . . . . . . . . . . . . . . . 145

9.5

Illustration of the intersection angle information for the unknown and the definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

9.6

Illustration of the intersection location information for the unknown and the definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

xiv

Chapter 1 Introduction Today’s engineers are surrounded by a myriad of sophisticated software ready to help them in many of their professional tasks. Even in this high-tech computer era, however, the most elementary form of written communication – sketching via pen and paper – is central to many technological advancements and will likely continue to be so. There are a number of reasons for this. First and foremost, sketches often serve as the first concrete manifestation of the human thought process, solidifying the elusive bridge between the world of minds and the world of reality. In fact, it is not unusual to trace the history of many successful achievements, be they an advanced machine, an architectural masterpiece or a major business idea, to a back-of-the-envelope sketch. Second, sketches serve an important role as a problem solving tool, both by aiding short term memory and by helping to make abstract problems more concrete. They compactly and efficiently represent various kinds of relationships, such as functional, temporal and geometric relationships which are often very difficult to communicate with text alone. In many disciplines, sketches provide a medium for visualizing new concepts, critiquing existing ideas, nurturing new ones, recording elusive thoughts, emphasizing key points and communicating information with other people. In the realm of engineering and architecture, sketches greatly facilitate conceptual design activities by freeing the designer from worrying about intricate details such as precise size, shape, location and color, and instead enabling him or her to focus on more critical issues 1

2

CHAPTER 1. INTRODUCTION

that require creativity and abstraction [50]. Due to their minimalist nature, i.e., articulating only what is necessary, they enhance collaboration and communication efficiency. Indeed, it is common for engineers, architects, and other designers to spend a considerable amount of time laying down their initial concepts on sketches using pencil and paper. Typically, it is only after the main ideas have sufficiently matured, that all of that work is transformed into electronic media in the form of technical drawings, flow charts and mathematical models. This obvious redundancy and inefficiency has propelled the idea of sketch-based user interfaces as a means of achieving more natural human-computer interaction. Starting with Sutherland’s sketchpad in 1963 [70], there has been a large and growing body of work devoted to the creation of computer software that works exclusively from freehand drawings. Early attempts to create such systems were often limited by insufficient technology. New insights into human perception, as well as advances in pattern recognition, machine intelligence, computer graphics, and hardware technology, have now made it feasible to create usable systems. Furthermore, the growing pressure to shift from computer-centric to human-centric interaction models [6] has made such systems essential. In fact, in many of today’s mainstream computing devices, such as tablet PC’s, electronic whiteboards, and personal digital assistants (PDA’s), the pen is emerging as a standard tool for interaction. Many of these new computing devices now come equipped with robust handwriting recognition utilities. However, an important aspect of sketch-based computer interaction that remains largely unsolved is the recognition of graphical input, such as schematic sketches and diagrams containing geometric shapes, engineering symbols and glyphs. When faced with such input, such devices either leave the strokes uninterpreted, thus resulting in the equivalent of a drawing program, or offer only limited support for recognition while placing many unnatural constraints on the way the user draws. Hence, the electronic pen usually ends up being no more useful than a note taking or pointing apparatus. Although researchers are beginning to make progress in handling graphical input, even the latest experimental systems are typically limited to basic shapes such as

3

x1

k

x2

m1

(a) c

x3

k

2k

m3

m2 2c

2k

c

2c

(b)

(c)

Figure 1.1: (a) Schematic of a vibratory system. (b) A computer model for the same system created using ADAMS, a commercial dynamic simulation package. (c) A hand-sketch of the same system drawn by a mechanical engineer.

rectangles, triangles and circles. Furthermore, the few experimental systems that do provide shape recognition are often too restrictive because their recognizers are either special-purpose, hard-coded systems, or they require substantial amounts of training data, which makes them difficult to extend to domains with novel shapes and glyphs. The motivation behind this work was to create techniques needed to handle the kind of graphical input that is an essential part of problem solving in many domains. With such techniques, computers will be able to work from the kinds of sketches and diagrams people ordinarily use when communicating and problem solving, rather than requiring people to adapt to software. As a case exemplifying the capability

4

CHAPTER 1. INTRODUCTION

we would like to achieve, consider the schematic of a vibratory system shown in Figure 1.1a. While such diagrams serve as handy visualization tools in human-to-human communication, current computational tools are not designed to work from such representations. Instead, even a simple analysis of such a system typically requires the use of precise computer models and software such as the one shown in Figure 1.1b. The use of such software, however, requires a significant amount of training and experience. Part c of the figure, on the other hand, shows a sketch of the same system drawn by a mechanical engineer. From a user’s perspective, this type of sketch embodies essentially the same information as the computer model, yet requires only a fraction of the effort to create it. This work aims to develop techniques that allow engineering software to work directly from such informal sketches.

1.1

Challenges in Sketch-Based Computer Interaction

The fundamental challenge in sketch-based computer interaction that distinguishes it from other types of interaction mechanisms has to do with the difficulty of interpreting hand drawings. The input to traditional text-based or WIMPy (Windows, Icons, Menus, Pointing) applications is unambiguous. In a text-based program, for instance, a character string typed at the command line will either match a valid internal command and thus be executed, or otherwise will be left unprocessed. Similarly, the interaction in a windows-based program is strictly regulated by the use of menus, icons and dialog boxes, thus making the input to the application definite and unambiguous. It is unrealistic to expect the same level of consistency, regularity and precision from hand drawings. Quite the contrary, hand-drawn input tends to be highly variable and inconsistent, both in intra-person and inter-person settings. Thus, for a sketch-based system to be of practical utility, it must robustly cope with the variations and ambiguities inherent in hand drawings so as to interpret the visual scene the way the user intended. We believe that among the many issues that remain to be solved, there are two

1.1. CHALLENGES IN SKETCH-BASED

COMPUTER INTERACTION

5

particular challenges that hinder the development of robust sketch understanding systems. The first is ink parsing, the task of grouping a user’s pen strokes into clusters representing intended symbols, without requiring the user to indicate when one symbol ends and the next one begins. This is a difficult problem as the strokes can be grouped in many different ways, and moreover, the number of stroke groups to consider increases exponentially with the number of strokes. The combinatorics thus clearly render approaches based on exhaustive search infeasible. To alleviate this difficulty, many of the current systems require the user to explicitly indicate the intended partitioning of the ink. This is often done by pressing a button on the stylus, or more commonly, by pausing between symbols [22, 58]. Alternatively, some systems avoid parsing by requiring each object to be drawn in a single pen stroke [50, 63, 43]. However, such constraints usually result in a less than natural drawing environment.

The second issue is symbol recognition, the task of recognizing individual handdrawn figures such as geometric shapes, glyphs and symbols. The task of differentiating between, say, a damper and a spring symbol is the focus of symbol recognition. While there has been significant recent progress in symbol recognition [63, 22, 8, 56], many recognizers are either hard-coded or require large sets of training data to reliably learn new symbol definitions. Such issues make it difficult to extend these systems to new domains with novel shapes and symbols. Some of the existing trainable systems either rely on single stroke methods in which an entire symbol must be drawn in a single stroke [63, 43], or constant drawing order methods in which two similarly shaped patterns are considered different unless the pen strokes leading to those shapes follow the same sequence [60, 75]. Systems such as [5, 23] allow for multiple stroke symbols, however the recognizers are manually coded. While trainable, systems such as [22, 8, 56, 37] typically require a multitude of training examples. Additionally, many symbol recognizers have been built as stand alone applications without addressing the issue of integration into high-level sketch understanding systems.

6

CHAPTER 1. INTRODUCTION

1.2

The Focus of The Work

This thesis addresses the issues of parsing and recognition. Particularly, it puts forth a new computational model for the online recognition of hand-drawn, diagrammatic sketches. The key advance in this model is a hierarchical parsing and recognition approach that alleviates many of the unnatural drawing constraints imposed by other systems. For instance, unlike systems that require the user to explicitly indicate the separation between different visual symbols, our system allows users to draw continuously, as they would on paper. Likewise, unlike systems that allow only single-stroke symbols, our system allows symbols to be composed of multiple strokes. Symbols can be drawn in any arbitrary order and, except for a few special cases, the strokes comprising a symbol need not be drawn consecutively or in any particular sequence. This allows the user, for example, to come back to a partially completed symbol and add more strokes to it. Furthermore, our system is trainable which makes our system easily extensible to new applications with novel symbols and shapes. Finally, our techniques are designed to provide interactive performance, meaning that they are fast enough to be used in real-time settings. Our hierarchical parsing and recognition approach for sketch understanding is based on a novel mark-group-recognize architecture. With this approach, the stream of pen strokes is first examined to identify certain delimiter patterns called “markers.” These are then used to efficiently cluster the remaining strokes into groups corresponding to individual symbols. In the last step, the identified stroke groups are evaluated using a symbol recognizer to determine which domain symbols they represent. Further details of this approach are discussed in Chapter 3. In addition to the hierarchical parsing and recognition approach outlined above, this thesis presents three new domain-independent symbol recognizers we have built. All three recognizers are trainable in that new symbols can be taught to the system without requiring the user to write any code. Hence, users can easily train their own set of symbols without having to learn a predefined set of symbols. While each recognizer has its own strengths and weaknesses, they all recognize shapes independent of size, location and orientation. The modular design and implementation of our

1.2. THE FOCUS OF THE WORK

7

parsing and recognition approach allows any of the three symbol recognizers to be easily incorporated into the sketch recognition engine. While our techniques are designed to be useful in a variety of pen-based applications, the domain of engineering software was chosen to illustrate our techniques. Particularly, we demonstrate the utility of our approach with two implemented systems. The first is SimuSketch, a sketch-based interface for Matlab’s Simulink package, and the second is VibroSketch, a sketch-based interface for analyzing vibratory mechanical systems. In both systems, users can construct functional engineering models by simply sketching them on a computer screen. Our system interprets the users’ sketches and then operates the underlying engineering tool, without further user intervention. Users can then interactively manipulate their sketches to change model parameters and perform analysis. Our system then gathers the results and presents them to the user in the form of graphs and live animations. Besides these technical advances, this work also explores part of the human-side of the problem. During our studies, a considerable effort was spent on the design and implementation of the user interfaces of our prototype systems. The resulting applications make use of several concepts from traditional graphical user interfaces. For example, while free-from sketching is a natural form of input for geometry creation, dialog boxes are a convenient abstraction for various other kinds of interactions, such as for editing the properties of the visual objects or viewing simulation results. Our system allows the users to combine both types of interaction. Also our system is designed so that it allows familiar user interface components to be readily accessible when they are needed. For example, it offers a number of gestural commands for several common tasks, such as object selection, deletion and repositioning, or for bringing up a sketchy scroll bar to gain more drawing space. Similarly, users have the choice of viewing their sketches in their original “sketchy” form, or in a “cleaned-up” form in which the user’s objects are replaced by beautified graphical elements. While useful for the practicing engineer, our techniques also have distinct advantages in engineering education. Students typically use only a subset of the capabilities of commercial engineering software, thus, sketch-based software for education is a readily achievable goal. Such tools will provide an enhanced educational experience

8

CHAPTER 1. INTRODUCTION

by allowing students to focus on problem solving rather than on how to use the software. Similarly such tools can be directly integrated into the classroom environment to better illustrate concepts that would ordinarily require mental simulations. For instance, using our interface, an instructor can sketch out a mechanical system on an electronic whiteboard, just as he or she normally would on an ordinary blackboard, and directly animate its behavior. A recent survey of the sophomore mechanical engineering students at Carnegie Mellon University revealed that they believed using SolidWorks (a commercial computer aided design software system) in their assignments to visualize mechanical motions greatly helped them in their understanding of the key concepts. Our goal is to make such tools readily available in the classroom while the students are learning a new concept for the first time. It is worthwhile to emphasize that the techniques in this thesis are tailored toward online sketch recognition for interactive pen-based systems. Our techniques assume that not only the final static image of the sketch, but also the drawing process leading to it, is available to the system. This is because our techniques exploit the interactive nature of sketching, such as information regarding the temporal ordering of the user’s strokes, and the speed at which they were drawn, to enhance both the system’s accuracy and efficiency. The problem of sketch recognition from static images involves many other challenges and shall be the focus of future work.

1.3

Contributions

The contributions of this thesis are: Concepts and techniques: • A computational model for automatically parsing and recognizing hand-drawn schematic sketches. • Three domain-independent, trainable symbol recognizers that can be used in size-, position-, and orientation-independent recognition of multi-stroke symbols. Software:

1.4. ORGANIZATION OF THE THESIS

9

• SimuSketch: A sketch-based interface for Matlab’s Simulink package. • VibroSketch: A sketch-based interface for analyzing vibratory mechanical systems. • A suite of techniques that allow users to interact with their sketches created in these systems. • A custom-built, stand-alone digit recognizer. • Code for integrating the two prototype systems with Matlab thus allowing users to solve real problems using these systems. User Studies: • Results of a user evaluation of SimuSketch. • Results of a user evaluation of VibroSketch. • Evaluation of two of the three symbol recognizers.

1.4

Organization of the Thesis

The remainder of the thesis is organized as follows. Chapter 2 reviews previous and ongoing work in sketch recognition, with a particular focus on symbol recognition and sketch parsing techniques. Chapter 3 presents an overview of our hierarchical parsing and recognition model. Based on this model, Chapter 4 and Chapter 5 present our SimuSketch and VibroSketch systems, respectively. Chapter 6 gives an overview and a comparative evaluation of the three symbol recognizers we have developed. The details of these recognizers are then presented in the subsequent three chapters. Finally, future directions and conclusions are presented in Chapter 10 and Chapter 11.

Chapter 2 Literature Review In recent years, there have been numerous efforts to create experimental sketch understanding systems. This section begins with a discussion of work focused on sketch parsing and recognition, and then surveys existing sketch-based applications.

2.1

Symbol Recognition

Graph-based methods have been one of the most prominent approaches to object representation and matching, and have recently been applied to hand-drawn pattern recognition problems. With these methods, sketched symbols are first decomposed into basic geometric primitives, such as lines and arcs, which are then assembled into a graph structure that encodes both the intrinsic attributes of the primitives and the geometric relationships between them. Pattern detection is then formulated as a graph-subgraph isomorphism problem, a problem extensively studied by computer vision practitioners [73, 15]. Lee [52] developed a graph-based approach in which the graph represents the precise geometry of the object, and thus the approach is suitable for precisely drawn symbols with uniform scaling. For example, the approach has been used to recognize machine drawn symbols, symbols drawn using templates, and precise hand-drawn symbols. Lee’s approach requires manual selection of key vertices during training. Calhoun et al. [8] developed an approach in which the graph encodes topology, rather 10

2.1. SYMBOL RECOGNITION

11

than geometry, so as to be more tolerant of typical variations in hand-drawn sketches, such as non-uniform scaling. Their method provides automatic training, although the drawing order must be consistent across the training examples. These sorts of graphbased approaches are sensitive to segmentation errors, and graph matching can be expensive. In this work, we extend Calhoun et al. ’s graph-based symbol recognizer such that attributes of the graph are described statistically, based on a set of training examples. This makes our approach less sensitive to segmentation errors and drawing variations. As an alternative to graphical methods, some researchers have developed approaches based on aggregate features of the symbols. For example, Apte et al. [5] developed a hard-coded recognizer that examines the geometric properties of the convex hull of a symbol. The recognizer also makes use of special geometric properties of particular shapes. As it is hard-coded, this recognizer is not easily extended to new symbols. Fonseca and Jorge [22] developed a recognition method that uses a naive Bayesian classifier to recognize multi-stroke shapes. Each shape is described by four geometric features calculated from three special polygons defined by the convex hull of the shape. Because this recognizer works from the convex hull properties, it cannot distinguish between different shapes with the same convex hull. Hammond and Davis [33] describe a symbolic language called LADDER that allows designers to specify how shapes are drawn, displayed and edited in a certain domain. The language consists of a set of predefined shapes, constraints, editing behaviors and a syntax for combining them. With this language, designers can create new domain objects by specifying the shape descriptions. Based on LADDER, the authors developed a translator [32] that takes symbolic shape descriptions and transforms them into shape recognizers. While their approach enables domain independent shape recognition, users are required to manually write new descriptions each time a new shape is introduced. In the symbol recognizers we have developed, the user simply sketches several examples of a symbol and the system learns a description of the symbol by considering the statistical properties of the training examples. Rubine [63] developed a trainable recognizer for single stroke gestures. A gesture (pen stroke) is characterized by a set of 11 geometric and 2 dynamic attributes.

12

CHAPTER 2. LITERATURE REVIEW

Training is accomplished by constructing a linear discriminant classifier with weights learned from training examples. As the attributes are aggregate features of the pen stroke, it is possible for different gestures to have the same features, resulting in confusion. The method is applicable to single-stroke sketches and is sensitive to the drawing direction and orientation. The methods we have developed handle multistroke shapes drawn in any orientation. Parametric methods such as polygon, B-spline, and Bezier curve fitting techniques have also been considered in shape representation and classification [39, 61]. A benefit of these approaches is that there is no need to segment the pen stroke into geometric primitives such as lines and arcs. Additionally, since only a few parameters are needed for shape description, these methods are computationally efficient. Similar to the Rubine’s method, however, these methods are primarily applicable to single-stroke symbols or gestural commands. Gross [29] developed an approach for identifying low-level glyphs (symbols). The approach relies on a 3x3 grid inscribed in the glyph’s bounding box. The sequence of grid cells the pen visits is used to distinguish each glyph. Higher-level object recognizers can be developed by examining the spatial relationships between glyphs. Based on this recognizer, his group has developed several applications in the field of architectural design [27, 47, 28]. His recognizer is trainable, but is sensitive to rotation: the recognizer must be trained with an example of each possible orientation of a symbol. Because of the coarse resolution of a 3x3 grid, this approach may not be able to handle glyphs with both large and small features.

2.2

Approaches to Parsing 2D Visual Scenes

Inspired by the advances in speech recognition, some systems facilitate parsing by requiring visual objects to be drawn with a predefined sequence of pen strokes [66, 75]. These systems are typically implemented in the form of Hidden Markov Models (HMM’s) where the observed pattern is viewed as the result of a stochastic process that is governed by a hidden stochastic model. Each stochastic model represents a different class pattern capable of producing the observed output. The goal is to

2.2. APPROACHES TO PARSING 2D VISUAL SCENES

13

identify the model that has the highest probability of generating the output. While useful at reducing computational complexity, the strong temporal dependency in these methods forces the user to remember the correct order in which to draw the strokes. For sketch recognition, this ordering has limited HMM’s to problems that exhibit strong temporal structure, such as handwritten text recognition. In a general sketch based system, however, one can, and typically does, go back to a previous spatial location and add new strokes. HMM based methods become less appealing in such situations. Furthermore, the need for large training data sets may inhibit the use of HMM’s when such data is scarce. Shilman et al. [67] present a statistical visual language model for ink parsing. Their approach requires the visual grammar to be encoded manually. A large corpus of training examples is then used to learn the statistical distributions of the geometric parameters used in the grammar, resulting in a statistical model. The grammar, and hence the statistical model, defines composite objects hierarchically in terms of lower level objects. The lowest level objects are single stroke symbols recognized with Rubine’s method. Thus, their method requires that the lowest level objects can be recognized in isolation, although ambiguity is handled naturally by their Bayesian approach. Additionally, the authors suggest that their approach may not scale well to large sketches. Finally, their approach assumes that shapes are drawn in certain preferred orientations. This is a property of both their grammar and Rubine’s method. Alvarado [4] has proposed a parsing approach based on dynamically constructed Bayesian networks. The approach is similar to that of Shilman et al. [67], but the lowest level objects are geometric primitives, such as lines, rather than symbols that must be recognizable in isolation. Alvarado has constructed an early implementation of the approach, but formal evaluations have not yet been conducted. Costagliola and Deufemia [12] present an approach based on LR parsing for the construction of visual language editors. They employ “extended positional grammars” to encode the attributes of the graphical objects. Their method is intended for use with pre-recognized shapes (icons selected from a menu), and thus is not directly applicable to sketch understanding problems. Saund et al. [65] present a system that uses Gestalt principles to determine the salient objects represented in a line drawing.

14

CHAPTER 2. LITERATURE REVIEW

Their work concerns only the grouping of the strokes and does not employ recognition to verify whether the identified groups are in fact the intended ones. In the domain of machine vision, Jacobs [41] describes a system to recognize objects with straight-line perimeter representations. The system uses a number of heuristic rules to group edges that likely come from a single object, and then uses simple recognizers to identify the objects represented by the edges. However the rules rely on the presence of straight line segments and sharp corners, and thus are not well-suited to less structured patterns, such as sketches. Other vision techniques operate by generating a multitude of partial interpretations of the edges in the scene. Additional evidence is then used to support or refute the interpretations. Grimson [26] examined the combinatorics of such techniques and reported that they are often faced with the difficulty of non-optimal thresholds that either prematurely terminate a promising path, or retain a futile one for too long.

2.3

Sketch Interpretation Systems

A few sketch-based interfaces have been developed for interpreting electrical circuit sketches. Narayanaswamy [58] developed a sketch-based interface for SPICE. It uses hard-coded recognizers that assume a fixed drawing order. Also, the system avoids issues of parsing by requiring the user to pause between symbols. Hong and Landay [36] demonstrated the capabilities of their SATIN system (described below) by creating SketchySPICE, a simple circuit CAD tool intended primarily as a proof of concept. SketchySPICE is limited to recognizing AND, OR, and NOT gates, and the wires connecting them. Gates must be drawn in either one or two strokes. Lee [52] describes a trainable recognizer for electrical circuit symbols. Symbols are classified by comparing a symbol’s attribute graph to that of a probabilistic model of each learned symbol. This recognizer was developed for use with scanned bitmap images rather than real-time sketches for which the pen trajectories are available as sequences of time-stamped coordinates. Additionally, Lee’s approach requires that each symbol be drawn using only one or two strokes.

2.3. SKETCH INTERPRETATION SYSTEMS

15

In addition to electric circuits, recent years have seen the development of experimental sketch-based interfaces for a variety of other disciplines. Stahovich et al. [69] present a system called SketchIT that can transform a sketch of a mechanical device into working designs. The program’s task is determine what the geometry of the sketch should have been to make the sketched device behave as intended. To do this, the program employs a novel behavioral representation, called qualitative configuration space (qc-space), that captures the behavior suggested by a sketch while abstracting away the particular geometry used to suggest that behavior. The program is concerned only with the high-level interpretation of the sketch, and does not consider the low-level issues of parsing and symbol recognition. Alvarado [3] describes a system called ASSIST that can interpret and simulate a variety of hand-drawn mechanical devices. Their system uses a number of heuristics to construct a recognition graph containing the likely interpretations of the sketch. The best interpretation is chosen using a scoring scheme that uses both contextual information and user feedback. A main limitation of ASSIST is that its shape recognizers are hard-coded. Kurtoglu and Stahovich [48] describe a program that augments sketch-understanding with qualitative physical reasoning to understand schematic sketches of physical devices. Employing the shape recognizer described by Calhoun et al. [8], the program first identifies the geometric interpretation of an input shape and then uses constraint satisfaction techniques to efficiently construct physically consistent interpretations of the identified components. It then uses qualitative simulation to select the interpretation that produces an intended behavior. One key feature of their system is that it allows users to incorporate shapes from several different domains instead of limiting them to one particular domain. However, this program is not concerned with parsing; the user is required to signify the beginning and end of each component. Hammond and Davis [31] describe Tahuti, a sketch recognition environment for class diagrams in UML. Tahuti interprets diagrams using a series of steps consisting of: preprocessing each stroke to fit it to a series of geometric shapes, grouping the strokes into possible symbols, recognizing the most likely symbols, and identifying each of those symbols. Their method recognizes objects based on their geometric

16

CHAPTER 2. LITERATURE REVIEW

properties and allows symbols to be drawn in multiple strokes at any orientation. However, Tahuti’s recognizers are hard-coded and can recognize only a limited number of shapes such as rectangles, ellipses, arrows, and editing gestures. Our recognizer is trainable and learns new definitions from sample sketches of a symbol. Matsakis [56] describes a system for converting handwritten mathematical expressions into a machine-interpretable typesetting command language. The recognition process begins with isolated symbol classification where each input stroke is analyzed together with the earlier ones to find the mathematical symbol that best describes the set of strokes. Next, driven by the results of symbol recognition, the program attempts to find an optimal partitioning between the strokes to form the set of salient symbols. Finally, a minimum-spanning tree algorithm is employed to examine the spatial arrangement of the symbols, and the result is combined with a geometric grammar to generate an interpretation of the expression in a typesetting language. For shape recognition, the program models each symbol as a Gaussian probability function whose features consist of the positions and gradients of the resampled pixel points. One limitation of this approach is that each symbol requires a large amount of training samples and each training sample needs to be drawn using the same number of strokes. Landay and Myers [50] present an interactive sketching tool called SILK that allows designers to quickly sketch out a user interface and transform it into a fully operational system. As the designer sketches, SILK’s recognizer matches the pen strokes to symbols representing various user interface components, and returns the most likely interpretation. Hong and Landay [36] developed SATIN, a toolkit designed to support the creation of pen-based applications. SATIN consists of a set of mechanisms for manipulating, handling, interpreting and viewing strokes; a set of policies to distinguish between the type (gesture vs. symbol) of the input stroke; and a number of beautification techniques to organize and clean up sketches. Its modular architecture is designed to enable developers to easily incorporate the toolkit into new applications. In the same group, Lin et al. [54] describe a program called DENIM that helps web site designers in the early stages of the design process. All three systems employ Rubine’s algorithm [63] as the primary recognition engine and

2.4. COMMERCIAL PEN-BASED PRODUCTS

17

hence are limited to single stroke objects. Forbus et al. [24] describe a system called nuSketch that uses qualitative spatial reasoning to understand the military tactics suggested in battlefield sketches. Their approach to sketch understanding is based on reasoning about the spatial relationships between strokes rather than the recognition of patterns. Mankoff et al. [55] have explored methods for modelling and resolving ambiguity in recognition based interfaces. Drawn from a survey on existing recognizers, they present a set of ambiguity resolution strategies, called mediation techniques, and demonstrate their ideas in a program called Burlap. Their resolution strategies are concerned with how ambiguity should be presented to the user and how the user should indicate his or her intention to the software. This work highlights a number of critical considerations that demand consideration for a better interaction between the user and the software. Igarashi et al. [40] created an interactive beautification system. Their task is to transform the user’s pen strokes into cleaned-up line segments and infer any intended connectedness, perpendicularity, congruence, and symmetry. The resulting image resembles a carefully drafted diagram despite the imprecision associated with the user’s original sketch. However, their system cannot be viewed as a sketch understanding system because they are concerned only with improving the visual structure of users’ strokes without attempting to uncover their semantic content.

2.4

Commercial Pen-Based Products

In 1991 Go corporation introduced the PenPoint; an operating system build from the ground up to support pen-based interaction in mobile computers. It was one the earliest manifestations of the ‘notebook’ metaphor. Unlike today’s application-centric windowing systems, however, it was document-centric, i.e., the system consisted of always-accessible open document pages similar to those on a paper notebook. Besides the common functionality provided by the pen, such as gestures and widgets, it introduced new concepts such as live embedding of documents into documents and integrated handwriting-gesture recognition. Unfortunately, this product did not

18

CHAPTER 2. LITERATURE REVIEW

gain any market interest and died after a couple years. The unexpected failure of this project has been attributed to a mixture of economic and business strategies, such as failure to distinguish between user need vs. technological opportunity, and to technology maturity issues, such as the market not being ready for a radically new interaction metaphor. Apple Newton and the more recent PalmOS have provided pen support for mobile and hand-held computing similar to that in the PenPoint. To minimize handwriting, the Newton provided a rich set of widgets that the user could point to to issue commands. PalmOS introduced a new alphabet called the Unistroke that is essentially the single-stroke version of the Latin alphabet. In these systems the stylus is used more for pointing, clicking and dragging then for drawing or sketching. Hence, the need for advanced shape recognition facilities was not an issue. In late 2002 Microsoft introduced the TabletPC as the next generation mobile personal computing system. Although the TabletPC can be used as a regular laptop with the usual keyboard and the familiar Windows operating system, the new penbased interaction is what distinguishes it from its earlier peers. It includes a powerful handwriting recognition engine that operates under two new Windows applications: the Journal, for taking handwritten notes and creating drawings, and Sticky notes for annotating documents from other applications. The main merit of the TabletPC is its robust handwriting recognizer. There is gesture recognition capability for erasing, backspacing and tabbing while writing. Shape recognition, however, does not exist and the recognizer gets confused when schematic drawings are embedded in normal text. Methods to separate graphics from text and to support diagrammatic sketch recognition (such as flowcharts and simple geometries) are at the research level. ‘Pen and Internet’ has introduced a series of pen utility tools comprised of five applications: (1) ritePen, a handwriting recognition utility for the TabletPC, (2) riteForm, similar to ritePen except designed for web-based form filling applications, (3) riteScript, for recognizing different styles of handwriting including cursive, separate and mixed (4) riteShape, for recognizing, cleaning and aligning geometrical shapes and polylines, and (5) riteMail, to create handdrawn graphics (both text and schematics)

2.4. COMMERCIAL PEN-BASED PRODUCTS

19

Figure 2.1: Pen-and-Internet’s riteShape can only recognize simple geometric shapes such as rectangles, triangles, circles and ellipses [http://www.penandinternet.com] on mobile devices and transmit the documents via email. The handwriting recognition engine in these utilities performs similar to that of the TabletPC, except it allows text to be written anywhere on the screen as opposed to dedicated application windows. Also the rite-series provide additional gestures for calling menus and drop-down lists. riteShape can additionally recognize geometric shapes which were left unrecognized in the TabletPC. However, the shape recognizer is basic in that it only recognizes simple shapes such as circles, ovals, squares, rectangles and triangles (Figure 2.1). There have been a number of pen-based drawing programs that enable users to draft geometrical, schematic or architectural forms (Geometer’s Sketchpad, Corel Draw, Smart Sketch etc.). Users strokes are usually transformed into cleaned-up and beautified drawings either in the form of simple lines and arcs, or more commonly Bezier curves. Such applications attempt to uncover geometric relations among the strokes such as the end-point proximity, perpendicularity, parallelism, symmetry, continuity and repetition. Additional features such as automatic snapping, completion and closure are implemented to facilitate the drawing process. Recognition is usually unsupported or is limited to simple geometric objects such as rectangles and circles.

Chapter 3 Hierarchical Parsing and Recognition of Sketches 3.1

Introduction

This chapter presents a new computational approach to parsing and recognizing handdrawn sketches. This approach allows users to continuously sketch without indicating when one symbol ends and a new one begins. Additionally, it does not restrict the number of strokes, or the order in which they must be drawn. The low computational cost of the approach enables fast interpretation of the input sketches, thus resulting in an interactive system in which users can edit and modify their sketches (and run simulations), all in real-time. Our techniques address two particularly challenging, and largely unsolved, issues in the development of robust sketch understanding systems. The first is ink parsing, the task of grouping a user’s pen strokes into clusters representing intended symbols, without requiring the user to indicate when one symbol ends and the next one begins. This is a difficult problem as the strokes can be grouped in many different ways, and moreover, the number of stroke groups to consider increases exponentially with the number of strokes. The combinatorics thus clearly render approaches based on exhaustive search infeasible. To alleviate this difficulty, many of the current systems require the user to explicitly indicate the intended partitioning of the ink. This is 20

3.1. INTRODUCTION

21

often done by pressing a button on the stylus, or more commonly, by pausing between symbols [22, 58]. Alternatively, some systems avoid parsing by requiring each object to be drawn in a single pen stroke [50, 63, 43]. However, such constraints usually result in a less than natural drawing environment. The second issue is symbol recognition, the task of recognizing individual handdrawn figures such as geometric shapes, glyphs and symbols. While there has been significant recent progress in symbol recognition [63, 22, 8, 56], many recognizers are either hard-coded or require large sets of training data to reliably learn new symbol definitions. Such issues make it difficult to extend these systems to new domains with novel shapes and symbols. Additionally most symbol recognizers have been built as stand alone applications without addressing the issue of integration into high-level sketch understanding systems. Our sketch-understanding approach is based on a hierarchical mark-group-recognize architecture shown in Figure 3.1. The first step is preliminary recognition which focuses on the identification of “markers,” delimiter symbols that are easily and reliably extracted from a continuous stream of input. We refer to these symbols as “markers” because of two important properties: First, their geometric and kinematic characteristics enable them to be easily extracted from a continuous stream of strokes, and second, they serve as delimiters, which allow the remaining strokes to be efficiently clustered into distinct groups corresponding to individual symbols. In the domain of network diagrams, for instance, the arrows connecting the nodes of a network serve as useful markers. The markers are then used to efficiently cluster the remaining strokes into distinct groups corresponding to individual symbols. The key here is that stroke clustering is driven exclusively by the marker symbols identified in the first step, without need for exhaustive search. Next, informed by the result of the clustering algorithm, our approach can employ contextual knowledge to generate a set of candidate interpretations for each of the stroke groups. To determine which of these interpretations is correct, the next step involves symbol recognition in which the groups are evaluated using one of the three symbol recognizers we have developed (details of these symbol recognizers are presented in Chapter 6 through Chapter 9). In cases of misrecognitions, the last step involves error correction where the user

22CHAPTER 3. HIERARCHICAL PARSING AND RECOGNITION OF SKETCHES

Raw Sketch

Preliminary Recognition

Stroke Clustering

Symbol Candidates

Context

Symbol Recognition

Error Correction

Interpretation

User

Figure 3.1: Parsing and Recognition Architecture. rectifies the problems. This approach has a number of distinct advantages. First, by focusing on marker symbols early on, it avoids unfruitful explorations and quickly directs the analysis in the right direction. Second, the approach provides a platform for encoding contextual knowledge. For example, at the conclusion of the clustering step (i.e., before recognition begins), the system can narrow down the set of possible interpretations for each symbol. This both increases recognition accuracy and reduces recognition cost. Finally, once the initial analysis is complete, our system can use domain knowledge to identify and correct errors that may have occurred during parsing and recognition. While there are many different domains that can make use of the mark-grouprecognize architecture outlined above, this thesis demonstrates its utility in two domains. The first involves network diagrams in which a set of symbols (nodes) are connected by a set of arrows. The second involves vibratory mechanical systems which typically contain objects such as masses, dampers, springs, external forces and grounds.

3.2. DOMAINS OF INTEREST

3.2

23

Domains of Interest

We demonstrate our approach in two different domains, namely network diagrams and vibratory mechanical systems. Network diagrams can be broadly characterized as a set of symbols connected by a set of arrows. The types of diagrams that can be considered in this category include signal flow diagrams, organizational charts, algorithmic flowcharts, and various graphical models such as finite state machines, Markov models and Petri nets. In the second domain, we consider vibratory mechanical systems in which masses are connected to one another or to grounds via dampers and springs, and are subject to external forces that produce oscillatory motions. The two domains present markedly different challenges and therefore it is quite difficult to employ precisely the same parsing and recognition techniques in both domains. However, the general mark-group-recognize principle does apply in both cases, as demonstrated in the following sections. One of the key advantages of our approach is that it encourages a modular design in that different components of the system architecture can be developed independently from other components. This allows pieces of the system to be easily customized to the needs of the domain without disrupting the remainder of the system. For example, although the preliminary recognition and stroke clustering algorithms are different in the two systems we developed, the assembly of those components and the interaction between them remain the same. Likewise, the modular architecture allows, for instance, different symbol recognizers to be easily incorporated into the system at the symbol recognition step without affecting the rest of the system. Indeed, while an earlier version of the vibratory mechanical system application used the same symbol recognizer employed in the network diagram application, this was later replaced with another recognizer in a matter of minutes.

Chapter 4 SimuSketch: A Sketch-Based Interface for Simulink As our first test bed, we have created SimuSketch, a prototype sketch-based frontend to Matlab’s Simulink package (Figure 4.1). Simulink is a block-oriented program used for modelling, simulating, and analyzing dynamic systems. It has a typical drag and drop interface in which the user navigates through a nested symbol palette to find, select and drag the components, one at a time, onto an empty canvas. With SimuSketch, on the other hand, the user can construct functional Simulink models by simply sketching them on a computer screen. The sketch interface does not restrict the order in which the symbols must be drawn nor the number of strokes used to draw them. Furthermore, it does not require the user to indicate when one symbol ends and the next begins. Likewise, the user need not complete one symbol before moving onto another, and thus the user may come back to a previous location to add more strokes at any time. The objects interpreted by SimuSketch are live from the moment they are recognized, thus enabling users to interact with them. For example users can edit the objects through dialog boxes or alter their sketch using traditional means such as selection and deletion. Once the user’s model is recognized, a simulation can be run and viewed directly in SimuSketch. At the end, users can save their work either in their original sketchy form or in a format compatible with Matlab, thus allowing users to 24

25

(a)

(b)

Figure 4.1: (a) SimuSketch, (b) Automatically derived Simulink model.

26CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

Figure 4.2: SimuSketch is deployed on a Wacom Cintiq tablet with cordless stylus. resume their work either in the SimuSketch or the conventional Matlab environments. In the following paragraphs, first the user interaction with SimuSketch is described. Next, the details of the multi-level parsing and recognition algorithm outlined above are described.

4.1

User Interaction

SimuSketch is deployed on a 9 in x 12 in Wacom Cintiq digitizing tablet with a cordless stylus (Figure 4.2). The drawing surface of the tablet is an LCD display, which enables users to see virtual ink directly under the stylus. Data points are collected as time sequenced (x,y) coordinates sampled along the stylus’ trajectory. As shown in Figure 4.1a, SimuSketch’s interface consists of a drawing region and a toolbar that contains buttons for commonly used commands. The user draws as he or she ordinarily would on paper. As the user is drawing, SimuSketch does not attempt to interpret the scene. Instead, it employs a recognize on demand (ROD) strategy in which the user taps the “Recognize” button in the toolbar whenever he wants the scene to be interpreted. This command invokes the sketch recognition engine which then parses the current sketch, recognizes the objects,

4.1. USER INTERACTION

27

and produces a Simulink model. As shown Figure 4.1a, the program demonstrates its understanding by displaying a faint bounding box around each object, along with a text label indicating what the object is. Recognized arrows are delineated with small colored points at each of their two ends. The ROD strategy has a number of advantages over systems that try to interpret the scene after each input stroke. First, as the users are not distracted by display of potentially premature interpretation results, they can focus exclusively on sketching. Second, as very little internal processing takes place after each stroke, the program is better able to keep up with the user’s pace1 . Third, by delaying recognition in a user controlled manner, it allows the system to acquire more context that would help improve the recognition accuracy of earlier strokes. Note that ROD does not require the model to be entirely completed before it can be used. In fact, it encourages an iterative construction process in which the user draws a portion of the final model, asks SimuSketch to recognize it, tests the model, and continues with the rest of the model. Object Manipulation: SimuSketch offers a number of gestures for different tasks. To select an object or an arrow, the user either taps on it or circles it with the stylus; the selected item is highlighted in a translucent blue color indicating its selection. The circular selection gesture is differentiated from a drawing stroke based on its end points and the region it encircles. If the distance between the stroke’s first and last points is less than 10% of the total stroke length (i.e., the stroke forms a nearly closed contour) and the stroke encircles one or more objects or arrows, the stroke is taken as a selection gesture. Once an object is selected, one of four things can happen depending on the subsequent input stroke. First, if the stroke is simply a quick tap in the blue region, a pop dialog message is dispatched, which brings up a dialog box pertinent to the selected object. Second, if the stroke is not a tap but its initial contact point is still within the blue region, a move message is dispatched and the selected object(s) is moved to the lift point of the stroke. Third, if both the contact and lift points of the stroke are outside the blue region, but the midpoint 1

Systems that interpret the sketch after each stroke, such as [3], often force the user to pause for a short duration between the strokes.

28CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

Figure 4.3: Users can bring up a scroll bar and instantly use it by drawing a line along the right border of the display. They can also annotate their sketches. is in the blue region, then a delete message is dispatched and the object is removed from the visual scene. A typical manifestation of this gesture is a stroke through the selected object. Finally, if the entirety of the stroke is outside the blue region, all selected objects are de-selected and the stroke is added to the raw sketch. An alternative to de-selection is a tap in the white space. When running out of drawing space, the user can request more space by drawing a long line along the right border of the drawing surface. As shown in Figure 4.3 this gesture brings up two sketchy scroll bars, one for vertical and one for horizontal scrolling, that can be instantly used. These scroll bars behave similar to traditional scroll bars in that as the user drags the scroll thumb, the contents of the drawing surface move in the opposite direction. This feature is a favorite among those who have used our system. By checking the “Annotate” box in the toolbar, users can also add notes to their sketches (Figure 4.3). In the annotation mode, users’ strokes are simply displayed in blue and are not considered by the recognition engine. Note that, unlike the gestural manipulation strokes described above, it would be quite difficult to distinguish a drawing stroke from an annotation. Hence, users are required to explicitly set

4.1. USER INTERACTION

29

SimuSketch in the annotation mode in order to add notes. Object Dialogs: For objects with variable parameters, selecting and tapping on the object brings up a dialog box for editing its parameters. Figure 4.4 shows an example. Here, the user has clicked on the Sine Wave block which brought up a dialog box specialized to sine waves. Interaction in these dialog boxes is also sketchbased in that users can change existing values by crossing out the old ones with a delete gesture (a stroke through the number) and simply writing in the new values. The program can recognize negative and/or decimal numbers using a digit recognizer we developed2 . After updating the constants, the user taps “Process” to have the changes take effect. New values are then recognized and displayed in computer fonts. The user closes the dialog box by tapping the “Done” button. Similar dialog boxes exist for the Transfer Function block where the user can edit the numerator and denominator polynomials of the transfer function, and for the Gain block where the user can change the gain constant. (The set of Simulink objects used by SimuSketch can be found in Figure 4.20 on page 47.) After completing the sketch, the user can run a simulation of it by pressing the “Simulate” button. This command simply hands the model over to Simulink (which runs in the background) for processing. The results of the simulation can be viewed directly in the sketch interface by double tapping on the Scope blocks. As shown in the right part of Figure 4.4, this brings up a window showing the simulation results. At any time, the user can add new objects to the model by simply sketching them. Training Object Shapes: SimuSketch is equipped with a trainable symbol recognizer that can learn a new object definition from a single prototype example. To train an object, the user presses the “Train New” button, which brings up the dialog shown in Figure 4.5. The user simply draws the object, in this case an alternate form of a Sine Wave, and sets it as the new Sine Wave definition. With this utility, users can seamlessly train new symbols, and remove or overwrite existing ones on the fly, without having to depart the main application. Views: Once the user’s sketch has been interpreted, the user has the option of viewing the model in its sketchy or cleaned-up form. In the cleaned-up view, the 2

The details of this digit recognizer are described in Section 4.7.

30CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

Figure 4.4: The user can interact with the system through sketch-based dialog boxes. In the instance shown, the user is editing a Sine Wave block. Simulation results are presented to the user through conventional Simulink displays which pop up when the user clicks on Scope blocks.

Figure 4.5: A dialog window for training object shapes.

4.2. PRELIMINARY RECOGNITION

31

Figure 4.6: The user can view a beautified version of his model in which the original sketch is replaced by cleaned-up objects. sketchy symbols are replaced by their iconic forms and the arrows are straightened out into line segments (Figure 4.6). Users can toggle between these two views by tapping the “Toggle view” button. Subjects in our user studies have indicated that the informality of the sketchy view gave a sense of freedom and creativity, while the cleaned-up view gave a sense of completeness and definiteness. Despite these perceived differences, the cleaned-up view is just as functional as the sketchy view in that it supports the same interaction mechanisms, including sketching, object selection, object manipulation and editing.

4.2

Preliminary Recognition

One key to successful sketch understanding lies in the ability to establish the ground truths about the sketch early on before costly mistakes take place. Our approach is based on the use of “marker symbols,” symbols that are easy to recognize and that can guide the interpretation of the remainder of the sketch. The results obtained at this step can then be used to help interpret the remainder of the sketch. This approach is similar in spirit to the construction of “islands of certainty,” a concept first used in the Hearsay-II speech understanding system [19]. In that system, the recognizer

32CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

first identifies the pieces of speech that can be interpreted with high confidence (i.e., the islands of certainties), and then uses those results to facilitate the analysis of the remaining pieces. There are several characteristics that distinguish a good marker symbol. Such symbols should: • occur frequently in the sketch, • have unique geometric and dynamic (e.g., speed) features that distinguish them from other objects, • be reliably recognizable in isolation, and • provide valuable information that can be used in the recognition of other objects. In the domain of data flow diagrams such as Simulink, we have found arrows to be this kind of pattern. Our program thus begins by recognizing the arrows in the sketch. Our observational tests on a small set of users during the design stages of our system indicated that, despite some exceptions, arrows were usually drawn as either a single pen stroke or two consecutive strokes, one for the shaft and one for the head. Hence, two types of arrow recognizers were developed to account for these two main styles. To simplify the analysis, both types of recognizers require that the arrows are drawn from a source object toward a target object, i.e., from the arrow’s tail toward its head. In addition to the distinction between single-stroke and two stroke arrow recognition, we have two separate approaches for detecting arrows. In the first approach, recognition is based on “speed” information, i.e., the speed with which the strokes are drawn. In the second approach, we use “curvature” information. The main difference between these two approaches is that while the speed-based approach uses hard-coded geometric tests for classification, the curvature-based approach employs a neural network. In the following paragraphs, we begin with a discussion of the speed-based approach, and explain how it is applied to recognizing both single-stroke and two stroke arrows. Next, we describe the curvature-based recognizer.

4.2. PRELIMINARY RECOGNITION

33

B

(a)

R 0.7

A

C

D

0.6 0.5

(b)

Speed

0.4 0.3

R

0.2 0.1

B

A

C

D

0 1

7

13

19

25

31

37

43

49

55

61

67

73

79

85

91

97 103 109 115 121 127 133 139 145

-0.1 -0.2 Tim e

Figure 4.7: Arrow recognition. (a) A one-stroke arrow with the key points labelled. (b) Speed profile. Key points are speed minima. Speed-based, Single-Stroke Arrow Recognition In this approach, recognition is based on the identification of five key points, labelled A, B, C, D and R in Figure 4.7a. Points A, B and C correspond to the sharp corners on the arrowhead. The distinguishing characteristic of these points is that they all correspond to pen speed minima, as can be seen in the pen speed profile in Figure 4.7b. These points are thus identified by locating the three global minima in the speed profile, excluding the end point. Point D is the end point of the stroke. Finally, R is a “reference” point on the arrow shaft and is obtained by moving a small distance backwards from point A. When determining the key points from the speed profile, two heuristic filters are used to ensure that the identified points are the intended ones. First, the search for the speed minima excludes the first 40% of the stroke. This filter is intended to discard any global speed minima that occurs early in the stroke since the true key points are expected to occur toward the end of the stroke. This way, the speed minima that occur due to bends or kinks in the arrow shaft can be avoided (Figure 4.8a). Second, minima points that occur too close to one another are condensed into a single point (Figure 4.8b). This filter is intended to eliminate occasional spurious points that

34CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

Dismissed region

(a)

(b)

Figure 4.8: Filtering used in detecting the key points. (a) Speed minima that occur in the first 40% (the dismissed region in the speed profile) are not considered as they typically correspond to the bends in the arrow shaft. (b) Speed minima that occur too close to one another (encircled in the speed profile) are condensed into a single point. might occur when the pen speed is reduced. As a result, the final key points are typically well-separated. Once the key points are determined, a series of geometric tests is performed to determine whether or not the stroke really is an arrow. The following tests must be satisfied: d < 90◦ , • ABC d < 90◦ , • BCD d < 90◦ , • RAB d < 90◦ , • RAD



|BC| ≤ 0.20 T otalinklength



|DC| ≤ 0.20 T otalinklength

4.2. PRELIMINARY RECOGNITION

35

(a)

(b) Figure 4.9: Examples of (a) arrows and (b) arrow heads, that are correctly recognized. B R

A

C

D

Figure 4.10: The five key points on a two-stroke arrow. T otalinklength is the distance travelled by the pen tip from a pen-down event to the next pen-up event. These geometric tests were designed empirically by collecting a corpus of positive and negative examples of arrows from several users, and experimenting with different levels of specificity and thresholds until the best classification performance was obtained. With the resulting recognizer, a variety of arrow shapes with different arrowhead styles can be recognized as shown in Figure 4.9.

Speed-based, Two-Stroke Arrow Recognition The two-stroke arrow recognizer is similar to the single-stroke recognizer in that recognition is still based on the spatial configuration of the key points. The difference, however, is that the key points are determined from pairs of consecutively drawn strokes. The end of the first stroke is set as point A, the beginning and end of the second stroke as point B and D respectively, and the global speed minima in the second stroke as point C (Figure 4.10). R is once again obtained by moving a small distance backwards from point A. If the five key points satisfy the geometric tests described above, the two strokes are grouped and interpreted as an arrow. The T otalinklength in this case is the sum of the individual lengths of the two strokes.

36CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

To identify the two-stroke arrows in the sketch, the two-stroke recognizer examines pairs of consecutively drawn strokes. Because only consecutively drawn pairs are considered, the computational complexity of this analysis remains linear in the number of strokes. During the design stages of our arrow recognizers, we have not encountered a case in which an arrow was drawn in three strokes (i.e., one for the shaft and two for the wings). Interviews with the participants revealed that such drawing style greatly slows the drawing process and feels unnatural. Hence, although the extension is trivial, the three-stroke arrow recognition is not considered in this study. Discussion and Shortcomings of Speed-based Arrow Recognition We contrast this definition to a similar arrow model described in [31] where the arrow shaft is required to be a line segment (hence curly or bent arrows cannot be recognized), the wings of the arrow head have to be of equal length, and the tip of the arrow shaft has to be coincident with the tip of the arrow head (i.e., points A and C must be coincident). Additionally, their definition requires the arrow to be drawn in two strokes with the pen lifted between points A and B. From this perspective, the arrow recognizers presented here allow for much greater flexibility and casualness when drawing arrows. Another key feature of our speed-based arrow recognizers is that they almost never produce false positives, i.e., they do not classify a non-arrow stroke as an arrow. However, sometimes the single-stroke arrow recognizer produces false negatives, i.e., it may fail to recognize an arrow when there is one. Three sources have been identified for this error: (1) The arrow head is too big or too small compared to the shaft. When the arrow head is too big (Figure 4.11a), the geometric tests described above fail to identify the arrow. On the other hand when the arrow head is too small (Figure 4.11b), the key points occur too close to one another in the speed profile. As described previously, the program groups such proximate points into a single point. In the case of small arrow heads, this erroneously merges the true key points. (2) The user draws too slowly. This usually happens when users are too careful

4.2. PRELIMINARY RECOGNITION

(a)

37

(b)

Figure 4.11: The single stroke arrow fails when the arrow head is either too big or too small.

Figure 4.12: The speed profile of a single stroke arrow when the user draws too slowly. using the digitizing tablet. In such cases, users tend to slow down too frequently which leads to spurious speed minima. Figure 4.12 shows an example. In such occurrences, the recognizer gets confused because it does not know which of the minima actually correspond to the true corners. When there are too many spurious minima, the heuristic filter that condenses proximate minima usually fails to preserve the true corners. (3) The user draws too fast. A similar problem occurs when the user is too fast. In such cases, the intended key points cannot be discerned from the speed minima. Figure 4.13 shows an example. Curvature-based, Neural Network Arrow Recognizer To overcome the shortcomings of the previous recognizer, we designed a second type of arrow recognizer. This recognizer employs a neural network to learn and classify arrows. The statistical nature of the recognizer alleviates the sensitivity of the

38CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

Figure 4.13: The speed profile of a single stroke arrow when the user draws too fast. Only three of the four key points are determined toward the end of the stroke. previous approach to the precise detection of corner points and to the hard-coded geometric tests. The new recognizer was developed only for single-stroke arrows since the two-stroke arrow recognition with the previous approach was sufficiently robust. Similar to the previous approach, it is assumed that the arrow is drawn from tail to head. The key features of the new recognizer are as follows: • The distinguishing characteristic of an arrow is once again chosen to be the shape of the arrow head. Therefore, both training and recognition focuses on the latter portion of the stroke, allowing arrows with different styles of shafts to be recognized. • To enhance the classification performance of the neural network, the raw input strokes are preprocessed to obtain data points equally spaced in space as opposed to time. • The recognizer uses information related to the curvature of the strokes as opposed to the drawing speed information to characterize arrows. The following paragraphs describe the recognizer in more detail. Preprocessing: As the user is drawing, the digitizing tablet provides data at a constant rate. As a result, the original data points are uniformly spaced in time. However, one undesirable

4.2. PRELIMINARY RECOGNITION

39

Resampling

Figure 4.14: Resampling a raw stroke to obtain data points equally spaced along the stroke’s trajectory. effect of this phenomenon is that variations in the drawing speed causes data points to be spaced non-uniformly along the stroke’s trajectory (low pen speeds cause dense point clouds while high pen speeds cause large gaps between points). As a result, the locations of the data points of two similarly shaped strokes will be different if the strokes were drawn at different speeds. We have found that training a neural network using the raw data points leads to poor performance. To overcome this difficulty, each input stroke is resampled such that the constituent points are regularly spaced in space as opposed to time. Figure 4.14 illustrates the idea. A linear interpolation function is used for resampling. The original stroke is resampled such that the resulting stroke consists of 36 data points, regardless of the number of points in the original stroke. At the end, we have found that training the neural network with regularly spaced points yields much better performance as inconsistencies due to different drawing speeds are eliminated. These findings are consistent with those reported in [53] and [30]. Feature Extraction: After resampling, a feature vector is computed forming the input to the neural network. Unlike the speed information used in the previous recognizer, this recognizer uses information related to the inverse-curvature of the resampled stroke. As shown in Figure 4.15a, the inverse-curvature is represented as the cosine of the angles between line segments connecting consecutive points. Although the cosine is not precisely the inverse-curvature, it is closely related to it, and is thus suitable for our purposes. As

40CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

B C A D

pn qn

pn-1

pn+1 D Cos(qn)

C

A B

Data points pn

(a)

(b)

Figure 4.15: (a) The inverse-curvature at a point is calculated as the cosine of the angle between the segments before and after the point. (b) The inverse-curvature is minimized at the sharp corners. shown in Figure 4.15b, cos(.) reaches a minimum at the sharp corners, such as at the corners of the arrow head. It is therefore conceptually closely related to the local radius along the resampled stroke. The advantage of using the cosine, however, is that it suitably ranges between [-1,1], thereby facilitating the training and classification processes of the neural network. Other metrics, such as the radii of the curvature along the stroke’s trajectory, can be arbitrarily small or large, making them unsuitable as inputs to a neural network. Note that the inverse-curvature profile obtained using the angle information resembles the speed profile in that the minima once again correspond to the key points. Hence, the same techniques used for identifying the key points in the speed-based recognizer could be used in this recognizer. Based on these key points, we could then use the same geometric tests described in Section 4.2 to decide whether the stroke is an arrow. We have tested this idea, and found that the performance of the hard-coded geometric tests to be the same, regardless of whether the key points are determined based on the speed profile or the curvature profile. Instead, the neural network-based recognizer described below (1) is not sensitive to the accurate detection of the key points, and (2) decides whether a stroke is an arrow

4.2. PRELIMINARY RECOGNITION

41

Output

6 neurons

16 neurons

...........

Input

cos(q19)

................................................

cos(q20)

cos(q21)

cos(q36)

Figure 4.16: The structure of the neural network. based on the output of the network, rather than using hard-coded geometric tests. Network Structure: The network is designed as a fully-connected, feed-forward backpropagation neural network with 18 inputs, 16 neurons in the first hidden layer, 6 neurons in the second hidden layer, and 1 output neuron (Figure 4.16). The network takes as input the last 18 elements of the inverse-curvature vector, which corresponds to the second half of the stroke. This allows the network to characterize a stroke based on the stroke’s end region where the arrow head is to be found. The neural network therefore does not consider the arrow shaft, allowing arrows with different shaft styles to be recognized. The number of input elements was determined empirically by examining a variety of different arrows drawn at different speeds, and finding the number of elements that best captures the arrow head region. In each of the neurons of the network, a tangent sigmoid (‘tansig’) function is used as the activation function. The output of the network is a single neuron that indicates whether an input stroke is an arrow or not. In the training examples, a

42CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

target value of 1.0 is assigned to arrows, while 0.0 is assigned to non-arrows. Given this scheme, the network classifies an unknown stroke as an arrow if the network’s output is greater than 0.5, and as a non-arrow otherwise. The network was implemented using Matlab’s neural network toolbox. This allowed us to experiment with a number of different network structures and activation functions in a feasible way. The network presented here is the final design that led to the best performance. The parameters considered during the design were (1) The speed of training, (2) The number of training examples required to reliably train the network (the fewer, the better), (3) The classification performance on a variety of different test sets. Because training is performed off-line, the user of this recognizer would not be aware of the first two criteria. The critical criterion from the user’s perspective is the classification performance and thus the most effort was spent on maximizing the last criterion. Performance: The network was trained using 200 arrows and 200 non-arrows in a total of 100 epochs3 . The training data was collected from 3 different users including the author. The total training time was 32 seconds on a Pentium 4 machine at 1.7 GHz. with 256MB of RAM. The classification performance on the set of training data was 100%, i.e., all training samples were classified correctly when they were introduced as test examples. The classification performance on an independent set of 216 test arrows was 92.1%. The average recognition time was less than 1 millisecond per arrow on the same machine used for training. To test the rate of false-positives, we also collected 216 non-arrow strokes and the recognizer misclassified 6.4% of those strokes as arrows. All data used in this test was collected from 4 users, none of whom were involved in training. With the same test data, the classification performance of the speed-based arrow recognizer was 70.0% with an average recognition time of again less than 1 millisecond per arrow. The false-positive rate in this case was 2.6%. These results indicate that 3

Each epoch can be considered as a training session in which the entire set of training examples is presented to the network. In each epoch, the weights in the network are modified by small amounts.

4.3. STROKE CLUSTERING

43

the neural network arrow recognizer is superior to the speed-based recognizer for identifying arrows. Additionally, the false-positive rates of the two recognizers are relatively low. Currently, SimuSketch has both recognizers built into it and the user can select which recognizer to use during run time. However, during the times a user study was conducted to test the performance of SimuSketch, only the speed-based recognizer was functional. Hence, the results of that study, which is presented in Section 4.9, do not reflect the better performance that would have been achieved if the neural network-based recognizer had been available.

4.3

Stroke Clustering

This section describes an algorithm to locate the distinct symbols in the Simulink domain. Note that this step is concerned only with stroke clustering and not recognition. Recognition is deferred until later after additional sources of information, such as context, have been considered. The arrow analysis identifies the arrows in the sketch but leaves the rest of the strokes uninterpreted. The goal in this step is to group the uninterpreted strokes into different clusters such that each cluster forms a distinct Simulink object that can be subsequently recognized. The key idea behind stroke clustering is that strokes are deemed to belong to the same symbol only when they are spatially proximate. The challenge is reliably determining when two pen strokes should be considered close together. Here, we rely on the arrows to help make this determination. In network diagrams, each arrow typically connects a source object at its tail to a target object at its head. Hence, different clusters can be identified by grouping together all the strokes that are near the end of a given arrow. In effect, two strokes are considered spatially proximate if the nearest arrow is the same for each. Based on this observation, the following procedure for identifying symbol clusters has been developed: Step-1 Assign each non-arrow stroke to the nearest arrow: Stroke clustering begins by assigning each non-arrow stroke to the nearest arrow (Figure 4.17a). The distance between a stroke and an arrow is defined to be the Euclidean distance between the median point of the stroke and either the head or tail of the arrow,

44CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

(a)

(b)

(c)

(d)

Figure 4.17: Illustration of the cluster analysis. (a) Each stroke is assigned to the nearest arrowhead or tail. (b) Strokes assigned to the same arrow are grouped into clusters. (c) Clusters with overlapping bounding boxes are merged. (d) Arrows that did not receive any strokes are attached to the nearest cluster. whichever is closer. The head is taken to be the apex, which is shown as point C in Figure 4.7. Step-2 Combine strokes into clusters: Strokes assigned to the same arrow end in Step-1 are grouped to form a stroke cluster. These clusters form the basis of the

4.3. STROKE CLUSTERING

45

symbols. Figure 4.17b shows the results of this step. Step-3 Merge overlapping clusters: Next, clusters with partially or fully overlapping bounding boxes are merged. The bounding box of a cluster is the minimum sized rectangle, aligned with the coordinate axes, that fully encloses the constituent strokes. As shown in Figure 4.17c, this process combines strokes that are part of the same symbol but which were initially assigned to different arrows in Step-1. If bounding boxes of different symbols overlap, this process could erroneously merge the symbols. However, in our experience, we have found that users rarely draw in such a way that this happens. Thus, at the completion of this step, each cluster is assumed to be a distinct symbol. Step-4 Connect empty arrowhead/tails to the nearest cluster: Step-1 guarantees that each non-arrow stroke is attached to the nearest arrow end. However, some of the arrow ends might remain devoid of any strokes. In this step, empty arrow ends are linked to the nearest stroke cluster (Figure 4.17d). This step helps to ensure the intended connectivity of the diagram by ensuring that each arrow has a cluster at its tail and head. Special Case: Branching Arrows In network diagrams, it is typical to observe arrows that start in the middle of other arrows and this can present a challenge to the stroke clustering algorithm described above. We call this phenomenon “branching”, an example of which is shown in Figure 4.18. In such cases, the program must infer that the source object of the primary and branching arrows is the same (the Sine Wave). However, if the branching point is close to a target object (the upper Scope), the clustering algorithm may erroneously assign the target object of the primary arrow as the source object of the branching arrow. To resolve this peculiarity, SimuSketch identifies the branching arrows before applying the clustering algorithm. This involves determining, for each arrow, whether its tail is closer to another arrow than it is to the non-arrow strokes. If so, the program assigns the source object of the branching arrow to be the same as the source object of the primary arrow. To efficiently find the distance between the tail of a

46CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

Primary arrow

Branching arrow

Figure 4.18: Example of a branching arrow. The program must infer that the source object of the branching arrow is the Sine Wave and not the upper Scope.

rrow Primary a

p1

p2

p3

p4 p5

Branching arrow

Figure 4.19: The distance between a branching arrow’s tail point and a primary arrow is the minimum of the five distances computed from the branching arrow to the five representative points on the primary arrow.

branching arrow and a primary arrow, five equidistant points along the primary arrow are selected. The distance from the branching arrow to each of the representative points is calculated and the minimum distance is taken as the distance between the two arrows. Figure 4.19 illustrates the idea. This approach avoids a more costly computation in which the distance to each of the points comprising the primary arrow is considered. While this approach does not necessarily compute the true minimum distance, it has proven sufficient to distinguish arrows that are closer to other arrows than to non-arrow strokes.

4.4. GENERATING SYMBOL CANDIDATES

4.4

47

Generating Symbol Candidates

After identifying the stroke clusters, the next step is to recognize the symbols they represent. A naive approach would be based solely on shape similarity where a stroke cluster is compared to each of the definition symbols, and the symbol that yields the highest similarity is chosen as the interpretation of the cluster. However, this approach is inefficient and prone to errors as in each step, the recognizer must consider and distinguish between a large set of symbols. Therefore, a recognizer should explore other sources of information to improve its accuracy and performance. Our approach combines contextual knowledge with shape recognition to achieve accuracy and efficiency. In particular, we examine the number of input and output arrows associated with each stroke cluster to help constrain its possible interpretations. For example, function generators such as the Sine Wave, the Chirp Signal, or the Random Signal Generator can have only output terminals, and therefore, must have only outgoing arrows. Likewise, certain symbols can have only input terminals, such as the Scope block, or may have an arbitrary number of input and output terminals such as the Sum block. Figure 4.20 lists the number of admissible input and output channels for the 16 Simulink objects we currently consider in our system.4 By examining the number of input and output arrows for a given cluster, the program is able to narrow down the set of possible interpretations for that symbol. For example if a stroke cluster has 3 input and 1 output arrows, Figure 4.20 suggests that the possible interpretations of the cluster are the Mux, Sum, and the Switch blocks. This determination reduces the amount of work the recognizer must do. It also helps increase accuracy by reducing the possibilities for confusion. For example, while the Sum block and the Clock look quite similar, context dictates that a Sum block must have at least two incoming arrows while the Clock must have none. With this additional knowledge, the recognizer would never consider the Sum block and the Clock as two competing candidates during shape recognition. Another form of contextual information comes from the number of strokes that 4

Currently SimuSketch has the operational knowledge of the above 16 Simulink objects. However, the extension to other Simulink objects is straightforward, requiring only code for linking the objects in SimuSketch to those in Simulink.

48CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

Num. Of Inputs Sine Wave

Chirp Signal

Random Number Generator

Step

Constant

8.6

Clock

Gain

4.0

Integrator

Differentiator

d dt

Num. Of Outputs

=0

>= 1

=0 =0 =0

>= 1 >= 1 >= 1

=0

>= 1

=0 =1

=1 >= 1

=1

>= 1

=1

>= 1

Num. Of Inputs

Num. Of Outputs

Signum

=1

>= 1

Mux

>=2

=1

=1

>= 1

Switch

=3

>= 1

Sum

>=2

>= 1

Scope

>=1

=0

Coulomb Friction

=1

>= 1

Transfer Function

s+1 2

s+4s-7

Figure 4.20: The number of legal input/output channels for the Simulink objects.

are typically used to draw a symbol. Consider once again Figure 4.1 (page 25). Unless very unusual drawing styles are used, the Sine Wave would be drawn in a minimum of three strokes (two for the axes and one for the curve) while the Integrator would be drawn in a single stroke. Based on this idea, we have identified the minimum and maximum number of strokes for each domain symbol. The lower bound is straightforward to determine; it is the minimum number of times the pen must be lifted. We assign an upper bound based on typical drawing styles. However, to account for the possibility of users elaborating their symbols, such as by overtracing or by adding small extra strokes to close open contours, we extend the upper bound by several strokes. During recognition, the number of strokes contained in a stroke cluster is used as another means to narrow down the list of alternatives. For example if a cluster contains two strokes, the program will prune any interpretation that requires at least three strokes, such as the Sine Wave.

4.5. SYMBOL RECOGNITION

4.5

49

Symbol Recognition

Once the clusters have been identified, and the possible interpretations have been determined, the next step is to actually recognize each cluster. We have developed a trainable, image-based symbol recognizer for this purpose. While Chapter 7 provides an extensive discussion of this recognizer, we present a brief overview of it at this point to preserve continuity. The recognizer can recognize shapes independent of their position, size and orientation. However, it is sensitive to non-uniform scaling, and thus it can distinguish between, say, a square and a rectangle. A distinguishing feature of this recognizer is that it is used for recognizing both the Simulink objects, and the numerical digits in the objects’ dialog boxes. Input symbols are internally described as 48x48 quantized bitmap images which we call “templates”. The template representation has a number of desirable characteristics. First, segmentation – the process of decomposing the symbol into constituent primitives such as lines and curves – is eliminated entirely. Second, the representation is well suited for recognizing “sketchy” symbols such as those with heavy overtracing, missing or extra segments, and different line styles (solid, dashed, etc.). Lastly, this recognizer puts no restrictions on the number of strokes, or the order in which the strokes are drawn. Unlike many traditional methods, our shape recognizer requires only a single prototype example to learn a new symbol definition. Using the “Train New” button in the interface, the user can create a new symbol definition by simply drawing a shape and assigning a name to it. With this approach, users can seamlessly train new symbols or overwrite existing ones on the fly, without having to depart the main application. This feature makes it easy for users to extend and customize their symbol libraries. The recognizer uses an ensemble of four different classifiers to evaluate the match between an unknown symbol and a candidate definition symbol. The classifiers we use are extensions of the following methods: (1) Hausdorff distance [64], (2) Modified Hausdorff distance [17], (3) Tanimoto coefficient [20] and (4) Yule coefficient [72]. The Hausdorff methods reveal the dissimilarity between two templates by measuring the

50CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

distance between the maximally distant pixels in the two point sets. The Tanimoto coefficient on the other hand reveals the similarity between two templates by measuring the amount of overlapping black pixels. The Yule coefficient is also a similarity measure except it considers the matching white pixels in addition to the matching black pixels. The motivation for using a multiple classifier scheme lies in the pragmatic evidence that, although individual classifiers may not perform perfectly, they usually rank the true definition highly, and tend to misclassify differently [2]. Hence, by advocating definitions ranked highly by all four classifiers, while suppressing those that are not, we can determine the true class more reliably. During recognition, each classifier outputs a list of symbol definitions ranked according to their similarity to the unknown. Results of the individual classifiers are then synthesized by first transforming the similarity measures into dissimilarity measures, then normalizing the classifiers’ output into a unified scale (to make them compatible), and finally combining the modified outputs of the classifiers. The definition symbol with the best combined score is chosen as the symbol’s interpretation.

4.6

Error Correction

SimuSketch provides two ways to correct parsing and recognition errors when they occur. The first involves automatic detection and correction of the error based on the results of the parsing algorithm. In the second case, the user is responsible for detecting and correcting errors the system cannot repair. Automatic Error Correction: Our experience with SimuSketch has shown that when errors occur, they are primarily due to failures to recognize arrows in the preliminary recognition step. A distinct feature of our two arrow recognizers (the speed-based and the neural network-based) is that they almost never identify a non-arrow as an arrow. However, they may occasionally fail to identify an arrow as such. Hence, our automatic error correction module is focused only on finding missing arrows. Nevertheless, if a non-arrow is classified as an arrow, the user can still correct the mistake interactively, as will be explained shortly. The failure to recognize an existing arrow typically manifests itself as a stroke

4.6. ERROR CORRECTION

51

(a)

(b)

Figure 4.21: (a) Malformed arrowhead causes bottom left arrow to be missed. (b) After detecting the abnormal cluster, the program finds the arrow and interprets the sketch correctly. cluster with an abnormal bounding box. As shown in Figure 4.21, undetected arrows may cause the bounding boxes of the stroke clusters to be too large or too thin compared to the other bounding boxes. The automatic error detection module thus tries to identify such ill-shaped stroke clusters. Currently, SimuSketch marks a cluster ‘abnormal’ if its bounding box area is (1) greater than 30% of the entire bounding box of the sketch, or (2) greater than twice the size of average cluster size, or (3) if the aspect ratio of the bounding box (i.e., height/width) is greater than 4 or less than 1/4. While these thresholds are hard-coded, they can be modified based on individual drawing styles and the typical shapes of the domain objects. When abnormal stroke clusters are found, SimuSketch tries to identify the undetected arrows in them. In the case of speed-based arrow recognizer, this is done by gradually relaxing both the criteria used for identifying the key-points in the arrows, and the geometric constraints used to verify that the points represent an arrow head. Each stroke in the abnormal cluster is then reevaluated with these new criteria, and the strokes that survive are taken to be arrows. Note that the program is currently

52CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

designed to correct only single-stroke arrows. This is mainly because we have found the two-stroke arrow recognition to be quite robust, thus we could save effort by not implementing an automatic error correction module for it. Once the originally undetected arrows are identified, SimuSketch reevaluates the remainder of the sketch to determine the new stroke clusters and the corresponding symbols. While the above process is performed without user’s assistance, the user is nevertheless responsible for initiating it by pressing the “Fix Error” button located in the menu bar. Making error correction a user-controlled process in this way, prevents the system from attempting to correct successfully parsed and recognized sketches. Currently, SimuSketch does not perform automatic error correction when the neural network-based arrow recognizer is used. However, the extension is quite trivial (and arguably simpler than the speed-based arrow recognition case) as the output of the neural network can be directly used as the likelihood of a stroke being an arrow. By lowering the threshold for classifying a stroke as an arrow,5 arrows that originally went undetected can be identified during error correction. Occasionally, stroke clustering errors may occur when the clustering process produces incorrectly grouped strokes. We have found that although this type of error occurs rarely, it occurs when strokes that belong to the same symbol are placed in different clusters. (It is possible that a cluster might incorrectly contain multiple symbols, but experience with our system suggests that happens very rarely.) The characteristic symptom of this error is two small bounding boxes that reside very close to each other. Furthermore, the recognizer may report that there is no reliable definition for the symbols in those clusters. When this occurs, SimuSketch can attempt to merge the clusters to obtain a cluster that the recognizer can identify with higher certainty. We are currently in the process of implementing this functionality. Interactive Error Correction: For errors that cannot be corrected automatically, SimuSketch enables the user to correct them interactively. The developed techniques have strong parallels with the mediation techniques presented in [55]. When an object is misrecognized, the user can repeat the object by selecting, deleting and 5

Remember that the output of the neural network ranges between 0 and 1, and strokes with a score ≥ 0.5 were classified as arrows.

4.6. ERROR CORRECTION

53

(a)

‘o’ gesture

(b)

Figure 4.22: (a) Undetected arrows can be corrected with the ‘o’ gesture. (b) After the correction, the program rectifies the rest of the sketch.

redrawing it. A more direct way is by choosing the correct interpretation from a choice list, which is revealed by bringing the stylus near the misrecognized object and pressing one of the buttons on the side of the stylus. This list contains only the candidate symbols previously determined using contextual information, and is ranked according to the results of the shape recognizer. Hence, the list is typically short with the correct interpretation usually occurring near the top. Indeed, we saw in our user tests that, when errors occurred, the correct interpretation was at worst the second alternative, which reinforces our confidence in our symbol recognizer. Finally, if an arrow goes undetected, and the automatic error correction module cannot detect it, the user can dictate the correct interpretation by drawing a small circle on or near the stroke (Figure 4.22). This gesture, which we call the ‘o’ gesture, explicitly forces the stroke in question to be an arrow. The ‘o’ gesture is distinguished from a regular drawing stroke based on its absolute size and its two end points. If the gesture fits in a 30 x 30 square on a 1024 x 768 screen, and the stroke forms a closed contour (similar to a selection gesture) without encircling any object, the stroke is interpreted as an ‘o’ gesture. Once a misrecognized arrow is corrected, SimuSketch automatically rectifies the portion of the sketch that was affected by the missed arrow.

54CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

Bounding box of the first stroke

Combined bounding box

(a)

y

x

(b)

Figure 4.23: Digit recognition.

4.7

Digit Recognition in the Dialog Boxes

As noted previously, SimuSketch allows users to change the numeric properties of the recognized objects through sketchy dialog boxes. These dialog boxes employ a simple digit recognizer we developed. The recognizer takes as input the set of strokes written in a particular slot, and returns a real number. The following paragraphs detail the intermediate processes. Identifying the minus sign: The recognizer first constructs the “combined bounding box” of the set of strokes (Figure 4.23a). This bounding box is aligned with the coordinate axes and is the smallest sized rectangle that contains the bounding boxes of all of the individual strokes. It gives a sense of the size of the number. Next, the program considers the first stroke in the number. If the height of the bounding box of this stroke is less than 30% of the height of the combined bounding box, the first stroke is taken to be a minus sign. Identifying the decimal point: The recognizer next tries to locate the decimal point, if any. In our system decimal points are input as commas because the stylus does not leave a mark when it is quickly tapped. To locate the decimal point, the recognizer looks for a stroke whose bounding box height is less than 30% of the combined bounding box height, and whose bounding box center is in the lower 30% of the combined bounding box. If found, the program decides that stroke is a decimal

4.8. COMPLEXITY ANALYSIS OF SIMUSKETCH

55

point. Parsing into digits: Next, the recognizer tries to separate the digits in the number. This would have been trivial if each digit was drawn in a single stroke, but in general this is not the case. To facilitate parsing, the recognizer projects each stroke onto the x-axis producing its ‘shadow.’ Shadows that overlap are grouped together (Figure 4.23b) forming distinct digits. The assumption here is that different digits do not overlap. Note that this assumption could be relaxed by grouping strokes whose shadows significantly overlap thus helping separate digits that barely overlap. Nevertheless, while the extension is trivial, the system currently works from the original assumption. Digit recognition: Each of the identified stroke clusters are then passed to the same image-based symbol recognizer used for the Simulink objects. The returned results, and the minus sign and the decimal point, are finally compiled into a decimal number and returned back to the main program.

4.8

Complexity Analysis of SimuSketch

In the preliminary recognition step of SimuSketch, the program first identifies the arrows by running down the list of strokes and determining for each whether it is an arrow or not. Hence, the complexity of this step is O(n), where n is the number of strokes. This is also true for the two-stroke arrow recognizer as only consecutively drawn pairs of strokes are examined. Note the requirement for consecutiveness is only for two stroke arrows; other objects need not be drawn in consecutive strokes. In SimuSketch’s stroke clustering algorithm, each of the non-arrow strokes is examined to find the corresponding nearest arrow. If we denote the number of arrows by a and the number of non-arrow strokes as b, the computational complexity of this step is O(a · b). For simplicity, if we assume that the number of arrows and nonarrows are both O(n) without distinguishing the two types, the complexity of this step is O(n2 ). Next, the strokes assigned to the same arrow are merged. This step costs O(a), where a again corresponds to the number of arrows (or O(n) if we do not distinguish arrows and non-arrows). Next, clusters that overlap are merged to

56CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

Preliminary Recognition

Case 1

Number of Strokes = 33 Number of Arrows = 9 Number of Clusters = 9

Case 2

Number of Strokes = 78 Number of Arrows = 18 Number of Clusters = 17

Case 3

Number of Strokes = 135 Number of Arrows = 38 Number of Clusters = 31

Stroke Clustering

Symbol Recognition

Less than 14 ms.

1422 ms.

15ms.

Less than 14 ms.

2891ms.

16ms.

Less than 14 ms.

4296 ms.

Less than 14 ms.

Figure 4.24: The processing times of different modules of SimuSketch for three different cases. All times are in milliseconds. form bigger clusters. This costs, in the worst case, O(c2 ) where c is the number of initial clusters that have been determined. This is because, for each cluster, we visit every other cluster to see if they overlap. This worst case scenario occurs only if none of the initial clusters overlap and thus no clusters are merged. In general, however, this is not the case because as the clusters are merged forming bigger clusters, the total number of clusters decreases. Hence, during run time, c will get progressively smaller. Overall, because typically c ¿ n, the computational complexity of the stroke clustering algorithm will be dominated by the first step, which is O(n2 ). The worst-case complexity of the subsequent symbol recognizer is O(c · d) where c is the number of stroke clusters that have been identified in the previous step, and d is the number of domain symbols. This is because for each cluster c, the symbol recognizer employs a nearest neighbor algorithm to find the best matching definition symbol d. However, this worst-case scenario occurs very rarely in the SimuSketch domain as contextual information is used to reduce the number of possible interpretations prior to recognition. Hence, the average complexity of the symbol recognition step is O(c·r), where r represents the number of candidate interpretations for each cluster. Typically r is much less than d. To get a sense of the complexity experienced in practice, we processed three different sketches in the SimuSketch domain. Figure 4.24 shows the running times of the different modules of our approach, in relation to the total number of strokes,

4.9. EVALUATION OF SIMUSKETCH

57

the number of arrows, and the number of identified symbol clusters. To provide a basis for comparison, the sketch shown in Figure 4.1a (page 25) contains 45 strokes, 13 arrows and 10 clusters. Hence, it is slightly more complex than Case 1 depicted in Figure 4.24. All experiments were conducted on a Pentium 4 machine at 1.7 GHz. with 256MB of RAM. The results indicate that the preliminary recognition and stroke clustering steps are quite fast and they scale well with the number of strokes. Because the resolution of the computer is less than 14 ms., we cannot provide an exact figure for processes that take less time. Note that the running times of the first two steps remain more or less constant while the size of the sketches nearly double in each case. This indicates that the first two steps scale well with the size of the sketch. Nevertheless, it would be necessary to try out significantly more complex sketches before this claim can be validated. Finally, the results indicate that the overall running time is dominated by the symbol recognition step. Detailed results of the computational complexity of the image-based symbol recognizer are presented later in Section 7.6.1.

4.9

Evaluation of SimuSketch

The focus of this study was to assess the performance of SimuSketch. Among the various aspects that were investigated, we were particularly interested in SimuSketch’s ease of use, its parsing and recognition accuracy, users’ adaptability to the system, their success at recovering from recognition errors, and their short and long term view of SimuSketch as a practical front-end to Simulink. A total of 14 graduate and undergraduate students – 12 engineering and 2 computer science majors – participated in the studies. Nearly half of the users either regularly used Simulink or had previously used it once or twice, while the other half had never used Simulink before. 10 users had no prior experience with the digitizing tablet or the stylus, while 4 users had once used the hardware in a previous study. However, none of the users had previously used SimuSketch, nor had seen it in use by others. As noted previously in Section 4.2, SimuSketch employed only the speed-based

58CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

arrow recognizer at the time of this user study. Hence, the results presented in this section do not reflect the potentially better performance that would have been achieved if the neural network-based arrow recognizer was used. The evaluation of SimuSketch with the latter recognizer shall be conducted in the future. However, it is worthwhile to note once again that the current version of SimuSketch has both types of recognizers operational, and the user can choose which one to use. Each session lasted approximately 30 to 40 minutes. For those who were not familiar with Simulink, we first described what Simulink is and gave a brief demonstration on its interface. Next, we introduced SimuSketch. Using simple examples, we demonstrated the means for creating a sketch, selecting, deleting and moving objects, editing object properties, correcting recognition errors, running simulations, training new symbols and switching between views. During this period, we elaborated on SimuSketch’s arrow recognizer as our experience with the first few users had indicated the arrow recognition to be fragile at times. Particularly, we told the users that only single or two stroke arrows were permitted and both types had to be drawn from a source object toward a target object. No other explanation was given regarding the underlying parsing and recognition algorithms. At the end of this introduction, a brief warm up period of approximately 5 minutes was given to let the users become familiar with the hardware and SimuSketch’s interface. The main test involved the two Simulink models shown in Figure 4.25. Users were asked to use SimuSketch to construct these models, run a simulation of each, and view the results with minimal help from us. The first model involved changing the parameters of several objects through their dialog boxes while in the second model the default values were accepted. Users were given the option to train their own set of symbols before starting, but none of them chose to do so. Hence, we provided a sketched version of each of the two models as a quick reference. Both the original models (Figure 4.25), and the sample sketches were presented on paper. Similarly, all users decided to use the pre-trained digit recognizer rather than training their own set of digits. However, in this case we did not provide sample figures of the trained digits. Although no time constraints were set, we encouraged users to complete their tasks in a total of 20 minutes.

4.9. EVALUATION OF SIMUSKETCH

59

Figure 4.25: Test problems employed in the user studies.

Observations, Evaluations and Discussions One consistent pattern among the users was that their encounter with SimuSketch began with great excitement as observed from their reactions during the demo session. This was followed by a period of frustration at the beginning of the warm up period, and finally reached a favorable equilibrium toward the end of the warm up period and during the actual testing. At the end, all users completed the first task successfully, while all but four users completed the second task. In the case of the four users, either the program crashed unexpectedly and they did not have time to redo it, or it was taking too long for them to finish the task. The users’ main remark about SimuSketch was that it was intuitive and fast to use, and easy to learn. They particularly liked the idea of simply drawing the objects without having to navigate through an object library to find them. Most users found

60CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

the interaction mechanisms to be “natural” and “familiar.” Many highlighted the ability to quickly train a custom set of symbols as an outstanding attribute, although they did not make use of it. The user studies enabled us to evaluate the individual accuracies of our arrow recognizer, parsing algorithm, and symbol recognizer. In its current implementation, our program saves only the user’s final sketch, and any objects that are deleted during a drawing session are lost. Our initial accuracy calculations thus do not reflect errors that users repaired by deleting and redrawing objects. This does not produce a significant error in our accuracy calculations, however, because users in the study rarely repaired interpretation errors in this way. In the results presented below, we include estimates of the accuracy that would have been obtained if all interpretation errors had been considered. The study has shown the main strength of SimuSketch to be its parsing algorithm. In cases where the arrows were all correctly recognized, or the misrecognized ones were corrected by the user, the parsing algorithm had an accuracy above 95%. In the few cases it failed, two distinct symbols were drawn too close to each other and thus their strokes were grouped into a single cluster. In cases where all stroke clusters were correctly identified, the accuracy of the image-based symbol recognizer was between 85 and 90%. Note that while this result is obtained in a user-independent setting (i.e., the training and test symbols belong to different individuals), it is similar to the result of the user-dependent study which will be described in depth in Chapter 7. We believe that SimuSketch’s ability to maintain the same level of accuracy in a more difficult user-independent setting can be attributed to its use of contextual knowledge for narrowing down the set of interpretations of a symbol prior to recognition. Nevertheless, when errors occurred, they were mostly due to: (1) the confusion between similarly shaped objects, or (2) the recognizer’s sensitivity to non-uniform scaling. Figure 4.26 shows examples of these issues. However, contrary to our expectation, users did not seem to mind such occasional errors, mainly because they found the means for recovery – either by deleting and redrawing, or by selecting the right interpretation from the list of alternatives – to be intuitive and undemanding. In the latter case, the correct interpretation was

4.9. EVALUATION OF SIMUSKETCH

61

(a) Random

Chirp

Coul. & Visc. Friction

Sign

(b) Definition of Transfer Function

One user’s Transfer Function

Figure 4.26: (a) Pairs of most frequently confused objects. (b) A misrecognition due to non-uniform scaling. (Left) Definition symbol. (Right) One user’s misrecognized symbol. always in the list of alternatives suggested by SimuSketch. The main complaint about SimuSketch centered around the arrow recognizer being too restrictive. As noted earlier, only the speed-based arrow recognizer was available at the time of this user study. Although several users quickly became adept at drawing arrows during the warm up period, most users continued having difficulty during the main test session. As we expected, the majority of the errors thus occurred due to the misrecognized arrows. For the most successful users, the arrow recognition accuracy was above 90%. However, when considering all users, the average accuracy of the speed-based arrow recognition was between 65 and 70%. These results indicate that our speed-based arrow recognizer must be further improved to accommodate a wider variety of styles. Indeed, the development of our neural network-based arrow recognizer (Section 4.2) was driven primarily by the results of this user study. Since then, our informal tests have shown that, by replacing the speed-based arrow recognizer with the neural network arrow recognizer, the overall performance of SimuSketch can be significantly improved. As noted in Section 4.2, we achieved a recognition accuracy of 92.1% with our neural network recognizer. However, formal evaluations of the performance of this recognizer within the SimuSketch system is the subject of future studies. Besides the issue with arrows, some users had difficulty tapping the stylus to

62CHAPTER 4. SIMUSKETCH: A SKETCH-BASED INTERFACE FOR SIMULINK

select an object or to bring up a dialog box. Usually, faulty taps were either too gentle, in which case the program did not receive a tap message, or persisted too long on the tablet, in which case the tap was interpreted as a drawing stroke. Another observed difficulty was with the digit recognition in the dialog boxes. While our pretrained digit recognizer had acceptable performance for certain users, it could not accommodate the vastly dissimilar digit styles that it was not trained for. In cases where the numbers were misrecognized, we asked the users to re-enter them until they got it right. If each user had trained his or her set of digits, we expect that the accuracy would have been similar to the results of a user study we conducted for digit recognition (Chapter 7.6.2). To obtain the users’ evaluation of SimuSketch’s performance, we asked each user to complete a questionnaire at the end of the session. The results shown in Table 4.1 indicate that, while there are a number of usability issues that must be addressed, most users viewed SimuSketch as a promising alternative to Simulink. Because SimuSketch is still at an early stage, a head-to-head comparison between SimuSketch and Simulink has been deliberately avoided in the user studies. Nevertheless, as a subjective test of how an individual who is proficient in both environments would perform, the author used the two programs to construct and simulate a variation of the second model shown in Figure 4.25. The test involved creating the model, changing the default properties of several objects, and viewing the simulation results. While the task took 241 seconds to complete in Simulink, it took only 183 seconds in SimuSketch. Although simplistic, this experiment helps reveal the latent value of SimuSketch as a practical tool. During the user studies, we have also identified several features that the users would like to see in the next version of the program. As a way to expedite the model creation process, many users strongly wanted the ability to copy-paste, group, and resize existing objects. Several users also felt the need for an ‘undo’ option. These features shall be implemented in the future versions of SimuSketch.

4.9. EVALUATION OF SIMUSKETCH

63

Table 4.1: Average scores obtained from user questionnaire. Scale: 1-10, 10 being excellent. Score As I was using SimuSketch, I was able to 8.2 adapt to it easily The software was fast enough to keep up 7.8 with my pace Most of the time, SimuSketch interpreted my 7.4 sketch the way I intended Most of the time, SimuSketch behaved expectedly 8.2 and when it did not, I felt I was in control to fix it The visual feedback on the interpretation results 9.1 was adequate and unobtrusive The editing operations (select, move, delete 8.3 deselect) were intuitive and easy to use I was comfortable using objects’ dialog 7.7 boxes to enter numeric values Currently, the overall performance 7.6 of SimuSketch is Assuming that SimuSketch was significantly more 9.4 robust I would use it in my projects Overall, my rating of SimuSketch is 8.7

Chapter 5 VibroSketch: A Sketch-Based Interface for Vibratory Systems VibroSketch is the second application we have built to demonstrate the mark-grouprecognize architecture. The system can be used to create and simulate multi-degree of freedom vibratory mechanical systems that consist of masses, dampers, springs, external forces and grounds. Although at the heart of VibroSketch is the same mark-group-recognize architecture used in SimuSketch, there are nonetheless several differences between the two systems. These differences are as follows: • VibroSketch uses two types of marker symbols, namely masses and ground symbols. • In SimuSketch the arrows actively guided the clustering of the remaining strokes, but in VibroSketch the marker symbols are used to simply magnify the separation between the other symbols. Hence, VibroSketch uses a clustering algorithm that is different than the one used in SimuSketch. • In SimuSketch, contextual information was used to reduce the number of possible interpretations of a symbol. VibroSketch, on the other hand, does not make use of this type of knowledge as the total number of domain symbols is already markedly less than that in SimuSketch. Nevertheless, it is quite trivial 64

5.1. USER INTERACTION

65

to encode such information in VibroSketch as described at the end of Section 5.3. • VibroSketch employs a feature-based symbol recognizer that is different than the image-based symbol recognizer used in SimuSketch. However, the modular nature of our mark-group-recognize approach allows the two symbol recognizers to be used interchangeably without affecting the rest of the system. The following paragraphs first describe the user interaction in VibroSketch, followed by a presentation of the technical details underlying the system.

5.1

User Interaction

User interaction in VibroSketch is very similar to that in SimuSketch. Figure 5.1 shows a snapshot of the user interface. Users draw continuously as they would on paper, without needing to indicate the separation between symbols. Users can sketch vibratory systems comprised of any number of masses, springs, dampers, forces and grounds. These objects can be drawn in any order, and each can consist of multiple strokes. New components can be added to the sketch at any time. After the drawing is completed, the user instructs the program to interpret the scene by tapping the “Recognize” button located on the drawing surface. At this point, the program processes the collection of strokes and identifies the mechanical components present. The program demonstrates its understanding by displaying unique text labels next to the identified components. These labels signify the type of the components, and the order in which they were drawn. The labels are similar to those an engineer might use. For example, “k1” indicates that the component is a spring, and furthermore, that it was the first spring drawn. Similarly, “m3” indicates that the component was the third mass drawn. A default value of one is assigned to the relevant properties of each spring, damper and external force. For example, each spring is assigned a stiffness of 1 N/m, and each damper is assigned a damping constant of 1 Ns/m. External forces are in the form of Fo · cos(ω · t) with Fo =1N and ω =1rad/s. Masses, are assigned mass values proportional to the size they were

66CHAPTER 5. VIBROSKETCH: A SKETCH-BASED INTERFACE FOR VIBRATORY SYSTEMS

Figure 5.1: A typical vibratory system created in VibroSketch. The program interprets the sketch, performs a simulation of it, and displays the results in the form of live animations and graphical plots. drawn. The geometrically largest mass block is assigned a mass of 1kg, while the remaining ones receive proportionally smaller values. For example, a mass block half the size of the largest one is assigned a mass of 0.5kg. The user can later change these default values by specifying the desired values through sketchy dialog boxes, similar to those used in the SimuSketch system. After identifying the components, VibroSketch performs a spatial analysis to determine how the components are connected to one another. It then constructs the set of differential equations that describe the dynamic behavior of the system. To simplify the generation of these equations, it is assumed that each mass has 1D motion along the horizontal direction. This assumption is not a limitation of our sketch understanding techniques, rather it avoids issues related to computing simulations, which are not the focus of this work. The equations are passed to, and solved by, the Matlab engine running in the background. VibroSketch is responsible for initiating Matlab and linking it to the sketch interface. VibroSketch retrieves the solution from Matlab, thus allowing the user to study the system behavior directly from the sketch

5.2. PRELIMINARY RECOGNITION

67

A’ A

2 1

1- AA’ - 2 - CC’ - 3 - BB’ C’ B’ B

3

C

Figure 5.2: A stroke chain involves the original pen strokes and the hypothetical linkages between them. The stroke-linkage sequence on the right shows the resulting stroke chain. The numbers and arrows indicate the order and directions in which the strokes were drawn. The stroke chain does not assume a particular drawing order or direction. interface. For example, the user can run a live animation by tapping the “Simulate” button in the interface. When the user does this, the sketch itself is animated: the masses move, the springs compress and stretch, and so on. The simulation results are also displayed in the form of graphical plots. As shown in Figure 5.1, the graphical output consists of position vs. time plots, and the frequency response of the system.

5.2

Preliminary Recognition

In the domain of mechanical systems, mass and ground symbols have proven to be good marker symbols as they possess a number of unique geometric characteristics that facilitate their recognition. For example masses invariably consist of closed loops. Similarly, ground symbols are characterized by a sequence of short, parallel line segments corresponding to the hatches. In the first step of analysis, VibroSketch exploits these features to identify the masses and grounds in the sketch. Recognizing Masses: Identifying a mass object involves finding a set of consecutively drawn strokes that connect end to end to form a closed loop. To determine if a set of strokes forms a closed loop, our program constructs a fully connected stroke chain that consists of the original strokes and a set of hypothetical linkages between them. The linkages are formed by joining the strokes to one another based on the

68CHAPTER 5. VIBROSKETCH: A SKETCH-BASED INTERFACE FOR VIBRATORY SYSTEMS

skeleton

s max

(a)

(b)

Figure 5.3: (a) Examples of correctly recognized ground symbols. (b) For recognition, our program considers various features such as the length of the skeleton, the separation between hatches and the orientation of hatches. minimum endpoint distance. For example in Figure 5.2, the beginning point of stroke1 is connected to the beginning point of stroke-2 (with the hypothetical linkage AA’) because these two ends are closer to each other than any other pair involving them. In a perfect closed loop, all strokes would be connected precisely at their endpoints and therefore the total linkage length1 would be zero. However, to account for sketchiness, we use a thresholded criterion that accepts a closure if the total linkage length is less than or equal to 10% of the total length of the pen strokes. For strokes that do not form a closed loop, this ratio is typically much higher. For example, it is 100% for a straight line, and can even be even higher than this for arbitrary stroke sets. Using this algorithm, VibroSketch identifies closed loops composed of up to five consecutively drawn strokes2 including single-stroke loops. Note that this method allows the constituent strokes to be drawn in any arbitrary order and direction. Also, the patterns identified in this way need not form a particular geometric shape such as a square or rectangle, but can be of any arbitrary shape. After identifying the closed loops, VibroSketch instantiates the mass objects and marks the associated strokes as processed to prevent them from later being considered as parts of other components. Recognizing Grounds: After identifying the masses in the sketch, VibroSketch 1

The total linkage length is the sum of the lengths of the hypothetical linkages joining the strokes. In Figure 5.2, for example, this would be |AA0 | + |BB 0 | + |CC 0 |. 2 We have found that in the domain of vibratory systems, people usually draw masses in two or three strokes. Hence, the upper limit of five strokes has been sufficient for our purposes.

5.3. STROKE CLUSTERING

69

focuses attention on the ground symbols. The distinguishing characteristic of a ground symbol is a set of short, parallel line segments (i.e., the hatches) that are aligned approximately along a straight line (Figure 5.3). Moreover, these segments are almost always drawn consecutively in a uniform manner, such as from the top to the bottom or vice versa. Our program thus searches for such patterns in the raw strokes to locate the ground symbols. To prevent arbitrary parallel strokes from being recognized as grounds, we require a minimum of four strokes in the hatch area before a ground symbol can be conjectured. To test whether a group of strokes constitutes a ground symbol, the program determines if (1) they are roughly uniformly separated, (2) they are more or less parallel, and (3) the imaginary polyline formed by connecting their starting points (which we call the skeleton) is close to a straight line. The first requirement is satisfied if the separation between the pair of most distant consecutive strokes (smax ) is less than twice the average separation distance. The second requirement is satisfied if the vectors defined by connecting the first points of the strokes to their last points all point to the same quadrant, for example south-west in Figure 5.3b. The last requirement is satisfied if the skeleton length is within 5% of that of an imaginary line extending from the first to the last stroke. Once a core sequence of four strokes that satisfy the above requirements is found, our program determines the extent of the pattern by appending subsequent strokes one at a time until the pattern is disrupted. Finally, the long stroke that appears next to the hatches is found and added to the pattern. The same procedure is applied to find other ground symbols.

5.3

Stroke Clustering

The previous step identifies the masses and grounds but leaves the rest of the sketch uninterpreted. When the masses and grounds are removed from the sketch, one is left with the springs, dampers and forces. For example, Figure 5.4a shows what is left after the masses and grounds are removed from Figure 5.1. As before, the task of identifying these components is split into two sub problems. The first is stroke clustering in which the strokes are grouped into clusters corresponding to distinct objects. Once the stroke clusters are identified, the next step is to recognize

70CHAPTER 5. VIBROSKETCH: A SKETCH-BASED INTERFACE FOR VIBRATORY SYSTEMS

(a)

(b)

Figure 5.4: (a) The remaining objects that need to be identified once the masses and the grounds in Figure 5.1 have been recognized. (b) The hierarchical clustering algorithm separates the scene into distinct clusters. In the configuration shown, the algorithm has been run until a single cluster was obtained. The marked clusters are later determined by analyzing the distance between the merged clusters at every iteration.

5.3. STROKE CLUSTERING

71

each stroke group using the symbol recognizer described in Section 5.4. This section concerns the first of these tasks. The clustering problem can be formally defined as finding the best grouping of the strokes such that each group embodies all the strokes belonging to a single object while excluding those coming from other objects. For example in Figure 5.4 this means identifying the six clusters corresponding to the two springs, three dampers and the force. There are four key issues that complicate the problem. The first is that the clusters can occur arbitrarily close to or far from one another. Hence we cannot set a fixed threshold distance below which two strokes would be considered in the same cluster. Second, the clusters can have arbitrary sizes and shapes. Third, each cluster may contain an arbitrary number of strokes. Fourth, and most importantly, one does not know a-priori the number of clusters to be determined. The clustering approach used in VibroSketch relies on the observation that the clusters in this domain typically occur at spatially distinct regions without overlapping. In fact, the purpose of excluding masses and grounds through a preliminary recognition process is to accentuate the separation between clusters. Also, although the distance between two clusters is arbitrary, it is usually greater than the distance between the strokes within the clusters. Hence, different clusters can be identified by grouping together the strokes that reside close to each other and separating those that are not. To implement this idea, we have adopted the agglomerative hierarchical clustering algorithm described in [18]. The clustering procedure is facilitated if the scene is viewed as a collection of data points rather than pen strokes. In this representation, each data point initially forms a distinct seed cluster. The algorithm takes as input these seed clusters and recursively merges them until a single, all-encompassing cluster is obtained. At each step, the two nearest clusters are merged resulting in a bigger cluster that contains the combined set of data points. At each iteration, the number of clusters thus decreases by one. To find the two nearest clusters at a given step, we must define a distance metric. In our approach, the distance between two clusters A and B is given by:

72CHAPTER 5. VIBROSKETCH: A SKETCH-BASED INTERFACE FOR VIBRATORY SYSTEMS

d(A, B) = min ka − bk a∈A,b∈B

where ka − bk represents the Euclidian distance between points a and b. In this formulation d(A, B) corresponds to the distance between the two closest points in A and B, and is known as the nearest-neighbor distance. At each step, the program computes this distance for all cluster pairs and merges the two having the minimum of these distances. Although other metrics could be used to determine cluster distances, such as farthest-neighbor, we have found the nearest neighbor measure to be most suitable as it favors thin and elongated clusters due to a phenomenon called ‘chaining’ [18]. Due to their typical appearances, the springs, dampers and forces in our domain often benefit from this effect. While using the sampled data points as the initial seed clusters facilitates clustering, it also results in superfluous computations in the early stages of the algorithm. Unless very unusual drawing styles are used, each pen stroke typically belongs to only one symbol. Because of this we initially group all of the data points coming from a single stroke into a single cluster. We have found this to greatly reduce the amount of computation needed to perform clustering. As mentioned, not knowing the number of symbols to be identified a-priori presents a challenge to our analysis. If this number was known, the clustering algorithm could be terminated when the desired number of clusters was achieved. In our case, however, this number must be determined automatically. The hierarchical clustering algorithm provides a means to accomplish this. At each iteration of the algorithm, the distance between the clusters merged at that iteration is stored as a dissimilarity score δ. Because the algorithm merges the nearest clusters at each iteration, δ monotonically increases with the number of iterations. The key, however, is that a large increase in δ usually signals a ‘forced merge’ [18] - a merge that combines two distant clusters and thus can be used as a stopping criterion. We exploit this observation to find the number of clusters. Consider Figure 5.5 that shows the dissimilarity score versus the iteration number obtained from Figure 5.4. The large jump from iteration 17 to 18 corresponds to the merging of the force symbol with the damper at its lower right. The subsequent iterations further combine the remaining clusters until a single cluster is obtained. Clearly the intended

5.3. STROKE CLUSTERING

73

Iteration #, i

Figure 5.5: The dissimilarity score δ increases monotonically with the number of iterations. Sharp leaps, such as the one at iteration 17, usually correspond to forced merges and thus can be used to determine the number of natural clusters. clusters are those obtained at the end of iteration 17. By finding the sharp leaps in δ, we can thus determine the best stopping iteration. However the challenge is to reliably determine such leaps, which in general may not be as distinct. To reliably find the best stopping iteration, we devised a stopping iteration i∗ that maximizes the leap from the preceding iteration to the next, while taking into account the absolute magnitude of the leap. That is, i∗ = argmax[ i

δi+1 − δi · (δi+1 − δi )] δi − δi−1

The first term in the above expression (the ratio) measures the change in the increase of δ between consecutive iterations. This is useful for detecting sharp leaps in δ such as the one that occurs at iteration 17 in Figure 5.5. However, because the ratio measures only the relative increase, if the increase in δ in the previous iteration was minute, even a small increment in δ on the current iteration may undesirably extremize the ratio. This often occurs during the initial iterations. To prevent such occurrences from dictating the stopping iteration, we favor globally large leaps over smaller ones by using the absolute amount of leap (δi+1 − δi ) as a scaling factor. The clustering method described above works best when the symbols form compact clusters at spatially distant locations. It naturally allows symbols to be drawn

74CHAPTER 5. VIBROSKETCH: A SKETCH-BASED INTERFACE FOR VIBRATORY SYSTEMS

(a)

(b)

Figure 5.6: The clustering algorithm falls short when symbols overlap or when intrasymbol distances are comparable to inter-symbol distances.

in an arbitrary number of strokes, and is not sensitive to the absolute angular orientation of a symbol or the angular orientation of one symbol relative to another. However, it is not well suited when different symbols overlap (Figure 5.6a), or when an internal gap in a symbol is comparable in size to the distance to a neighboring symbol (Figure 5.6b). In the first case, the algorithm will simply produce erroneous clusters. In the second case, the right number of clusters will not be determined reliably as the leap from intra-cluster merges to inter-cluster merges will not be distinct as in Figure 5.5. Although the first of these issues is highly uncommon in our domain (springs, dampers and forces usually do not overlap), occasionally the second issue does cause errors. We have found that most of these errors can be alleviated by requiring the user to keep the gaps in the dampers to a minimum. Currently, VibroSketch does not employ contextual information to reduce the number of interpretations of a stroke cluster prior to recognition. However, it would be quite trivial to encode this kind of knowledge into the system. For example, it is known that in the types mechanical systems we consider, ground symbols can have only springs and dampers attached to them, while forces can be attached only to masses. Hence, by considering the spatial configurations of the previously identified mass and ground symbols in relation to the stroke clusters, it is possible to rule out certain interpretations for the clusters. For example, if a ground symbol is found next to a stroke cluster, only a spring or a damper can be the interpretation of the cluster. Another form of contextual information would be the number of strokes contained in a cluster. Typically, a spring would be drawn in 1 or 2 strokes while a damper symbol would require a minimum of 4. Hence, as in SimuSketch, it would be

5.4. SYMBOL RECOGNITION

75

possible to reduce the possible interpretations of a cluster by considering such pieces of information. However, we have avoided this in VibroSketch, primarily because the total number of symbol definitions was already small and the symbol recognizer (described next) was sufficiently successful.

5.4

Symbol Recognition

Once the symbol clusters have been identified, the next step is to actually recognize each cluster. VibroSketch uses a feature-based, trainable symbol recognizer for this purpose. This recognizer is different from the one used in the SimuSketch system and is used here to demonstrate its utility in a real application3 . While a detailed description of this recognizer is given in Chapter 8, here we highlight the main characteristics of it. The recognizer takes as input the raw strokes in a cluster and outputs the domain object that best matches the given strokes. Because masses and grounds are already identified in the preliminary recognition step, the symbol recognizer presented here is used for distinguishing between springs, dampers and forces only. The relatively small number of patterns to consider in our working example, however, should not obscure the utility of our symbol recognizer. To date, we have successfully used this recognizer in several other domains with significantly larger symbol libraries. Segmentation: The recognizer first decomposes the raw strokes into line and arc segments that match the original ink. This process, called segmentation, provides compact descriptions of the pen strokes that facilitate recognition. Segmentation involves searching along each stroke for “segment points,” points that divide the stroke into different geometric primitives. Once the segment points have been identified, a least squares analysis is used to fit lines and arcs to the ink between the segment points. Examples of segmented symbols are shown in Figure 5.7. Training: Our recognizer uses a feature-based, statistical learning technique to learn new symbol definitions. To train the recognizer, the user draws several examples of a symbol. Each example can be sketched using any number of strokes drawn in 3

An earlier version of VibroSketch used the same recognizer used in SimuSketch.

76CHAPTER 5. VIBROSKETCH: A SKETCH-BASED INTERFACE FOR VIBRATORY SYSTEMS

Figure 5.7: The original (left) and segmented (right) versions of a spring, damper and a force symbol. any order. The examples need not be drawn the same size or at the same orientation, since the recognizer is insensitive to size and rotation, and is robust to moderate non-uniform scaling. A set of geometric features are extracted from the segmented version of each training example. These features include: the number of pen strokes, the number of line segments, the number of arc segments, the number of endpoint (“L”) intersections, the number of midpoint (“X”) intersections, the number of endpoint-to-midpoint (“T”) intersections, the number of pairs of parallel lines, the number of pairs of perpendicular lines, and the average distance between endpoints of the segments. Once these features have been computed for each of the training examples of a symbol, a statistical definition model is constructed for the symbol. We assume that the training features are distributed normally, i.e., they can be modelled as Gaussian distributions. A Gaussian model naturally accounts for variations in the training examples. Recognition: The first step in recognizing an unknown symbol, S, is to extract the same features used to describe the training examples. The values of these features are then compared to those of each learned definition, Di . At the end, S is classified by the definition D∗ that maximizes the probability of match. That is: D∗ = argmax P (Di |S) i

We assume that all definitions are equally likely to occur hence we set the prior

5.5. CONNECTIVITY ANALYSIS

77

probabilities of the definitions to be equal. We also assume that the features (xj ) are independent of one another. With these assumptions, Bayes’ Rule tells us that the definition which best classifies the symbol is the one that maximizes the likelihood of observing the symbol’s individual features. D∗ = argmax i

Y

P (xj |Di )

As stated in the training section, we assume each statistical definition model P (xj |Di ) to be a Gaussian distribution with mean µi,j and standard deviation σi,j . P (xj |Di ) =

1 √

σi,j 2π

exp[−

(xj − µi,j )2 ] 2 2σi,j

Since the features are assumed to be independent, this is referred to as a naive Bayesian classifier. This type of classifier is commonly thought to produce optimal results only when all features are truly independent. This is not a proper assumption for our system, since some of the features we use are interrelated. For example, the number of intersections in a symbol frequently increases with the number of lines and arcs. However, Domingos and Pazzani [16] show that the naive Bayesian classifier does not require independence of the features to be optimal. While the actual values of the probabilities of match may not be accurate, the rankings of the definitions will most likely be correct. Nevertheless, in Chapter 8, we describe a very similar recognizer that does not assume feature independence.

5.5

Connectivity Analysis

The final step in our analysis involves finding how the recognized components are connected to one another so that we can construct the equations of motion. This is accomplished in a straightforward way by connecting the components that are spatially nearest to each other. For example in Figure 5.1, the right end of spring k1 is connected to mass m1 because among all grounds and masses, m1 is the nearest component to the right end of k1. Our measure of proximity between a mass and a spring is the Euclidian distance between the bounding box center of the mass and

78CHAPTER 5. VIBROSKETCH: A SKETCH-BASED INTERFACE FOR VIBRATORY SYSTEMS

the end of the spring4 . Similar measures are used for determining the connectivity between springs and grounds, dampers and grounds, dampers and masses, and forces and masses. Note that in the models we consider, we require each end of a spring or damper to be connected to precisely one mass or one ground symbol, whichever is closer. Each mass or ground, however, may have an arbitrary number of springs or dampers attached to it. Currently, our analysis excludes the case in which springs and dampers are connected end to end. The structural analysis described above circumvents the pitfalls that can occur due to a literal interpretation of the sketch. Our goal is to infer the intended rather than the apparent structure. For instance, in Figure 5.1, although c1 and m1 are not actually attached, our program decides, just as anybody seeing the sketch would, that the two are connected. A literal interpretation, on the other hand, would consider the two components disconnected. After determining the connectivity between components, the equations of motion describing the system behavior are constructed. For the discrete, linear and timeinvariant systems we consider, these equations are conveniently described in terms of: mass, damping and stiffness matrices; the displacement and forcing vectors; and the initial position and initial velocity vectors. All of these can be straightforwardly written once the connectivity of the components has been determined. Finally, these system matrices and vectors are passed to, and solved by, the Matlab engine running in the background. The solution is a displacement vector, whose elements are the displacements of each of the masses as a function of time. These results are displayed to the user in the form of conventional Matlab plots and as an animation of the user’s sketch where masses translate, and dampers and springs stretch and compress (Figure 5.1).

4

As before, the bounding box of a symbol is the smallest rectangle, aligned with the coordinate axes, that fully encloses the symbol.

5.6. COMPLEXITY ANALYSIS OF VIBROSKETCH

5.6

79

Complexity Analysis of VibroSketch

The preliminary recognition step of VibroSketch identifies the mass and ground symbols in the sketch. One requirement in the detection of these patterns is that they are assumed to be formed of consecutively drawn strokes. With this requirement, the computational complexity of detecting these patterns is O(k · n), where k is the maximum number of strokes that can occur in these patterns, and n is the total number of strokes in the sketch. Typically, k can be set to be a small number. For instance, VibroSketch’s mass recognizer searches for closed contours containing up to 5 strokes and hence k is 5. The ground recognizer, however, does not put an upper limit on the number of strokes in a ground symbol. Therefore the complexity of the ground recognizer is theoretically O(n2 ). However, practical drawing styles naturally result in a much lower complexity as the number of cross-hatches in ground symbols rarely exceed 6 or 7. If the requirement of temporal consecutiveness was relaxed in the detection of these patterns, all stroke groups containing up to k strokes would need to be considered. In that case, the cost of identifying the patterns would be

Pk

n i=1 (k )

= O(nk )

resulting in a quite inefficient procedure. Hence, although the requirement of consecutiveness imposes some constraint on drawing, we believe it is not overly burdening. Additionally, this requirement is balanced by the fact that the computational complexity is significantly reduced. Note that this requirement applies only to marker symbols. The remaining symbols need not be drawn with consecutive strokes. The complexity of VibroSketch’s stroke clustering algorithm is O(n3 ). This is because at each iteration of the algorithm, we identify the nearest two clusters in a naive way by considering all cluster pairs which results in a O(n2 ) complexity5 , and this process is repeated until a single, all encompassing cluster is obtained, which runs O(n) times. However, since the number of clusters decreases at each iteration due to merges, the actual cost of the algorithm is sub-cubic. Finally, the symbol recognition complexity of VibroSketch is O(c · d) where c is the number of stroke clusters and d is the number of domain symbols. 5

Although not implemented, it is possible to compute the two nearest clusters with better designed algorithms that run in O(n · log n).

80CHAPTER 5. VIBROSKETCH: A SKETCH-BASED INTERFACE FOR VIBRATORY SYSTEMS

Preliminary Recognition

Stroke Clustering

Symbol Recognition

Case 1

Number of Strokes = 18 Number of Markers= 2 Number of Clusters = 3

152 ms.

Less than 14 ms.

326 ms.

Case 2

Number of Strokes = 52 Number of Markers= 6 Number of Clusters = 6

420 ms.

55 ms.

581 ms.

Case 3

Number of Strokes = 148 Number of Markers= 20 Number of Clusters = 18

1392 ms.

1170 ms.

627 ms.

Figure 5.8: The processing times of different modules of VibroSketch for three different cases. All times are in milliseconds. ‘Number of Markers’ include both the ground and mass symbols. Figure 5.8 shows the actual processing times of three sketches created in VibroSketch. All results are obtained from the same machine used in the analysis of SimuSketch. Note that unlike the SimuSketch system, the symbol recognition step of VibroSketch is quite fast. This is primarily because the feature-based recognizer employed in VibroSketch is several orders of magnitude faster than the image-based recognizer of SimuSketch. Additionally, there are fewer definition symbols in VibroSketch, which further causes the overall recognition time to be less compared to SimuSketch.

5.7

Evaluation of VibroSketch

The evaluation of VibroSketch involved only the evaluation of its parsing and recognition accuracy. Usability studies considering such things as the editing and viewing capabilities of the user interface were not considered as the evaluation of SimuSketch provided appropriate feedback concerning these issues. Additionally, the evaluation of VibroSketch was more qualitative compared to the evaluation of SimuSketch. We asked 13 subjects, most of whom were graduate and undergraduate mechanical engineering students, to sketch the two types of vibratory systems shown in Figure 5.9. Each subject provided four sketches, two of each type. Subjects had very little or

5.7. EVALUATION OF VIBROSKETCH

81

Figure 5.9: Two successfully recognized sketches employed in our user study.

no experience with the LCD tablet. Moreover, the test was conducted in a walk-upand-draw fashion in which subjects were nearly immediately asked to start drawing. Only a brief warm-up period of about 30 seconds was given to allow the subject to become familiar with the stylus and LCD tablet. No explanation was given about how the program performs its task. For example, subjects were not told that the system begins by looking for closed loops to identify masses, and hatches to identify ground symbols. Each session involved only data collection; the data was processed at a later time. This approach was chosen to prevent the participants from adjusting their style from one sketch to the next based on our program’s output. The results have indicated that we have a sound parsing and recognition approach. However, to accommodate a wider variety of users, it may be necessary to adjust some of the assumptions about drawing styles. In general, the agglomerative parsing algorithm worked quite successfully. However when it did fail, it was due to the phenomenon illustrated in Figure 5.6b (page 74), in which symbols are too close to one another. For the sketches in which parsing was successful, the feature-based symbol recognizer was highly accurate, even though none of the participants were

82CHAPTER 5. VIBROSKETCH: A SKETCH-BASED INTERFACE FOR VIBRATORY SYSTEMS

involved in the training of the recognizer.6 The rare misrecognitions were due to deficiencies in the segmentation process caused by subjects drawing too quickly or too small. The mass recognizer worked correctly for 11 of the 13 subjects. One subject sometimes drew a mass and spring together, in a single pen stroke. Another drew small triangles for the arrowheads on the forces, which were then misrecognized as masses. We believe this situation can be fixed relatively easily by filtering out masses that are geometrically small compared to the rest of the masses. The ground recognizer worked correctly for 9 of the 13 subjects. One subject drew only three strokes for the hatch, while our program requires a minimum of four. A second subject drew ground symbols in which the hatch consisted of three sets of hatches, each containing three strokes, that were drawn far apart from each other. A third subject varied the directions of the strokes in the hatch, for example, with one pointing to the south-west, another pointing to the north-east, and so on. These three situations might be handled by a more general definition of a ground symbol. A fourth subject sometimes used a single stroke to draw both a spring and a ground, and rarely lifted the pen while drawing the hatches. Occasionally, the test subjects would try to improve the appearance of their sketch after it was nearly completed. For example, they would add a small bit of ink to try to close the boundary of a mass, or they would try to extend a ground symbol by adding a few extra hatches. Our special-purpose mass and ground recognizers require that strokes be drawn consecutively. Thus when new ink is added in this way, it will be identified as a separate symbol. While currently not implemented, this issue requires relaxing the requirement for temporal proximity when recognizing mass and ground symbols. Note, however, that such added ink typically does not pose problems in the recognition of springs, dampers and forces, as the agglomerative parsing approach is not sensitive to the temporal order, and moreover the feature-based recognizer is robust to a few extra or missing strokes. VibroSketch did work as expected for the majority of the test subjects. This 6

The recognizer was previously trained by one of the author’s colleagues, using 10 training samples for each symbol.

5.7. EVALUATION OF VIBROSKETCH

83

is quite encouraging given that they had no experience with our system, and no information about how it worked, prior to the test. As described above, the test has revealed several problems that some test subjects encountered. However, we expect that providing users with even minimal information about how the system works, or letting them use the system for extended times, would significantly decrease errors, while still providing a natural drawing environment.

Chapter 6 Symbol Recognition: An Overview This chapter presents an overview of the three symbol recognizers we have developed, together with a comparative evaluation of the advantages and disadvantages of each. A detailed presentation of each of the recognizers is contained in the following chapters. A short description of the first two recognizers were presented previously in Sections 4.5 and 5.4 respectively. Image-Based Recognizer: The development of the first recognizer was inspired by techniques from image processing. Symbols are internally represented as quantized bitmap images we call “templates.” Each symbol is centered within its template, and is uniformly scaled to fill it, thus making the approach insensitive to uniform scaling. One distinct advantage of this recognizer over traditional ones is that it can learn new definitions from single prototype examples. An unknown template is matched to a definition template using an ensemble of four different classifiers. These classifiers are extensions of the following methods: (1) Hausdorff distance [64], (2) Modified Hausdorff distance [17], (3) Tanimoto coefficient [20] and (4) Yule coefficient [72]. These classifiers were originally intended for matching precise bitmaps. Here, they had to be modified to handle the variations typical of a “sketchy,” hand-drawn shape. Each classifier outputs a ranked list of candidate definitions for the unknown symbol. These rankings are combined, and the definition with the best combined score is selected. In practice, the combined performance is better than that of any of the individual classifiers. To achieve rotation invariance, the recognizer uses a novel 84

85

polar coordinate analysis that avoids expensive rotations in the drawing coordinates. The recognizer is versatile in that we use it both for graphical symbol recognition and digit recognition. Feature-Based Recognizer: The second recognizer works from a set of features extracted from input symbols. These features encode the geometric properties of the line and arc segments fitted to the raw strokes comprising the shape. This method is based on a naive Bayesian approach. From a set of training examples, the recognizer learns the statistical distributions of these features in the form of multi-variate gaussian probability functions. An unknown symbol is classified by comparing its features to the distribution of features contained in a given definition. One advantage of this recognizer is that, once it is trained, recognition is fast. However, the approach can occasionally produce false positives because sometimes different symbols have similar features. Graph-Based Recognizer: The last recognizer works from a structural representation of the line and arc segments making up a symbol. To train the system, the user provides a few examples of a given symbol. Each example is represented internally as an attributed relational graph. The properties of the graphs for the various examples are extracted, and both the “average” properties and their “statistical distributions.” are assembled into a definition graph. The nodes in the graphs are the geometric primitives (line and arc segments) fitted to the raw strokes comprising the shape. Each primitive is characterized by intrinsic properties, such as its type and length. The links in the graphs are geometric relationships between the primitives, such as parallelism, intersection, and perpendicularity. During recognition, the graph of the unknown is matched to the library of definition graphs. When considering a particular definition, errors are accrued if the unknown is missing attributes or relations contained in that definition. The definition with the smallest error is then reported as the interpretation of the unknown symbol. Recognition (and training) requires solving a graph matching problem. In practical terms, the issue is to determine which primitives in the unknown symbol match which primitives in the definition. To allow for variations in drawing order and segmentation errors, we have developed an error-driven stochastic algorithm that can quickly converge to a good

86

CHAPTER 6. SYMBOL RECOGNITION: AN OVERVIEW

enough matching configuration, while avoiding an exhaustive search to find the best matching configuration.

6.1

Comparison of the Recognizers

While all three recognizers can be used to recognize a wide variety of different symbols, differences in their underlying principles give each recognizer unique strengths. An advantage of the image-based recognizer is that it does not segment the input strokes into line and arc primitives, making it immune to errors that might take place during this process.1 Our experience has shown that segmentation errors occur mostly when the raw strokes exhibit too many subtle transitions from linear to curved regions. The integrator sign in the Simulink domain is an example of such a symbol (Page 25). In such cases, using the image-based recognizer usually provides better results as potential errors due to segmentation are averted. Also, this recognizer is well suited for recognizing “sketchy” symbols such as those with heavy overtracing, missing or extra segments and different line styles (solid, dashed, etc.). The other two recognizers are not designed to handle this kind of sketchiness. Another unique attribute of this recognizer is that it can learn new symbol definitions from single prototype examples. Hence, training it is very easy. Nevertheless, the sensitivity of this recognizer to non-uniform scaling makes it less appealing in cases where topology is more important than geometry. For instance, if the goal is to be able to recognize cantilever beams irrespective of their thickness and lengths, this recognizer would not be a good candidate for the task. Additionally, among the three recognizers, this recognizer is the most computationally expensive. The feature-based and graph-based recognizers are best suited to recognizing symbols with small details. Due to quantizing input symbols into templates, the imagebased recognizer can miss small details. The other two recognizers, however, preserve such details, allowing them to distinguish between symbols with subtle differences. For instance, consider the two rectangular beams shown in Figure 6.1. The first 1

A description of the segmentation process is given in Section 8.1.

6.1. COMPARISON OF THE RECOGNIZERS

87

Figure 6.1: Two beams, one with two supports and the other with three. beam is supported at its two ends by a pair of small rectangular supports. The second beam, has an additional third support at its midway giving it more rigidity. The image-based recognizer might not be able to distinguish between these two types of beams, as the quantization resolution may not be sufficient to capture the details in the supports. The feature-based and graph-based recognizers, on the other hand, would have no problem distinguishing between the two. This is primarily due to the fact that these two recognizers take into account the number of line and arc primitives fitted to the raw strokes. Hence, they can better distinguish between symbols in which the number of such segments plays an important role. Between the feature-based and the graph-based recognizer, however, the graphbased recognizer provides a more complete description of the symbols by taking into account both topology and geometry. For instance in Figure 6.1, if the middle support of the beam on the right was moved slightly to the left or to the right, the graph-based recognizer would detect the change. However, the feature-based recognizer would not be able to do so, since the properties of the extracted features would remain the same (these features are described in detail in Chapter 8). From a computational cost perspective, however, the feature-based recognizer is much faster than the graphbased recognizer. Additionally, the feature-based recognizer is easier to train than the graph-based recognizer as the latter requires all training examples to be drawn with the same sequence of pen strokes. Figure 6.2 tabulates several key features of the three recognizers. A qualitative comparison in terms of two performance metrics is also provided. Discussions about individual accuracies of the recognizers are provided in the following chapters.

88

CHAPTER 6. SYMBOL RECOGNITION: AN OVERVIEW

Image-based

Feature-based

Graph-Based

Trainable Rotation invariant Size invariant Attributes

Non-uniform scale invariant Requires segmentation Handles overtracing

Performance Metrics

Ease of trainability Recognition speed

Figure 6.2: A comparative illustration of the attributes and performance metrics of the three symbol recognizers. The performance metrics are based on a maximum of five stars, five being the best.

Chapter 7 Image-Based Symbol Recognition This chapter describes a trainable, hand-drawn symbol recognizer based on a multilayer recognition scheme. Symbols are internally represented as binary templates. An ensemble of four different classifiers compares and ranks definition symbols according to their similarity to the unknown symbol. The scores of the individual classifiers are aggregated to produce a combined score for each definition. The definition with the best combined score is assigned to the unknown symbol. All four classifiers use template-matching techniques to compute similarity (and dissimilarity) between symbols. Ordinarily, template-matching is sensitive to rotation, and existing solutions for rotation invariance are too expensive for interactive performance. We have developed a fast technique that uses a polar coordinate representation to achieve rotational invariance. This technique is applied prior to the multi-classifier recognition step to determine the best alignment of the unknown with each definition. One advantage of this technique is that it filters out the bulk of unlikely definitions, thereby reducing the number of definitions the multi-classifier recognition step must consider. In the following paragraphs, we give a brief overview of the main characteristics of this recognizer followed, by a description of its underlying architecture. Template representation: Our recognizer uses an image-based recognition approach. Input symbols are internally described as down-sampled bitmap images which we call “templates.” This representation has a number of desirable characteristics. 89

90

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

Figure 7.1: Examples of symbols correctly recognized by our system (at the time of the test, the database contained 104 definition symbols). The top row shows symbols used in training, and the bottom row shows correctly recognized test symbols. With our approach, over-tracing, missing / extra pen strokes, different line styles, or variations in angular orientation do not pose difficulty. First, segmentation – the process of decomposing the sketch into constituent primitives such as lines and curves – is eliminated entirely. Many of the traditional recognition approaches, such as graph-based1 and feature-based2 methods, rely heavily on the segmentation process, making them vulnerable to minor segmentation errors. Second, our system is well suited for recognizing “sketchy” symbols such as those shown in Figure 7.1. Lastly, symbols drawn with multiple strokes or varying drawing orders do not pose difficulty. Many of the existing recognition approaches have either relied on single stroke methods in which an entire symbol must be drawn in a single stroke [63, 10, 43] or constant drawing order methods in which two similarly shaped patterns are considered different unless the pen strokes leading to those shapes follow the same sequence [60, 75, 59]. Learning from a Single Example: There has been a large body of work concerning character and numeral recognition. Most systems have traditionally been 1

In graph-based methods the basic geometric primitives obtained after segmentation are assembled into a graph structure that encodes the intrinsic attributes of the primitives and their spatial relationships. 2 In feature-based methods, various aspects of the patterns are quantified and encoded in a feature space that helps distinguish between different patterns. In sketch recognition, the features are often geometric, and commonly involve the line and arc primitives obtained from segmentation.

91

built on statistical learning methods [51, 56, 30] that require large amounts of training data. For instance, LeCun’s [51] neural network recognizer for handwritten digits, one of the best in its class, uses a total of 60,000 patterns for training purposes. However, due to the need for large training sets, these systems are not easily extensible to new applications with novel symbols and shapes. One of the principle goals of this recognizer was to enable users to create, extend and update their own library of symbols without the need for extensive training. To this end, we designed our system to work from single prototype examples. For training, the user creates a new symbol definition by drawing a single example. With this approach, users can seamlessly train new symbols, and remove or overwrite existing ones on the fly, without having to depart the main application. An additional advantage is that, unlike many statistical and neural network approaches, the existing symbols do not need to be retrained or adjusted upon the introduction of a new symbol. While the primary advantage of our recognizer is its ability to work from single training examples, it is easily extended to work from multiple training examples. For this, the user simply provides multiple, different definition templates to the database for each type of symbol. During our user studies, we have observed this approach to noticeably improve the recognition accuracy at an expense of a minor increase in recognition times. Multiple Classifiers: During our studies we experimented with a variety of classification methods and found that no single method was adequate for hand drawn shapes. However, recognition accuracy increased dramatically when classifiers were used in combination. Inspired by this observation, we designed a recognition scheme comprised of four classifiers. Our tests indicated that the combined scheme usually outperforms individual classifiers, and is always better than the worst performing classifier. In fact, we have frequently encountered cases in which the combined scheme produced the right result even though none of the classifiers ranked the true class at the top. These findings are consistent with a large body of evidence that supports the idea of multiple classifiers for recognition [2, 35, 38, 42]. Achieving Rotation Invariance Efficiently: Template matching is ordinarily

92

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

sensitive to rotations. Therefore, patterns must be brought to the same orientation before template matching is applied. In many cases, this is achieved by incrementally rotating one pattern relative to the other until the best correspondence is obtained. However, this approach is too expensive for real-time applications due to the costly rotation operation. We developed a technique, based on polar coordinates, to greatly expedite this process. The technique is based on the fact that rotations in screen coordinates become translations in polar coordinates. Hence, finding the optimal rotational alignment in screen coordinates reduces to determining the shift between patterns in polar coordinates. As we shall describe later, this technique is conceptually similar to the cross-correlation operation in signal processing. Two-Step Recognition: We use the results of the polar analysis not only to determine the best alignment angles but also as a tool to filter out unlikely matches before recognition. We have found that the similarity metric employed in polar coordinates gives a reasonable estimate of the match in screen coordinates. Specifically, we found that although the analysis in polar coordinates may sometimes mistake two dissimilar patterns as being similar, it almost never misses a true match when there is one. Taking advantage of this feature, we designed a two-phase recognition scheme that first involves an elimination of the bulk of the unlikely matches in polar coordinates, followed by a detailed evaluation of the reduced set of candidates in screen coordinates. System Architecture: The recognition architecture consists of four sequential layers as shown in Figure 7.2. The first step is preprocessing, where the input symbols are cropped, size-normalized and quantized into templates. If the system is in training mode, the template becomes a definition and is added to the database of existing definitions. If the system is in recognition mode, the template is passed to the next stage where it is matched against the definitions. In the first step of recognition, the unknown symbol is transformed into a polar coordinate representation, which allows the program to efficiently determine which orientation of the unknown best matches a given definition. During this process, definitions that are found to be markedly dissimilar to the unknown are pruned out and the remaining ones are kept for further analysis. In the second step, recognition

93

Raw Ink

Preprocessing Training

Definition

Recognition

Polar Analysis (1)Angular Alignment (2)Pre-elimination

Screen Coordinates

Template matching Hausdorff Distance

Modified Hausdorff Dist.

Tanimoto Coef.

Yule Coef.

Combining Classifiers

Result Figure 7.2: Recognition Architecture.

94

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

switches to screen coordinates where the surviving definitions are analyzed in more detail using an ensemble of four different classifiers. Each classifier outputs a list of definitions ranked according to their similarity to the unknown. In the final step of recognition, results of the individual classifiers are pooled together to produce the recognizer’s final decision. As shown in Figure 7.2, the analysis in the polar coordinates precedes the analysis in the screen coordinates. However, for the sake of presentation, we have found it useful to begin the discussion with our template representation and the four template matching techniques, since some of those concepts are necessary to set the context for the analysis in the polar coordinates. Hence, in the next few sections we shall assume that symbols are already brought into the correct orientation using the polar analysis, and we will defer the details of that until later.

7.1

Preprocessing and Representation

Input patterns are internally represented as binary bitmap images that consist of the (x,y) coordinates collected from the digitizing tablet. However, in this form, the input image usually contains too many data points, which can hinder recognition performance. To facilitate recognition, the initial image is framed and down sampled into a 48X48 square grid producing a rasterized image we call a “template”. This quantization significantly reduces the amount of data to consider while preserving the patterns’ distinguishing characteristics. To frame the image, a bounding box aligned with the screen axes is first constructed. The shortest dimension of the bounding box is then expanded, without changing the location of the box’s center, to produce a square. The result is that the symbol appears centered in a square frame, but does not necessarily fill the entire frame. This representation preserves the original aspect ratio so that one can distinguish between, say, a circle and an ellipse. Figure 7.3 shows examples of quantized symbols.

7.2. TEMPLATE MATCHING USING MULTIPLE CLASSIFIERS

95

. Figure 7.3: Examples of symbol templates. Left: a mechanical pivot, Middle: ‘a’, Right: ‘8’. The templates are shown on 24X24 grids to better illustrate the quantization.

7.2

Template Matching Using Multiple Classifiers

Template matching, in its simplest form, can be described as the process of superimposing two digital images and applying a measure of similarity. It has found use in many different fields including face and character recognition, radar signal processing, medical imaging, and object detection and tracking in visual scenes. [11, 13, 71, 72], present overviews of existing techniques. Although individual techniques differ in the way they define similarity, one property common to all template matching methods is that they are inherently “featureless”; the template itself constitutes the input to the recognizer. Hence, there is no need to determine the set of line and arc primitives that best approximates the pattern, or the geometric relationships between these primitives such as parallelism, intersection or containment. In their work on face recognition, Brunelli and Poggio [7] demonstrate the advantage of this simplicity over feature-based methods. Besides eliminating the need for feature extraction, template methods also provide great simplicity and flexibility in training new patterns. In our case, for instance, a new symbol can be easily added to the database by simply drawing one example of it. While most template-based recognition systems are traditionally designed around a single similarity measure, in this work four different methods are used to enhance recognition accuracy. The first two are based on the Hausdorff distance, which measures the dissimilarity between two point sets. Hausdorff-based methods have been successfully applied to object detection in complex scenes [64, 62, 68]. However, most of the applications have involved detection or recognition of “rigid” objects,

96

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

such as those in photographic images or machine generated text, and only a few researchers have recently considered the use of the Hausdorff distance for hand-drawn pattern recognition: Cheung et al. [9] have used it for character recognition and Miller et al. [57] for digit recognition. In this work, we apply the Hausdorff distance to the more general problem of rotation invariant graphical symbol recognition. In particular, as described in the following sections, the Hausdorff distance is applied both in screen coordinates and in polar coordinates. Also, in Section 7.4.2, we introduce a weighted Hausdorff distance method that enables different parts of an image to be emphasized differently during matching, according to a measure of confidence based on prior information about the image. The other two recognition methods we use are based on the Tanimoto and Yule coefficients. Unlike the Hausdorff methods, these methods measure the similarity between patterns in the form of correlation coefficients. The Tanimoto coefficient is extensively used in chemical informatics such as drug testing, where the goal is to identify an unknown molecular structure by matching it against known structures in a database [20, 21]. The Yule coefficient has been proposed as a robust measure for binary template matching [72]. To the best of our knowledge, the Tanimoto and Yule measures have not previously been applied to handwritten pattern recognition. These four classification methods are explained in the following sections together with the modifications we used to better suit them to hand-drawn symbol recognition.

7.2.1

Hausdorff Distance

The Hausdorff distance between two point sets A and B is defined as: H(A, B) = max(h(A, B), h(B, A)) where h(A, B) = max(min ka − bk) a∈A

b∈B

and h(B, A) = max(min ka − bk) b∈B

a∈A

7.2. TEMPLATE MATCHING USING MULTIPLE CLASSIFIERS

h(B,A)

97

: points in A : points in B

h(A,B)

Figure 7.4: Illustration of directed Hausdorff distances. The Hausdorff distance is the maximum of the two directed distances, in this case h(A, B).

ka − bk represents a measure of distance (we use the Euclidian distance) between two points a and b. h(A, B) is referred to as the directed Hausdorff distance from A to B and corresponds to the maximum of all the distances one can measure from each point in A to the closest point in B. The intuitive idea is that if h(A, B) = d, then every point in set A is at most distance d away from some point in B. h(B, A) is the directed distance from B to A and is computed in a similar way. Note that in general h(A, B) 6= h(B, A), an example of which is shown in Figure 7.4. The Hausdorff distance is defined as the maximum of the two directed distances. In its original form, the Hausdorff distance is too sensitive to outliers. As shown in Figure 7.4, if points in the two sets are mostly proximate except for one outlier, the Hausdorff method will fail to detect the noticeable correspondence between the point sets because ultimately the distance will be dictated by the outlier. In pattern recognition, such difficulties often arise in the presence of noise and occlusion. A modified version proposed by Rucklidge [64], called the Partial Hausdorff distance, eliminates this problem by ranking the points in A according to their distances to points in B in descending order, and assigning the distance of the k th ranked point as h(A, B). The partial Hausdorff distance from A to B is thus given by: hk (A, B) = k th min ka − bk a∈A b∈B

98

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

For example, for k = 3, h3 (A, B) would ignore the two points in A that are the farthest from points in B, and instead would correspond to the point in A that is the 3rd farthest from a point in B. hk (B, A) is calculated similarly. The partial Hausdorff distance, in effect, softens the distance measure by discarding points that are maximally far away from the counterpart point set. The results reported in the following sections are based on a rank of 6%, i.e., in the calculation of the directed distances, the most distant 6% of the points are ignored. We determined this cutoff value empirically based on experience with our system. Although we use the partial Hausdorff distance instead of the original Hausdorff distance in our implementation, in the remainder of this chapter we will refer to the partial Hausdorff distance as simply the Hausdorff distance for notational convenience. In fact, in the literature these two terms are often used interchangeably, as the Hausdorff distance is rarely used in its original form. However, we shall distinguish between the Hausdorff distance and another version called the Modified Hausdorff distance, which is the subject of the next section. Whether it is based on the maximum or the k th ranked directed distance, calculation of h(A, B) involves computing, for each point in A, the distance to the nearest point in B. This process can be greatly expedited by using what is called the distance transform. The main idea is to pre-compute all necessary distances only once during the training phase, allowing any distance of interest to be obtained via simple indexing during recognition. In our system, we have found the distance transform to accelerate the computation of the Hausdorff distance by a few orders of magnitude. A brief explanation of the distance transform and its utility in template matching can be found in Section 7.2.5, following the discussions of the similarity measures.

7.2.2

Modified Hausdorff Distance

Dubission and Jain [17] proposed the modified Hausdorff distance (MHD) which replaces the max operator in the directed distance calculation by the average of the distances: hmod (A, B) =

1 X min ka − bk Na a∈A b∈B

7.2. TEMPLATE MATCHING USING MULTIPLE CLASSIFIERS

na

99

nb

A

B

n ab n 00 Figure 7.5: Schematic illustration of the overlap between two patterns A and B. The number of image pixels are denoted by na and nb respectively. The intersection denotes the number of overlapping black pixels nab . n00 denotes the number of overlapping white pixels (i.e., background pixels) in the two patterns. where Na is the number of points in A. The modified Hausdorff distance is then defined as the maximum of the two directed average distances: M HD(A, B) = max(hmod (A, B), hmod (B, A)) Although hmod (A, B) may appear similar to hk (A, B) with k = 50%, the difference is that the former corresponds to the mean directed distance while the latter corresponds to the median. Dubuisson and Jain [17] argue that for object matching purposes, the average directed distance is more reliable than the partial directed distance, because as the noise level increases, the former degrades gracefully whereas the latter exhibits a pass/no-pass behavior. We again use the distance transform to facilitate distance calculations. The modified Hausdorff distance is slightly more computationally efficient than the regular Hausdorff distance, as the need for sorting minimum distances is eliminated.

7.2.3

Tanimoto Coefficient

The Tanimoto coefficient between two binary images A and B is defined as: T (A, B) =

nab na + nb − nab

where na is the total number of black pixels in A, nb is the total number of black pixels in B and nab is the number of overlapping black pixels.

100

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

Intuitively, T (A, B) specifies the number of matching points in A and B, normalized by the union of the two point sets (Figure 7.5). By definition, T (A, B) yields values between 1.0 (maximum similarity) and 0.0 (minimum similarity). In the form given above, the similarity between two images is based solely on the matching black points. However, for images that contain mostly black pixels, the discrimination power of T (A, B) may vanish. In such situations, coincidence of white pixels can be used as a measure of similarity: T C (A, B) =

n00 na + nb − 2nab + n00

where n00 is the number of matching white pixels. The denominator is the number of pixels that are white in at least one of the images. T C (A, B) is called the Tanimoto coefficient complement. It represents the number of matching white pixels normalized by the union of the white pixels from the two images. The two expressions can be combined to form the Tanimoto similarity coefficient [20]: Tsc (A, B) = α · T (A, B) + (1 − α) · T C (A, B) where α is a weighting factor between 0.0 and 1.0. Ideally, if the number of black pixels in an image is small compared to the number of white pixels, the similarity decision should be based on matching black pixels. In this case, T (A, B) should be emphasized by means of a large α. In the converse case, similarity should be based on matching white pixels, which means T C (A, B) should be emphasized by means of a small α. This effect can be achieved by linking α to the relative number of black pixels as follows: α = 0.75 − 0.25 · (

na + nb ) 2·n

where n is the image size in pixels. The term in parentheses is the total number of black pixels divided by the total number of pixels in the two images. The form of this relationship is adapted from [20] such that α is small when the number of black pixels is high and vice versa. We selected the two constants in the equation so that α is generally high, in the range [0.5,0.75] to be precise. This bias favors T (A, B) over T C (A, B). The choice is justified by the fact that hand-drawn symbols usually consist of thin lines (unless

7.2. TEMPLATE MATCHING USING MULTIPLE CLASSIFIERS

101

7.1 6.4 5.8 5.4 5.1 5.0 5.1 5.4 5.8 6.4 7.1 6.4 5.6 5.0 4.8 4.1 4.0 4.1 4.8 5.0 5.6 6.4 5.8 5.0 4.2 3.6 3.2 3.0 3.2 3.6 4.2 5.0 5.8 5.4 4.8 3.6 2.8 2.2 2.0 2.2 2.8 3.6 4.8 5.4 5.1 4.1 3.2 2.2 1.4 1.0 1.4 2.2 3.2 4.1 5.1 5.0 4.0 3.0 2.0 1.0 0.0 1.0 2.0 3.0 4.0 5.0 5.1 4.1 3.2 2.2 1.4 1.0 1.4 2.2 3.2 4.1 5.1 5.4 4.8 3.6 2.8 2.2 2.0 2.2 2.8 3.6 4.8 5.4 5.8 5.0 4.2 3.6 3.2 3.0 3.2 3.6 4.2 5.0 5.8 6.4 5.6 5.0 4.8 4.1 4.0 4.1 4.8 5.0 5.6 6.4 7.1 6.4 5.8 5.4 5.1 5.0 5.1 5.4 5.8 6.4 7.1

Figure 7.6: When determining the coincident pixels during in the Tanimoto and Yule coefficients, a threshold of 4.5 pixels is used to take into account variations in the patterns. The figure shows the boundaries of the admissible region around an image pixel at the center.

excessive over-tracing is done) producing rasterized images that contain fewer black pixels than white. Hence, for our applications, the Tanimoto coefficient should be controlled more by T (A, B) than by T C (A, B). Similarity measures that are based exclusively on the number of overlapping pixels, such as the Tanimoto coefficient, often suffer from slight misalignments of the rasterized images. We have found this problem to be particularly severe for hand-drawn patterns where rasterized images of ostensibly similar shapes are almost always disparate, either due to differences in shape, or more subtly, due to differences in drawing dynamics. The latter commonly occurs as a result of irregular drawing speed, often manifesting itself as unevenly sampled digital ink. Hence, for two shapes drawn at different speeds, the resulting rasterized images will likely exhibit differences. In order to absorb such variations during matching, we use a thresholded matching criterion that considers two pixels to be overlapping if they are separated by a distance less than 1/15th of the image’s diagonal length. For a 48X48 image grid, this translates into 4.5 pixels, i.e., two points are considered to be overlapping if the distance between them is less than 4.5 pixels. As shown in Figure 7.6, this threshold defines a small neighborhood around each image pixel such that any pixel that falls in this region is considered to be coincident. To apply this criterion efficiently, once again distance transforms are used such that simple queries immediately indicate whether

102

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

there exists a corresponding point in a specified region.

7.2.4

Yule Coefficient

The Yule coefficient, also known as the coefficient of colligation, is defined as: Y (A, B) =

nab · n00 − (na − nab ) · (nb − nab ) nab · n00 + (na − nab ) · (nb − nab )

where the term (na − nab ) corresponds to the number of black pixels in A that do not have a match in B. Similarly, (nb − nab ) is the number of black pixels in B that do not find a match in A. Y (A, B) produces values between 1.0 (maximum similarity) and -1.0 (minimum similarity). Unlike the original form of the Tanimoto coefficient, the Yule coefficient simultaneously accounts for the matching black and white pixels via the terms nab and n00 . However, like the Tanimoto coefficient, it is sensitive to slight misalignments between patterns for the reasons explained above. A thresholded matching criterion is thus employed, which is similar to the one we use with the Tanimoto method. Tubbs [72] originally employed this measure for generic, noise-free binary template matching problems. By using a threshold, we have made the technique useful when there is considerable noise, as is the case with hand-drawn shapes.

7.2.5

Distance Transform

A distance transform can be simply described as a nearest neighbor function. It is a morphological operation that converts a binary bitmap image into an image in which each pixel encodes its distance to the nearest black pixel in the same image. The resulting image is called a distance map. Depending on the task, different metrics can be used in the distance calculations resulting in different maps. The most frequently used metrics are the Euclidian distance, the Manhattan distance and the Chessboard distance. In this work, the Euclidian distance is used. Figure 7.7 illustrates the idea on a 10X10 image. As shown, the distance map values start at 0.0 for pixels that are themselves black and increases as one moves farther into the white space. For example, the distance transform value of the upper-left white pixel is 5 because the

7.2. TEMPLATE MATCHING USING MULTIPLE CLASSIFIERS

103

5.0 5.1 5.0 4.2 3.6 2.8 2.2 1.4 1.0 1.0 4.0 4.1 4.2 3.6 2.8 2.2 1.4 1.0 0.0 0.0 3.0 3.2 3.6 2.8 2.2 1.4 1.0 0.0 0.0 1.0 2.0 2.2 2.8 2.2 1.4 1.0 0.0 1.0 1.0 1.4 1.0 1.4 2.2 1.4 1.0 0.0 0.0 1.0 2.0 2.2 0.0 1.0 1.4 1.0 0.0 0.0 1.0 1.4 2.2 3.2 0.0 0.0 1.0 0.0 0.0 1.0 1.4 2.2 2.8 3.6 1.0 0.0 1.0 0.0 1.0 1.4 2.2 2.8 3.6 4.2 1.0 0.0 0.0 1.0 1.4 2.2 2.8 3.6 4.2 5.0 1.4 1.0 1.0 1.4 2.2 2.8 3.6 4.2 5.0 5.7

Figure 7.7: (a) A checkmark symbol quantized into a 10X10 template. (b) The corresponding distance transform matrix. nearest black pixel is 5 units away. By analogy to the Voronoi diagram, distance maps are often viewed as quantized Voronoi surfaces [64]. During matching, distance maps serve as look-up tables for the closest distances. In the Hausdorff distance, for instance, the directed distance h(A, B) is found by superimposing template A on the distance transform map of B. Minimum distances are simply read from B’s distance map using points in A as query indices. h(A, B) then becomes the maximum (or the k th max.) of the queried distances. h(B, A) is computed in a similar way except points in B are used as indices to query A’s distance map. In this system, distance maps are constructed along with the symbol templates during preprocessing. Numerous algorithms have been proposed for the fast computation of distance transforms (see [14] for an extended overview). These algorithms can be broadly classified as exact methods and approximate methods. Naturally, approximate methods are computationally less demanding compared to the exact methods. We experimented with both types and found the exact methods to be more suitable despite some loss in efficiency. Nevertheless, because the distance transform of each definition symbol is computed only once immediately after the user presents the symbol, the difference in performance is unnoticeable. Furthermore, the only distance transform that needs to be calculated during recognition is the unknown’s, since the transforms of the definitions are already available. The computational burden of

104

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

distance transform calculations is therefore not significant in our system.

7.3

Combining Classifiers

The recognizer compares the unknown symbol to each of the definitions using the four classifiers explained above. The outcome of this process is four separate lists in which definition symbols are ranked according to their similarity (or dissimilarity) to the unknown symbol. The next step in recognition is to identify the true class of the unknown by synthesizing the results of the component classifiers. One key to this lies in the pragmatic evidence that, although individual classifiers may not perform perfectly, they usually rank the true class highly, and tend to misclassify differently [2]. Hence, by advocating definitions ranked highly by all classifiers while suppressing those that are not, it is often possible to sift the true class from a crowd of false positives. This idea forms the basis of all recognition systems that employ multiple classifiers. The main effort goes into the design of the so called “combination function” that can accomplish this using the information produced by the constituent classifiers. Combination functions can be grouped under two categories based on the classification information they utilize. Simple voting schemes, such as majority vote, Borda count, and highest rank [35] consider solely the ordinal rankings of the classes and ignore the numerical scores leading to those rankings. Ho [35] argues that this abstraction is advantageous in many-class problems and points out that numerical scores such as distances to prototypes, values of arbitrary discriminant functions or confidence measures may not be directly usable due to the incomparability of their scales. A downside of these methods is that they are effective only when both the number of classes and the number of classifiers are particularly high. Otherwise these methods frequently result in ties and therefore lead to inconclusive decisions [34]. On the other hand, methods such as voting, mixture of experts, stacked generalization, boosting, bootstrapping, cascading [2], and the sum, product, max, min and median rules [45] are designed to take into account the numerical scores assigned to the classes, and are less concerned with the actual class rankings. In these methods, class

7.3. COMBINING CLASSIFIERS

105

scores are usually expressed as a-posteriori probability values and are thus immune to the problem of incomparable scales.

In this work, we use an approach similar to the sum rule introduced by Kittler et al. [45]. With their approach, each class receives a combined probability score, which is the sum of the a-posteriori probabilities from the individual classifiers and the prior probabilities of the classes. The unknown pattern is then assigned to the class with the maximum combined score. One expedient feature of this method is that a definition is promoted only if it is ranked highly by most of the classifiers. False positives are therefore attenuated if they do not find support from the majority of classifiers. Kittler’s experimental and theoretical analysis prove the sum rule to be superior to the product, max, min and median rules. Their findings are particularly important as their rules are generalizations of many existing methods.

In this work we use the sum rule with some modifications. In the original sum rule, because all class scores are expressed as probability values, the outputs of different classifiers are directly comparable. The four classifiers explained above, however, do not produce directly comparable results. First, as noted previously, the Tanimoto and Yule coefficients are measures of similarity whereas the Hausdorff methods are measures of dissimilarity. The numerical scores assigned by these two groups thus have opposite interpretations and are consequently incomparable. Second, the four methods have dissimilar ranges. For instance, while the Tanimoto coefficient produces values between 0 and 1, the Yule coefficient ranges between -1 and 1, and the Hausdorff distances range between 0 and, theoretically, infinity.

Before the results of the classifiers are combined, it is thus necessary to establish a congruent ranking scheme for the four classifiers. For this, we first transform the Tanimoto and Yule similarity coefficients into distance measures and then normalize the values of all four classifiers to the range 0 to 1. These two processes are referred to as parallelization and normalization.

106

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

7.3.1

Parallelization

To facilitate discussion, let M denote the number of definitions, R denote the number of classifiers and dm r denote the score classifier r assigns to definition m. Here, r ∈ {Hausdorff, Modified Hausdorff, Tanimoto, Yule} and m is any definition symbol in the database. The Tanimoto and Yule coefficients are transformed into dissimilarity measures by reversing their values as follows: For m = 1, ..M , dm T animoto ← 1.0 − dm T animoto dm Y ule ← 1.0 − dm Y ule This process brings the Tanimoto3 and Yule coefficients in parallel with the Hausdorff measures in the sense that the algebraic scores of all classifiers now increase with increasing dissimilarity. Although the choice is arbitrary, we chose the constant 1.0 such that new scores obtained from both classifiers are zero at maximum similarity, making the transformed coefficients analogous to distance values.

7.3.2

Normalization

After parallelization, all classifiers become measures of distance but still remain incompatible due to differences in their ranges. To establish a unified scale among classifiers, we use a linear transformation function that converts the original distances into normalized distances. For this, we first find the smallest and largest values observed for each of the four classifiers: M

minscorer = min dk r k=1 M

maxscorer = max dk r k=1

r The normalized distance d¯m for definition m under classifier r is then defined as: 3

This modified version of the Tanimoto coefficient is also known as the Soergel distance [74].

7.4. HANDLING ROTATIONS

r d¯m =

107

dm r − minscorer maxscorer − minscorer

This transformation maps the distance scores of each classifier to the range [0,1] while preserving the relative order established by that classifier.

7.3.3

Combination Rule

Having standardized the outputs of the four classifiers by parallelization and normalization, the results can now be combined. For each definition m, we define a combined normalized distance Dm by summing the normalized distances computed by the constituent classifiers: Dm =

R X

r d¯m

r=1

Finally, the unknown pattern is assigned to the class having the minimum combined normalized distance. The decision rule is thus: Assign unknown symbol to definition m∗ if m∗ = argmin Dm m

7.4

Handling Rotations

Template matching techniques are sensitive to orientation. Therefore, for rotation invariant recognition, it is necessary to first rotate the patterns into the same orientation. Often this is accomplished by incrementally rotating one pattern relative to the other until the best alignment is achieved. However, this is overwhelmingly expensive for real-time applications due to the costly rotation operation. We have developed an efficient technique, based on the polar coordinate transformation, to greatly facilitate this process. The main idea is that rotations in Cartesian coordinates become translations in polar coordinates. Hence, by identifying the linear offset between two patterns in polar coordinates, one can determine the angle by which the patterns differ in the x-y plane. This process is conceptually analogous to the cross-correlation operation used in signal processing. Cross-correlation determines if a signal resembles

108

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

2

120

r

1.6 170

1.2 y 220

0.8 270

0.4

320 300

0 350

400

450

500

(a)

x

p/2

130

-3.15

-1.15

0.85

q

2.85

4.85

2.85

4.85

theta

p/2 2

r 1.6

180

1.2

y 230 0.8

0.4

280

0

280

330

380 x

430

480

(b)

-3.15

-1.15

0.85

q

theta

Figure 7.8: (a) Left: Letter ‘P’ in screen coordinates. Right: in polar coordinates. (b) When the letter is rotated in the x-y plane, the corresponding polar transform shifts parallel to the θ axis. a time-shifted version of another one. It does so by incrementally sliding one signal along the other while taking their dot product at every step. The end result is a correlation value indicating the similarity of the two signals and the temporal delay between them. In our case, finding the optimal angle is equivalent to determining the delay. We begin the discussion by introducing the polar transform.

7.4.1

Polar Transform

The polar coordinates of a point in the x-y plane are given by the point’s radial distance, r, from the origin and the angle, θ, between that radius and the x axis. The well known relations are: r=

q

(x − xo )2 + (y − yo )2

and θ = tan−1 (

y − yo ) x − xo

where (xo , yo ) is the origin.

7.4. HANDLING ROTATIONS

109

A symbol originally drawn in the screen coordinates (x-y plane) is transformed into polar coordinates by applying these formulae to each of the points. Figure 7.8a illustrates a typical transformation. As shown in Figure 7.8b, when a pattern is rotated in the x-y plane, the corresponding polar image slides parallel to the θ axis by the same angular displacement. In the form given above, the polar transformation is sensitive to size. When a pattern is scaled in the x-y plane, the corresponding polar image stretches along the r axis. To eliminate such variance, we first normalize the r axis using the “ink length” of the symbol. Ink length is defined as the total distance the pen tip travels on the writing surface. The main reason for using the ink length as opposed to, say, the diagonal or perimeter of the bounding box, is that ink length is invariant to the orientation of the pattern while the bounding box properties are not. With this normalization, the values along the r axis correspond to non-dimensional scale factors. No adjustment to the θ axis is necessary as uniform scaling does not affect angular positions. For polar transforms, the choice of the origin has a significant impact on the resulting image. Although a fixed origin, such as always the top-left corner of the drawing tablet, would theoretically seem appropriate, there are two important practical factors to consider. First, it is desirable to set the origin inside the image, preferably close to the image center, so that the θ range can be utilized to its full extent. If the origin is far away from the image, for instance at one of the screen corners, the polar image will subtend only a narrow angular window, compressing many important details. Second, identical shapes should have identical origins so that their polar transforms are identical. A seemingly suitable (in fact a common) choice for the origin is thus the centroid of the image pixels given by: N N 1 X 1 X xc = xi and yc = yi N i=1 N i=1

where N is the number of data points in the original image. In this formulation, because the centroid is the average of the sampled points, it has a tendency to drift towards regions containing dense pixel clusters. In our domain, however, variation in the pen speed often causes large variations in the pixel density. For example,

110

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

the significant reduction of the pen speed at the beginnings and ends of strokes causes pixels to be denser in these regions compared to the rest of the stroke. This phenomena abnormally and unpredictably alter the centroid location, causing polar transforms of similar shapes to be different. To prevent this phenomenon, we compute the weighted centroid of the line segments that join pairs of consecutive points. Each segment is assigned a weight proportional to its length, and the new centroid becomes the mean of the weighted segments: S X

xc =

S X

xi · li

i=1 S X i=1

and yc = li

yi i=1 S X

· li li

i=1

where S is the total number of segments and li is the length of the ith segment. Each segment is treated as a point-mass concentrated at the segment’s center, denoted (xi , yi ). This approach attenuates the effect of short segments and therefore prevents dense pixel clusters from shifting the centroid. As a result, the centroid location becomes more stable over different examples of a pattern. Without loss of generality, we take advantage of the 2π-periodicity of the θ axis and position all polar images in a window extending from -π to +π. This limits the amount of search necessary to find the best alignment of two polar images. Figure 7.9 illustrates the idea. The top figure shows the initial polar coordinate representation of the rotated ‘P’ from Figure 7.8. Mapping the angles to the range -π to +π produces the result shown in the bottom of Figure 7.9. Effectively, the points to the right of the +π boundary were moved to the right of the -π boundary.

7.4.2

Finding the Optimal Alignment Using Polar Transform

To find the angular offset between two polar images, we use a slide-and-compare algorithm in which one image is incrementally displaced along the θ axis. At each displacement, the two images are compared to determine how well they match. The displacement that results in the best match indicates how much rotation is needed

7.4. HANDLING ROTATIONS

111

2

r 1.6

1.2

0.8

0.4

0 -3.15

-p

-1.15

q

0.85

+p

4.85

+p

4.85

2.85

theta

2

r 1.6

1.2

0.8

0.4

0 -3.15

-p

-1.15

q

0.85

2.85

theta

Figure 7.9: Top: Initial polar image of the rotated ‘P’ from Figure 7.8. Bottom: Same image mapped to the range -π to +π. In effect, the portions overstepping the +π boundary are moved near the -π boundary. The two images are equivalent.

to best align the original images. Because the polar images are in fact 2D binary patterns, the template matching techniques from Section 7.2 can be used to match the polar images. In particular, we use the modified Hausdorff distance (MHD) as it is slightly more efficient than the regular Hausdorff distance (directed distances need not be sorted) and it performs slightly better than the Tanimoto and Yule coefficients in polar coordinates. To use the MHD with polar images, the images are quantized into 48X48 templates just as it is done for screen images (Section 7.1). Here again, we use distance transforms to accelerate the computation of the MHD. Although the idea remains the same, the periodicity of the θ axis introduces a new subtly in the distance calculations. Unlike in the x-y plane, pixels at the far right and left sides of a polar image (i.e., those close to the -π and +π boundaries) are in fact proximate as noted in Figure 7.9. For example a pixel just to the left of +π will have a distance of 1 to a pixel that lies on the -π boundary at the same r value. This periodicity is taken into account

112

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

140

r 1.6

190 1.2

(x0,y0)

y 240

0.8

290

0.4

340 350

400

450

500

550

x

(a)

-3.15

-2.15

0 -0.15

-1.15

0.85

1.85

2.85

0.85

1.85

2.85

q

140

r 1.6

190 1.2

(x0,y0)

y 240

0.8

290

0.4

340 340

390

440 x

490

540

(b)

-3.15

-2.15

-1.15

0 -0.15

q

Figure 7.10: The θ coordinate of the polar transform is sensitive to the origin for points near the image center. (a) Letter ‘T’ and its polar transform. (b) Nearly the same letter except for the curl of the tail. The difference causes noticeable difference in the polar transform at small values of r. when the distance transform is computed. Because this subtlety is considered when computing distance, the distance transforms in polar coordinates need be computed only once, as distances do not change with shifts in the polar plane. One difficulty with the polar transform is that data near the centroid of the original image is sensitive to the precise location of the centroid. Consider Figure 7.10 that shows two similar shapes and their polar transforms. In the top image the tail curves slightly to the left while in the bottom image it curves slightly to the right. This difference causes the image centers to fall on the opposite sides of the tail, which, in turn leads to significant dissimilarity in the polar transforms for small r values. Naturally, the modified Hausdorff distance is adversely affected by these variations. To alleviate this problem, we introduce a weighting function w(·) that attenuates the influence of pixels near the centroid of the screen image. Using this function, the directed MHD, previously introduced in Section 7.2.2, becomes: hmod

weighted (A, B)

=

1 X w(ar ) · min ka − bk b∈B Na a∈A

7.5. POLAR TRANSFORM AS A PRE-RECOGNIZER

113

1.2

1

0.8

w(r) 0.6

0.4

0.2

0 0

0.2

0.4

0.6 r

0.8

1

1.2

Figure 7.11: Weighting function used to suppress the sensitivity to the origin. where ar represents the radial coordinate of point a in the quantized polar image A. The directed distance from B to A, hmod

weighted (B, A),

is calculated analogously

and the maximum of the two directed distances is the MHD between A and B. The weighting function has the form: w(r) = r0.10 which is shown graphically in Figure 7.11. The exponent in the function has been determined experimentally for best performance. As shown, the function asymptotes near 1 for large values of r, and falls off rapidly for small values of r. By assigning smaller weights to the pixels near the image center, this function allows the Hausdorff distance between the polar images to be governed by the pixels that reside farther from the origin, hence reducing the sensitivity to the precise location of the centroid of the screen image.

7.5

Polar Transform as a Pre-recognizer

The polar analysis allows us to find the angular difference between two patterns in an efficient way. Once the angle is determined, the patterns can be aligned properly in the x-y plane by a single rotation, and compared using the template matching techniques in Section 7.2. (The rotation is performed before quantizing the screen

114

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

image to produce a template.) However before applying these techniques, the matching information from the polar coordinate representation can be used to filter out many of the unlikely definitions. We have found that the degree of match between two polar images provides a reasonable estimate of the match of the original screen images. In fact, if it were not for the imprecision of the polar transform for small r values, the entire recognition process could be performed exclusively in the polar plane. The match in polar coordinates discounts data near the centroid of the screen image, which can result in false positive matches (i.e., declaring a close match between two patterns when they are in fact dissimilar), but rarely results in false negative matches. Thus, the polar analysis can be used as a pre-recognition step to eliminate unlikely definitions. In practice, we have found that the correct definition for an unknown is among the definitions ranked in the top 10% by the polar coordinate matching. Thus, 90% of the definitions are discarded before considering the match in screen coordinates. This approach is conceptually similar to cascading presented in [2] where a simple classifier is used to reduce the number of classes before a more complex classifier with a more expensive classification rule is applied. In our case, however, the polar transform not only serves as a pre-elimination step but also as a means to efficiently achieving rotation invariance. This dual functionality of the polar transform has been invaluable for achieving real-time performance on an otherwise computationally demanding task.

7.6

User Studies

A user study that consists of two separate experiments was conducted to asses the performance of the recognizer. In the first experiment we used a set of 20 graphic symbols. In the second, we used digit recognition as our test bed. Because the participants in our studies had little or no experience using the digitizing tablet and stylus, they were allowed to acquaint themselves with the hardware until they felt comfortable, which typically took about 2 to 3 minutes. Each experimental session involved only data collection, the data was processed at a later time.

7.6. USER STUDIES

115

Figure 7.12: Symbols used in the graphic symbol recognition experiment. This approach was chosen to prevent participants from adjusting their drawing style based on our program’s output. During data collection, if users were not pleased with what they drew, which occasionally occurred due to the unintentional slip of the stylus, they were allowed to redraw the symbol. However, participants rarely used this option.

7.6.1

Graphic Symbol Recognition

Five users participated in the graphic symbol recognition study. Each user was asked to provide three sets of the 20 symbols shown in Figure 7.12, yielding a total of 60 symbols per user. Four different types of tests were conducted using the collected data. The tests differ based on (1) the number of definition symbols used for training, and (2) whether the test was conducted in a user-dependent (i.e., all training data from the given user) or user independent manner. Below we detail each of these tests and the results. Test 1: Single definition set, user-dependent: In this test, the recognizer was evaluated separately for each user. Each test consisted of three iterations, akin to the K-fold cross validation technique with K=3. In each iteration, one of the user’s three sets of symbols was used for training, and the other two were used for testing. Different iterations employed different test sets. The performance for each user was

116

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

Test1 Test2 Test3 Test4

Top 1 (%) 90.7 95.7 94.7 98.0

Top 2 (%) 96.3 98.3 97.3 99.0

Recog. Time (ms) 332 354 623 674

Table 7.1: Results from the graphic symbol recognition study. The first two columns show the top-one and top-two accuracy, respectively. All tests were conducted on a Pentium 4 machine at 2.0 GHz. with 256 MB of RAM. computed as the average of the three iterations. The first row of Table 7.1 shows the results obtained from this study, averaged over the five users. In this table, the first column shows the recognition accuracy, or the rate at which the class ranked highest by the recognizer, is indeed the correct class. We call this the “top-one” accuracy. The second column shows the “top-two” accuracy, or the rate at which the correct class is either the highest or second highest ranked class. The last column shows the average recognition time in milliseconds. Test 2: Two definition sets, user-dependent: This test is similar to the first test except, in each of the three runs, two sets of symbols were used for training while the remaining set was used for testing. Hence, during recognition, each unknown was compared to 40 definition symbols – 2 definitions per symbol. As shown in the second row of Table 7.1, the additional training set increased the recognition accuracy at the expense of only a minor increase in the recognitions times. Test 3: Twelve definition sets, user-independent: The aim in this test was to evaluate the recognizer when the training and test sets belonged to different users. When testing a particular user’s data, the training database consisted of all users’ symbol sets excluding the data from the user under consideration. In each run, the database thus consisted of a total of twelve sets: three sets from each of the four users not involved in that particular test. In effect, this test mimics a walk-up-anddraw scenario in which the user works directly from a pretrained recognizer without providing his or her own training symbols. The third row of Table 7.1, shows the performance obtained in this setting.

7.6. USER STUDIES

Test1 Test2 Test3 Test4

117

Top 1 (%) 95.4 97.7 91.8 97.7

Top 2 (%) 98.3 98.5 95.5 99.2

Recog. Time (ms) 211 225 516 586

Table 7.2: Results from the digit recognition study. All tests were conducted on a Pentium 4 machine at 2.0 GHz. with 256 MB of RAM. Test 4: Fourteen definition sets, partial user-dependence: The difference between this test and the previous one is that, the training database additionally contained two symbol sets from the user being tested, in addition to the twelve sets from other users. In terms of training sets employed, this experiment is thus a hybrid of Test 2 and Test 3. As shown in the last row of Table 7.1, the top-one accuracy in this case reaches 98%.

7.6.2

Digit Recognition

Nine users participated in the digit study. Users were asked to provide three sets of digits from “0” to “9”, yielding a total of 30 digits per user. Our examination of the collected data revealed that frequent misclassifications occurred due to a confusion between “6” and “9”, and occasionally between “2” and “7”, both of which are reasonable errors given the rotation invariant nature of our recognizer. As a remedy, we adjusted the recognizer in thus study so that search for the correct orientation of a digit was restricted to ±90◦ of the digits’ original orientation. We believe this restriction on rotation is reasonable for the digit recognition study as traditional digit recognizers assume a fixed orientation. We conducted the same four tests described in the symbol recognition study. However, with nine rather than five participants, Test 3 now has 24 training sets rather than 12, and Test 4 has 26 rather than 14. Table 7.2 shows the results obtained from this study. State-of-the-art hand-drawn digit recognition systems achieve recognition rates

118

CHAPTER 7. IMAGE-BASED SYMBOL RECOGNITION

above 96-97% in user-independent settings [49, 51]. We achieve about 91.8% accuracy in a user-independent setting (with rotation limited to ±90◦ ). Nevertheless, we consider our approach to be quite attractive given that it works from only a handful of training examples. As one would expect, if the problem is to recognize digits only, it is better to use a dedicated digit recognizer. However, if the problem involves user defined symbols, our approach has distinct advantages.

7.7

Discussion

In the graphic symbol recognition study, we consider the top-two classification performance to be of considerable importance, as a common method for correcting recognition errors in many interactive applications is to display a short list of potential candidates from which the user can pick the intended one. In such cases, if the correct class was not selected by the recognizer, it should at least be near the top of the list. Similarly, in systems that consider context, such as [25] and [3], the shape recognizer may produce a set of likely candidates which can be further analyzed using contextual information. In such cases, the system may use context to decide that the second or third choice from the recognizer is the correct interpretation, for example. For such approaches to work, the correct interpretation must be near the top of the list of choices. We believe that the results of our user study are quite promising when compared to results reported in the literature. For example, Landay and Myers [50] report a recognition rate of 89% on a set of 5 single-stroke editing gestures. In our case, however, we obtain a recognition rate over 90% on a set of 20 symbol definitions each of which can be drawn with any number of strokes. In a study involving 7 multistroke and 5 single-stroke shapes, Fonseca and Jorge [23] report recognition rates around 92%. In that study, half of the subjects were experts in using the hardware. Also, the recognizer required the shape features to be manually encoded for each individual shape, which makes training new shapes difficult. On a database of 13 symbols, Hse and Newton [37] report a recognition rate of 97.3% in a user-dependent setting, and 96.2% in a user-independent setting. Each symbol was trained using 30

7.7. DISCUSSION

119

samples. On a database of 20 symbols, we achieve an accuracy of 95.7% in a userdependent setting, where each symbol was trained with 2 samples (Test-2). Likewise, with the same 20 symbols, we achieve an accuracy of 94.7% in a user-independent setting, where each symbol was trained with 12 samples (Test-3). To evaluate the efficiency of our polar coordinate analysis, we conducted a separate experiment in which the angular alignment of the images was computed in screen coordinates via incremental rotations. This not only bypassed the polar coordinate approach for computing optimal alignment, but also bypassed the accompanying prerecognition step in which unlikely definitions are pruned. With these modifications, the average recognition time for Test 1 of the graphical symbol and digit recognition experiments increased to 3590ms and 1350ms respectively, while the recognition accuracy remained the same in both cases. As these results suggest, the polar analysis provides significant savings in overall processing time without any decrease in accuracy.

Chapter 8 Feature-Based Statistical Symbol Recognition This method uses a statistical approach to learn symbol definitions from a set of training examples. Training consists of extracting a number of geometric features from a set of training examples and computing the statistical information contained in these features. This approach naturally accounts for the variations typical of handdrawn shapes. For recognition, the unknown symbol is matched against each of the definition symbols and the definition that yields the highest probability of match is declared as the unknown’s classification. As explained below, the problem is modeled as a standard Bayesian classification problem with the underlying probability density functions assumed to be Gaussian. Like the previous image-based recognizer, this method puts no restriction on the number of strokes or the order in which they are drawn. For example, when training a square symbol, some of the training examples can be drawn with four strokes, one for each side of the square, while others can be drawn with a single stroke or two strokes etc. Also, for those examples drawn with multiple strokes, the sequence in which they are placed on the drawing surface is immaterial. This flexibility proves quite useful as the user is not required to draw the same way every time. Moreover, the samples from different users can be collected resulting in symbol definitions that are not tailored to any single user. 120

121

The recognition task involves determining the symbol definition that best matches the unknown symbol. Formally, we seek to find the definition (call it ‘Object’ for clarity) that results in the highest ‘posterior’ probability, given the unknown input strokes (Strokes) that have been observed. Mathematically this probability can be stated as: w∗ = argmax P (Objecti |Strokes) i

Objecti that maximizes the above expression is chosen to be the interpretation w∗ of the input strokes. The term P (Objecti |Strokes) is known as the posterior of Objecti , which lies at the heart of our decision process. From Bayes rule, this can be written as: p(Strokes|Objecti ) · P (Objecti ) w∗ = argmax PN i k=1 p(Strokes|Objectk ) · P (Objectk ) where N is the number of prototype symbols in the database. Here, the denominator can be ignored since it is simply a scaling factor that remains the same for different Objecti ’s. The term P (Objecti ) is called the prior probability of observing Objecti . In the absence of any other evidence, the class with the highest prior probability would be chosen as the unknown’s classification. In this approach, we assume that all symbols in the library have equal prior probabilities and thus the P (Objecti ) terms do not influence the posterior probabilities. The term P (Strokes|Objecti ) is called the likelihood (LH). This term specifies the probability of observing the set of strokes given that the class under consideration is Objecti . By assuming that the prior probabilities are the same, Bayes formula states that the posterior probabilities are dictated solely by the LH terms, P (Strokes|Objecti ). Thus, the main focus in this approach is to find a means to reliably determine the likelihood terms. For each symbol class, this term is learned from the set of training examples. The classification of an unknown symbol is thus given by: w∗ = argmax P (Strokes|Objecti ) i

In words, the unknown is assigned to the symbol class that results in the maximum LH term.

122 CHAPTER 8. FEATURE-BASED STATISTICAL SYMBOL RECOGNITION

Figure 8.1: Left: The unprocessed shape as drawn by the user. Right: The resulting shape after segmentation.

8.1

Preprocessing and segmentation

(The segmenter described below was developed primarily by T. Stahovich and T. Kurtoglu, and is presented here for completeness. For details, see [8].) Data points are collected as a time sequenced (x,y) points sampled along the stylus’ trajectory. The program gathers these points and attempts to fit one of the two types of geometric primitives: a straight line segment or an arc segment. We refer to this process as ‘segmentation’. Figure 8.1 shows an example. The figure on the left corresponds to the unprocessed ink as obtained directly from the digitizing tablet. The figure on the right shows the resulting symbol after segmentation. Segmentation can be viewed as an optimization procedure in which the input drawing is described with a geometrically manageable set of simple primitives. The lack of two pieces of information makes this a challenging problem. These are: (1) not knowing a-priori the optimal number of primitives to use (a casually drawn curve can be closely represented by a large number of small line segments, or a fewer number of longer lines), and (2) not knowing the type of primitive that is most appropriate (i.e., whether to fit a line or an arc to a set of data points). The key to the solution of the problem is to determine what we call the ‘segment points’, the points that divide the symbol into different primitives. Once these points are determined, a least squares test is used to determine which type of primitive best fits a given set of data points. Trivially, all pen-up and pen-down events in a multi-stroke symbol mark the ends of a segments. Hence, all such points are considered as segment points. The difficulty,

8.1. PREPROCESSING AND SEGMENTATION

123

0.800

SPEED (pixels/msec)

0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000 -0.100 0 -0.200

1000

2000

3000

4000

TIME (msec)

Figure 8.2: Left: A square drawn in a single stroke in the clock-wise direction. Right: The corresponding speed profile of the pen tip. As shown, the corners of the square can be identified by locating the minima in the speed profile.

however, is to find the segment points within the strokes. Consider for example the square shown in Figure 8.2, which is drawn in a single stroke. The challenge is to identify the corners of the square from the continuous stream of (x,y) points.

A common approach to detect such points is based on the curvature information: points that have large curvature are likely to be corner points. Although this insight is useful, it does not always result in reliable segment points. Instead, a more subtle characteristic, namely the drawing speed, has proven to be a more effective indicator of the segment points. The key observation is that while drawing, humans naturally tend to slow down at the intended corners and draw at a relatively higher pace otherwise. Hence, by monitoring the minima in the speed profile, one can determine which bumps and bends are intended and which are accidents. The speed profile on the right of Figure 8.2 clearly shows the corner points corresponding to the square. Once all the segment points are determined, a line or an arc segment is fitted to each of the intervals. This segmentation process is used in the feature-based symbol recognizer described in this chapter, and the graph-based symbol recognizer described in the next chapter.

124 CHAPTER 8. FEATURE-BASED STATISTICAL SYMBOL RECOGNITION

8.2

Feature Set

The class models are constructed using a number of geometric features extracted from the training examples. Each training example is processed and segmented using the symbol segmenter described above resulting in a set of line and arc segments closely approximating the raw digital ink. The geometric features considered in this project are computed following the segmentation process. We consider ten types of features to characterize a symbol. The features we use are: (1) Number of lines (2) Number of arcs (3) Number of parallel lines (4) Number of perpendicular lines (5) Number of ‘L’ intersections (6) Number of ‘X’ intersections (7) Number of ‘T’ intersections (8) The average distance between endpoints (9) The distance between the image center and the point closest to the center, normalized by the total ink length1 . This is denoted as the ‘normalized rmin ’ (10) The distance between the image center and the point farthest from the center, normalized by the total ink length. This is denoted as the ‘normalized rmax ’ These features are invariant to translation, scale, rotation, and arbitrary drawing orders. Figure 8.3 illustrates these features on a pivot symbol. The nature of hand-drawings often causes some of these features to be missing when they are actually intended, or, to occur superfluously when they are not intended. To account for such distortions, we employ a number of heuristics. A tolerance is allowed when determining if two segments intersect. If one segment intersects another when its length is extended by 15%, it is assumed that the intersection was intended in the original sketch. If an intersection occurs within 15% of the segment’s 1

The total ink length is defined as the total distance the pen tip travels on the writing surface.

8.2. FEATURE SET

125

Rmax

Rmin

Num of Lines = 7 Num of Arcs = 1 Num of parallel Lines =5 Num of perpendic. Lines =0 Num of ‘L’ intersections=0 Num of ‘X’ intersections=4 Num of ‘T’ intersections=5 Average end point distance=0.46

Rmin =0.222 rmin= Total_inklength rmax= Rmax =1.822 Total_inklength

Figure 8.3: Illustration of the feature set. length from an endpoint, the intersection is said to occur at an endpoint. Otherwise, it occurs at a midpoint. An ‘L’ intersection is formed when the endpoints of two segments meet. When an endpoint of one segment meets a midpoint of another segment, a ‘T’ intersection is formed. An ‘X’ intersection is formed when two segments cross at midpoints. Tolerances are also used when determining if two lines are parallel or perpendicular. For two lines to be considered parallel, their slopes may differ by as much as 5 degrees. Perpendicular lines are also permitted a tolerance of 5 degrees. The 8th feature, the average distance between endpoints, is found by determining the distance from each endpoint of each segment to each endpoint of other segments. This value is averaged, and normalized by the maximum distance between any two endpoints to account for scaling. The average distance between endpoints gives information similar to the aspect ratio of the bounding box of the symbol, but it is insensitive to rotation. This feature allows us to differentiate between objects that contain scaled versions of the same segments. For example, the average distance between endpoints of a square is larger than that of a rectangle, while the first seven features could have the same values for these two symbols.

126 CHAPTER 8. FEATURE-BASED STATISTICAL SYMBOL RECOGNITION

Features 9 and 10 measure the hollowness and the density of the symbol. If rmin is large, this means the sampled data points comprising the shape reside far from the image center, as is the case with circles, squares and other hollow shapes. If both rmin and rmax are relatively small, this usually means the shape is compactly drawn around its center, and moreover, the shape is dense in the sense that the spread of the shape on the drawing surface is small compared to the total ink length.

8.3

Training

When a new symbol definition is to be created, the program is set in the training mode and the user begins to draw examples of the new symbol. As strokes are drawn, the program automatically segments the raw ink into line or arc primitives. Next, it extracts the features described above. This process is repeated separately for each of the training symbols2 and at the end a probabilistic model for the trained symbol is constructed from the extracted feature vectors. We assume that the training feature vectors are distributed normally, i.e., they can be modeled as Gaussian distributions. The parameters of a multivariate Gaussian distribution are governed by the second order statistics; the mean m and the covariance Σ. These parameters are computed from the training symbols in the following way: m=

R 1 X · xk R k=1

Σ=

R X 1 · (xk − m) · (xk − m)T R − 1 k=1

here, R denotes the number of training examples for a particular symbol class and xk denotes the column vector encoding the ten features for the k th training example. The following are the resulting m and Σ matrices when a damper symbol is trained using the kinds of examples shown in Figure 8.4. The actual training involved 50 2

Currently a total of 50 training examples, collected from different users, are used to learn a symbol definition.

8.3. TRAINING

127

Figure 8.4: Examples of a damper symbol used for training. Segmented ink is shown. such examples. (For simplicity, we present the idea using the first four features only.)     5.780           0.400  

m=

  3.860           

4.160

and the covariance matrix as:

Σ=

  0.9098       −0.4816   1.1727     

1.7910



1.1727

1.7910   

−0.4735

3.7963

3.5943

−0.5755

3.5943

8.0555

−0.4816 0.4082



  −0.4735 −0.5755        

If all ten features were used, m would be a 10X1 column vector and Σ would be 10X10 square matrix. The non-zero values along the diagonal of the covariance matrix indicate that there are variations in the features extracted from different training examples of a symbol. These variations occur because each training example is drawn differently than others in one way or another. In fact, this variation is exactly what we want to

128 CHAPTER 8. FEATURE-BASED STATISTICAL SYMBOL RECOGNITION

encode. We have seen that if the training examples do not exhibit any variation in one or more of the features, then the resulting symbol definition becomes too rigid and this adversely affects the recognition performance. We shall return to this issue in the next section.

8.4

Recognition

In the recognition mode, the user draws a symbol and asks to program to identify it. Similar to the way the training examples are processed, the program segments the symbol and computes the feature vector x. Next, this vector is evaluated against each of the symbol definitions using the Gaussian probability function: 1

1 · exp[− · (x − m) · Σ−1 · (x − m)T ] 2 (2π)f · det(Σ)

P (x) = q

where x is the 10X1 feature vector of the unknown symbol and f is the number of features. Here, f =10. P (x) is the probability of x coming from the symbol model described by m and Σ. At the end of the analysis, the symbol that results in the highest probability for the vector x is chosen as the classification of the unknown symbol. The above expression is ill-defined when the covariance matrix is singular. Such cases usually arise when the training examples do not exhibit variation in one or more of the features. For example, consider training a square symbol with the examples shown in Figure 8.5. For illustration purposes, let’s consider only three types of features in this example: (1) Number of lines, (2) Number of arcs and (3) Number of intersections without distinguishing between L-type, X-type or T-type intersections. As seen, all samples of the square contain exactly four lines and no arcs. There are variations in the number of intersections because occasionally the square is not fully closed. The corresponding mean vector is:

8.5. USER STUDY

129

Figure 8.5: Examples of a square symbol that lead to a deficient covariance matrix.

m=

  4.0   

    

0.00   

    3.93



and the covariance matrix is:   0.00 0.00   



0.00   

Σ =  0.00 0.00



0.00      0.00 0.00 0.067  

The covariance matrix is singular and thus cannot be inverted. This often happens when there are not enough training examples, or when the features take on discrete values (such as the number of lines or arcs). We have found that in our case the latter is a more common cause of deficient covariance matrices. To overcome this difficulty, we introduce an artificial variation into the data by replacing all of the zero diagonal elements by a small positive number ². This way, the rather stiff definitions are slightly softened to better suit the recognition.

8.5

User Study

To evaluate the recognizer, we asked six subjects to train and use the system to recognize common engineering symbols. The subjects had little or no experience using the digitizing tablet and the stylus. The 24 graphical symbols used in this study is shown in Figure 8.6. Users were asked to provide 11 examples of each symbol, yielding a total of 66 examples per

130 CHAPTER 8. FEATURE-BASED STATISTICAL SYMBOL RECOGNITION

arrow

beam

brackets

dashpot

current

capacitor

moment

integral link

pivot

pound

cantilever

pi

ground

piston

puley

sq root spring

turbine square

sum

triangle

voltage

throttle

Figure 8.6: The set of symbols used for testing the feature-based recognizer. symbol. 5 examples were added to the training set by the author yielding a total of 71 training examples per symbol. For each symbol, 50 examples were reserved for training and the remaining 21 examples were used for testing. Subjects were told that the program would be used to test symbol recognition, but were given no information regarding how the recognizer works. Figure 8.7 shows the results of this study tabulated in the form of a confusion matrix. The confusion matrix of a classifier depicts the misclassification rate for each type of symbol considered in the test. The rows in the matrix correspond to the test symbol under consideration. The columns correspond to the symbols that the classifier outputs when testing a particular test symbol. A number in the matrix

1 1 1

re

th ro ttl e tri an gl e tu rb in e vo lta ge

4

2 1 7

capacitor current

2 1

1

dashpot

1

ground integral

m

sq ua

1

su

g

ot

rin

ro sq

sp

d

lle y

un

po

pu

n vo t pi

pi

pi sto

1

2

beam brackets

m om

1

arrow

cantilever

en t

131

br ac ke ca ts nt ile v ca er pa ci to r cu rre nt da sh po t gr ou nd in te gr al lin k

ar ro

w be am

8.5. USER STUDY

1 14 1 7

1

1 1

link

1

moment

1 1

pi

1

piston

1 1

pivot

1

pound

2

pulley spring sq root

3 8

2

square

1

sum

1 1

throttle

20 10

triangle turbine

1

1

voltage

1

1

Figure 8.7: The confusion matrix of the feature-based recognizer. A number in the matrix indexed by (row, column) indicate how many times a row symbol is misclassified as a column symbol. Each row symbol was tested using 21 test cases. indexed by (row, column) indicates the number of times the recognizer incorrectly classifies the test symbol row as the symbol column. For example, Figure 8.7 states that the Brackets symbol is misclassified once as an arrow symbol and twice as a Squareroot symbol. The total number of test examples for each symbol is 21 as mentioned previously. The first row of Figure 8.8 shows the overall recognition rate of the recognizer. The overall recognition rate is defined as 1 − ErrorRate where the error rate is the total number of misclassifications (in this case 112) divided by the total number of tests (in this case 24x21=504). Our examination of the confusion matrix showed that most of the misclassifications were oddly related to the Squareroot and the P ivot symbols. For example, out of the 21 test cases, the T riangle symbol was

132 CHAPTER 8. FEATURE-BASED STATISTICAL SYMBOL RECOGNITION

Recognition Rate Original Data

77.78%

Squareroot left out

90.48%

Squareroot and Pivot left out

92.64%

Figure 8.8: Top: The recognition rate of the feature-based recognizer. Middle: The recognition rate when the Squareroot symbol is left out from the analysis. Bottom: The recognition rate when both the Squareroot and the P ivot symbols are left out. Recognition Rate (Top-2)

Recognition Rate (Top-3)

Original Data

92.86%

97.02%

Squareroot left out

95.03%

97.31%

Squareroot and Pivot left out

96.32%

97.40%

Figure 8.9: The recognition rates when the correct symbol is identified in the top-2 and top-3 spots of the recognition lists. misclassified as the Squareroot symbol 20 times. Many of the other symbols are also frequently misclassified as the Squareroot and the P ivot symbols. We believe that this rather distinct peculiarity is a result of a glitch that occurred during the training of these two symbols, most likely during the computation of their covariance matrices. For a more objective assessment, we conducted two more experiments in which first the Squareroot, then both the Squareroot and the P ivot symbols were left out. The second and third rows of Figure 8.8 show the recognition rates for these two additional experiments. As shown, the recognition rates have significantly increased. The above recognition rates are obtained when the program correctly identifies the true class at the top of the ranking list, that is, they correspond to top-1 accuracy. The top-2 and top-3 recognition accuracies are shown in Figure 8.9. We consider these results very encouraging given the results reported in other studies. A discussion of those studies is given at the end of Section 7.7, where the

8.5. USER STUDY

133

results of our image-based recognizer is discussed. A quick note, however, is that the accuracy of our recognizer is either similar to or better than those recognizers that are limited to single-stroke symbols [50], or those tested by people who were already familiar with the recognition system [23]. The main limitation of this approach is that it is sensitive to the results of the segmentation process. The segmenter often misses small details in the sketch especially, if those details are part of a relatively long pen stroke. Our recognizer thus does not perform well on symbols that consists of a large number of small details. Also, the recognizer is not capable of detecting repetitive patterns. This means, for example, when drawing the spring symbol shown in Figure 8.6, the user must pay attention to the number of zig-zags in the symbol. There are similar requirements, for example, for the ground symbol.

Chapter 9 Graph-Based Symbol Recognition This chapter presents a symbol recognizer that works from a graph-based representation of the symbols. The recognizer works from a structural representation of the line and arc primitives making up a symbol. Various geometric relationships among these primitives are encoded in an attributed graph, providing good resolution for describing even the smallest features in a symbol. This approach advances a previous approach by encoding not only the “frequently occurring” properties of the training examples but also their “statistical distributions.” This helps distinguish features that are unique characteristics of a symbol from those that are less representative. Also, to facilitate graph matching (which is typically expensive), we have developed an error-driven stochastic algorithm that can quickly converge to a good enough matching configuration, while avoiding an exhaustive search to find the best matching configuration.

9.1

Background

Calhoun et al. [8] have developed a symbol recognizer for multi-stroke symbols. Symbols are first segmented into line and arc primitives using the ink segmenter described previously. To train a new symbol, the user provides several examples of the symbol. From these training symbols, a symbol definition is learned in the form of an ‘attributed graph’. The information encoded in this graph can be grouped under two 134

9.1. BACKGROUND

Intersect (100,0)

135

Line Rel. Length=25%

Intersect (100,0)

Line

Line

Rel. Length=25%

Rel. Length=25%

Intersect (100,0)

Line Rel. Length=25%

Intersect (100,0)

Figure 9.1: The semantic network definition of a square. The links represent parallel and perpendicular relationships and intersections. categories: (1) The intrinsic properties of the primitives, such as their types (line or arc) and their relative lengths, (2) The relationships between these primitives, such as the existence of an intersection between two primitives, the relative location of intersections, the existence of parallel or perpendicular lines and so forth. The nodes in the graph correspond to the intrinsic properties of the individual primitives whereas the links correspond to the relationships between pairs of primitives. Figure 9.1 shows an example of an attributed graph of an ideal square. The nodes in this graph correspond to the four line primitives that make up the sides of the square. The links on the other hand represent various geometric relationships between the line primitives. Whether an intrinsic or relationship property gets encoded in a definition graph depends on the frequency of observing that attribute in the training examples. For instance, if two line primitives are drawn parallel (within some tolerance) in at least half of the training examples, then this parallelism relationship is added to the graph, otherwise it is assumed that the parallelism was coincidental and is not represented in the graph. Likewise, if two primitives intersect in 70% of the training examples, then this intersection is considered to be significant and is encoded in the graph. During training, it is assumed that the all of the examples have the same number and types of primitives. Furthermore, it is assumed that the primitives are drawn in the same order and in the same relative orientation.

136

CHAPTER 9. GRAPH-BASED SYMBOL RECOGNITION

To facilitate graph construction during training, it is assumed that the all of the examples have the same number and types of primitives. Furthermore, it is assumed that the primitives are drawn in the same order and in the same direction. For example, when training a rectangle symbol, each training example must consist of four line primitives which are drawn in a certain sequence (e.g., first the left side, then the right side, then the top and finally the bottom). Moreover, each primitive must be drawn in a certain direction, for example the vertical sides must be drawn starting at the top and ending at the bottom. While such choices can be set arbitrarily in the first training example, the training method requires that the subsequent examples are all consistent with the initial set of choices. These assumptions make it trivial to determine which primitives in one example match those of another, thereby greatly facilitating the training process. The most critical stage in this method is recognition, where the graph of the unknown has to be matched against the definition graphs. The main difficulty is in the algorithmic complexity of the graph matching problems. The nature of graph matching makes the recognition algorithm sensitive to the order in which primitives are drawn. To achieve drawing order invariance, graphs need to be compared multiple times until the best matching configuration is found. In the case of Calhoun et al. [8], pruning methods are used to prevent the algorithm from exploring unfruitful branches in the search tree, thereby speeding up the search process. Nevertheless, the expensive matching process remains as the main limitation of this method. An additional limitation is that for two symbols to be considered similar, they must have the same number of primitives. Otherwise the approach concludes that the two symbols are dissimilar. This requirement often causes likely matches to go undetected when one symbol has a few missing or extra primitives.

9.2

Error-driven Stochastic Graph Matching

In the previous approach, whether an intrinsic or relationship property is added to, or excluded from the definition graph, depends solely on the frequency of observing that feature in the training symbols. Naturally, the approach is too sensitive to the

9.2. ERROR-DRIVEN STOCHASTIC GRAPH MATCHING

137

hard thresholds that determine whether a feature should be added or excluded. At the end, the training produces a graph of a “representative” symbol that is believed to encapsulate the most important attributes of the symbol. However, during this process, some useful pieces of information such as the variation of observed features among the training samples are lost. In this work, the above graph-based method is improved in several ways. First, the graph representation is modified such that statistical variations among continuously valued intrinsic and relationship properties are represented in the form of probabilistic distributions. By doing so, not only the features characterizing a symbol but also their relative significance in describing the symbol are expressed in the graph representation. This averts many of the difficulties associated with hard thresholding. Second, several of the previously quantized features such as the existence of parallel or perpendicular lines, are replaced by features that exhibit natural continuum, such as the angles between pairs of lines. Third, a new error-driven, stochastic search algorithm is developed to facilitate graph matching. The key innovation is that the algorithm achieves drawing order invariance in a relatively fast way by iteratively interchanging the placements of certain nodes in the graph, whose choice is determined based on the error of match in the previous iteration. Finally, unlike the old method, this method does not require two symbols to have the same number of primitives during matching. Hence, the approach can detect a similarity even if an unknown symbol has a few missing or extra primitives compared to a definition symbol. The following paragraphs first describe the features used for characterizing the symbols. Next, the training method is described, followed by the recognition algorithm. Finally, the error-driven search algorithm for achieving drawing order invariance is described.

9.2.1

Representation

The following intrinsic and relationship properties are used to describe a symbol: Intrinsic Properties: • Primitive Type: A binary valued attribute that specifies if a segment is a line

138

CHAPTER 9. GRAPH-BASED SYMBOL RECOGNITION

or an arc. • Relative Length: The length of each primitive relative to the total length of the primitives comprising the symbol. Each side of a perfect square has a relative length of 25%. Relationship Properties: • Number of Intersections: A ternary valued attribute that specifies the number of intersections (0, 1 or 2) between two primitives. Note that if at least one of the primitives is an arc, there are potentially 2 intersections. • Intersection Angle: The acute angle between two line primitives. It is defined for both intersecting and non-intersecting lines. If one of the primitives is an arc, this attribute is not used. • Intersection Location: A pair of values specifying the location of an intersection point between two primitives. The intersection locations are measured relative to the lengths of the primitives. For example, if the beginning of one line segment intersects the middle of another, the intersection is described as the pair (0%, 50%).

9.2.2

Training

The user constructs a new symbol definition by providing several training examples. As before, all training examples are required to have the same number and type of primitives, drawn in the same order. (This is not required for recognition though.) As noted earlier, this requirement makes it easier to construct the graph representation as the additional need for finding the matching primitives among different training examples is eliminated. Consider the hypothetical symbol shown in Figure 9.2a. The definition graph, shown in Figure 9.2b, is constructed from several examples of this symbol. For each of the attributes (except for the Primitive Type property), both the mean value m and the standard deviation σ of the attribute are computed and encoded in the graph.

9.2. ERROR-DRIVEN STOCHASTIC GRAPH MATCHING

139

2 Arc RL: 32% (s=7%)

Relationship [2-3] Num.of.Int: 1 (s=0) Int. Angle: ***N/A*** Int. Loc.:48%,39% (s=4%,2%)

2

1

4

1 Line

Relationship [1-4] Num.of.Int: 0 (s=0) Int. Angle: 3 (s=2 ) Int. Loc.:***N/A***

RL: 28% (s=4%) RL: Relative Length

3

Relationship [1-3] Num.of.Int: 1 (s=0) Int. Angle: 37 (s=6 ) Int. Loc.:8%,4% (s=2%,7%)

RL: 21% (s=3%)

Line Prim. 1

RL: 28% (s=4%)

Prim. 2 Line-Arc Num.of.Int:0(s=0) Int.Angle:***N/A*** Int.Loc.:***N/A***

Arc Prim. 2

RL: 32% (s=7%)

Prim. 3 Line-Line Num.of.Int:1(s=0) Int.Angle:37 ,s=6 Int.Loc.:8%,4% (s=2%,7%)

Line-Line Num.of.Int:0(s=0) Int.Angle:3 ,s=2 Int.Loc.:***N/A***

Arc-Line

Arc-Line Num.of.Int:1(s=0) Int.Angle:***N/A*** Int.Loc.:91%,16% (s=3%,3%)

RL: 21% (s=3%)

(b)

Prim. 4

Num.of.Int:1(s=0) Int.Angle:***N/A*** Int.Loc.:48%,39% (s=4%,2%)

Line Prim. 3

4 Line RL: 19% (s=5%)

Relationship [3-4] Num.of.Int: 1 (s=0) Int. Angle: 28 (s=3 ) Int. Loc.:94%,82% (s=2%,5%)

3 Line

(a)

Prim. 1

Relationship [2-4] Num.of.Int: 1 (s=0) Int. Angle: ***N/A*** Int. Loc.:91%,16% (s=3%,3%)

Line-Line Num.of.Int:1(s=0) Int.Angle:28 ,s=3 Int.Loc.:91%,16% (s=3%,3%)

Line RL: 19% (s=5%)

Prim. 4

(c) Figure 9.2: (a) A hypothetical symbol. (b) Its statistical graph representation. (c) The same graph represented in a matrix form. Note that all examples used to construct the graph had four segments drawn in order (1)Line-(2)Arc-(3)Line-(4)Line. ‘RL’ stands for relative length.

140

CHAPTER 9. GRAPH-BASED SYMBOL RECOGNITION

A small standard deviation in a feature typically indicates that it is an important and intended feature. A large standard deviation, on the other hand, indicates that the feature is less critical in the characterization of the symbol. As described later, this information is utilized during recognition to determine the severity of a mismatch. For example, a mismatch in a feature which has a small standard deviation in the training examples is penalized heavily, while one with a large standard deviation is penalized lightly. While any number of training examples can be used, the experimental tests showed that reliable definitions are learned from a minimum of 10 to 15 examples. With fewer examples, the statistical variations usually become less informative, and in some cases, misleading. For implementation purposes, we choose to represent the definition graphs as symmetric matrices, such as the one shown in Figure 9.2c. In this representation, the diagonal elements correspond to the intrinsic properties of the primitives, whereas the off-diagonal elements correspond to the relationships between different primitives. Due to symmetry, only the diagonal and upper half of the matrix need be maintained. Note that this matrix representation treats the symbol as a fully connected graph in which there is an entry for each pair of primitives.

9.2.3

Recognition

The goal during recognition is to match an unknown symbol’s graph representation to the statistical graph representations of the definition symbols, and choose the best match as the recognition result. To find the degree of match between an unknown and a definition symbol, the algorithm begins with an assignment of the nodes in the unknown graph to the nodes in the definition graph. Initially, the order of assignment is determined by the drawing order of the unknown, i.e., the first primitive in the unknown is matched with the first primitive in the definition, the second is matched with the second etc. If the definition graph contains more nodes than the unknown’s graph, the unknown’s graph is padded with idle nodes and relationships to make the two graphs of the same size. The idle nodes or relationships do not introduce any unwarranted attributes and hence they do

9.2. ERROR-DRIVEN STOCHASTIC GRAPH MATCHING

141

not alter the properties of the original graph. Similarly, if the unknown has more nodes than the definition, the definition graph is padded with idle nodes and relationships. Once the size of the two graphs are made equal, each node and link in the unknown’s graph is matched against the corresponding nodes and links of the definition’s graph. Note that graph matching in this way corresponds to comparing the diagonal and offdiagonal elements of the matrix representations of the two symbols. Comparison of the diagonal elements corresponds to a comparison of the nodes (i.e., the primitives), while comparison of the off-diagonal elements corresponds to a comparison of the links (i.e., the relationships) between the primitives of the two symbols. The match between an unknown and a definition symbol is quantified in terms of a dissimilarity score, which is computed using an ensemble of error metrics. These error metrics are based on both the intrinsic and relationship properties. The table below lists the error metrics and their corresponding weights employed during recognition. The weights used in the error metrics reflect our experience with which characteristics of a symbol are most important for accurately identifying a symbol. The subsequent paragraphs detail these metrics. For practical purposes, all error metrics are designed such that they lie in the range [0,1] with a value close to 0 indicating a better match. Table 9.1: Error metrics and corresponding weights used in the graph-based symbol recognizer. Error Metric Primitive count error Primitive type error Relative length error Number of intersections error Intersection angle error Intersection location error

Weight 20% 20% 20% 15% 15% 10%

1. Primitive count error: This error accounts for the differences in the number of primitives in the two symbols. Let, NU and ND denote the number of primitives (i.e., nodes) in the unknown and the definition symbols respectively, excluding the idle nodes used for equating the size of the symbols’ graphs. Then the primitive count

142

CHAPTER 9. GRAPH-BASED SYMBOL RECOGNITION

error is defined as: |NU − ND | ) min(NU , ND )

P rimitive Count Error = min(1.0,

The numerator in the right hand side of this expression measures the absolute difference in the number of primitives in the two symbols. This value is normalized by the minimum number of primitives in either the definition or the unknown, as a way to quantify the significance of the difference. For example, an absolute difference of 3 primitives is important when the unknown or the definition symbol has only, say, 5 or 6 primitives. On the other hand, the same absolute difference would be less significant if the two symbols have, say, 20 or 30 primitives. The ratio thus produces a large value when the difference is significant and a small value otherwise. Note that the ratio varies between 0 and theoretically, infinity. To obtain values in the range [0,1], we use a saturating error function whose output is limited to 1.0. Example: If the unknown has 10 segments and the definition has 12, the primitive count error is: min(1.0,

|10 − 12| ) = 0.20 min(10, 12)

2. Primitive type error: Given a certain assignment of primitives in the unknown to primitives in the definition, this error metric measures the difference between the types (i.e., line vs. arc) of the primitives, excluding the matches involving idle primitives. Each primitive in the unknown (denoted Ui ) is compared to the corresponding primitive in the definition (denoted Di ). The error is defined as the number of times the types of the matched primitives differ, normalized by the number of primitives in the unknown or the definition, whichever is smaller: min(NU ,ND )

X

P rimitive T ype Error =

δ(T ype(Ui ), T ype(Di ))

i=1

min(NU , ND )

Where δ(x, y) is 1 when x = y and zero otherwise. Example: Consider the sequence of primitives of an unknown and a definition symbol. Let L denote a Line primitive and A denote an Arc primitive.

9.2. ERROR-DRIVEN STOCHASTIC GRAPH MATCHING

143

Figure 9.3: The probabilistic definition function P (x). Left: m = 30% and σ = 4%. Right: m = 80% and σ = 10% Unknown

L-L-A-L-L-A-A-A-L-L-L-I -I -I -I

Definition L-A-L-L-L-L-A-A-L-L-A-L-L-A-A Note that the unknown is padded with idle primitives (I ) to make the number of primitives in the two symbols equal for matching. Given this configuration, the number of actual primitives whose types do not match is 4 (2nd, 3rd, 6th and 11th primitives). Here, the normalizing factor is determined by the unknown because it has the fewer number of primitives, namely 11. Hence, the primitive type error becomes: P rimitive T ype Error = 4/11 = 0.36. 3. Relative length error: This error compares the relative lengths of the primitives in the unknown and the definition. The comparison is made by comparing the relative length of each unknown primitive to the relative length of the corresponding definition primitive. The relative length of a definition primitive is described in terms of a “probabilistic definition function” whose parameters (mean m and standard deviation σ) are learned from the training samples. The function has the form: P (x) = exp[−

(x − m)4 1 · ] 50.0 σ4

Figure 9.3 shows the function for two different values of m and σ. The function

144

CHAPTER 9. GRAPH-BASED SYMBOL RECOGNITION

was designed empirically such that its top is flatter than the Gaussian probability function with the same m and σ. This feature makes it easier to detect matches that are in the “vicinity.” We have found that the Gaussian distribution dies off too quickly towards its tails, which decreases its usefulness for recognition. The match between an unknown and a definition primitive is the output of the above function, whose input is the unknown primitive’s relative length. For example, if a primitive in the definition symbol has its relative length described as the curve on the left in Figure 9.3, a primitive from the unknown with a relative length of 20% will produce a matching score of 0.7. Note that the above expression produces values between 0 and 1, where 1 represents a perfect match. The error of match is conveniently defined as the complement 1 − P (x). The overall relative length error is defined as the average error accumulated over the number of primitives considered: min(NU ,ND )

X

1 − P (uiRL )

i=1

Relative Length Error =

min(NU , ND )

where uiRL represents the relative length of the ith primitive of the unknown. Example: Consider the relative length information (denoted in percentage) of an unknown with 5 primitives and a definition with 6 primitives. As before, the unknown is padded with an idle primitive (I ). Prim. 1

Prim. 2

Prim. 3

Prim. 4

Prim. 5

Prim. 6

10%

24%

30%

21%

15%

I

Definition (m, σ) (9%,1%)

(18%,3.2%)

(37%,3%)

(17%,2%)

(16%,3%)

(3%,2%)

P (uiRL )

0.980

0.781

0.553

0.726

1.000

N/A

P (uiRL )

0.020

0.219

0.447

0.274

0.000

N/A

Unknown

1−

From this data, the relative length error becomes: 5 X

Relative Length Error =

1 − P (uiRL )

i=1

5

= 0.192

4. Number of intersections error: This error identifies the difference between

9.2. ERROR-DRIVEN STOCHASTIC GRAPH MATCHING

145

Figure 9.4: The squashing function S(x). the number of intersections occurring between the primitives of the unknown and those of the definition. Unlike the previous error metrics which consider the intrinsic properties of the primitives, this error focuses on the relationships between the primitives. For each relationship in the graph, the error is defined as the difference between the number of intersections occurring in the unknown and the definition. From the individual differences, a cumulative difference is defined which is simply the average of the differences: min(NU ,ND ) i X X

Average Intersection Dif f erence =

i=1

|N umInt(Ui , Uj ) − N umInt(Di , Dj )|

j=1

min(N RU , N RD )

where N RU and N RD represent the number of original relationships (i.e., those excluding the idle relationships) in the unknown and the definition respectively. N umInt(., .) denotes the number of intersections observed between the argument primitives. The indices i and j index the primitives in the unknown and definition symbols. The double sum ensures that all pairs of primitives are considered. Note that N umInt(., .) can either be 0, 1, or 2 due to the presence of both line and arc primitives. When averaged, the result of the above expression thus lies in the

146

CHAPTER 9. GRAPH-BASED SYMBOL RECOGNITION

Unknown

Definition A

A

B

B Int. Angle = 32

Int. Location (m,s) = (40 , 7 )

Figure 9.5: Illustration of the intersection angle information for the unknown and the definition. range [0,2]. To map the error in the range [0,1], we use a squashing function S(.), which is shown in Figure 9.4. This function was designed such that large differences are amplified while small differences are attenuated. During our experiments, we have found this choice to provide better discrimination compared to a linear squashing function. With this squashing function, the “Number of Intersections Error” becomes: N umber of Intersections Error = S(Average Intersection Dif f erence) 5. Intersection angle error: This error determines how much the acute angle formed by a pair of lines in the unknown, differs from the angle formed by the corresponding pair of lines in the definition. As in the relative length error, the angle information in the definition symbol is encoded statistically (m and σ) for each of its line pairs. Figure 9.5 illustrates the idea. During matching, the error between an unknown line pair and a definition line pair is determined using the probabilistic definition function earlier shown in Figure 9.3. The final intersection angle error is then defined as the average of the errors accumulated from the line pairs considered. This error is only defined between line primitives. Hence, for a given pair of primitives, this error is not used if at least one of the primitives is an arc. 6. Intersection location error: This error focuses on the location of intersections along the primitives’ lengths. The error measures the average difference between the location of an intersection in the unknown and the corresponding intersection location in the definition. An intersection between two primitives A and B is described in terms of a pair of values each denoting the location of the intersection relative to the length

9.2. ERROR-DRIVEN STOCHASTIC GRAPH MATCHING

Unknown

147

Definition A

A

B

B Int. Location(A,B) = (50%,10%)

Int. Location (A(m,s)) , B(m,s)) =

((47%,3%) , (12%,4%))

Figure 9.6: Illustration of the intersection location information for the unknown and the definition.

of the individual primitives. Figure 9.6 illustrates the idea. Intersection locations in the definition are once again described using the probabilistic definition function. The intersection location error between two primitive pairs is defined as the average of the errors for the individual primitives:

P rimitive Location Error =

[1 − P (A)] + [1 − P (B)] 2

where P (A) and P (B) are the probabilistic definition function (Figure 9.3) evaluated using the intersection location information of primitives A and B. When two primitives intersect at two points, there are four such P (.) calculations (for two intersection points on A and two on B) and the error is thus averaged over four values. The final intersection location error is defined as the average of the individual errors accrued from all pairs of primitives. Because intersection locations are described in relation to the drawing directions of the primitives, the above metric is sensitive to drawing direction. To achieve invariance to drawing direction, both the original drawing direction and its reverse are considered when computing the error. For each primitive pair, the final error is taken as the smaller of the two values. The above error is not computed for primitives that do not intersect.

148

9.2.4

CHAPTER 9. GRAPH-BASED SYMBOL RECOGNITION

Combining Error Metrics

Using the above six error metrics and their respective weights (Table 9.1), a cumulative dissimilarity score for each definition symbol is computed as follows: Dissimilarity Score =

6 X

wi · Ei

i=1

where each Ei ,wi pair represents an error metric and its corresponding weight. The definition symbol with the least dissimilarity score is declared as the recognition result.

9.2.5

Handling Different Drawing Orders Using Stochastic Primitive Swapping

In the recognition approach described so far, graph matching is carried out by initially assigning each primitive of the unknown - in the order it was drawn - to the corresponding primitive of the definition. As a result, a good match between an unknown and a definition symbol could only be obtained if the primitives in the two symbols occurred in similar orders. However, using the recognizer with this restriction would require the user to remember the order in which the primitives must be drawn. To allow for different drawing orders, different primitive assignments must be explored. In our system this is done by hypothetically changing the drawing order in the definition symbol and determining if the new order results in a better match with the unknown. However, instead of considering all possible orderings in an exhaustive manner, an iterative approach is taken in which a small set of different configurations are considered by semi-randomly swapping two primitives. After each swap, the graphs of the two symbols are compared yielding a dissimilarity score. This process is repeated a fixed number of times1 (currently 100) and the dissimilarity score obtained from the best matching configuration is considered as the score of that definition symbol. The selection of the primitives to be swapped in the definition graph is based on the error of match in the intrinsic properties of the previous iteration. At every iteration of the algorithm, each primitive is given a chance to be swapped. The probability that 1

While not implemented, the search could also be terminated when a certain number of iterations ceases to produce any improvements.

9.3. EVALUATION

149

a particular primitive is swapped is proportional to its error of match in the previous step. Hence, while any two primitives can be chosen to be swapped in a particular step, on average, primitives with the highest errors will be chosen for swapping. The name of the method (Error-driven stochastic search) stems from this property. The swapping of primitives is implemented as a change of indices in each graph’s matrix representation (Figure 9.2c) and hence is computationally inexpensive. Overall, this approach is similar in spirit to Genetic Algorithms in which a problem solution is viewed as a competition among a population of evolving candidate problem solutions [46, 1]. We have found the use of matching error to greatly accelerate the search process. Additionally, with the introduction of idle nodes, two graphs that originally contain different numbers of nodes can be compared as the swapping algorithm allows the idle nodes to be interspersed between the original nodes. This elasticity proves crucial when the symbols under consideration are for the most part similar but differ by one or two stray primitives.

9.3

Evaluation

Currently, we are in the process of conducting user studies to test the accuracy of this recognizer. So far, we have conducted informal experiments. While no quantitative data is available at this point, both the recognition accuracy and speed have proved to be quite promising, demonstrating the viability of the approach. Results of a formal user study shall be the focus of future publications.

Chapter 10 Limitations and Future Work Obviously, this work is only a small step toward achieving truly natural and practical pen-based computer interaction, and there are many issues that remain unsolved. Some of these issues are not directly addressed by this study. For instance, our techniques are developed primarily for 2D sketches. Hence, it is not clear whether they can be extended to sketches of 3D scenes. Likewise, our techniques are most suitable for interpreting “schematic” sketches, and are not particularly useful for sketches that are “artistic” in nature. Also, although it would be an interesting study, this work does not investigate how well our techniques can be employed in small hand-held devices, such as personal digital assistants. In such devices, limited availability of computational resources, such as processor speed and RAM, will place additional constraints on the design of interaction techniques. Another area of interest may be the recognition of “static” input such as scanned sketches. In such cases, the lack of temporal data leading to the final sketch adds another level of complexity not addressed by this study1 . There are several immediate extensions to our system that can significantly improve its performance. For example, the current mark-group-recognize architecture does not exploit the interpretations of nearby objects when processing a cluster. However, such a consideration can significantly improve the overall recognition accuracy. 1

Note that here we distinguish between static images of sketches versus images of more structured, textual documents such as pages of a phone book. The latter form of input is already being studied under the name document image processing (DIP) or document image decoding (DID). However, to the best of the our knowledge, there are no existing such systems that can recognize (in real time) the types of informal sketches considered in this study.

150

151

Consider, for example, the Simulink domain. In the absence of other symbols, it does not make much sense to connect the output of an integrator to the input of a differentiator as this would not alter the original signal in any useful way. The system can use such forms of knowledge to rule out certain interpretations of one object after interpretations of other objects have been established. The most obvious way to encode this type of knowledge is through the use of a rule-based system. Indeed, our SimuSketch system makes use of a similar representation to encode contextual information regarding the number of arrows entering and leaving the clusters. However, more elegant and mathematically rigorous solutions can be formulated based on Markov Random Fields [44]. This framework provides a useful foundation for encoding spatial context in the processing of spatial problems, and has found many applications in such fields as image classification, image restoration and texture analysis. In our case, it can be used as a means to “maximize the joint probability of observing the interpreted symbols,” in light of domain knowledge. We plan to explore this possibility in the near future. Another avenue we would like to explore is the effect of stroke beautification on the interpretation results. Stroke beautification typically involves smart cosmetic alterations to the raw strokes as the user draws. For example, nearly vertical strokes can be interpreted as perfect vertical line segments and displayed as such, or nearly parallel or perpendicular lines can be altered so that they conform to these constraints. Currently, our systems work either directly from users’ raw strokes, or replace the strokes with approximate line and arc segments prior to recognition. However, the latter operation is based solely on the immediate raw ink and is not informed by higher-level perceptual principles. Stroke beautification would not only clean up unwarranted artifacts in the sketch arising from the imprecision of the human hand, but could also identify perceptually dominant attributes of the strokes, such as perpendicularity or enclosure. We suspect that this could significantly aid symbol recognizers. Igarashi et al. [40] present a set of techniques that can be useful in this endeavor. Similarly, more psychologically based approaches, such as Gestalt principles, or those presented by Saund [65] may be considered.

Chapter 11 Conclusions This thesis presents a new computational model for automatically interpreting handdrawn sketches of schematic diagrams. Our model employs a multi-level parsing and recognition architecture. Our approach allows users to continuously sketch without having to indicate when one symbol ends and a new one begins. Additionally, it does not restrict the number of strokes in a symbol, or the order in which they are drawn. Hence, it eliminates many of the unnatural constraints imposed by existing sketch understanding systems, such as limitations to single-stroke objects, or the need for user involvement in separating different symbols. Our approach differs from earlier techniques in that it acts selectively in the early stages to identify a small set of easily recognizable “marker symbols.” These markers anchor a spatial analysis which parses the uninterpreted strokes into distinct clusters, each representing a single symbol. Finally, a symbol recognizer, informed by clustering and domain specific knowledge, is used to find the best interpretations of the strokes. We have argued that this approach has the advantage of quickly guiding the recognizer in the right direction while preventing unfruitful explorations. This work emphasizes that techniques aimed at uncovering the underlying structure of a sketch, such as preliminary recognition and stroke clustering, can have a significant beneficial impact on computational efficiency and recognition accuracy. The computational cost of the resulting system is low enough to be suitable for interactive sketch understanding. 152

153

To demonstrate our techniques, we have built SimuSketch, a sketch-based interface for Matlab’s Simulink package, and VibroSketch, a sketch-based interface for analyzing vibratory mechanical systems. In both systems, users can construct functional engineering models by simply sketching them on a computer screen. Users can then interactively manipulate their sketches to change model parameters and run simulations. Our user studies have indicated that even novice users can effectively utilize these systems to solve real engineering problems, without having to know much about the underlying recognition techniques. User feedback has shown that, overall, most users had highly favorable opinions of our prototype systems, and found them easy and straightforward to use. To enhance the user’s experience with these systems, however, it may be necessary to adjust some of our assumptions about drawing styles, and improve the recognizers used in the preliminary recognition step.

Another key contribution of this thesis is in the field of symbol recognition. Three trainable, domain-independent symbol recognizers have been developed. The first uses quantized bitmap images as the primary representation. One advantage of this recognizer over traditional ones is that it can learn new definitions from single prototype examples. It is also well suited for recognizing “sketchy” symbols such as those with heavy over-tracing, missing or extra segments and different line styles. To achieve rotation invariance, it uses a novel polar coordinate analysis that avoids expensive rotations. The recognizer is versatile in that we use it both for graphical symbol recognition and digit recognition. The second recognizer works from a set of features extracted from input symbols. These features encode the geometric properties of the line and arc segments fitted to the raw strokes comprising the shape. From a set of training examples, the recognizer learns the statistical distributions of these features in the form of multi-variate gaussian probability functions. One advantage of this recognizer is that, once it is trained, recognition is very fast. The last recognizer works from a structural representation of the line and arc segments making up a symbol. Various geometric relationships among these primitives are encoded in an attributed graph, providing good resolution for describing even the smallest features in a symbol. This approach advances a previous approach by encoding not only the

154

CHAPTER 11. CONCLUSIONS

“frequently occurring” properties of the training examples but also their “statistical distributions.” This helps distinguish features that are unique characteristics of a symbol from those that are less representative. Also, to facilitate graph matching (which is typically expensive), we have developed an error-driven stochastic algorithm that can quickly converge to a good enough matching configuration, while avoiding an exhaustive search to find the best matching configuration. Although the techniques presented in this thesis are demonstrated in two domains, we speculate that they are applicable to other domains as well, such as electrical circuit diagrams, linkage design tools, user interface design software, etc. We believe many useful marker symbols can be identified in such domains, setting the stage for the application of our mark-group-recognize architecture. Additionally, we believe our symbol recognizers form a useful and practical suite of techniques for the recognition of multi-stroke symbols, and hence we hope other researchers in the community can make use of them. While useful for the practicing engineer, our techniques are likely to have distinct advantages in engineering education. By their nature, our prototype systems are ideally suited for electronic whiteboard applications and thus can be readily integrated into the classroom environment. In the near future, we plan to explore this possibility with pilot studies.

Bibliography [1] http://www.aaai.org/aitopics/html/genalg.html. [2] Fevzi Alimoglu and Ethem Alpaydin. Combining multiple representations for pen-based handwritten digit recognition. ELEKTRIK: Turkish Journal of Electrical Engineering and Computer Sciences, 9(1):1–12, 2001. [3] Christine Alvarado. A Natural Sketching Environment: Bringing the Computer into Early Stages of Mechanical Design. Master thesis, MIT, 2000. [4] Christine Alvarado. Dynamically constructed bayesian networks for sketch understanding. Technical report, MIT Project Oxygen Student Workshop Abstracts, 2003. [5] Ajay Apte, Van Vo, and Takayuki Dan Kimura. Recognizing multistroke geometric shapes: An experimental evaluation. In UIST 93, pages 121–128, 1993. [6] Oliver Bimber, L. M. Encarnao, and Andre Stork. A multi-layered architecture for sketch-based interaction within virtual environments. Computer Graphics, 24(6), 2000. [7] R. Brunelli and T. Poggio. Face recognition: Features versus templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(10):1042–1052, 1993. [8] Chris Calhoun, Thomas F Stahovich, Tolga Kurtoglu, and Levent Burak Kara. Recognizing multi-stroke symbols. In AAAI Spring Symposium on Sketch Understanding, pages 15–23, 2002. 155

156

BIBLIOGRAPHY

[9] Kwok-Wai Cheung, Dit-Yan Yeung, and Roland T Chin.

Bidirectional de-

formable matching with application to handwritten character extraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):1133–1139, 2002. [10] F. Cohen, Z. Huang, and Z. Yang. Invariant matching and identification of curves using b-splines curve representation. IEEE Transactions on Image Processing., 4(1):1–10, 1995. [11] Scott D. Connell and Anil K. Jain. Template-based online character recognition. Pattern Recognition, 34(1):1–14, 2001. [12] Gennaro Costagliola and Vincenzo Deufemia. Visual language editors based on lr parsing techniques. In 8th International Workshop on Parsing Technologies (IWPT’03), Nancy, France, 2003. [13] Greg S. Cox. Template matching and measures of match in image processing. Review, University of Capetown South Africa, 1995. [14] Olivier Cuisenaire. Distance Transformations: Fast Algorithms and Applications to Medical Image Processing. Ph.d. thesis, Universite Catholique de Louvain, Belgium, 1999. [15] Sven Dickinson, Marcello Pelillo, and Rahmin Zabih. Introduction to the special section on graph algorithms in computer science. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(10):1049–1052, 2001. [16] Pedro Domingos and Michael J. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. Machine Learning, 29:103–130, 1997. [17] Marie-Pierre Dubuisson and Anil K. Jain. A modified hausdorff distance for object matching. In 12th International Conference on Pattern Recognition, pages 566–568, Jerusalem, Israel, 1994.

BIBLIOGRAPHY

157

[18] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. John Wiley and Sons, Inc., 2nd edition, 2001. [19] Lee D. Erman, Frederick Hayes-Roth, Victor R. Lesser, and D. Raj Reddy. The hearsay-ii speech understanding system: Integrating knowldge to resolve uncertainty. Computing Surveys, 12(2):213–253, 1980. [20] Michael Fligner, Joseph Verducci, Jeff Bjoraker, and Paul Blower. A new association coefficient for molecular dissimilarity. In The Second Joint Sheffield Conference on Chemoinformatics, Sheffield, England, 2001. [21] Darren R. Flower. On the properties of bit string-based measures of chemical similarity. Journal of Chemical Information and Computer Science, 38:379–386, 1998. [22] Manueal J. Fonseca, Cesar Pimentel, and Jaoquim A. Jorge. Cali-an online scribble recognizer for calligraphic interfaces. In AAAI Spring Symposium on Sketch Understanding, pages 51–58, 2002. [23] Manuel J. Fonseca and Joaquim A. Jorge. Using fuzzy logic to recognize geometric shapes interactively. In Proceedings of the 9th Int. Conference on Fuzzy Systems (FUZZ-IEEE 2000). San Antonio, USA, 2000. [24] Kenneth D. Forbus, Jeffrey Usher, and Vernell Chapman. Qualitative spatial reasoning about sketch maps. In IAAI-2003, Acapulco, Mexico, 2003. [25] Leslie Gennari, Levent Burak Kara, and Thomas F. Stahovich. Combining geometry and domain knowledge to interpret hand-drawn diagrams. In AAAI Fall Symposium Series 2004: Making Pen-Based Interaction Intelligent and Natural, 2004. [26] W. Eric L. Grimson. The combinatorics of heuristic search termination for object recognition in cluttered environments. IEEE PAMI, 13(9):920–935, 1991. [27] M. D. Gross, C. Zimring, and E. Do. Using diagrams to access a case base of architectural designs. Artificial Intelligence in Design, pages 129–144, 1994.

158

BIBLIOGRAPHY

[28] Mark Gross and Ellen Y. Do. The design amanuensis - an instrument for multimodal design capture and playback. CAAD Futures., pages 1–13, 2001. [29] Mark D. Gross. Recognizing and interpreting diagrams in design. In ACM Conference on Advanced Visual Interfaces., pages 88–94, 1994. [30] I. Guyon, P. Albrecht, Y. Le Cun, J. Denke, and W. Hubbard. Design of a neural network character recognizer for a touch terminal. Pattern Recognition, 24(2):105–119, 1991. [31] Tracy Hammond. Tahuti: A geometrical sketch recognition system for uml class diagrams. In AAAI Spring Symposium on Sketch Understanding, pages 59–66, 2002. [32] Tracy Hammond and Randall Davis. Automatically transforming symbolic shape descriptions for use in sketch recognition. In 19th National Conference on Artificial Intelligence (AAAI-2004), 2004. [33] Tracy Hammond and Randy Davis. Ladder: A language to describe drawing, display, and editing in sketch recognition. In 2003 International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, 2003. [34] Tin Kam Ho, Jonathan J. Hull, and Sargur N. Srihari. On multiple classifier systems for pattern recognition. In 11th International Conference on Pattern Recognition, pages 84–87, The Netherlands, 1992. [35] Tin Kam Ho, Jonathan J. Hull, and Sargur N. Srihari. Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1):66–75, 1994. [36] Jason I. Hong and James A. Landay. Satin: A toolkit for informal ink-based applications. In ACM UIST 2000 User Interfaces and Software Technology, pages 63–72, San Diego, CA, 2000. [37] Heloise Hse and A. Richard Newton. Sketched symbol recognition using zernike moments. Technical report, EECS, University of California, 2003.

BIBLIOGRAPHY

159

[38] Y. S. Huang and C. Y. Suen. A method for combining multiple experts for the recognition of unconstrained handwritten numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1):90–94, 1995. [39] Z. Huang and F. Cohen. Affine-invariant b-spline moments for curve matching. IEEE Transactions on Image Processing., 5(10):1473–1480, 1996. [40] T. Igarashi, S. Matsuoka, S. Kawachiya, and H. Tanaka. Interactive beautification: A technique for rapid geometric design. In UIST ’97, pages 105–114. 1997. [41] David W. Jacobs. The use of grouping in visual object recognition. Technical Report Technical Report 1023, MIT AI Lab, 1988. [42] F. Kimura and M. Shridhar. Handwritten numerical recognition based on multiple algorithms. Pattern Recognition, 24(10):969–983, 1991. [43] T. D. Kimura, A. Apte, and S. Sengupta. A graphic diagram editor for pen computers. Software Concepts and Tools, pages 82–95, 1994. [44] R. Kindermann and J. Snell. Markov Random Fields and their Applications. American Mathematical Society, Providence, 1980. [45] Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239, 1998. [46] John R. Koza. Genetic programming. In James G. Williams and Allen Kent, editors, Encyclopedia of Computer Science and Technology, pages 29–43. MarcelDekker, 1998. [47] K. Kuczun and M. D. Gross. Local area network tools and tasks. In ACM Conference on Designing Interactive Systems (DIS 97), pages 215–221, 1997. [48] Tolga Kurtoglu and Thomas F Stahovich. Interpreting schematic sketches using physical reasoning. In AAAI Spring Symposium on Sketch Understanding, pages 78–85, 2002.

160

BIBLIOGRAPHY

[49] Ernst Kussul and Tatyana Baidyk. Improved method of handwritten digit recognition tested on mnist database. In 15th International Conference on Vision Interface, Calgary, Canada, 2002. [50] James A. Landay and Brad A. Myers. Sketching interfaces: Toward more human interface design. IEEE Computer, 34(3):56–64, 2001. [51] Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of learning algorithms for handwritten digit recognition. In International Conference on Artificial Neural Networks, pages 53–60, Paris, 1995. [52] Seong-Whan Lee. Recognizing hand-drawn electrical circuit symbols with attributed graph matching. In H S Baird, H Bunke, and K Yamamoto, editors, Structured Document Image Analysis, pages 340–358. Springer-Verlag, 1992. [53] Howard Wing Ho Leung. Representations, Feature Extraction, Matching and Relevance Feedback for Sketch Retrieval. Ph.d., Carnegie Mellon University, 2003. [54] James Lin, Mark W. Newman, Jason I. Hong, and James A. Landay. Denim: Finding a tighter fit between tools and practice for web site design. In CHI Letters: Human Factors in Computing Systems, pages 510–517. ACM Press, 2000. [55] Jennifer Mankoff, Gregory D. Abowd, and Scott E. Hudson. Oops: a toolkit supporting mediation techniques for resolving ambiguity in recognition-based interfaces. Computers and Graphics, 24(6):819–834, 2000. [56] Nicholas E. Matsakis. Recognition of Handwritten Mathematical Expressions. Master thesis, MIT, 1999. [57] Erik G. Miller, Nicholas E. Matsakis, and Paul A. Viola. Learning from one example through shared densities of transforms. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 464–471, 2000.

BIBLIOGRAPHY

161

[58] Shankar Narayanaswamy. Pen and Speech Recognition in the User Interface for Mobile Multimedia Terminals. Ph.d. thesis, University of California at Berkeley, 1996. [59] Krishna S. Nathan, H. S. M. Beigi, J. Subrahmonia, G. J. Clary, and H. Maruyama. Real-time on-line unconstrained handwriting recognition using statistical methods. In International Conference on Acoustics, Speech and Signal Processing, pages 2619–2622, 1995. [60] Omer Faruk Ozer, Oguz Ozun, C. Oncel Tuzel, Volkan Atalay, and A. Enis Cetin. Vision-based single-stroke character recognition for wearable computing. IEEE Intelligent Systems and Applications, 16(3):33–37, 2001. [61] Chris Raymaekers, Gert Vansichem, and Frank Van Reeth. Improving sketching by utilizing haptic feedback. In AAAI Spring Symposium on Sketch Understanding, pages 113–117, Stanford University, 2002. AAAI Press. [62] Eric Reither, Fady Said, Ying Li, and Ching Suen. Map symbol recognition using directed hausdorff distance and a neural network classifier. In XVIIIth Congress of ISPRS, 1996. [63] Dean Rubine. Specifying gestures by example. Computer Graphics, 25:329–337, 1991. [64] W. J. Rucklidge. Efficient Visual Recognition Using the Hausdorff Distance. Number 1173 Lecture Notes in computer Science,. Springer-Verlag, Berlin, 1996. [65] Eric Saund, James Mahoney, David Fleet, Dan Larner, and Edward Lank. Perceptual organisation as a foundation for intelligent sketch editing. In AAAI Spring Symposium on Sketch Understanding, pages 118–125, 2002. [66] Tevfik Metin Sezgin. Generic and hmm based approaches to freehand sketch recognition. Technical report, MIT Project Oxygen Student Workshop Abstracts, 2003.

162

BIBLIOGRAPHY

[67] Michael Shilman, Hanna Pasula, Stuart Russell, and Richard Newton. Statistical visual language models for ink parsing. In AAAI Spring Symposium on Sketch Understanding, pages 126–132, 2002. [68] Dong-Gyu Sim, Oh-Kyu Kwon, and Rae-Hong Park. Object matching algorithms using robust hausdorff distance measures. IEEE Transactions on Image Processing, 8(3):425–429, 1999. [69] Thomas F. Stahovich. Sketchit: a sketch interpretation tool for conceptual mechanism design. Technical report, MIT AI Laboratory, 1996. [70] Ivan E. Sutherland. Sketchpad—A man-machine graphical communication system. Phd thesis, MIT, 1963. [71] O. D. Trier and Anil K. Jain. Feature extraction methods for character recognition - a survey. Pattern Recognition, 29(4):641–662, 1996. [72] Jack D. Tubbs. A note on binary template matching. Pattern Recognition, 22(4):359–365, 1989. [73] J. R. Ullmann. An algorithm for subgraph isomorphism. Journal of the ACM, 23(1):31–42, 1976. [74] Peter Willett. Chemical similarity searching. J. Chem. Inf. Comput. Sci., 38:983– 996, 1998. [75] H. Yasuda, K. Takahashi, and T. Matsumoto. A discrete hmm for online handwriting recognition. International Journal of Pattern Recognition and Articial Intelligence, 14(5):675–688, 2000.