A System for Assembling Specialized Textbooks from a ... - CiteSeerX

6 downloads 1405 Views 208KB Size Report
After the HTML form is received, its contents are processed automatically, ... The word processor styles (such as heading, normal paragraph, caption and bold).
A System for Assembling Specialized Textbooks from a Pool of Documents Helena Ahonen, Barbara Heikkinen, Oskari Heinonen and Pekka Kilpeläinen University of Helsinki, Department of Computer Science We consider assembling specialized, customized textbooks from a large collection of SGML documents. Our prototype assembly framework allows the user to select parts of the documents in the collection and to form a new structured document. The order of a user is processed in the following way. (1) The user lls in and submits an HTML form. (2) The form is processed and the parts to be included in the assembled book are selected automatically. The choice is dependent both on the user's input and a few heuristic rules. (3) The texts and images are retrieved from the pool and the layout is created. (4) The books are printed and delivered to the client. In addition, we describe our experience on converting MS Word documents into tagged SGML format by presenting both the conversion architecture and lessons learned. Categories and Subject Descriptors: I.7.2 [Text Processing]: Document Preparation; J.7 [Computers in other systems]: Publishing; K.3.1 [Computers and Education]: Computer Uses

in Education General Terms: Documentation, Management Additional Key Words and Phrases: Document assembly, structured documents, educational material, SGML

1. INTRODUCTION

In an on-going research and development project called Structured and Intelligent Documents (SID) [Ahonen et al. 1996] we study document assembly in its dierent aspects. One application area we have considered is assembling educational material. The overall goal of our partner, National Board of Education, is to create a material pool that contains, for instance, textbooks, articles, exercise collections, and simulations. The pool can be used in multiple ways. The publisher can print customized textbooks for various school branches and levels. Additionally, teachers can form tailored compositions of WWW or CD-ROM delivered materiale.g., summaries, slides and testsfor their own class. Students could also use the pool directly for problem-oriented learning, as a data bank, or as an active textbook with computer-assisted learning features. Recently, several publishers and universities have launched their own services for delivering customized textbooks [NCPP 1997; Primis 1997; CCW 1997]. All the approaches seem to have a fairly static set of document fragments that can be combined to form a book, and their user interfaces simply allow the user to choose a selection of these parts. We claim that dynamic compositions are needed, and to be able to achieve this, we have decided to use SGML [ISO 1986] for the management of the material. Although one of the strengths of SGML technology is to allow several representations and formats to be generated from the same documents, multiuse of documents by dynamic assembly of structure elements is still an unsolved problem, and hardly This work was supported by the Finnish Technology Development Centre (TEKES) as part of the Electronic Printing and Publishing research programme. Address: Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23), FIN-00014 University of Helsinki, Finland

2



H. Ahonen, B. Heikkinen, O. Heinonen and P. Kilpeläinen Text

2 DynaTag

3

Rainbow DTD

OmniMark

RTF

Book DTD SGML

1

Equations

4 MS Word & MathType

TeX

5 MIF

Tex2mif

MS Word files

FrameMaker+ SGML equations

Figures

6 Windows Clipboard

WMF original format figures

graphics files

Fig. 1. Conversion architecture.

any techniques or tools exist. The necessity of assembly is, however, clearly seen and considered to be of great importance [Kimber 1996; McFadden 1996]. We have converted three textbooks on control engineering from MS Word format into tagged SGML documents. With the help of the authors, we also added some supplementary information into the documents. At the same time an assembly system was created. In the rst prototype a teacher can order customized textbooks via WWW. After the HTML form is received, its contents are processed automatically, and the elements to be included in the textbook are chosen using a few heuristic rules. The table of contents and some other information about the assembled book is returned to the user immediately, so that he/she can check if the contents meet his/her expectations. The specication of the parts to be included is then submitted to the assembly process, which retrieves the texts and images from the collections and creates the layout. Finally, a PostScript le is sent to the printing house that produces the books and delivers them to the teacher. The rest of this paper is organized as follows. The conversion of textbooks from MS Word into SGML is described in Section 2 including the conversion architecture and some lessons learned. Section 3 presents the structure of books used, and Section 4 the assembly prototype implemented, and ideas for an assembly system planned. Finally, Section 5 gives some conclusions. 2. CONVERSION

The rst components of the material pool are three text books on control engineering. The books, originally prepared using MS Word1, contain many equations and technical pictures. Creating a structured SGML version required laborious conversion steps using tools such as MathType2 , DynaTag3, OmniMark4, and FrameMaker+SGML5 (Figure 1). The basic structure for the book is the Book DTD of the ISO 12083 [ISO 1994] extended with element types for exercises, answers, examples, formulas, denitions and clauses. In the worst case, the conversion prosess is as hard as writing the material from scratch using a structured editor. The amount of manual editing in the conversion process depends largely on how the author uses a word processing system during the writing process. In other words, the more formal and consistent the original data, the more straightforward it is to convert. On the other hand, commercial conversion products do not oer feasible solutions for converting graphics and mathematical equations automatically. Moreover, there is no real SGML standard for representing 1 2 3 4 5

MS Word is a trademark of Microsoft Corporation. MathType is a trademark of Design Science, Inc. DynaTag is a trademark of Electronic Book Technologies, Inc. OmniMark is a trademark of OmniMark Technologies Corporation. FrameMaker+SGML is a trademark of Adobe Systems Incorporated.

A System for Assembling Specialized Textbooks from a Pool of Documents



3

equations. Finally, the richer and more complicated the required SGML structure is, the more complex and time consuming the conversion from word prosessing data will be. In the following we present some problematic cases we encountered in our material. First of all, text and graphic objects of the pictures were mixed in some Word documents. Thus, we had to re-edit hundreds of pictures in Word, which was very laborious (Figure 1; (1)). Additionally, the conversion of equations was dicult. We used the MathType program with Word to convert Word equations to TEX format. Next, we used a lter called Tex2mif (Figure 1; (4)), which we had to implement in order to get the equations in editable format to FrameMaker+SGML. There were hundreds of pictures and thousands of equations; roughly half of the material was utterly hard to convert. The word processor styles (such as heading, normal paragraph, caption and bold) were used to recognize document structures (Figure 1; (2)). Unfortunately, the use of styles was inconsistent within the original documentsif they were used at all. Therefore, content recognition was used in addition to style-based recognition. It revealed that dierent authors had dierent ways to express, e.g., exercises, examples and equation numbers. All those dierent approaches had to be considered. Furthermore, it is fairly easy to notice the beginning of an example automatically, but it is impossible to automatically decide where an example ends and a new paragraph begins. Hence, closing tags had to be added manually. Some authors did not use the cross-reference mechanism of the word processing system, or even table and list tools; for instance, tables were represented as picture objects within documents. Also in some cases, the words were hyphenated using hyphen signs (`-'), which had to be removed manually. Word document styles and contents were recognized using DynaTag program (Figure 1; (2)). DynaTag generated intermediate SGML documents conforming to EBT's Rainbow DTD. Next, OmniMark transformation language was used to transform and rene the document structure according to our Book DTD (Figure 1; (3)). The converted SGML text, MIF equations, and pictures originally produced by graphics programs (such as Designer) were imported to FrameMaker+SGML (Figure 1; (5)), and nally the part of the pictures which had to be edited in Word were imported to FrameMaker+SGML via Windows Clipboard (Figure 1; (6)). The nal editing and page layout was done in FrameMaker+SGML, which then was used to generate the SGML text, gure and equation les. 3. STRUCTURE

The major hierarchical structure elements of a book are chapters, sections, subsections and paragraphs. We have used as a basis for our document type denition the Book DTD of the ISO 12083 [ISO 1994] extended with element types for exercises, answers, examples, formulas, denitions and clauses. The book is enriched with internal supplementary information. Manual seeding, based on information from the authors, was necessary to classify the fragments of the material into appropriate categories. Each text paragraph is classied according to level (vocational, college, university), branch of specialization (metal industry, process industry, data communications, automation, electric power, civil engineering) and didactic form of the contents (introductory, base material, advanced, applied, summary, auxiliary, examination). Keyword elements are being added automatically to the paragraphs containing occurrences of the terms that have a denition (normally recognizable as boldface words or phrases) in the book. The categorization is represented in the text paragraphs and their superstructures as SGML attributes. Attribute values are propagated from the upper levels of the structure, if not overridden in the element. Figure 2 is an example of the supplementary information attached to the markup of the material.

4



H. Ahonen, B. Heikkinen, O. Heinonen and P. Kilpeläinen

SÄÄTÖLOHKOT TAAJUUSTASOSSA ...
Impedanssi-kompleksinen vastus ... Impedanssin sisältämä virtapiiri.

kokonaisimpedanssi Laske sarjakytkennän kokonaisimpedanssi U = 5 V

kokonaisimpedanssi Kokonaisimpedanssi on sarjaankytkettyjen impedanssien summa, samoin kuin tasavirtavastusten summa on osavastusten summa.

...

Fig. 2. An example of internal supplementary information.

The shown chapter was authored by Jari Savolainen (author=aJS). It is recommended for students at the college level (level=b). The content of the section 4.1 is base material (content=BB). The example of the section is directed at students at the university level (level=c) and it contains advanced material (content=CC). Keywords and their frequencies are listed explicitly. Formulas are in external entity les (attribute entity of element eqbody) in MIF format (can be represented also in picture formats, e.g. TIFF). Using the added markup, we can easily extract, e.g., base material for college level students, or exercises for university level students. Search features of existing commercial SGML browsers such as DynaText6 or Panorama Pro7 support this kind of querying. We chose to use a modication of a standard DTD to ensure that most elements we would meet in the original textbooks could be found in it. This DTD, however, is unnecessarily large and complicated. Particularly, when further documents in the pool are to be prepared using SGML tools from the very beginning, it is essential to design a new DTD for the authors. This DTD should be simple and, if possible, should guide the authors to write text that would be easy to assemble. 4. ASSEMBLY 4.1 Current system

We have built a prototype system that gives teachers a possibility to order customized control engineering textbooks. An overview of the system can be seen in Figure 3. The order is processed as follows. (1) Order. The client lls in an HTML form and submits it. (2) Processing of order. The order form is processed. The composition of the assembled document is created based on the input from the client and some heuristic rules. The table of contents is returned to the client. If the client conrms the order, the specication of the composition is submitted to the assembly process. (3) Assembly. According to the input for the assembly, which is represented by a list of le names, the assembly process retrieves the desired texts and images from the SGML collection and composes the new textbook. Thereafter, the 6 7

DynaText is a trademark of Electronic Book Technologies, Inc. Panorama Pro is a trademark of SoftQuad, Inc.

A System for Assembling Specialized Textbooks from a Pool of Documents

WWW server

order form



5

WWW service provider

form processing

table of contents

form data WWW browser invoicing data (email) assembly rules, content information

invoice

Client

invoicing

tailored textbook

assembly order (email)

Technical editor

Publisher assembly

assembled SGML printing

layout assembled book in PS (email)

SGML texts, images

Printing house

Fig. 3. Overview of the system.

textbook is formatted and the layout is manually checked. Finally, a PostScript le is submitted to the printing house. (4) Printing. The printing house produces the textbooks and delivers them to the client. As a user interface for the teachers we use an HTML form (Figure 4). The form contains a table of contents for each of the three books. The parts that can be used for selection are chapters, sections, subsections, and, in some cases, author-dened sequences consisting of several paragraphs. The columns in the table of contents are (from left to right): (1) Choice. By checking choice buttons the user selects corresponding structures of the books. The user can select whole chapters, or just some sections, subsections, or predened paragraph sequences. (2) Title or beginning of text. Original chapter, section or subsection title, or the beginning of the text of the paragraph sequence. (3) Page count. Length of the part. (4) Content. For example, `johdanto' (introductory), `perusaines' (base material), `soveltava' (applied). (5) Level. For example, `ammattioppilaitos' (vocational). The rightmost column indicates the lowest school level for which the part is recommended. At the moment this knowledge is only given to the teacher as a hint; it is not used in any automated way. As the pool does not contain any branchspecic material yet, we have not utilized the values of the branch attribute.

6



H. Ahonen, B. Heikkinen, O. Heinonen and P. Kilpeläinen

Fig. 4. Order form.

The six choice buttons above the table are shortcuts. They enable the user to select multiple items by one click. The rst button selects all the material, and the rest select all parts of the named content (e.g., `johdanto'). This is implemented with JavaScript. This selection approach could be extended to other attributes as well. As the selections have several dimensions, it is not at all obvious, how all the choices should be presented to the client; the order form becomes easily too complicated. One solution would be to oer pre-dened assemblies as a starting point: some parts would be pre-selected but the user could change the selections. In order to achieve sensible results, the assembly has to utilize heuristics which guarantee the inclusion of all necessary fragments. For example, if only some sections of a chapter are selected then also the title of the chapter and introductory paragraphs are included. If the ordered books are delivered on paper, the problem of predictability arises (since attributes and heuristics are used). Hence, the teacher should be able to check the contents of the result. At the moment, after the teacher has submitted the form, the ordering system immediately returns the table of contents for the assembled textbook, as well as the number of pages and the price of the order. We also assume that the teacher has a paper copy of all the original books, when he/she selects the parts to be included.

A System for Assembling Specialized Textbooks from a Pool of Documents



7

Instead of the table of contents, the whole assembled textbook could be sent to the client, e.g., by email. However, the client should not get the SGML source text, because it would give him/her the possibility to copy it to the students for free, or even modify the text. Hence, SGML browsers like Panorama Pro cannot be used: there is no way to restrict the user's access to the source. One solution would be to create a PDF le from FrameMaker+SGML. Naturally, it is still possible to print and copy the text, but this can be made harder by damaging the pages slightly, e.g., with a grey overprint `Draft'. Unfortunately, some schools might not yet have computers powerful enough, to handle large les. After the conrmation of the order a list of le names is submitted to the assembly process. The assembly retrieves the required les and catenates them to form one SGML document. As the DTD of all the assembled textbooks is the same as the original, it is possible to create, using FrameMaker+SGML, the formatting layout nearly automatically. Due to some problematic cases, e.g., improper page breaks, a manual inspection is still needed. Finally, the textbooks are produced by a printing house. In our system a preprinted cover page is used, on which the personalized headings are printed. The printing house delivers the books to the client. 4.2 Further development

As noted above, it is not at all trivial to design a user interface and functionality for an assembly system. Clearly, it would be better if a client could experiment iteratively with the collection and see the formatted layout as well. In fact, printed books are not necessarily produced at all: the system can be seen as a dynamic electronic textbook. However, in order to be more useful and appealing than a printed book, an electronic textbook should have some additional features. Clearly, various query facilities are a great benet, but integrating simulations, videos, and computer-aided learning, for instance, should also be considered. Still, we maintain that text content is important; a good compilation of a large collection is always invaluable. As we have seen, in the current system the user selects chapters and sections by marking them up in the table of contents. This is not feasible if the document collection is large. Hence, our new system under development allows four ways to select elements to be included in the assembly. The rst one is a rather standard search, that will include full text search and conditions on the structure and attribute values of documents. The second way is based on the Scatter/Gather clustering [Cutting et al. 1992]. The user-controlled iterative clustering process forms a classication of the material, and gives the user an extracted view of the documents. This is especially invaluable when the user is dealing with an unfamiliar collections of documents; traditional query models do not help the user to achieve a simple overview. The user can further select the appropriate clusters for reclustering, and in this way narrow down the retrieval space. The third possibility is to browse the document collection by navigating in the tree structure and selecting the desired elements manually. The reasonable use of this selection usually necessitates that the two above-mentioned selection methods have already been used to reduce the document collection. The fourth possibility is to start from the elements selected so far, and let the system search for similar or related elements in the collection. Similarity of two elements can be estimated, for example, by the amount of the same words they contain. In the current system the user can only prune the existing books: mixing the sections of dierent books is not possible, and even within one book it is not possible

8



H. Ahonen, B. Heikkinen, O. Heinonen and P. Kilpeläinen

to change the order of sections. This kind of modications are, however, often desired. Hence, in the new system the user is allowed to modify the set of selected documents in order to obtain a nal textbook. The modications include rearranging the elements, moving or copying selected fragments to be part of an existing element, and replacing an element either by its parent or children elements. 5. CONCLUSIONS

We have introduced a novel approach for assembling specialized, customized textbooks using as a source a large pool of SGML documents. In our current prototype system the order of a client is processed in the following way. (1) The client lls in and submits an HTML form. (2) The form is processed and the parts to be included in the assembled book are selected automatically. The choice is dependent both on the client's input and a few heuristic rules. (3) The texts and images are retrieved from the pool and the layout is created. (4) The books are printed and delivered to the client. We have developed our assembly system together with a group of teachers, including the authors of the textbooks used. At the moment the teachers are about to select sample assemblies of material. So far the results are encouraging: the teachers nd the assemblies useful, even if the collection is still rather restricted. Some conclusions can already be drawn. First, conversion should be avoided, unless the source data has a rigid and well-dened structure. In any case, the process is highly application-oriented and may need specic programming. As the textbooks used have originally been written to be ordinary paper textbooks, they may contain harmful intradependencies. A book often contains a story which is lost if the parts are arbitrarily assembled. There may also be crossreferences dicult to maintain if some parts are left out. The authoring teachers are satised with the new writing environment, particularly, that they do not need to care about the layout. We have, however, noticed some psychological barriers against textbook assembly. The authors are usually teachers that are not very well paid for writing material. One signicant motivation for writing a book may be the resulting physical book itself, with the author's name on the cover. When books are assembled, the work of one author may be hidden in the jungle of material. How can the writers now be motivated? The approach demonstrated with the current prototype will be developed further as part of a more general assembly environment. Since document collections concerned are assumed to be large, more advanced selection possibilities will be included in the system. Moreover, the resulting assembled documents can be modied in order to achieve the desired composition. The environment facilitates also other uses than just printing books: the interface of the system can be seen as a dynamic electronic book. REFERENCES Ahonen, H., Heikkinen, B., Heinonen, O., Jaakkola, J., Kilpeläinen, P., Lindén, G., and Mannila, H. 1996. Intelligent Assembly of Structured Documents. Technical

Report C-1996-40 (June), Department of Computer Science, University of Helsinki. CCW. 1997. Custom CourseWare. McMaster University Bookstore. URL: http://bookstore.services.mcmaster.ca/home/ccw/ccw.html. Cutting, D. R., Pedersen, J. O., Karger, D., and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th ACM SIGIR Conference (June 1992), pp. 318329. ISO. 1986. Information Processing  Text and Oce Systems  Standard Generalized Markup Language (SGML), ISO 8879. ISO.

A System for Assembling Specialized Textbooks from a Pool of Documents



9

ISO. 1994. Information and documentation  Electronic manuscript preparation and markup, ISO 12083. ISO. Kimber, W. E. 1996. Re-usable SGML: Why i demand SUBDOC. In SGML '96 (Boston, 1996). GCA. McFadden, J. 1996. Hybrid distributed database (HDDB) and the future of SGML. In SGML Europe '96 (Munich, Germany, 1996). GCA. NCPP. 1997. Nelson Canada Power Pak. Nelson Canada, a Division of Thomson International. URL: http://www.thomson.com/nelson/custom/custom.html. Primis. 1997. Primis. Primis Custom Publishing, a Division of McGraw-Hill. URL: http://www.mhcollege.com/primis/.

Suggest Documents