Pixel and semantic capabilities from an image-object ... - CiteSeerX

Copyright 2006 Society of Photo-Optical Instrumentation Engineers. This paper was (will be) published in Document Recognition and Retrieval XIV, IS&T/SPIE Symposium on Electronic Imaging, January 2007 and is made available as an electronic reprint (preprint) with permission of SPIE. One print or electronic copy may be made for personal use only. Systematic or multiple reproduction, distribution to multiple locations via electronic or other means, duplication of any material in this paper for a fee or for commercial purposes, or modification of the content of the paper are prohibited.

Pixel and semantic capabilities from an image-object based document representation Michael Gormish, Kathrin Berkner, Martin Boliek, Guotong Feng, Edward L. Schwartz Ricoh Innovations, Inc., 2882 Sand Hill Road, #115, Menlo Park, CA 94025 {gormish,berkner,boliek,feng,schwartz}@rii.ricoh.com

ABSTRACT This paper reports on novel and traditional pixel and semantic operations using a recently standardized document representation called JPM. The JPM representation uses compressed pixel arrays for all visible elements on a page. Separate data containers called boxes provide the layout and additional semantic information. JPM and related image-based document representation standards were designed to obtain the most rate efficient document compression. The authors, however, use this representation directly for operations other than compression typically performed either on pixel arrays or semantic forms. This paper describes the image representation used in the JPM standard and presents techniques to (1) perform traditional raster-based document analysis on the compressed data, (2) transmit semantically meaningful portions of compressed data between devices, (3) create multiple views from one compressed data stream, and (4) edit high resolution document images with only low resolution proxy images. Keywords: JPEG 2000, Mixed Raster Content, JPM, document representation

1.0 INTRODUCTION Electronic documents are typically represented in one of two forms. One form is a semantic representation resulting from a word processing, content generation, or authoring tool and stored in a format like “.rtf” or “.ppt”. The other form is a pixel array resulting from a camera, scanner, or rendering operation and stored in a format like “.jpg” or “.tif”. To perform all of the functions desired with electronic documents it is necessary to have both forms. For example, it is necessary to convert from the semantic form to a pixel array for print or display. Likewise, it is necessary to extract semantic information, e.g. using optical character recognition (OCR) from the pixel array form to provide search of contents. The rest of the introduction describes the historical progression of document compression and representation. Compression of document images has made more and more use of image analysis in order to improve compression. At the same time these improvements come closer and closer to model based coding. With the advent of JPM a sufficiently rich model of a document can be created to allow more direct semantic manipulation. After reviewing compression technologies, the JPM image model and format are explained in more detail. Third, methods for creating semantically rich JPM files are explained. Then methods for making use of the semantic and pixel content are explained. In particular, the use of JPM for analysis, presentation, transmission, and editing are presented with experimental results. 1.1 Raster array compression Standardized compression of document images began with methods for facsimile.2 These methods operated on binary documents processing one row at a time. There was no explicit dependency on semantic document content in the compression system. The static Huffman tables used in these systems were designed for statistics of typical business letters or other office documents and did not perform well with raster arrays with content different from a business letter, e.g. white text on a black background. Some of the poorest compression was obtained on continuous-tone image content that had been converted to binary via dithering or halftoning. Newer binary compression standards, e.g. JBIG,3 addressed this by using extensive twodimensional templates to try to predict pixels in halftoned regions. Further, halftoned regions could also be compressed by recording the average gray value for a group of pixels and recreating that at the decoder. This set of standards

essentially made use of statistics likely to be observed in the pixel data, with later versions allowing different types of statistics for different kinds of commonly compressed imagery. The most advanced binary compression technology is embodied in the JBIG2 standard.4 JBIG2 supports the statistical compression of prior standards and added a dictionary of symbols. The symbols were not predefined, instead the compressor could select any set of pixels and declare them to be a dictionary entry and instruct the decoder to recreate that entry anywhere on the page. In addition, JBIG2 allows a "correction" layer to be compressed and stored in the file. This allows an encoder to use the dictionary for two symbols that are similar, but not identical. This works very well for text in any language, including mathematical formulas, because the dictionary can be created for just those characters appearing on the page, and slight rendering differences do not prevent matches. Thus although the true meaning of a particular glyph is not understood by the compresser, the compressed file does capture some meaning because it stores all the locations where the same glyph appears on the page. Ruckidge, Huttenlocher, and Jaquith5 make use of this analysis to improve the glyphs using information from all appearances of the same glyph. Among the many techniques for compression of continuous tone grayscale and color pixel arrays, JPEG is by far the most successful.6 In addition to taking advantage of statistical properties of imagery, JPEG carefully allowed errors in the compression processes in a way that was least likely to be observed by the human visual system. Further JPEG allowed Huffman tables to be adaptive to image content, so if the image being compressed was unlike the design corpus compression would not suffer greatly. However, JPEG was based on cosine basis functions and was therefore poorly suited for document images which contain smooth regions separated by step edges. The problems of step edges and continuous tone compression were addressed with the mixed raster content (MRC) family of compression systems.7,8 These systems encode three full size layers for one array of pixels, typically called a foreground, background and mask. The mask is most commonly a binary image and compressed with a compressor good at compressing step edge information, like JBIG or JBIG-2. The foreground and background are typically compressed with continuous tone compressors like JPEG or JPEG 2000.9 Although the names foreground and background might imply some semantic understanding, normally they just refer to the order the images are composed. In many cases content of the foreground and background can be switched as long as the mask is also inverted. The new JPM standard extended the three layer model used in MRC by allowing mutliple objects on a page. While a JPM file may contain three layers (foreground, background, and mask), it may alternatively use several contone images and masks for different or even overlapping regions of the page. Those regions may correspond to semantic objects on a page e.g. paragraphs, or figures. To the extent that the regions separate imagery types and avoid compressing regions without content, object based compression is often better than the three layer model. 1.2 Semantic form compression Compression for semantic forms, i.e. word processing document formats, was historically less important, because the files took up relatively little space. There was often some implicit compression because it would be assumed that the font information needed to render a document was stored somewhere else, e.g. on the operating system or at the printer. As word processing applications incorporated more drawings or images the file size grew. The image or drawing data was stored in compressed form if the file copied into the document was compressed, but not otherwise. Some formats, e.g. postscript allowed compression of the drawing commands with a text compressor, more typically if compression was important it was assumed that the entire file would be compressed with a text compressor such as zip. 1.3 Dual document representations This paper makes use of an image representation that is a mix of images and semantic information. A page is made up of objects containing codestreams which can be decompressed to pixel arrays, but these objects can contain metadata. Even “metadata” like the compression type, codestream size, and location on a page is useful semantic information. Bagley and Brailsford11 realized the value of independent objects and invented a way to obtain coherent semantic document objects, called COGS, from drawing and rendering commands contained in a PDF file. One of the main benefits of COGs is the creation of independent state for each object. Normal PDF and Postscript carry graphic state

(a)

(b) (d) (c) (f) (e)

Figure 1 — Document made up of Image and Mask Objects forward and this makes it impossible to manipulate objects independently.18 Because each image object in a JPM file is independent, the only interaction between objects occurs if one is rendered on top of another. Thomas and Brailsford10 add an XML data structure (standoff markup) which references a PDF document, thus allowing multiple overlapping semantic meanings to be assigned to the content. Their standoff markup is more natural in JPM than in PDF because the fundamental structure of a JPM document is already a set of objects, with a “standoff” description specifying presentation. Unlike PDF, JPM does not support arbitrary drawing commands and fonts.

2.0 JPM DOCUMENT REPRESENTATION 2.1 Image-Object Model Figure 1 shows a collection of image objects (b), (c), and (e) on the left, a set of mask objects (a), (d), and (f) in the center, and the final rendered page on the right. A given page could be divided into image objects in a number of ways with the simplest representation having one unit for the entire page and a more complex one separating the document image information into the objects (a)-(f). In the example in Figure 1 black text is represented by a mask only (a). Black or other colors of text can be represented using an image (c) and mask (d). Similarly, an image object can be represented as an image without mask (b) or as an image (e) and mask (f). All color information must be captured in the image object which can be a codestream or just a constant color block. The masks can be stored as binary codestreams in which case they select from the object or the background page. Masks may also be stored as graysacle codestreams in which case the pixel value is interpreted as the fraction of the object to render with the page in an alpha blend. It is possible for masks or images to be reused for multiple objects on the final page. For example, the four flowers at the bottom of the page could have been split into four different objects with a common mask combined with a different color image for each one. It is also possible to crop and scale a codestream before using it for an image or mask. Thus, the flowers could be stored as one image with four colors, and one mask with a single flower. Four objects would be used to crop a different section of the image, and thus a different color. It is also possible for semantic objects to overlap. This is unlikely to be detected if the representation was created from a scan of a page, i.e. from a single bitmap. However, if a document contains edits or is generated from a source with overlapping drawing commands it may have objects that overlap. All objects are rendered in order, and so the final appearance of a pixel depends on the last object which contained an unmasked pixel in a given position. The images and masks are created solely by decompressing codestreams. There are no font rendering or drawing commands. Thus there is no graphic “state” of a document. A page buffer is needed to store the image as objects are being composited. Because objects are independent and their position and extent on a page is known it is possible to render a part of the page by decoding only those objects that overlap. In postscript or PDF an object might be drawn at

the bottom of a page, but change the graphics state, and thus affect an object drawn later at the top of the page. Partial rendering is important both for on screen viewing and printers e.g. inkjets with limited memory. It is possible to implement hardware acceleration for the decompression operations. Also, parallel processing is easy for different objects, and the objects affecting any point on a page are trivially determined by examining their extent and position. For image objects compressed with JPEG 2000 powers of two scaling is faster than decompressing the whole image. For other compression techniques or non power of two scaling bitmaps are resampled. 2.2 JPM File Format Standard JPEG 2000 Part 6 was developed to allow JPEG 2000 compression to be used with an MRC style document representation, and thus was named “JPM.” It also allows for multi-page documents, multiple objects on a page, multiple alternative groups of pages, called page collections, and metadata. The actual bytes of a JPM file are stored in “boxes” just as all the JPEG 2000 family file formats are. Quicktime and many MPEG standards also use this structure, although they call it an atom. Thus if a particular type of metadata is being used in another application, e.g. audio data, it is easy to incorporate it in a JPM file. A box is simply a four byte length code, followed by a four byte type code, although the four byte length code can be extended to 8 bytes if the content of the box is more than a few gigabytes. Boxes contain some defined set of fields or a collection of other boxes. Thus, the boxes in a JPM file form a tree, with XML or binary content equally well supported. The length field allows a parser to skip a box and all of its sub-boxes without reading the data, something that can be very important for media with potentially large content. Some portions of the JPM file are accessed by offsets into the file, and some specific items in the JPM file may be stored externally to the file. Codestreams for image objects can thus be shared between pages, and even between JPM files. The location of the codestream within the file is flexible, allowing aplications to optimzie placement to ease random access, network access, etc. A JPM file has a main page collection which can point to other page collections or pages. Thus a file has a default set of pages. However, the standard allows for alternate page collections which can address a different set of pages, for example, pages containing a figure could be collected for the list of figures page collection. Any page has a box describing how the page should be rendered. The page box includes header information like the width, height, and orientation of the page, and a list of layout objects. Each layout object may contain a mask object, an image object, or both. Information is also included to determine the position of an object on the page and any scaling or cropping required. A page is rendered by locating the mask and image objects, decoding to pixel arrays, and combining the image objects in the specified order onto the base page using the mask objects for alpha blending. Masks and images can be compressed with a variety of compression standards including CCITT G4 Fax, JBIG, JBIG2, T.45, JPEG, and JPEG 2000. Some of these standards are good for binary images. JBIG2 is particularly good for document images because of the ability to create a dictionary of symbols, which can capture repeated characters in any font. JPEG 2000 is especially good for continuous tone document imagery because images can be decoded at different resolutions. Thus if a JPM file is being decoded for a printer, the full resolution JPEG 2000 image object can be decoded, while if a page is going to be viewed on the screen, only a low resolution version of the JPEG 2000 image needs to be decoded and displayed. Metadata in a JPM file can be stored in an XML box if the data is XML, binary metadata can be stored in other box types depending on whether the type is standardized or specific to a particular vendor. Importantly, the metadata can be stored at various positions in the file depending on its use. Metadata can stored in a Page Collection box, a Page Box, an Object Box, or a top level File Box, if the data applies to a group a pages, a single page, a single object or the entire file respectively. Thus OCR results could be stored with an object and if that object is moved or updated the OCR metadata can be moved or updated with it. Page Collections, Pages, and layout objects can also have Label boxes stored in them, these boxes are text strings with some restrictions to make them printable. Thus the labels are useful for a GUI which provides controls or tool tips to manipulate objects or pages, etc.

Figure 2 — Formation of Image Object based files

3.0 DETERMINATION OF SEMANTIC MEANING 3.1 Creation of Image Objects In order to use the JPM format for analysis, transport, and rendering, it is necessary to convert document data into this format. Since JPM is a very new standard, currently no publically available transcoding tools exist. Creation of JPM files from both pixel arrays and current semantic forms is explained in the section. A schematic overview of these two paths is shown in Figure 2. 3.1.1 Raster based creation Raster images can be converted into image objects most simply by making one continuous tone object for the entire image. If the continuous tone image is compressed with JPEG 2000 then it is possible to access only portions of the document for decoding. However, only minimal semantic operations are possible. Some systems divide a scan into foreground, background, and mask in order to optimize compression. This provides the best known compression for document images.8,12,13 Dividing the image into foreground, background and mask normally involves some image analysis to decide which pixels would be in foreground vs. background. This analysis is partially captured in the compressed form. For example, examining the mask image very often reveals all of the text, and perhaps major edges in the image. If OCR is run on a scanned page, the preprocessing for the OCR very often includes some layout analysis, images and text can be distinguished, and sometimes titles and paragraphs of text are grouped into separate semantic units. Such an analysis allows the page to be divided into semantically meaningful image objects, the text can usually be compressed with a binary mask, and images compressed with a contone compressor. 3.1.2 Semantic Based Creation An image stored in a proprietary word processing format, or standardized format with drawing commands, e.g. Scalable Vector Graphics (SVG), can always be converted to image objects by printing, or displaying the document, capturing the pixel array, and running the analysis from the above section on the full page. Such a conversion works fairly well, since “noise” introduced through printing and scanning is avoided and it is thus easy to distinguish contone imagery from text or line art. It is often thought a semantic form will be much smaller than a pixel array, but this is usually not true. The authors collected the first page from a set of PDF files (version 1.3 or earlier) including a variety of fonts and drawing commands, converted all of the content to image objects, and compressed the pixel data creating a JPM file. In addition OCR was performed which can optionally be stored with the JPM file. The images in Figure 3 were chosen because they represent the breadth of the image set: large and small PDF files, mainly image, or text only content, and good and poor performance of JPM relative to PDF. Table 1 shows the file size of the vector based PDF files and the corresponding percentage size of a JPM file created using a 400-dpi capture rendering of the PDF file. In general, the quality of the image object representation increases

Dx

Cx

Jx

Ox Nx Px

Figure 3 — PDF pages converted to JPM with the rendering resolution, as does the file size. However, our experiments show that often the 200-dpi capture is indistinguishable from the original, and the 400-dpi capture is always indistinguishable from the original when printed on a high end office color copier. As seen in the table the 400-dpi JPM files are usually smaller, sometimes substancially smaller, than the PDF file. Of course, there are programs to compress PDF files and maintaining compatibility with any particular version of PDF, but the point of this experiment was to determine how size and quality would compare with files found, “in the wild.” Note that for the Nx file the the JPM file is larger than the PDF file. This is caused primarily by the high quality settings used in the JPM compression, which were choose to guarentee high quality across images, and not optimized, or adapted for these pages. In the case of image Nx, the images regions were compressed in the PDF, but were uncompressed, and recompressed with a different system which attempted to presever quality including artifacts from the original compression. In addition, the black background in this image, although appearing solid on both screen and in print, is actually a pattern. This pattern is maintained by the compression, at a considerable cost in bitrate. Indeed, repeating patterns are probably the biggest challenge for efficient representation in a compressed bitmap system, like JPM. Notice that the OCR size percentage of the Ox file is significantly larger than others. This is because the file compresses quite well, and the OCR did not group all characters into words efficiently, and thus information on the location of individual characters is stored in the OCR results, of course this information could be compressed. 3.2 Document Image Analysis An important characteristic of a document represented in a semantic form is that processing tasks like reformatting to fit a different paper or display size are possible. Typical steps may include reflowing or scaling of text as well as resizing or complete exclusion of images. In the following, techniques operating on document image representations that enable reformatting procedures similar to the ones possible for semantic forms will be shown. 3.2.1 Efficient image-based analysis for automatic object creation The syntax developed for JPEG 2000 codestreams was designed to support compression and network transfer applications. However, the J2K syntax also provides a powerful analysis tool that enables applications requiring

Table 1 — Compression efficiency of JPM for rendered data File

PDF Size (bytes)

Cx Dx Jx Nx Ox Px Average

942,597 345,161 222,113 87,501 64,447 523,820 364,273

JPM Size (percent of PDF) 8.6 11.08 48.47 118.10 41.49 12.85 40.10*

OCR Size (percent of PDF) 0.52 5.39 3.660 3.07 31.21 2.08 7.65*

*. Average computed over percent of file size, not total JPM size over total PDF size.

Figure 4 — Original document (left) and resolution-dependent segmentation map (right) derived from a multiresolution bit distribution (top).

repurposing of image objects. In this section an information-rich multiresolution bit distribution is extracted from J2K header data and that distribution is used to automatically crop and scale selected image regions. The J2K image compression algorithm performs a wavelet transform on rectangular tiles and partitions the wavelet domain into groups of coefficients called codeblocks. These codeblocks are encoded with an arithmetic coder and organized into quality layers. Finally, all data are assembled into smaller units called packets. Packet headers contain information such as the number of bits allocated to each code-block. At low bit-rate, this information reflects the visual importance of a code-block. Due to the specific encoding scheme in the wavelet domain, a code-block is localized in the spatial and frequency domain. Using the codeblock information available in the packet headers a multiresolution bit distribution (MBD) B = B(r, (i, j)) can be defined. For each code-block location (i, j) at wavelet resolution r the value B(r, (i, j)) denotes the number of bits spent by the encoder to encode the specific code-block area at resolution r.14 Figure 4 shows an example of an MBD for a document compressed using 5 levels of a wavelet transform. It is important to note that an MBD can be extracted from J2K-compressed file by reading the packet header data only. No decoding of image data is required. Implicitly embedded into the MBD are results from an analysis that the encoder has performed to support bit allocation decisions. Those analysis results may be information about where in the spatial domain and at what resolution the compressor spent significant bits. This MBD has been used to perform segmentation of an image into regions with similar resolution properties.14 A resolution-sensitive segmentation map has been derived using a Bayesian-type technique, where each code-block location is assigned a preferred resolution computed as: L

r* (i, j) = arg minr ! w r! l B(l,(i, j)), l=r

Figure 5 — Analyzing JPM structure data where L denotes the maximal level of wavelet decomposition and wr,l are weights chosen empirically to match resolution charactersitics of document images such as capture resolution. A connected component analysis performed on the segmentation map results in a collection of objects each with a preferred resolution. The visual importance of an object is measured by extracting the number of bits the coder decided to spend on the object at its preferred resolution. This purely image-based analysis technique leads to segmentation in terms of importance measure and preferred resolution. Strict semantic content such as reading order, sentence units or logical labels for paragraphs can not be obtained from this image based technique. Properties such as appropriate scaling factors for con-tone images in a document and visual importance measures, however, can be determined. Section 4.1 shows how image objects and their attributes are used in a reformatting application. 3.2.2 Text-specific analysis for object creation In order to be able to define regions in document images based on semantic characteristics a text-specific analysis step is used. The goal is to group pixels with common semantic meaning into objects and assign — as in the image-case — a preferred resolution and an importance measure to each such object. Initially, a document layout analysis is performed. Document layout analysis is typically part of most OCR systems.16 Bounding boxes for text zones and words are obtained from the layout analysis. This information provides a segmentation map for the text portions of a document including a hierarchy between segments and sub-segments. A text segment is described by a rectangular bounding box. Next, a resolution and an importance attribute are assigned to each text segment. A resolution attribute is derived from the minimal scaling factor that still keeps the text at a readable character size. An importance attribute is derived as a product of font-size and position of the text segment on the page.15 When a file is stored in JPM form rather than in a pixel array form, the JPM layout information can be used, thus eliminating the need for layout analysis or an OCR system. The left side of Figure 5 shows some of the elements of the JPM file for the image shown in Figure 1. The type of compressor, the position of the objects, and the sizes of the compressed codestreams can be used to generate a spatial analysis of the file as shown on the right of Figure 5.

4.0 EXPLOITING SEMANTIC MEANING 4.1 Multiple View Applications The image- and text-specific analysis presented in Section 3.2 is now used to create a thumbnail-type representation of a document, with focus on trying to keep text readable and images recognizable. In general, adapting image content to a smaller size requires scaling of pixel data. In order to maintain readability of text and recognizability of images scaling of those objects needs to be controlled. Text is scaled not lower than to the factor given by the resolution attributes.

Figure 6 — SmartNail examples created for the document in Figure 4 for various display sizes: 130x160 (left), 400x200 (middle), and 80x100 (right) pixels.

Similarly for image objects, the preferred resolution attribute as described in Section 3.2.1 enables control over appropriate scaling factors for images. Given a certain size for a thumbnail, not all objects may be displayed at their preferred resolution. That means a selection process has to be performed, extracting important objects and laying those out in the available thumbnail window. The importance attribute assigned to each object assists in this task. The importance measure can include more semantic information such as logical interpretation of image objects, e.g. “title” or “figure caption”. The layout method used to format selected segments into the final window penalizes deviations from preferred resolutions and deviations from the original layout. The resulting thumbnail is an image composed of automatically selected, cropped, scaled, and reformatted document images objects. To contrast with the typical thumbnail, this representation is called a SmartNail. Objects included in a SmartNail are displayed keeping original formatting characteristics such as font size, font type, or color. Potential OCR errors are not visible to the viewer. By imposing strict constraints on the final thumbnail size, maintaining relative spatial positioning properties of objects may not always be possible. Example SmartNails are shown in Figure 6. More details on the layout algorithm are available.19 The SmartNail framework utilizes semantic document information and pure image-processing steps such as scaling, cropping, and pasting of image data to enable a reformatting of a document. This new presentation is represented as a collection of document image objects, and cab be optimized for given display constraints. The JPM file format is capable of storing image and semantic data needed to generate a SmartNail representation, and can in addition store a fullresolution representation suitable for high-quality printing. 4.2 Interactive Network Delivery High end digital copiers process uncompressed pixels at roughly the same rate as HDTV. Thus although network bandwidth has increased somewhat over the years, accessing documents over a corporate intranet is still a pain point for many organizations. While good compression makes the transfer of documents possible, even more benefit is available by only delivering the portions of a document actually required. The JPIP standard17 defines a protocol to provide access to large JPEG 2000 images over a network. Recently an amendment to this standard was passed (not yet published), which extends the JPIP protocol for JPM images. Thus it is possible to deliver only the image objects needed for a particular task, and only at a resolution needed for the task. The authors report on several user tasks and the required bitrates,20 a sample is included here to show the benefit of the image object representation. Networking tasks and required bitrates are summarized in Table 2. First, the “Jx” image from Figure 3 is rendered from an electronic source. If it is rendered and saved as a high quality JPEG image, storage, email, or network access to the full (single page) image requires almost two megabytes. If the image is divided into layout objects and compressed with JPM using JPEG 2000 and JBIG-2 less than one tenth as much data is required to store the image at the equivalent Signal to Noise ratio (SNR). Note that in this case the two images with equivalent SNR have roughly equivalent nearly perfect, visual quality. However, if high compression is used, the JPEG image ends up looking much worse than the JPM image even when SNR is matched, because of the poor performance of JPEG on step edges. In many cases compression gains for JPM can be even more significant than 10:1. Often the first thing that happens with a document is display on a monitor. Typically a full 8.5x11 inch page can only be displayed at about 75-dpi even on high end monitors. Using JPIP, as extended for JPM, it is possible to access a 75-dpi version of the page image. For this test image used in Table 2, the 75-dpi version requires less than half of the bytes in the file.full file. The JPEG image must

Figure 7 — One JPM file referencing the image objects in another either be transmitted in full to be displayed (assuming sequential JPEG), or a page sized “preview” image must be created. Since a page document is often not readable at 75 DPI, it is very likely that a user will zoom in on some content, for example the upper left hand corner, and examine this at full resolution. The JPIP protocol allows the server to send only the additional data needed to view this part of the image, in this case less than 25% of the full file is needed for the JPM file. For the JPEG file, it is assumed the whole file was sent in the previous step so no additional data need be sent. Of course, if a low resolution preview image was sent, it would now be necessary to send a new image in the JPEG case, the previous image would not help render the high resolution data. These bytes sizes are shown in the third row of the table. The final row of table reports results for editing that are described in Section 4.3. Table 2 — Sample Gain For Pan and Zoom Operations Task a) Full size, 300dpi b) 75 dpi, full screen c) 300 dpi, zoom and pan, extra bytes, total bytes d) Add object, bytes to server

JPEG 1,948,544 1,948,544 +0 1,948,544 2,160,084

JPM 161,402 78,373 + 35,954 114,327 4,661

4.3 Remote Editing, Printing, and Versioning To add an annotation to the file (row d of Table 2) a small binary image is created. For JPM, this is a new object. A new JPM file can be created which contains only the new object, and some layout information, and perhaps some updated metadata, and references the high resolution codestreams which are still stored at the server, as shown in Figure 7. This new image may be transferred to a server entirely in a short period of time. Further, if a user is interested in the previous version of the file, it is still available. In contrast, to edit a JPEG image the entire file must be received, the changes made and the modified file must be compressed and returned to the server. For high quality printing full resolution image objects are required. Normally, someone viewing a document on a laptop would wait to receive the entire image, then if they decided to print, they would transmit the entire image to a printer. In the JPIP-JPM case the document can be viewed at low resolution quickly (even over a wireless connection), and the server can be requested to print the full resolution document. As the server and printer are more likely to have higher bandwidth connections than a wirelessly connect laptop, this will often be much faster. The authors implemented a JPM printer driving on a Macintosh running OS X, using the Common Unix Printing System (CUPS). This driver allowed our custom browsing application to make use of low resoltuion “proxy” images for browsing and editing, but ensured that full quality images were used when printing. Because printing is not an interactive processes, the delay to get full resolution data is not nearly as noticeable as when editing.

5.0 CONCLUSIONS Documents are represented in a variety of formats. A high-level distinction can be made by grouping representations utilizing pixel data only in one category, the pixels-array form, and grouping representations originating from work processing applications into a second category, the semantic form. Both forms co-exist in today’s document workflow. The semantic form is used in authoring and content repurposing, whereas the pixel array form is necessary for printing and capturing via a scanner or digital camera. This paper describes the recently finalized JPM standard as a hybrid

between the two forms, combining pixel array data with semantic structure in a single file. A selection of applications that utilize the combined access to pixel data and semantic descriptions through a JPM representation are described: access to semantic units in a compressed codestream, repurposing of document content considering image and text characteristics for alternative thumbnail generation, network transfer of semantically structured or printer-specificstructured document data, and remote editing of content at a thin-client.

REFERENCES 1. 2. 3. 4.

JPM, JPEG 2000 image coding system -- Part 6: Compound image file format, ISO/IEC 15444-6:2003, www.iso.org. A. Jain, Fundamentals of Digital Image processing, Prentice-Hall., Englewood Cliffs, NJ, 1989, pp. 540-553. JBIG, Progressive bi-level image compression, ISO/IEC 11544:1993, www.iso.org. JBIG-2, ITU-T Recommendation T.88 | ISO/IEC 14492:2001, Information technology - Lossy/lossless coding of bilevel images. 5. Ruckidge, W. J., Huttenlocher, D.P., Jaquith, E.W., “Method and Apparatus for Comparing Symbols Extracted from Binary Images of Text Using Topology Preserved Dilated Representations of the Symbols,” US Patent #5,835,638, 10 November 1998. 6. W. B. Pennebaker, J. L. Mitchell, JPEG Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993. 7. ITU-T Recommendation T.44 | ISO/IEC 16485:2000, Information technology - Mixed Raster Content (MRC). 8. R. L. de Queiroz, R. Buckley, and M. Xu, “Mixed Raster Content (MRC) model for compound image compression,” in Proc. of SPIE Conf. on Visual Communications and Image Processing, vol. 3653, San Jose, CA, January 25-27 1999, pp. 1106–1117. 9. JPEG 2000, JPEG 2000 image coding system -- Part 1: Core coding system, ISO/IEC 15444-1:2004, www.iso.org. 10. P. Thomas and D. Brailsford, “Enhancing Composite Digital Documents Using XML-based Standoff Markup.” In Proceedings of the 2005 ACM Symposium on Document Engineering, pp. 177-186, 2-4 November 2005. 11. Steven Bagley, David Brailsford and Matthew Hardy. “Creating reusable well-structured PDF as a sequence of Component Object Graphic (COG) elements.” Proceedings of the ACM Symposium on Document Engineering (DocEng'03). Grenoble, France. 20–22 November 2003. (ACM Press) 12. L. Bottou, P. Haffner, P.G. Howard, et. al., “High Quality Document Image Compression with DjVu,” http:// www.djvuzone.org/djvu/techpapers/jei/jei.ps.gz, 1998. 13. Michael Thierschmann, Kai-Uwe Barthel and Uwe-Erik Martin, “A Scalable DSP Architecture For High-Speed Color Document Compression,” Document Recognition and Retrieval VIII, Paul B. Kantor, Daniel P. Lopresti, Jiangying Zhou, Editors, Proceedings of SPIE Vol. 4307 (2001), San Jose, CA, January 2001. 14. R. Neelamani and K. Berkner, “Adaptive Representation of JPEG 2000 Images using Header-based Processing,” Proc. IEEE Int. Conf. Image Processing – ICIP ’02, 2002, vol. 1, pp. 381–384. 15. K. Berkner, E. L. Schwartz, C. Marle, “SmartNails - Image and Display Dependent Thumbnails,” Proceedings of SPIE, vol. 5296, pp. 53-65, San Jose, 2004. 16. G. Nagy, “Twenty years of document image analysis in pami,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 38–62, January 2000. 17. JPIP, JPEG 2000 image coding system: Interactivity tools, APIs and protocols, ISO/IEC 15444-9:2005, www.iso.org. 18. Adobe Systems, PDF Reference, Fifth Edition, Version 1.6, http://partners.adobe.com/public/developer/pdf/ index_reference.html#5, 2004. 19. K. Berkner, E.L. Schwartz, C. Marle, “SmartNails - Display and Image Dependent Thumbnails,” IS&T/SPIE Electronic Imaging Conf., San José, January, 2004. 20. M. Boliek and M. Gormish. “Network Acceses to Parts of Archived Document Image Files”. Proceedings of Archiving Conference 2006.

Pixel and semantic capabilities from an image-object ... - CiteSeerX

Pixel and semantic capabilities from an image-object ... - CiteSeerX

Suggest Documents

CAPABILITIES AND COMPETENCIES - CiteSeerX

CAPABILITIES AND COMPETENCIES - CiteSeerX

capabilities - CiteSeerX

From Stakeholder Intentions to Agent Capabilities - CiteSeerX

Sparse Pixel Vectorization - CiteSeerX

Capabilities, Cognition, and Inertia: Evidence from Digital ... - CiteSeerX

Pixel Tours - Semantic Scholar

Pixel energies - Semantic Scholar

Capabilities in Context - CiteSeerX

Pixel readout electronics development for an imaging silicon pixel ...

Organizational Capabilities - Semantic Scholar

A Fast Monolithic Active Pixel Sensor with Pixel Level ... - CiteSeerX

Improved Pixel-Based Rate Allocation For Pixel ... - Semantic Scholar

14.5 10 Âµm Pixel-to-Pixel Pitch Hybrid Backside ... - CiteSeerX

SEMI-SUPERVISED HYPERSPECTRAL PIXEL ... - CiteSeerX

Single-Pixel Remote Sensing - CiteSeerX

Pixel based Symmetry Analysis of an Axial T2 Weighted ... - CiteSeerX

Morphological Boundary Pixel Classification - CiteSeerX

Target Costing Implementation and Organizational Capabilities: An ...

An automated approach for the optimization of pixel ... - CiteSeerX

3D Semantic Trajectory Reconstruction from 3D Pixel Continuum

An Eye Detection Algorithm Using Pixel to Edge Information - CiteSeerX

Capabilities and Strategic Entrepreneurship in Public ... - CiteSeerX

SKILLS, CAPABILITIES AND CHANGE: ARE WE ... - CiteSeerX