Automatic Paragraph Detection for Accessible PDF Documents A. Darvishy, M. Nevill, H.-P. Hutter ZHAW Zurich University of Applied Sciences InIT Institute of Applied Information Technology Technikumstrasse 9 CH-8401 Winterthur, Switzerland E-Mail:
[email protected]
Keywords: Accessible PDF, Tagged PDF, Visual Implement, Algorithm, Screen Readers, Document Accessibility Abstract. This paper describes a new algorithm for the automatic detection and tagging of paragraphs in PDF documents. This is an important feature of the PDF Accessibility Validation Engine (PAVE) [1] which is an open-source web application for the analysis and semi-automatic correction of accessibility issues in PDF documents. The tool is currently used by a large number of users, and their feedback is collected and evaluated. The evaluation so far revealed some major usability issues mainly due to the missing paragraph detection functionality. After an introduction in PDF accessibility this paper discusses the current usability issues with PAVE and describes the newly proposed algorithm to alleviate them. A first evaluation and conclusion of the results will be provided in the final paper.
1
Introduction
Modern assistive technologies improve the ability for people with disability to work effectively with current software. People with impaired vision specifically face the challenge that information is commonly presented visually on a screen. Screen readers [2] are used in such situations to render the content of the screen using speech synthesis or braille output devices. To allow users to navigate software interfaces and documents, screen readers must expose structural information to the user. For documents, a screen reader will provide key combinations that enumerate headings, figures or other elements. Listening to these items provides the user with an overview of the document, and selecting an item makes the screen reader navigate to the specific item and read the text of the document out aloud starting from that position. Some document formats have all content embedded in the kind of structural information needed by screen readers. The PDF format, however, is primarily designed to
allow precise visual positioning of elements within a document, but these elements do not have any information about their structural relationship to other elements in the document. In order to introduce this kind of structural information the PDF format allows a tree of structural information separate to the content. This structural information, however, may be absent or incomplete depending on the software that generated the PDF file. There are a number of tools available to create accessible PDF documents [3], but PAVE is the only available open source web based application for validating and fixing accessibility issues directly in the PDF files. It has been awarded with the first price at 2014’s Conference on Computers Helping People with Disabilities (ICCHP) [4]. It performs automatic fixes for some accessibility issues, while providing an interactive interface for fixing others manually. The most common kind of manual fix necessary is annotating (tagging) the elements of the PDF file with the structural information described above. In a pilot period this tool was used by a large number of users, their feedback was collected and evaluated. The evaluation revealed some major usability issues described in the next section.
2
Usability Challenges with PAVE
In a PDF file, a sequence of characters with uniform formatting and positioned on the same line usually form a single textual element, although in some cases each individual character is such an element. A document will therefore have at least one separate element per line, and often many more. When no structural information is present, this leads to hundreds of elements in a single page that need be annotated, and thousands in a whole document. Users of PAVE have reported that the editing phase is overwhelming due to the often long list of individual items that need fixing. As shown in the screenshot below, all text elements that are not annotated with structural information are individually displayed as “issues.”
Fig. 1.
In the page editing phase, the user is presented with a long list of technical issues that has often been reported as overwhelming
3
Usability Improvement of PAVE
The main objective for improvements of the usability of PAVE is to reduce the number of actions a user must take to fix a document, and thereby reducing the time spent on each document. The most common action that has to be performed is the grouping of regular text elements into paragraphs. We have therefore implemented an algorithm to automatically perform such grouping wherever blocks of text follow standard layout patterns. 3.1
Automatic Tagging of Paragraph Elements Fig. 2.
Fig. 3.
Fig. 4.
First, we must determine the text elements that should be considered. Our goal is to tag paragraphs of body text, while requiring the user to tag elements that require additional information such as headings and figures. We determine body text by examining the distribution of element heights; we assume body text constitutes a majority of the elements in the document. Any elements that exceed the median element height with a small error margin are ignored. We then group text elements into blocks based on interstitial whitespace (Figure 2). To do this, we first use a vertical line sweep to find potentially overlapping lines. For any potential overlap, we check if there is an actual overlap (allowing for a small margin). If an overlap is present, we mark the elements as part of the same text block. Paragraphs must be created in the correct order, so that the screen reader reads them in order. Before creating the paragraphs we therefore establish an order between text blocks. Ordering by a single coordinate may produce wrong results, e.g. in the case of multiple columns. Based on [5], we instead use the XY-Cut algorithm from [6] to determine an order of the text blocks (Figure 3). Within each text block, we use left and right justification as cues to indicate the start and ends of paragraphs (Figure 4) [7]. We determine if shared left or right edges exists by calculating the median of the start and end positions of the lines in the block, and testing if the majority of the lines are close to the calculated value. If a suitable left edge is found, any lines that do not start on the left edge are the beginning of a paragraph.
4
Evaluation Fig. 5.
Fig. 6.
The above example shows a page of a technical paper. The paper did not have any tags when uploaded, and was automatically tagged using the algorithm described. In Figure 5, we see that all regular paragraphs of an entire page were tagged in order (marked green), while the remaining elements (marked red: figure, caption and headings) are left to the user as they require additional semantic annotation (heading level, figure caption and alternative text). In Figure 6, the paragraphs and ordering have been manually highlighted. We see that the ordering correctly follows the two columns, and the one paragraph spanning the columns is combined to a single paragraph. With this new paragraph detection algorithm one kind of tagging issue that may well occur a hundred times in one document can now be solved automatically. In the document used in Figures 5 and 6, the automatic tagging algorithm reduced the number of elements requiring user action from 679 to 214. This algorithm has therefore a huge potential impact on the efficiency of the tagging of PDF documents with PAVE. In order to evaluate the efficiency gain of the paragraph detection algorithm and the effect of the other usability improvements implemented in the new version of PAVE extended usability tests with real users are being performed in the next weeks. We will monitor user interactions with the service in the background to measure the number of accessibility issues and the time spent to fix all issues in the document. With structured feedback questionnaires we will elicit the impact on user confidence and satisfaction. In order to set the feedback in place a large number of users will test the application for 4 weeks intensively. The users will report their feedback using structured feedback forms. The results will be discussed in a focus group with users. These steps are carried out within the next few weeks.
5
Conclusion
There are millions of PDF documents available on the internet. Most of them are not accessible for people with a visual impairment. PAVE facilitates analysing and fixing accessibility issues directly in PDF documents in an easy and intuitive way. A first pilot phase showed some major usability challenges mainly due to a missing paragraph detection algorithm. A new algorithm for detecting paragraphs was implemented within PAVE and is currently being tested with a set of different PDF documents and with real users.
6
References
1. Alireza Darvishy, Hans-Peter Hutter, Oliver Mannhart. 2014. Web Application for Analysis, Manipulation and Generation of Accessible PDF Documents. ICCHP 2014, Springer Verlag, Berlin Heidelberg. 2. Screen readers such as: http://www.freedomscientific.com/Products/Blindness/JAWS 3. Alireza Darvishy, Hans-Peter Hutter. 2013. Comparison of the Effectiveness of Different Accessibility Plugins Based on Important Accessibility Criteria. Universal Access in Human-Computer Interaction. Applications and Services for Quality of Life. Springer Verlag,
4. 5.
6. 7.
Berlin Heidelberg. Volume 8011 of the series Lecture Notes in Computer Science pp 305310. SS12 EU 2014 Finals and Winners. URL: http://ss12.info/Europe/ Hervé Déjean and Jean-Luc Meunier. 2006. A system for converting PDF documents into structured XML format. In Proceedings of the 7th international conference on Document Analysis Systems (DAS'06), Horst Bunke and A. Lawrence Spitz (Eds.). Springer-Verlag, Berlin, Heidelberg, 129-140. Jean-Luc Meunier. 2005. Optimized XY-Cut for Determining a Page Reading Order. In Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR '05). IEEE Computer Society, Washington, DC, USA, 347-351. Yimin Chu,and Jun Adachi, and Atsuhiro Takasu. 2012. Detection of Paragraph Boundaries in Complex Page Layouts for Electronic Documents. Information Processing Society of Japan (IPSJ).