1 USING WEB CONTENT TO ENHANCE ...

5 downloads 10862 Views 116KB Size Report
Email: [email protected]. D. Jeffery Higginbotham ... Our web crawler is similar to those used by search engines to catalog site content. The crawler.
USING WEB CONTENT TO ENHANCE AUGMENTATIVE COMMUNICATION Gregory W. Lesher, Ph.D. DynaVox Technologies 2 High Rock Lane South Hamilton, MA 01982 Phone: 412-222-7944 Email: [email protected] D. Jeffery Higginbotham, Ph.D. Department of Communicative Disorders and Sciences University at Buffalo 126 Cary Hall Buffalo, NY 14214-3005 Phone: 716-829-2797 ext. 601 Email: [email protected] INTRODUCTION The Google search engine indexes more than 4 billion distinct web pages, containing millions of pictures and several trillion words of text. Given this resource of almost unimaginable scope and depth, our research group is pursuing methods of leveraging this information to enhance augmentative communication (AAC). To this end, we have developed a web crawler that is capable of autonomously browsing through web pages to collect specific information. Although the current implementation is limited to collecting text, the goal is to extend this system to include images and sounds. Our web crawler is similar to those used by search engines to catalog site content. The crawler visits an operator-provided list of pages, recursively visiting each link within that page. When combined with a novel system for automatically classifying the collected text, the crawler provides us with many opportunities for exploiting web content. We have thus far concentrated on using the crawler to develop more effective databases for word prediction. IMPROVING WORD PREDICTION Traditional word prediction systems have used word frequency lists to complete words already started by the user. Statistical relations between word sequences can be exploited to improve predictive accuracy. Inter-word statistics are generally derived through the analysis of corpora containing representative samples of text. Local word context can then be cross-referenced with these statistics to generate prediction lists. Our research has demonstrated the importance of using large text corpora to derive inter-word statistics (Lesher, Moulton, & Higginbotham, 1999). In this study, the keystroke savings associated with word prediction improved from 50.2% to 54.4% as the training corpus size was increased from 500 thousand to 3 million words. Additionally, results indicated that keystroke

1

savings would increase further with larger training corpora. Unfortunately, it proved difficult to obtain larger general-purpose corpora that didn't have significant use restrictions. Historically, constructing large text databases has been difficult. Although the proliferation of electronic documents has made the collection of certain types of text easier than ever, there remain substantial hurdles to the development of large corpora. It remains difficult to collect a disciplined set of texts, carefully balanced to provide representative content and style. The Brown Corpus, for example, consists of text samples from each of 15 written genres ranging from journalistic editorials to science fiction. Tracking down sufficiently large samples from these diverse genres is extremely time-consuming. The Web offers a vast repository of electronic documents, but the pages are wildly diverse in style and sophistication. Many are wholly inappropriate for corpus generation, but there still remain numerous examples of all the genres included in the Brown Corpus. Of course, it would take an inordinate effort to manually collect and sort text from the Web to generate a comparable corpus. Our web crawler was thus initially designed for the disciplined autonomous collection of large text corpora from the Web. This program traverses web sites collecting blocks of text, filtering out text that is inconsistent with a pre-defined corpus specification. We have employed the crawler to collect a 100 million word corpus, which was used to generate a prediction database that provided keystroke savings of 56%. In the first stage of corpus generation, the crawler traverses the Web, extracting meaningful blocks of text for consideration as part of the corpus. In the second stage, each text block is processed by a classification module. By examining the word usage and syntax patterns of the text, the blocks are categorized into specific domains (e.g., genres, styles, or education levels). This classification information is used to determine whether the text blocks are compatible with the corpus specification and thus should be added to the growing corpus. The classification stage has been the focus of our research efforts. CLASSIFYING TEXT BY GENRE OR STYLE Biber (1988) proposed using statistical means to classify texts. He defined a variety of objective measures which could be easily extracted from texts, and then showed that the relative magnitudes of these measures (collectively called a "feature vector") were indicative of particular text genres. As a rudimentary example, a high percentage of pronouns is indicative of a conversational text, while a low percentage is indicative of a scientific work. The measures actually employed by Biber were far more sophisticated, incorporating a combination of various syntactic features. We have extended Biber’s classification approach using a neural network to automatically classify text blocks. In our paradigm, a number of pre-collected text blocks are manually tagged with their respective styles and genres. The feature vectors of these tagged blocks are then used to “train” a neural network, such that the network learns to associate the individual feature vectors with their tags. The system might learn, for example, that a specific relationship between adjective frequency and intransitive verb frequency is indicative of journalistic text. The ARTMAP neural network (Carpenter & Grossberg, 1992) was trained to discriminate between

2

acceptable text blocks (e.g., narratives and essays) and unacceptable blocks (e.g., everything else). It could then classify each block provided by the crawler. We tested our neural network by forcing it to classify text blocks from the Brown Corpus. By comparing ARTMAP's performance to the known genre of 1000 different text blocks, we determined that our system correctly identified the genre (from the 15 genre catagories in the Brown Corpus) over 93% of the time. CLASSIFYING TEXT BY TOPIC When constructing a broad corpus for use with a general-purpose word prediction engine, it makes sense to include text from a variety genres and styles, irrespective of the topics of that text. But what if we want to make predictions that are specific to a particular topic? This would be useful, for example, if we know that we're going to be talking about baseball, or politics, or the latest movie. Can the web crawler be used to construct topic-specific prediction databases for this purpose? Our statistical approach was well-suited for identifying genres and styles of text, but since this method is based solely on syntax (rather than semantic content), it cannot be used to identify topic domains without some modifications. However, we can extend the approach by adding "keywords" to the feature vector that ARTMAP uses to identify texts. These keywords might be the names of the topics themselves (e.g., "baseball"), or they might be words associated with the topic (e.g., "bat", "hit", "stadium"). As long as the keywords are in the right ballpark (so to speak), the texts identified by the web crawler can be used to derive an effective topic-specific prediction database. We have successfully combined keyword-based and syntax-based classification methods to generate dozens of topic-specific databases. A downside to this approach, however, is that the crawler must search through extraneous web pages to find those pages that are on-topic. While the operator could manually direct the crawler to look only in certain sites likely to have relevant text (e.g., the ESPN web site), we'd clearly prefer an automatic method of determining such sites. Web search engines do exactly what we need - given keywords, they provide a list of relevant sites. Google provides an experimental interface that allows third-party programs to access to its search engine. We have incorporated this interface into our web crawler. Preliminary results are extremely promising. Note that utilizing the Google interface does not eliminate the need for the text classification system; this module is still needed to identify valid text samples (as opposed to tables, lists, ads, etc.), to eliminate spurious search results, and to process the sites indicated by links embedded within the initial search pages. SUMMARY AND FUTURE DIRECTIONS We have successfully implemented a web crawler that can autonomously collect an arbitrary amount of text of specific genre, style, or topic. Thus far, the crawler has operated as a standalone program with no direct ties to AAC software. Once testing of the Google component of

3

the web crawler has been completed, we will integrate the crawler into our experimental AAC software. This will allow us to further explore methods for dynamically generating topicspecific prediction databases for immediate use during communication. Additionally, we plan to investigate ways to extract meaningful words and phrases from the topic-specific text collected by the crawler. These messages could then be presented directly on interface buttons and combined with images and sounds also grabbed from the Web, opening the possibility of automatic generation of topic pages in real-time. REFERENCES Biber, D. (1988). Variation Across Speech and Writing. Cambridge, UK: Cambridge University Press. Carpenter, G.A. & Grossberg, S. (1992). Fuzzy ARTMAP: Supervised learning, recognition, and prediction by a self-organizing neural network. IEEE Communications Magazine, September, 38-49. Lesher, G.W., Moulton, B.J., & Higginbotham, D.J. (1999). Effects of ngram order and training text size on word prediction. Proceedings of the RESNA 1999 Annual Conference, 52-54, Arlington, VA: RESNA Press. ACKNOWLEDGEMENTS The authors wish to acknowledge support from the U.S. Department of Education under grants #H133E980026, #H133E030018, and #RW97076002. The opinions expressed are those of the authors and do not necessarily reflect those of the supporting agency.

4

Suggest Documents