Text Gathering and Processing Agent for Language

1 downloads 0 Views 213KB Size Report
context-free grammars, or their combination as described in ... can appear in a PDF or HTML form and this rule should ... checked for correctness, converted to lowercase and added to .... html, doc, rtf or pdf formats. ... [Online]. Available: http://korpus.juls.savba.sk. [2] M. Mirilovic, J. Juhár, and A. Cizmár, “Large vocabulary ...
Text Gathering and Processing Agent for Language Modeling Corpus Daniel Hladek

Jan Stas

Department of Electronics and Multimedia Telecommunications Technical University of Kosice Kosice, Slovak Republic Email: [email protected]

Department of Electronics and Multimedia Telecommunications Technical University of Kosice Kosice, Slovak Republic Email: [email protected]

Abstract—An approach for acquisition, preprocessing and storage of large quantity of text for creation of the Slovak language model is presented. Text, downloaded from web, is preprocessed by automatically generated parser. Heuristic parser rules are used to identify entities in text such as abbreviations or end of sentence. Raw and processed text is stored in the relational database. Effective filters for erroneous and duplicate sentences are proposed. Evaluation of the results includes effects of the proposed filters on corpus and tendency of the database growth.

I. I NTRODUCTION Language model is a key component for automatic speech recognition, grammar correction or spoken language recognition. In order to create a language model, it is necessary to gather a great amount of text called training corpus. There are just a small number of attempts to create a database of corpora for statistical processing of the Slovak language. One of them, Slovak National Corpus [1]. due to the different approach, insufficient size and licensing does not meet our needs. This paper continues work presented in [2], [3], [4]. For the sake of building good language model it would be very helpful to have a sufficiently large database of text that would unify various sources in one place. This database then can be easily used to easily construct domain-specific corpora from the already collected and prepared data. When the text is stored, in order to be used, it have to be preprocessed. Recent methods use statistical methods, or context-free grammars, or their combination as described in [5]. II. T EXT DATABASE For the text database creation we can use more types of electronic sources: 1) Printed media - Classic paper books, newspaper and magazines. Printed text must be scanned and processed by OCR software first. In this phase we do not focus on this source. 2) Static electronic sources - Electronic databases of text, such as laws or theses, available on removable media such as DVD or that can be downloaded from a certain Internet site.

3) On-line electronic sources - Usually a web page that contains news, magazines or commercial presentations. This source seems to be the most promising, but is many problems with cleaning and extracting have to be solved. All these sources need to be gathered in one place. The best way to store, process and sort large quantity of structured data seems to be a relational database. Text data are collected in smaller parts that are called documents. On document contains one piece of text data written about one theme from one source. An example of document is a magazine article about motorcycles in the PDF format. One document contains a certain amount of text that can be used for creation of a text corpus. Process of acquisition of the text and sorting out irrelevant text is called extraction. All documents in database can be stored in just one relation that have following attributes: 1) Document file - Location of the document file on the disk. It have been found out that storing binary data directly in the database is not efficient for this use case, files are rather stored in dedicated directory on the disk. 2) Extracted text - Text that was extracted from the document file and is ready to be incorporated to the corpus 3) Document source - Description of the original document location. The most convenient way seems to be URI string. This format is general enough to assign each document a unique name. 4) Document processing status - We can assign some information about preliminary outcome of the processing a) Processing - Document is not ready yet b) Final - Extracted text can be inserted to the corpus c) Error - Document contain error and could not be processed d) Copy - Document is a copy of some other document in the database e) Bad text - Extracted text of the document has low quality 5) Meta Information String with some additional information about document, such as keywords that can help us determine document domain.

6) Segment Every document in the database can be assigned to a certain partition of the database, that will be used for construction of certain corpus. The most important issue when building a big text database is a control of document redundancy to restrict multiple copies of the same document in the resulting corpus. This issue have to be taken in mind during each insert or update of the database. It is possible, that the same document can be stored under various URIs. On the other hand, one URL can have a document that is often updated and each time it contains different content. That’s why multiple criteria for redundancy control have to be used: 1) URI based control - documents in database should have unique URI to ensure that form one place there will be just one document. URI of the document is usually not very long string, so it is sufficient to implement this control as a simple unique constraint in the SQL. 2) Document file control - to ensure that the same file will occur just once in the database. It is very time consuming to compare contents of two files. To help search of the file copy, hash code of the file is stored in the database. Hash code is a control sum of all characters of a string, such that two different strings have different hash codes. This allows easily control duplicities of a text strings, such as extracted text by assigning “unique” constraint to the document file hash column. 3) Extracted text control - ensures that there will be no two documents with the same content. The same article can appear in a PDF or HTML form and this rule should prevent this case. Again, hash code of the extracted text is stored in the database together with the unique constraint. III. T EXT ACQUISITION AND P ROCESSING Once database is ready, it is possible to acquire text from various sources that were mentioned above. One of the most promising sources is internet because it stores big quantities of text on various themes. For that purpose a text gathering agent has been created. Text gathering agent must fulfill tasks such as HTTP header parsing for handling redirections and MIME type resolution, and HTML parsing for encoding detection, HTML entity replacement and link extraction for further web exploration. There are several types of entities in the gathered text. All of them have to be detected and assign them correct type of action. For example, order numerals or abbreviations have to be rewritten to word form, regular words just have to be checked for correctness, converted to lowercase and added to the corpus. Following steps have to be performed: 1) Abbreviation expansion Abbreviation in the Slovak language is a string of several small characters finished by a dot. In some special cases it can begin with a capital letter or contain more parts, such as (Z.z. - zbierka zakonov - code of laws,).

Narodil sa 30.5. 2002. Lexical Analysys Morphological Analysys Word Word

Date Dot

Semantic Analysys Narodil sa tridsiateho mája dvetisíc dva.

Fig. 1.

Text processing

2) Number expansion Basic numeral occurs usually as a string of numbers. Order numeral ends with a dot. Very hard problem is determining correct grammatical case and gender of the numeral. This have to be found out by inspecting context of the numeral, usually following word. Results of the automatic numbers expansion strongly depends on the part-of-speech tagger (one of the approaches is [6]). The number expansion system uses pre-processed vocabulary of words with part-of-speech tags from [1]. 3) Date expansion Date in the Slovak language consists of one number with a dot expressing day followed by a number expressing month (e.g. 10. 12). Year number does not have to be incorporated to the date entity because it is rewritten just like basic number. Date entity is easily distinguished from a floating point number because in the Slovak text uses notation with comma (e.g 13,4) 4) Special symbol expansion some special characters such as. % or § have to be rewritten to words. 5) Sentence segmentation Sentences in the Slovak language are ended by a dot, exclamation mark or a question mark. Occurrence of a dot is not sufficient feature for sentence end, often it can mean order numeral or an abbreviation. These cases must be detected first. In the first phase - called lexical analysis - text is segmented into tokens. Tokens as the smallest atomic part of a word that has a meaning for further processing. Tokens are separated by whitespace. Example of token is number or word. Tokens identified in the lexical analysis form more difficult structures. These structures are detected in the morphological analysis step. For example date entity consists of tokens number dot and number, as explained above. Each entity have assigned some action. This action is triggered by occurrence of some string of tokens. In the case of date entity it means to rewrite first number in the word form and second number as a month. This step is performed in the semantic analysis phase. The whole process is depicted on the Fig. 1. Specialized language ANTLR [7] for creation of parsers have been used. According these rules a parser is generated that is used for detection of entities described above and execution of actions.

End of the sentence is usually handled by the statistical methods, e.g. [8], [9]. Due lack of sufficient amount of training data, end of sentence detection is incorporated as a set of rules for entity detection in the automatically generated parser: •

• •



• • • •

Abbreviation consists of word and a dot. This entity in ambiguous - word ending with a dot does not have to be an abbreviation. To resolve if a word ending with dot is abbreviation, abbreviation dictionary is used. If the word is abbreviation, it is expanded and added to the sentence. Otherwise it is handled like normal word and end of sentence is marked. Word - regular word starting with small letter is just added to the sentence. Capital - word starting with capital letter. It can mean start of sentence or can be a name. Occurrence of too many words with capital letter in the sentence can mean that sentence is bad (described below). Date - entity consisting of tokens number dot number. When it is detected, it is rewritten and added to the sentence. Order numeral number with dot, not matched as a date. Basic numeral number that does not belong to the previous entities Special symbol one of % + / $ End consists of dot that was not matched by previous entities (abbreviation, date or order numeral). It marks end of the sentence.

A. Detection of the Invalid Sentences Too often repeating sentences, such as headlines or advertisements are contra-productive to the corpus. These sentences have to be removed by the sentence count filter. Counting algorithm uses hash code of the sentence to quickly determine how many times sentence appeared in the corpus. Sentences that occur more than a given number times Smax are discarded. Some sentences, even those with low occurrence, can degrade value of a training text - can contain too many out of vocabulary words, numerals or can be written in a different language. This is especially true for sentences gathered from internet. These sentences have to be removed by the sentence error filter. Bad sentence examples: Sep 2009 11: 47: 54 N´ azov obr´ azku: Skalka 2N´ azov albumu: Kvety. Download K´ od Vybraˇ t vˇ setko Len registrovan´ y uˇ z´ ıvatelia. Good sentence examples: Veci ktor´ e rozˇ c´ ulia teri´ era si veˇ lk´ a doga ani nemus´ ı vˇ simn´ uˇ t. Tak mechanik povedal ze sa zasekla pri nehode aj prevodovka.

Fig. 2.

Fig. 3.

Database size

Item count

It can be seen that bad sentence contains bigger amount of numbers and words with capital letter. On the other hand, regular sentence just starts with a capital word and rarely contains some other capital words or numbers. Heuristic procedure to evaluate quality of the sentence comes from the following presumptions: • • •

Sentence is irrelevant when it contains too many numbers Sentence is irrelevant when it contains too many words with capital letter Sentence is irrelevant when it contains too many out-ofvocabulary words

For each sentence numbers, capital letter words and outof-vocabulary words are counted and according these rules sentence error coefficient Es is calculated as: Es =

wbad Nbad + wc Nc + wn Nn Ntotal

where Nbad is a count of out-of-vocabulary words, Nc is a count of capital letter words, Nn is a count of numbers, Ntotal a total count of words in the sentence. Each of these counts a weight w is assigned to tell how much this count will add to the total sentence error coefficient. Setting maximum allowed value Ev for Es coefficient influences size and quality of the training corpus. If it is too high, corpus is bigger, but it contains more text that can be irrelevant. If it is set too low, corpus is too small and contains just sentences without capital words or numbers.

Wn = 0.17

wc , wn 0.6 0.6 0.6 0.6 0.17 0.17 0.17 0.17 0.17

Wn = 0.5

1,2

0,5

Corpus Relative Size

Corpus Relative Size

0,6

0,4 0,3 0,2 0,1 0 1

2

5

30 50 100 200 500

Sentence Max Count Smax a.) Sentence Count Filter

Fig. 4.

1 0,8 0,6 0,4 0,2

Ev 0.10 0.12 0.17 0.3 0.00 0.05 0.10 0.17 0.20

LM Size [kB] 426140 465342 544388 658163 111796 419928 545977 635154 656370

WER [%] 17.71 17.52 17.66 17.75 21.55 17.94 17.67 17.60 17.69

0 0,05 0,1 0,2 0,3 0,5

1

Sentence Error Ev

TABLE I L ANGUAGE M ODEL E VALUATION

b.) Sentence Error Filter

Influence of the Sentence Filters on the Corpus Size

V. C ONCLUSION IV. R ESULTS OF DATABASE C REATION AND F ILTERING History of the database size is depicted on the fig. 2. It displays total space used by all downloaded documents in html, doc, rtf or pdf formats. Growth of number of items in the database is displayed on fig. 3. It can be seen that, the agent continuously fills the database. Speed of downloading have been increasing, as errors in the code were eliminated and downloading algorithm has been optimized. Current database growth have approximately linear characteristics, that shows us that the database still have not been saturated and there is a space for future growth. Influence of the sentence count filter and sentence error filter on the corpus size has been inspected on a collection of 28 231 pages downloaded mostly from one of the major blog websites on the Slovak internet. First, text have been extracted from HTML code, parsed and segmented into sentences using technique described above. Then filter with various parameters have been applied on the resulting text. The fig. 4.a displays effect of the sentence count filter. On x-axis is value of maximal allowed count of the same sentence, y-axis shows resulting corpus. size. Sentence error filter is evaluated in a similar way on fig. 4.b. On the x-axis is value for maximal allowed error per sentence. Tests for various values of Ev were run with wn = wc = 0.17 and wn = wc = 0.5. wbad was set to 1 in all cases. Third experiment evaluates gathered corpus with the language recognition system. Each gathered and filtered corpus is used to construct an unpruned bigram language model. Whitten-Bell smoothing technique is applied on the language model to estimate probabilities of unseen bigrams (more on language model evaluation can be found in [10]). This language model, together with the acoustic model trained on the parliament train speech database, is used to to recognize a test set with 825 sentences from the parliament speech database. Word error rate is calculated for each language model. Results of language recognition and resulting language model sizes depending on parameters of the sentence error filter are shown on table I. This experiment shown us, that optimal weights for the sentence error filter are approximately (wc = 0.17, Ev = 0.12.

Text gathering agent is can obtain and process sufficient amount of documents from internet. Experimental results shows us, that the proposed filters are able to process raw text and produce a corpus usable for language model training. Possible future research steps include semantic analysis of the document content. It would allow to automatically propose document meta information such as keywords, domain or document author. This would enable easier making of domain specific corpora for training language models. ACKNOWLEDGEMENT The research presented in this paper was supported by Slovak Research and Development Agency and Ministry of Education under research projects APVV-0369-07, VMSP-P0004-09 and VEGA-1/0065/10. R EFERENCES ˇ ura SAV, ˇ St´ [1] Slovensk´y n´arodn´y korpus r-mak-3.0, Jazykovedn´y u´ stav L. 2009. [Online]. Available: http://korpus.juls.savba.sk ˇ zm´ar, “Large vocabulary continuous [2] M. Miriloviˇc, J. Juh´ar, and A. Ciˇ speech recognition in slovak,” in AEI’08 - Applied Electrical Engineering and Informatics International Conference: Athens, Greece, ser. 97880-553-0066-5, September 8-11 2008, pp. pp.73–77. [3] ——, “Comparison of grapheme and phoneme based acoustic modeling in lvcsr task in slovak,” Lecture Notes In Artificial Intelligence, pp. 242– 247, 2009. [4] M. Miriloviˇc and J. Juh´ar, “Morphological segmentation of word units for large vocabulary automatic speech recognition in slovak,” in HLT 2007 - Proceedings of the Third Baltic Conference on Human Language Technologies, ser. ISBN 978-9955-704-53-9, October 4-5 2007, pp. pp. 189–196. [5] M. Collins, “Head-driven statistical models for natural language parsing,” Computational linguistics, vol. 29, no. 4, pp. 589–637, 2003. [6] J. Kanis, J. Zelinka, and L. M¨uller, “Automatic numbers normalization in inflectional languages,” Proc. SPECOM. Moscow, pp. 663–666, 2005. [7] T. Parr, The Definitive ANTLR Reference: Building Domain Specific Languages, ser. 978-09787392-4-9. Pragmatic Bookshelf, 2007. [8] K. Tomanek, J. Wermter, and U. Hahn, “Sentence and token splitting based on conditional random fields,” in Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, 2007, pp. 49–57. [9] A. Mikheev, “Tagging sentence boundaries,” in Proceedings, SIGIR 2000, 2000. [10] J. Staˇs, D. Hl´adek, and J. Juh´ar, “Language model size reduction by quantization and pruning,” Journal of Electrical and Electronics Engineering, vol. 3, pp. 205–208, 2010.

Suggest Documents