A Method for Automating the Extraction of Specialized Information from ...

0 downloads 0 Views 165KB Size Report
present in Western languages, adopting a dictionary-based maximum ... study, the Chinese Language Market Sentiment System, which extracts from the web.
A Method for Automating the Extraction of Specialized Information from the Web Ling Lin1, Antonio Liotta1, and Andrew Hippisley2 1

Department of Electronic Systems Engineering, University of Essex, Colchester, CO4 3SQ, UK {llini, aliotta}@essex.ac.uk 2 Department of Computing, University of Surrey, UK [email protected]

Abstract. The World Wide Web can be viewed as a gigantic distributed database including millions of interconnected hosts some of which publish information via web servers or peer-to-peer systems. We present here a novel method for the extraction of semantically rich information from the web in a fully automated fashion. We illustrate our approach via a proof-of-concept application which scrutinizes millions of web pages looking for clues as to the trend of the Chinese stock market. We present the outcomes of a 210-day long study which indicates a strong correlation between the information retrieved by our prototype and the actual market behavior.

1 Introduction The Web has now become the major source of information, a gigantic database in constant growth which disseminates news and documents at an incredible speed. On the other hand, people can gain access to the information they are interested in thanks to sophisticated search engines providing keyword matching and thematic filters. Despite its success, the web looks increasingly more like a ‘black hole’ from where the information is difficult to retrieve. The user is more often overwhelmed by documents that are difficult to digest in ‘real-time’ by the ‘human’ user of the Internet. This is because the tools currently available are not suited to the extraction of cognitive knowledge. With such a powerful, dynamic and large-scale database at everybody’s fingertips, tools allowing automatic, intelligent scans of the web are needed. One would expect to extract knowledge, which is aggregate information, from the web rather than raw documents requiring human interpretation. Armed by the objective of building an intelligent information retrieval system, we have developed a novel methodology for extracting specialized information from the Web. We present such methodology by means of a case study, were we extract ‘sentiments’ of the Chinese Stock Market resulting from a daily scan of thousands of Web pages that are then interpreted by our intelligent agent system. We combine the use of self-learning techniques with simple statistical processing to establish whether the web page under scrutiny gives positive or negative information about the stock market (we actually use a range of 10 marks). We then aggregate the information extracted from all web pages scrutinized in one day and come up with a market Y. Hao et al. (Eds.): CIS 2005, Part I, LNAI 3801, pp. 489 – 494, 2005. © Springer-Verlag Berlin Heidelberg 2005

490

L. Lin, A. Liotta, and A. Hippisley

sentiment, which is an indication of the status of the market. This high-level information is then offered to the user – as opposed to the thousands of web pages that the user couldn’t possibly digest in a single day. Finally, in order to verify that the information we are providing is accurate, we compare our ‘sentiments’ with the actual index of the stock exchange. We propose a new information retrieval [1] technique which extracts relevant information from documents. In our work we make use of a Chinese language processing algorithms to address their inherent word segmentation problem not present in Western languages, adopting a dictionary-based maximum forward match approach proposed [2]. We develop a case study focused on the extraction of financial knowledge out of Chinese web sites that report news about the Chinese market. Cross-validation is, hence, performed with the Shanghai Stock Exchange. The method is, however, more general and could be used for the extraction of any other specialized information in any other language. To apply our method to other contexts one should just select its three dictionaries according to the language, specialized terminology and specific semantics to be used to produce the aggregate information.

2 Methodology 2.1 Overview In order to evaluate our methodology, we developed an intelligent agent system case study, the Chinese Language Market Sentiment System, which extracts from the web semantically-rich information about the Chinese stock market. The idea lies on quantifying market sentiments which are expressed in Chinese financial news. Market sentiments are the perception of traders as to how good or bad the stock market is. They can be used to indicate the demand or the lack of demand for financial instruments. If the feeling of stock market can be expressed numerically, the market sentiments expressed in financial news can be treated as daily series and compared with the daily series of actual stock market values [3]. Therefore, the strategy goes as following: first, extract market sentiments from web pages in the Internet. Then, correlate these market sentiments into daily series and compare them with daily series of Shanghai Stock Exchange Composite to examine if there is a potential correlation between them. In our case the correlation between sentiments and stock market trend is used to validate our methodology. Fig 1 shows the high-level process including three intelligent agents, web spider agents, html parser agents and language processing agents. Web spider agents are responsible for extracting the volumes of financial news HTML documents available in the Internet. Starting from a ‘seed’ link, the yahoo stock news web site in this case, web spider agents follow breadth-first searches through the links in web pages across the Internet. Web spider agents have two main tasks. One is to feed HTML source documents to HTML parser agents; the other is to detect URLs and fetch them back to crawl through the Internet. In the mean while, visited URLs should be filtered before feeding back to URL buffer queue. The Html parser agents convert information implicitly stored as an HTML structure into information explicitly stored as an XML structure for further

A Method for Automating the Extraction of Specialized Information from the Web

491

processing. For each HTML source document, a title, content, a written date and an author are extracted to construct a content xml, which is stored according to the written day to generate the daily series of market sentiment. Both the title and the content are analyzed by language processing agents.

The Internet

HTML Parser Agents Content XML

Web Spider Agents Word Segmentation HTML Source Documents

HTML Parser Agents

Content XML

Language Processing Agents

Market Sentiment Daily XML

Graphic User Interface

End Users

Fig. 1. High Level Process

General Dictionary

Filtering Noise Words Segmented Word XML

Financial Dictionary

Counting Financial Terms in Segmented Word XML

Is it a Financial Document?

Counting Sentiment Weight in Segmented Word XML

Sentiment Dictionary

Market Sentiment Daily XML

Graphic User Interface

Fig. 2. Language Processing Agent

The language processing agents are dictionary-based and auto-training, which are in charge of identifying market sentiments from content xml into market sentiment daily series. As demonstrated in fig 2, this is done through three steps. First, segment content xml into segmented word xml and filter ‘noise’ words, such as “is”, “am”, which can be found in virtually every sentence and are, therefore, useless four language processing purposes. Second, look up on the financial dictionary to determine whether the web page article is finance related. Then filter all of content xml which are not finance related. Last, collect market sentiments in segmented word xml, such as “increase”, “rise”, “decrease”, etc, by searching the semantic dictionary, and sum document sentiment weights according to the written day. If there is ‘not’ or ‘no’ in a sentence, we skip all sentiment words in the sentence. Finally, compare a daily series of market sentiments with that of the Shanghai Stock Exchange Composite. This is achieved through three dictionaries, general dictionary, financial dictionary, and semantic dictionary.

492

L. Lin, A. Liotta, and A. Hippisley

2.2 Natural Dictionary for Word Segmentation HTML Parser Agents Content XML General Dictionary

Word Segmentation Filtering Noise Word Segmented Word XML

Counting Financial Terms in Segmented Word XML

Financial document?

Counting Sentiment Weight in Segmented Word XML

Financial Dictionary

Fig. 3. Auto-training Specialized Dictionary

Word Segmentation is required in most East Asia natural language processing System since they do not have built-in delimiters to mark the boundaries of multi-character terms of phrases. The first phase of Language Processing is dictionary-based word segmentation whose aim is to segment the titles and contents in content xml by looking up a general Chinese Dictionary. Content xml is scanned sequentially and the maximum matching word from the general Chinese dictionary is taken at each successive location. A maximum matching algorithm is a greedy search routine that walks through a sentence trying to find the longest string of words starting from a given point in the sentence that matches a word entry in the general Chinese dictionary. The string having the highest match is then taken as indexing tokens and shorter tokens within the longest matched strings are discarded.

2.3 Auto-training to Generate Specialized Dictionary This phase is to judge whether the content xml is financial related or not by summing financial term’s weights (Fig.3). First, we look up the financial term dictionary. We then sum the financial weight of every word occurring in the segmented XML word, dividing by the total number of segmented words. FinancialRelated =

∑ FinancialTermWeight TheNumberOfTermsOccurringInTheArticle

If the FinancialRelated value is higher than a certain threshold, it is concluded that the article is financial related. Otherwise, the article is discarded. Our system depends to a large extent upon the financial dictionary. There are two main approaches to building dictionaries. One is the knowledge engineering approach; the other is the auto-training approach. In the former, words for the dictionary are constructed by hand. The development process can be very laborious and requires high maintenance costs. For the auto-training approach, someone with sufficient knowledge of the domain annotates a set of training documents. Once a

A Method for Automating the Extraction of Specialized Information from the Web

493

training corpus has been annotated, a training algorithm is run to train the system for analyzing texts. This approach requires a sufficient volume of training data. It is very almost impossible to find any free financial dictionary available. Furthermore, the number of Chinese financial term exceeds thousands. Therefore, human involvement will be costly and time-consuming. We even have no financial news training documents to train our financial term dictionary. Fortunately, the Yahoo stock news available in the Internet is a good training corpus. The idea lies in creating automatic and unsupervised dictionary construction agents which can read Yahoo stock news in the Yahoo web site and count the word frequency occurring in financial news. We assume that the words which occur more in stock news are more related to finance, which will be included in the financial term dictionary. Another advantage of this approach is that it is updating constantly to reflect the current status of language uses in the Internet, especially for continually changing nature of the Internet. 2.4 Semantic Dictionary (Financial Term Trend Dictionary) The expression of optimism or pessimism w.r.t. the behavior of the stock market relies on a choice of words which is generally understood. However, the words expressing the market sentiment haven’t been standardized in a way that science and technology are standardised. Instead, there is a general consensus on how to express optimism or pessimism about an instrument. Sentiment terms are limited and each term has its own inherent market sentiment. So, it is possible to construct the sentiment dictionary by hand. To our understanding, we have identified 138 Chinese sentiment terms, each of which conveys ‘good’ or ‘bad’ market sentiment. We allocated them into 21 groups. Each group has an associated sentiment weight, whose weight is from 10 to -10 consecutively. FinancialTrend =

∑ MarketSentiment TheNumberOfTermsOccurringInTheArticle

3 Method Validation According to the assumption that the words which occur more in stock news are more related to finance, the financial dictionary has been trained starting from the Yahoo stock news web site. It was sufficient to train the system for seven days from (12 AM, 16 July, 2004 to 12 AM, 23 July, 2004) to achieve satisfactory results. Having put together the three dictionaries, we then run our information retrieval system for another one week, (between 12Am 26 July, 2004 and 12AM, 2 Aug, 2004), obtaining the Market Sentiment diagram of Fig. 5, which includes also the actual values of the shanghai stock exchange composite index. The most apparent result is that the aggregate information built by our system is in strong agreement with the actual market trends. It may be worth noting that the most evident horizontal bits usually correspond to week ends. On the other hand, in small portions of the diagrams there is discrepancy between the two lines. Clearly the

494

L. Lin, A. Liotta, and A. Hippisley

Fig. 4. Comparing Shanghai Stock Exchange Composite and the Market Sentiment

system is not perfect and may not be used to forecast the future but that was not the purpose of our experiments. The strong overall correlation proves, nevertheless, that the system provides a good level of extraction of semantically-rich information.

4 Concluding Remarks In this paper we have presented a novel approach to extracting semantically-rich information from the Web, illustrating its efficacy by way of a practical case study. To develop our experimental prototype we have used a combination of technologies, ranging from intelligent agents to language segmentation and self-training knowledge systems. We haven’t yet optimized the system and there is still space for improvement but the results achieved are extremely encouraging. Our work so far represents a significant first step towards automating semantic data extraction from the web.

References 1. Gaizauskas, R. et al (1998), Information Extraction: Beyond Document Retrieval, Computational Linguistics and Chinese Language Processing 3(2):17-60. 2. Cheng, K. S., et al (1999), a Study on Word-Based and Integral-Bit Chinese Text Compression Algorithms, J. American Society for Information Science 50(3):218-228 3. Gillam, L., et al (2002), Economic News and Stock Market Corr.: A study of the UK Market

Suggest Documents