SOCIAL COMPUTING
Multilingual Information Access on the Web Soto Montalvo, Universidad Rey Juan Carlos Raquel Martínez and Víctor Fresno, Universidad Nacional de Educación a Distancia Rafael Capilla, Universidad Rey Juan Carlos
Named entities (NEs) can facilitate access to multilingual knowledge sources—which have exploded in recent years—but the identification, classification, and retrieval of NEs remain challenging tasks.
I
nformation access plays a central role in the acquisition and dissemination of knowledge. The exponential growth of multilingual data on the Web presents a challenge for companies and organizations seeking to mine and analyze unstructured knowledge in email, news, webpages, tweets, and other media resources and deliver this knowledge to users. Such multilingual information access requires novel natural language processing (NLP) and information retrieval (IR) techniques. A key difficulty in developing multilingual NLP and IR techniques is matching concepts and terms across
48
CO M PUTE R P U B LISH ED BY TH E I EEE COMP UTER SOCI E T Y
different languages, which requires efficient and accurate translation. Today, the most popular translation approaches include1 ›› machine translation techniques that produce a readable and reliable target-language version of a source text, ›› knowledge-based techniques that translate textual units using ontologies or dictionaries, and ›› corpus-based techniques that statistically analyze large textual collections and automatically extract and translate needed information.
Each of these approaches has disadvantages. For instance, a major drawback of machine translation is the computational effort required to translate a large amount of text. In the case of knowledge-based techniques, multilingual o ntologies are expensive to build, costly to maintain, and difficult to update, and dictionaries don’t always provide good coverage. Generally speaking, corpus-based translation provides better results than ontologies or dictionaries, but it’s difficult to find 0018-9162/15/$31.00 © 2015 IEEE
EDITOR CHRISTIAN TIMMERER Alpen-Adria-Universität Klagenfurt;
[email protected]
T
his installment of the Social Computing column, which describes the benefits and challenges of named entities as a tool for enabling multilingual information access on the Web, is the first to be accepted through open submission rather than directly solicited. Interested readers who wish to contribute articles to this column should contact me at
[email protected]. —Christian Timmerer, column editor
the right corpus for certain languages and some corpora aren’t large enough to be useful.
NAMED ENTITIES
One way to alleviate these problems is to leverage named entities (NEs) in multilingual Web queries to better recognize and efficiently classify the relevant terms to be translated into different languages. NEs are proper nouns that can refer to, for instance, people (Barack Obama), locations (Rome), organizations (Microsoft), ideologies (capitalism), religion (Buddhism), or some domain-specific category (such as proteins, drugs, or genes in the biomedical domain). Currently, the three universally accepted categories of NE taxonomies are person, location, and organization. Today, more than half of all Web queries are related to NEs.2 This makes identifying and classifying NEs crucial to successful information access and processing in a wide range of applications. For instance, extracting relevant NEs from tweets—of which there are more than 100 million daily—could help a company to assess a new product’s success, health authorities to track a disease’s evolution, or security agencies to identify and locate a potential terrorist. NEs are identified and categorized in raw text using a named entity recognition and classification (NERC) system. Figure 1 outlines how a NERC system works. It first splits the text into sentences, and then subdivides
Sentence segmentation Sentences Raw text
Tokenization
Person entities Part-of-speech tagging Location POS-tagged entities sentences Organization Tokenized entities sentences NERC
Annotated text
Figure 1. Named entity recognition and classification (NERC) system.
Figure 2. Named entities identify key terms in textual passages—in this case, from a New York Times news story—that answer basic questions such as who, what, where, and when.
each sentence into lexical units called tokens. Next, the system tags parts of speech (POS) in the tokenized sentences: nouns, verbs, adjectives, and so on. Finally, it identifies and classifies
NEs in POS-tagged sentences and annotates the text with these NEs. NERC systems can classify NEs using a particular approach—for example, rule-based, supervised learning, or
J U LY 2 0 1 5
49
SOCIAL COMPUTING
REFERENCES
T
he IEEE Computer Society Special Technical Community on Social Networking’s E-Letter provides timely updates on recent developments, hot research topics, and Society news in the area of social networking. The current issue, on Science 2.0, is available at www.computer.org/stcsn. Come and join now!
hybrid—or according to an external source of NE occurrences such as Wikipedia or a gazetteer.
BENEFITS AND CHALLENGES OF NAMED ENTITIES
NEs broaden the scope of IR by identifying key terms in textual passages that answer many basic questions— namely, who (does something); what (has been done); and where, when, how, and why (it happened). Figure 2 shows an example of a news story with highlighted NEs that identify the story’s key actors and events. NEs likewise can help IR systems process content in tweets, forums, blogs, reviews, and other social networking sources. However, NEs can be problematic in IR because of variations in terms referencing the same entity—for example, “President Barack Obama,” “Mr. Obama,” and “Barack H. Obama” all refer to the same person. Also, many NEs can be equal to common nouns; thus, “White House” could be the US presidential mansion or simply a white house. Moreover, some proper names, such as those of new movies and emerging celebrities, might not yet exist in NE sources. These problems are magnified in a multilingual context. For example, “Messi” could refer to the Argentine footballer, but an English translation of the Italian word “messi” could be “put” or “placed.” Also, some proper names can’t be accurately translated into other languages and thus preserve their original orthography. Consequently, translating proper names requires special techniques and resources. Multilingual NE repositories are available, such as the Heidelberg Named Entity Resource (HeiNER; 50
COMPUTER
http://heiner.cl.uni-heidelberg.de) and the European Commission Joint Research Centre’s JRC-Names (https:// e c.e u r o p a .e u/ j r c/e n/ l a n g u a g e -technologies/jrc-names). In addition, translation can use anchor text from webpages in the target language3 or comparable corpora that make it possible to distinguish the semantics of distinct entity-pair relations.4 Furthermore, equivalent NEs in the same and different languages can be identified by computing the orthographic similarity of proper names.5 Finally, translation can be improved by comparing entity occurrences over time: two entities are considered similar if they have similar occurrence distributions in different languages over time.6
T
he explosive growth of multilingual knowledge sources has created significant obstacles for information access on the Web. NEs have proven to be effective mechanisms in modern NLP and IR solutions, but the identification, classification, and retrieval of NEs remain challenging tasks, particularly for information harvested from today’s vast array of social networks.
1. C. Peters, M. Braschler, and P. Clough, Multilingual Information Retrieval: From Research to Practice, Springer, 2012. 2. J. Pound, P. Mika, and H. Zaragoza, “Ad-hoc Object Retrieval in the Web of Data,” Proc. 19th Int’l Conf. World Wide Web (WWW 10), 2010, pp. 771−780. 3. W. Ling et al., “Named Entity Translation Using Anchor Texts,” Proc. Int’l Workshop Spoken Language Translation (IWSLT 11), 2011, pp. 206−213. 4. T. Lee and S. Hwang, “Bootstrapping Entity Translation on Weakly Comparable Corpora,” Proc. 51st Ann. Meeting of the Assoc. for Computational Linguistics (ACL 13), 2013, pp. 631−640. 5. S. Montalvo et al., “Exploiting Named Entities for Bilingual News Clustering,” J. Assoc. for Information Science and Technology, vol. 66, no. 2, 2015, pp. 363−376. 6. J. Kim et al., “Entity Translation Mining from Comparable Corpora: Combining Graph Mapping with Corpus Latent Features,” IEEE Trans. Knowledge and Data Eng., vol. 25, no. 8, 2012, pp. 1787−1800.
Selected CS articles and columns are also available for free at http://ComputingNow .computer.org.
SOTO MONTALVO is an associate professor in the Technical School of Informatics at Universidad Rey Juan Carlos, Spain. Contact her at
[email protected]. RAQUEL MARTÍNEZ is an associate professor in the Natural Language Processing and Information Retrieval Group at Universidad Nacional de Educación a Distancia (UNED). Contact her at
[email protected]. VÍCTOR FRESNO is an associate professor in the Natural Language Processing and Information Retrieval Group at UNED. Contact him at
[email protected]. RAFAEL CAPILLA is an associate professor in the Technical School of Informatics at Universidad Rey Juan Carlos. Contact him at
[email protected].
W W W.CO M P U T E R .O R G /CO M P U T E R