International Journal of Information Management 35 (2015) 387–395
Contents lists available at ScienceDirect
International Journal of Information Management journal homepage: www.elsevier.com/locate/ijinfomgt
Towards semantically linked multilingual corpus Junsheng Zhang a , Yunchuan Sun b,∗ , Antonio J. Jara c a
IT Support Center, Institute of Scientific and Technical Information of China, Beijing 100038, China Business School, Beijing Normal University, Beijing 100875, China c Institute of Information Systems, University of Applied Sciences Western Switzerland (HES-SO), Sierre 3960, Switzerland b
a r t i c l e
i n f o
Article history: Available online 11 March 2015 Keywords: Information management Multilingual corpus Semantic association Semantic link network In-network
a b s t r a c t Multilingual information processing gains more and more attention in recent years with the development of information globalization. Multilingual corpus is a key challenge for multilingual information extraction, analysis, management and service in a wide range of systems. This work addresses on the study and analysis of semantic associations among elements in a multilingual corpus. A solution is proposed in this paper to optimize the semantic organization of multilingual corpus by linking the corpus elements into a semantic link network. This enhances the text-basd applications of multilingual corpus such as corpus linguistics study, dictionary search, machine translation and cross-lingual information retrieval. © 2015 Elsevier Ltd. All rights reserved.
1. Introduction The rapid development of information and communication technologies enables the information globalization. Multilingual information has been coming into our daily life and business. Users are often puzzled to face with multilingual information resources when surfing on the Web because most of them are only familiar with one or two natural languages. It is necessary to find effective ways to bridge the gap caused by different languages, and multilingual information processing has been gaining more and more attention in recent years. Multilingual information processing includes organization, search, translation, management and analysis, where organization is the primary basis for other services. An intuitive idea for multilingual information organization is to weave them into a network with semantic associations. Semantic association is a concept from the cognitive science. When a concept is mentioned, other concepts occurring in the mind of human being are considered “having semantic association with” the mentioned concept. Semantic associations among resources have great influence on information search, and finding relevant multilingual information resources is the basis of the utilizing of multilingual information. Multilingual information resources and semantic associations among them formulate a complex network, which is the research object of intelligence analysis. So it is important to establish semantic
∗ Corresponding author. E-mail addresses:
[email protected] (J. Zhang),
[email protected] (Y. Sun),
[email protected] (A.J. Jara). http://dx.doi.org/10.1016/j.ijinfomgt.2015.01.004 0268-4012/© 2015 Elsevier Ltd. All rights reserved.
associations among multilingual information resources for management, search and analysis. Semantic associations exist in information systems ubiquitously, in which links are used to reflect the semantic associations among the multilingual information resources. For example, hyperlinks among the web pages imply the semantic associations among information resources represented by URI in the Web; semantic associations may be relationships in the specific domains such as refer and cite in the scientific literature. Some associations are hard to be represented by specific relationships, such as similar and relevant. Similar is a specific instance of relevant. If two things are similar, they must be also relevant; however, if two things are relevant, they may not be similar. Multilingual information services are based on the result of multilingual information processing. Machine translation (MT) and cross-language information retrieval (CLIR) are two typical multilingual information services. They have the same ultimate aim – semantically linking multilingual information for users to easily access without language barriers. MT aims to translate text into other foreign natural languages, while CLIR aims to find the relevance between information resources in different natural languages. So users can use CLIR to find relevant information, and then use MT to translate the text for further reading and analysis. Multilingual corpora are necessary for multilingual information processing systems such as MT and CLIR. Besides, multilingual corpora are important for contrastive linguistics (McEnery & Hardie, 2011). In linguistics, a corpus is a large and structured set of texts. Currently, corpora are electronically stored and processed, and they have been widely used in statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific
388
J. Zhang et al. / International Journal of Information Management 35 (2015) 387–395
language territory. Corpora have different categories according to different perspectives: • Monolingual corpus and multilingual corpus. A corpus containing texts in a single language is called monolingual corpus, while a corpus containing text data in multiple languages is called multilingual corpus. Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora, which are heavily depended by the Statistical MT systems. • Raw corpus and labeled corpus. Raw corpus contains the plain text without any manual annotation, while labeled corpus needs manual annotations on the collected texts. The manual annotation tasks mainly include word segmentation, part-of-speech tagging, syntactic annotation, semantic annotation and pragmatic/discourse annotation. Web has been regarded as a large-scale multilingual corpus (Liu & Curran, 2006). Massive multilingual information is emerging quickly on the Web every time, social networks (He, Zha, & Li, 2013), cloud computing (Yee, Chia, Tsai, Tiong, & Kanagasabai, 2011) and it is expected to increase with the extension of the Web to the Web of Everything (Jara et al., 2014). However, there are no explicit semantic links or even no explicit hyperlinks among the multilingual information resources. Information resources are massive, but connections between them are sparse and the information resources formulate many information islands. This becomes an embarrassment of efficient access and use of multilingual information. Information resources are urgent to be semantically connected by establishing semantic links among multilingual information resources automatically in order to solve the described problem of information islands. Therefore, linking multilingual documents becomes one of the key challenges for the multilingual information effective sharing and use. To establish semantic links among information resources, there are two feasible ways. One way is to build semantic links according to their attributes which are often contained in metadata of information resources, and the other way is to define semantic links according to the meaning implied in the textual descriptions. Metadata of information resources could be used to link multilingual documents based on the values of attributes (Zhang, Wang, & Sun, 2010; Zhuge & Zhang, 2011). An extensible semantic model is proposed for information management by organizing semantic data with event-linked network where events and links can be extracted from raw data in the Internet of Things (Sun et al., 2014; Sun and Jara, 2014 ). However, many multilingual documents on the Web have no complete metadata. Consequently, text analysis based on natural language processing becomes the alternative way to establishing the semantic links based on the full texts of documents. The processing objects of natural language processing include word, phrase, sentence and text (paragraph, section or document). Establishing semantic links between multilingual information resources depends on the analysis on words, phrases and sentences in the descriptive text. All the information resources such as documents, images, audios and videos can be all described by text in natural languages. The establishment of attribute-based semantic associations has been discussed in the previous work (Zhang et al., 2013), so we focus on the content-based semantic associations among documents in this paper. To process the content of documents, natural language processing is a necessary technology, especially in establishing semantic links based on text analysis. In this paper, a semantically linked multilingual corpus in science and technology has been designed for the following aims: (1) providing a multilingual corpus for natural language processing research such as machine translation in science and technology; (2) providing support for applications based on the semantically
linked corpus such as multilingual information analysis and crosslingual information retrieval. The main contributions include: (1) a framework for constructing semantically linked multilingual corpus based on the metadata and content of texts; (2) a solution to construct semantically linked multilingual corpus; and (3) the feasible multilingual applications based on the semantically linked multilingual corpus. 2. Related work 2.1. Multilingual corpus construction 2.1.1. Corpora for multilingual words and phrases During the construction of multilingual corpus, alignment is necessary for words, sentences and texts in different languages. Word alignment is the basis of phrase alignment, sentence alignment and text alignment. The alignment results of words or phrases formulate multilingual dictionaries. Manually constructed bilingual dictionaries have been used in dictionary search and machine translation. Although statisticbased method can align the bilingual words based on parallel sentences automatically, the manually constructed dictionaries still could not be replaced. WordNet is an English dictionary to describe the relations among English words which is mainly focusing on nouns and verbs. Hownet is an English–Chinese bilingual dictionary, which aims to become a knowledge organization system (Dong & Dong, 2006). Thesaurus, as a kind of relational structured dictionary, has been widely used in the library (Aitchison, Gilchrist, & Bawden, 2000). It has played an important role in organizing and retrieving the scientific and technical literature. There are five popular relation types representing the relations among words, that is, broad term (BT), narrow term (NT), related term (RT ), use (USE) and used for (UF). Crouch and Yang (1992) introduced an approach for the automatic construction of thesaurus. However, the construction of thesaurus still mainly depends on the manual construction by domain experts with the aid of auxiliary software tools. However, manual construction of multilingual dictionaries is hard to keep up with the changing of large-scaled corpus and the increasing of new words and phrases. So it is necessary to find the automatic construction methods of multilingual or bilingual dictionary. Giza++ (Och & Ney, 2000) is widely used to align the bilingual words based on the parallel sentence pairs in statisticalbased machine translation. 2.1.2. Corpora for multilingual documents Many multilingual corpora have been constructed for researches on machine translations, multilingual or crosslingual information retrieval. MultiUN was a multilingual corpus from United Nation documents, which was used to support the French–Chinese machine translation research (Eisele & Chen, 2010). Multilingual corpora of European languages, such as EuroParl (Koehn, 2005) and JRC-Acquis (Steinberger et al., 2006), were the essential materials to produce statistical machine translation systems (Koehn, Birch, & Steinberger, 2009). MULTEXT-East is a multilingual corpus, which comprises marked-up texts in the six languages totaling approximately 2 million words and a small speech corpus (Erjavec & Ide, 1998). Umc 0.1 is a Czech–Russian–English multilingual corpus (Klyueva & Bojar, 2008). 2.2. Applications based on multilingual corpus 2.2.1. Machine translation Machine translation (MT) is also called automatic translation. Generally, a MT system aims to translate a natural language
J. Zhang et al. / International Journal of Information Management 35 (2015) 387–395
(source language) into one or more other natural languages (target languages) automatically by computers. The popular machine translation methods include rule-based MT, example-based MT and statistic-based MT. MT is important for multilingual information access. Although the translation result of MT is still far from satisfaction comparing with the human translations, MT has played an important role in promoting the information propagation on the Web by series of machine translation services provided by Google, Baidu and so on. Currently, MT has been widely used in search engines and information retrieval systems, which are not strict on the quality of translation, but concentrate on the relevance between the query and the candidate documents. For example, Google Machine Translation (http://translate.google.com/) can help users to access the multilingual information from all over the world, and it has aided the communications among people that speak different natural languages. With the development of speech recognition technology, some successful voice translation systems occurred in the travel guide service. Current MT systems are not to replace the human translation, although that is the ultimate aim of multilingual information processing. However, there is a long way to make the quality of MT as good as the human translation. Currently, MT could help users to understand the category and rough meanings of a text in foreign languages. MT is closely related to the alignment of bilingual dictionaries. During the translation, words or phrases are translated into corresponding words and phrases in target languages, and then reordered to satisfy the constraints of syntactic rules. As MT engines have to translate the phrases in the source sentence to the phrases in the target sentences, the alignment of phrases is an important step in machine translation engines. Especially, statistic-based MT studies the probabilities between bilingual words, and uses the alignment result to translate automatically. Furthermore, novel phrases could be detected based on the alignment of bilingual words, which has been shown to be better than the direct word alignment in statistical machine translation (Chiang, 2005). 2.2.2. Cross-lingual information retrieval Search engines and information retrieval systems are becoming the dominant information acquisition ways in the big data age comparing with navigation on the traditional web portals. During the search and retrieval in multilingual web information environment, users have to change their query words in different languages for finding multilingual information resources. A possible way is to use monolingual query to find information resources in other languages. CLIR can help user to search multilingual information with the monolingual query. NTCIR (http://research.nii.ac.jp/ntcir/) has organized multilingual information processing and retrieval contests for many years. It has a task called cross-lingual link discovery, and it aims to discover the links among documents in two different languages automatically. This task includes two subtasks: key phrase selection and alignment of the selected key phrases. Key phrase selection can be done in the monolingual text, while the alignment of bilingual selected key phrases is based on the alignment of bilingual phrases. Another cross-language workshop CLEF (http://clef.isti.cnr.it/) promotes R&D in multilingual information access by: (1) developing an infrastructure for the testing, tuning and evaluation of information retrieval systems operating on European languages in both monolingual and cross-language contexts, and (2) creating test-suites of reusable data which can be employed by system developers for benchmarking purposes.
389
With the rapid development of the Internet since the 1990s, relational database systems have been widely used. Meanwhile, two major challenges exist for the emerging contents on the Internet: one is how to organize volume contents in semi-structure, the other is how to make the digital contents understandable for machines. A series of semantic technologies like XML (http://www.w3.org/XML/), SPARQL (http://www.w3.org/ TR/sparql11-query/), and OWL (http://www.w3.org/Submission/ 2006/10/) came into being to meet the challenges by representing the data with a semantic way. The main characteristic of these technologies is to represent all elements by adding some marked label to annotate corresponding meaning (Berners-Lee et al., 2006). Semantic link network is proposed as a semantic data model for managing the Web resources, in which nodes can be any type of resources and edges can be any semantic relations. The schema theory provides the basis for normalized management of semantic link network (Zhuge & Sun, 2010). A computing theory for semantic relations between objects including semantic relation basis, orthogonalized theory, and mathematical operations is proposed to represent semantic relations in an accurate way and to accquire implicit relations by automatic reasoning and autonomous computing (Sun et al., 2015). These techniques are well proposed for querying, browsing, and reasoning semantic data. However, they have not shaken the dominance of the relational model in industries and they do not eliminate the important role of information extracting in the Internet. Integrating, organizing, and interpreting data are still big challenges to achieve the vision of the future Internet of Things (Barnaghi, Wang, Henson, & Taylor, 2012; Jara et al., 2014). The emergence of Web 2.0 is a milestone in the development of information and communication technology. Volumes of diverse data are flooding in at an unimaginable rate, especially no structured contents with the development of various information technologies. The primary challenge is to find reasonable organizing models for the big data. Obviously, the traditional relational model is far from the required Big data bases (Jeffery, 2009). Several No-SQL models, such as Key–value model, Document stores model, Column Family Stores, and graph databases, have been proposed to model the big data. Key–value model organizes the data with a simple structure (key, value) like traditional dictionaries. Though its efficiency in query is higher than traditional models (Han, Haihong, Le, & Du, 2011), it is hard to use in practice for its uninterrupted arrays and isolated values lacked of relationships between datasets. Document stores model encapsulates key–value pairs in documents of self-contained form. It is more suitable to handle complex data like nested contents and performs well in query, integration, and schema migration (Hecht & Jablonski, 2011). Column family stores is inspired by Google’s Bigtable which organizes arbitrary number of key–value pairs within rows (Chang et al., 2008). It is more suitable for applications dealing with huge amounts of data stored on very large clusters (Hecht & Jablonski, 2011). Graph databases organize object data and relation data with nodes and edges and it is easy to describe complicated constraints for the schema defined range of keys and values. It is widely used in location-based services, knowledge representation, path finding in the navigation systems, and recommendation systems.
3. Modeling semantic associations in multilingual corpus 3.1. Semantic association analysis
2.3. Semantic data models Data modeling and management are closely related to the modeling of multilingual corpus.
Semantic associations among multilingual information resources are distributed in two layers: organization layer and application layer.
390
J. Zhang et al. / International Journal of Information Management 35 (2015) 387–395
Organization layer includes the management of attribute-based semantic associations and content-based semantic associations. • Attribute-based semantic associations. Each information resource has its own attributes such as author, creation time, language, file size and so on. According to the values of attributes, semantic associations such as sameAuthor, before, sameLanguage and longer can be established (Zhang et al., 2013). Attribute-based semantic associations are external semantic associations. The finding of semantic associations does not depend on the analysis of the content of documents. • Content-based semantic associations. The content of an information resource means its meaning. For a document, keywords, thesaurus, domain classification and topic are used to represent the meaning. Content-based semantic associations are internal semantic associations and they are implied in the contents of documents. Application layer contains search, translation, analysis and so on. The information acquisition include searching information in monolingual information, finding multilingual information with CLIR, and translating foreign information into the familiar language for further information analysis. After multilingual information resources are semantically linked, information search or retrieval is the basic application for users to acquire information resources. When users find the multilingual information resources they need submit new queries in other foreign languages. After the information in foreign languages is downloaded, it is necessary to translate the documents in foreign languages into the languages that users need. And then, users can analyze the information for decision making. To support multilingual applications based on multilingual corpus, it is necessary to study the elements and how to organize them in the multilingual corpus. Textual documents have different granularities in representing the content. The textual components can be divided into different granularities including word, phrase, sentence, paragraph (optional), section (optional), chapter (optional), document and collection (optional). The texts in different granularities are the elements in multilingual corpus. Fig. 1 shows the hierarchical relations between the elements in the multilingual corpus. A multilingual corpus may contain one or more collections of documents within word, phrase, sentence, paragraph, section and chapter. • Word is the minimal meaning component of text. • Phrase contains one or more words that represent a complement meaning. • Sentence contains one or more words/phrases. The combination of word(s) or phrase(s) satisfies the legal syntactic structure. • Paragraph contains one or more sentences that are semantically close. • Section contains one or more paragraphs. • Chapter contains one or more sections. • Document contains one or more chapters. Usually, an information resource is often shown as a document. • Collection contains one or more documents. Semantic associations among elements in the multilingual corpus could be represented by Entity–Attribute–Relation graph (Gorman & Choobineh, 1990). Fig. 2 shows entities, attributes and relations among entities in a multilingual corpus, and elements are simplified into four types: word/phrase, sentence, text and collection. Semantic associations exist between elements of the same type and elements of different types. For each type of elements, there are three general attributes: time, domain and language. Attribute time records the creation, modification, deletion
Fig. 1. Components of text in different granularities: word, phrase, sentence, paragraph (optional), section (optional), chapter (optional), document and collection (optional).
and other time-related information, domain implies the research domain information of elements, and language reflects what languages are used for representing the elements. More attributes could be added on demand according to the requirements in applications. Semantic links exist in the different granularities of text. Semantic links before, after and cooccur may exists between two components of the same type (e.g., word, phrase, sentence, paragraph, section, chapter), while semantic links similar, equal and different exist between two documents or two collections. Semantic associations such as equal, similar and relevant are general, and they may exist in different granularities. For word and phrases, there are more semantic links such as synonym, antonym, BT, NT, USE, UF and RT. Semantic link in exists between components of different granularity. Table 1 shows the semantic links in different granularities. 3.2. Semantic link network model Semantic link network (SLN) is a semantic data model aiming to manage the web-aged information resources. It extends the current web to be a semantic linked web by adding relationships between web resources (Sun, Bie, Yu, & Wang, 2013). A typical SLN S = (N, L, R), and it has three parts: N is the node set, L is the semantic link set, and R is the rule set for semantic relation reasoning and establishing. A node n ∈ N represents any resources. A link l ∈ L represents a semantic relation between two nodes n1 and n2 . A relal1
l2
tion reasoning rule r ∈ R takes the form of (n1 →n2 ) × (n2 →n3 ) ⇒ l3
n1 →n3 , and denoted by l1 × l2 ⇒ l3 for short. The probabilistic rules could be used to help building the probabilistic semantic relations: the statistic inference rules are to predict the semantic relations between resources (Zhuge & Zhang, 2011), while the analogical
J. Zhang et al. / International Journal of Information Management 35 (2015) 387–395
391
Fig. 2. The schema of Entity–Attribute–Relation graph for representing elements and semantic associations in multilingual corpus: a rectangle means an entity; an ellipse means an attribute; a solid line means an attribute of an entity; and a dotted line means a semantic association.
rules are used to predict the semantic relations according to the similar structures of semantic link networks (Zhang & Sun, 2010). Semantic links are direction sensitive, that is, the source node and target node of semantic links could not be exchanged. The SLN model is simple and easy-to-use. The success of the web owns a lot to the easy-to-use mechanism of hyperlinks, that is, each URI can link to any other URI. The SLN inherits the easy-to-use characteristic, and each semantic node can link to any other semantic nodes. Furthermore, a node in the SLN can be another SLN, and this characteristic distinguishes the SLN graph from RDF graph. The nodes in RDF graph formulate a two dimensioned graph, while the characteristic “a node can be another SLN” enables the hierarchical characteristic of the SLN graph.
Table 1 Semantic associations among textual components in different granularities. Source component
Target component
Semantic associations
Links between same granularities word
word
phrase
phrase
sentence
sentence
paragraph section chapter document collection
paragraph section chapter document collection
before, after, cooccur, antonym, synonym, BT, NT, RT, USE, UF before, after, cooccur, antonym, synonym, BT, NT, RT, USE, UF before, after, cooccur, cause–effect before, after, cooccur before, after, cooccur before, after, cooccur similar, different, cooccur similar, different, in
Links between different granularities word phrase sentence paragraph section chapter document
phrase sentence paragraph section chapter document collection
in in in in in in in
RDF focuses on the resource description, while SLN is mainly focusing on the graphical model for relationship representation and reasoning among resources. An SLN graph can be transformed into a RDF graph, and vice versa. The transform between SLN graph and RDF graph is via a blank node: the hyper-node in the hyper-graph of SLN becomes a blank node linking to many nodes in the hyper-node; and the node linking to a hyper-node formulates a hyper-node in the hyper-graph of SLN. 4. Building semantically linked multilingual corpus 4.1. Establishing semantic links Alignment of multilingual words/phrases is the first step to link multilingual documents according to the content of documents. The meaning of a text is represented by the sentences. However, it is hard to find the same sentence in two different texts, so the alignment of texts is directly based on the alignments of words/phrases. After the alignment of multilingual words/phrases, semantic links among multilingual documents can be built, and some specific relations could be further established based on the semantic links among the multilingual information resources. Based on the semantic links among words and phrases, the semantic associations between sentences and those between texts could be further established. 4.1.1. Semantic links among multilingual phrases Before the alignment of multilingual phrases, the phrases have to be selected first. Key phrases are selected from the text for alignment, especially from the parallel sentences. Key phrase selection is the basis to build the semantic links among the multilingual phrases. Because key phrases can represent the meaning of the information resources, they could be further used to build the semantic associations among sentences, documents and even collections. The manual construction of multilingual phrases is needed to build the multilingual or bilingual dictionaries. The alignment of bilingual words can be used to build the alignment of bilingual phrases. Dictionary experts could select bilingual phrases from the
392
J. Zhang et al. / International Journal of Information Management 35 (2015) 387–395
word alignment with high frequency. However, the manual construction of multilingual phrases is time-consuming and high-cost. The possible way is to build the associations among multilingual phrases automatically by the statistical machine learning method. As a phrase is formulated by one or more words, if two phrases have the same meaning, the words in the two phrases would be possible to have close association. So the alignments of words can be used to align the bilingual phrases. From bilingual phrases to multilingual phrases. If each pair of natural languages has the bilingual dictionaries, the bilingual dictionaries can be merged into the hybrid multilingual dictionary. However, the bilingual dictionary for each pair of natural languages is not easily to acquire currently and the manual construction of parallel sentences is high-cost and time-consuming. So it is necessary to find an approach to building the bridge between different natural languages. Commonly used natural languages such as English can be used as the bridge language to connect the bilingual dictionaries into the multilingual dictionaries. For example, English–Chinese bilingual dictionary and English–Japanese bilingual dictionary can help to build the Chinese–Japanese bilingual dictionary. Because the differences between natural languages, the alignment of bilingual phrases via the bridge language may not be all right, the manual check is necessary. Building bilingual phrases based on the parallel sentences corpus. Giza++ can align the bilingual word pairs based on the parallel sentence pairs. However, the phrases are often combined by two or more words. The phrases can be hierarchical extracted based on the word alignment. This approach has been shown to be efficient to improve the quality of machine translation (Chiang, 2007). Suppose L1 and L2 are two natural languages, and e1 , e2 , . . ., em (m ≥ 1) are words of a sentence in language L1 , and c1 , c2 , . . ., en (n ≥ 1) are words from the parallel sentence in language L2 , and their alignment can be represented by (e1 , c1 , p1 ), (e2 , c3 , p2 ), . . ., (em , cx , pm ), where m ≥ 1, x ≥ 1, and pi (1 ≤ i ≤ m) are the probabilities of the alignments between words in different languages. Building bilingual phrases based on the comparable corpus. Besides the parallel sentence pairs, bilingual phrase pairs can be extracted from the comparable corpus. The accuracy of building bilingual phrases from comparable corpus will be lower than the accuracy of building bilingual phrases from the parallel corpus. However, the cost of collecting parallel corpus is higher than the cost of collecting comparable corpus. The hypothesis for extracting bilingual phrases from the comparable corpus is that phrases in different languages with same meaning have similar distribution. 4.1.2. Semantic links between sentences According to the alignment of words in sentences, the similarity between sentences can be calculated. Two sentences with similarity beyond a certain threshold will be established with a semantic link similar. According to the sequences of sentences in the text, semantic links such as before and after could be established. Words and phrases in a sentence have semantic link in with the sentence. 4.1.3. Semantic links between documents According to the words, phrases and sentences, the similarity between documents could be calculated. Two documents with similarity beyond a certain threshold will lead to the establishment of a new semantic link similar. Words, phrases and sentences in a document have semantic links in with the document. Semantic links based on the attributes of documents could be established according to the values of attributes. 4.2. Reasoning semantic links After some semantic links in the multilingual corpus are established, more semantic links could be derived according to the
Table 2 Semantic relations among multilingual phrases. Relations
Abbr. relation
Example
Broad Term
BT
Narrow Term
NT
Use
USE
UsedFor
UF
Related Term
RT
“electric” is broad term of “electric equipment” “electric equipment” is narrow term of “electric” “Information science–Abstracting” use “Abstracting” “Abstracting” is used for “Information science–Abstracting” “Pavement construction” is related term of “stirl”
semantic link reasoning rules. Possible semantic link reasoning rules include: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
SameLang × SameLang ⇒ SameLang SameDomain × SameDomain ⇒ SameDomain SubClass × SubClass ⇒ SubClass SuperClass × SuperClass ⇒ SuperClass Before × Before ⇒ Before After × After ⇒ After In × In ⇒ In Contain × Contain ⇒ Contain BT × BT ⇒ BT NT × NT ⇒ NT
More semantic links could be established according to the domain knowledge. Semantic link reasoning rules can help to establish more semantic links among the elements in the corpus. Associations among semantic links help to predict more probable semantic links with (Zhang, Wang, & Sun, 2009). Relational reasoning derives definite semantic links, while statistical rules predict the possible semantic links according to the co-occurrence between existing semantic links. 5. Case study 5.1. Semantic associations among words and phrases As mentioned above, word is defined as the minimum element to represent a meaning in the natural languages, and phrase contains one or more words to represent a concept. A concept is reflected by a set of words and phrases. Phrase has different alias such as term, keyword and controlled vocabulary such as thesaurus in library and information science. Phrases are often used to index the document for information retrieval. Keywords and thesaurus have been used to label the information resources for information retrieval. Keywords are often assigned by the authors of literature, while thesaurus is often assigned by the librarians for organizing the information resources to improve the recall rate of information retrieval. Semantic link network of multilingual phrases is used to show the semantic relations among the multilingual phrases. Table 2 shows the semantic relations among the multilingual phrases and the corresponding examples. A word may have two or more homonyms and synonyms, while a phrase may also have other phrases with the similar meanings. The synonyms of a phrase are represented by a phrase set, called by synSet. All the phrases in a synSet have the same or similar meanings, while the morphology of the phrases/words may be different. The synSet acts as the middle language to connect multilingual phrases. All the phrases with the same meaning are mapped into the same synSet. Table 3 shows the synSet examples in Chinese, English and Japanese. Furthermore, other languages can be added
J. Zhang et al. / International Journal of Information Management 35 (2015) 387–395
393
Table 3 The synSet for multilingual phrases in Chinese, English, Japanese and other natural languages.
Table 4 The exemplary semantic links among the synSets. Source synSet ID
Target synSet ID
Relation
Associative degree
1 2
2 1
BT NT
0.9 (set on demand) 0.9 (set on demand)
on demand. The synSet is the union of all the sub-synSets in different natural languages. There are different types of semantic relations among the synSets. Because the synSet is essentially a set in mathematics, the relations between sets such as subset and superset are mapping on NT and BT, respectively. The intersection relation between sets does not exist among the synSets, because if two synSets have non-empty intersects, they will be merged into the same synSet. Two different synSets have no overlapped words or phrase. Especially, if two synSets have a common word or phrase, the word or phrase has different meanings. A phrase may have more than one meaning at the same time, so different synSets may have phrases with the same spelling, but the semantic meanings of phrases are different. So phrases in different synSets must have different meanings, even though they have the same spelling. Table 4 shows the examples of semantic links among the synSets in Table 3. The semantic associations among the phrases could be managed by using relational database. For the exchange of semantic relations among the multilingual phrases, the data could be described with XML format. Fig. 3 shows the synSet and their semantic links represented by XML. The characteristics of the synSet of words and phrases are as follows: • It is unified. All the words or phrases in different natural languages with the same meaning are put into the same synSet. As mentioned above, phrases and words in different synSets must have different meanings. • It is extensible. For each synSet, the phrases of a natural language formulate a sub-synSet. When a new natural language is added, the sub-synSet can be added without having influence on the other existing sub-synSets. • The alignment between words and phrases can be processed in batch. The synSet can speed up the alignments of phrases. When a new phrase is inserted into a sub-synSet, it will be aligned to phrases in sub-synSet of different natural languages automatically. Words and phrases have the semantic link in with sentences. Words and phrases have semantic link co-occur if they are in the same sentence, paragraph, section, chapter, document or collection, and co-occur depends on the size of text window.
Fig. 3. An example to show the semantic links among English, Chinese and Japanese sub-synSets in XML.
5.3. Semantic associations among documents Semantic associations among multilingual documents include two classes. One class is based on the attributes of documents such as author, publication venue, publishing time, citation, topic and pages, and the other class is based on the content of the documents.
5.2. Semantic associations among sentences According to the sequence of sentences in the document, semantic link before and after exist between sentences. If two sentences in the same paragraph, section, document or collection, they have semantic link co-occur. Two sentences may have the semantic relation Cause–Effect according to their meanings, which could be established according to the clue words such as “because”, “so” and “therefore”.
• The attributes based semantic links between documents include sameAuthor based on author, before and after based on publish time, sameJournal or sameConference based on the publication venue, citing, co-citing, cited and co-cited based on the citations between documents, sameTopic based on topic, and longThan, shortThan and sameLengh based on attribute pages. • Coming to semantic links based on content, similar and irrelevant exist between two documents. Furthermore, the semantic links
394
J. Zhang et al. / International Journal of Information Management 35 (2015) 387–395
between documents can decide the semantic links between collections including similar and irrelevant. After the semantic link network of multilingual corpus is constructed, there are many possible applications such as corpus linguistic study, multilingual dictionary query, machine translation, cross-lingual or multilingual information retrieval. Dictionary query is based on the aligned multilingual words or phrases. Machine translation systems could train the engines with bilingual dictionaries and parallel sentences. Cross-lingual or multilingual information retrieval is based on the semantically linked words, phrases, sentences and texts. Besides, multilingual synSets of phrases could be used to organize the multilingual information resources. After the semantic link network of synSets is constructed, the synSet can be used in multilingual information indexing, recommendation and search. When an information resource is labeled by multilingual synSets, then it can be searched by CLIR. Alignments of smaller elements are the basis of alignments of larger elements. The alignments of multilingual words or phrases can be used to align the sentences in different languages. During the construction of parallel sentence pairs, sentences in different languages should be aligned, and the aligned phrases can help to find the alignment of sentences. The more phrases and sentences aligned, the more probability that two multilingual texts could be aligned. The parallel sentence pairs could be used to train the statistical machine translation. Besides, the aligned sentence pairs could be used as the exemplary sentences for machine-aided translation. During translation, the example sentence pairs could be recommended according to the similarity computing between the sentences in source language (Zhang, Sun, Wang, & He, 2011). The alignments of multilingual words or phrases could be used to cluster the sentences in different languages. During the information analysis, multilingual information should be collected first. So words, phrases and sentences will be aligned and used to computing the similarity between documents in different languages. The documents with the similar semantic link beyond the similarity threshold could formulate the collection for information analysis. 6. Conclusions and future work Multilingual information processing gains more and more attention in recent years. Multilingual corpus is important for multilingual information analysis, research and service. How to organize multilingual corpus has great influence on the usage of multilingual corpus and the corpus-based applications. In this paper, we study the elements of a multilingual corpus and semantic associations among these elements. We propose an approach to semantically link the elements into a semantic link network for further multilingual information analysis and processing. The semantic link network realized the semantic organization of multilingual corpus, and it can promote the applications of corpus in many text-based applications such as dictionary search, machine translation and cross-lingual information retrieval. Future work will focus on how to distribute the multilingual processing and the development of semantic relationships in the distributed computing and mobile computing environments. In particular, future work will focus on optimizing the techniques to carry out semantic matching via in-network processing system. Thereby, it will allow a scalable processing and semantic link network development for the data coming from social networks, emerging applications and the Internet of Things-enabled solutions.
Acknowledgments This research work is partially supported by International Science & Technology Cooperation Program of China under Grant No. 2014DFA11350; National Natural Science Foundation of China (Grant Nos. 61371185 and 61171014), and ISTIC Research Foundation Projects (Grant Nos. XK2014-6 and ZD2014-3-4). The authors would like to thank the HES-SO and the Institute of Information Systems funding and support. Finally, we would like to thank the European Project “In-Network Programmability for next-generation personal cloud service support” from the H2020 Framework Programme with the grant agreement number: 644672. References Aitchison, J., Gilchrist, A., & Bawden, D. (2000). Thesaurus construction and use: A practical manual. Psychology Press. Barnaghi, P., Wang, W., Henson, C., & Taylor, K. (2012). Semantics for the Internet of Things: Early progress and back to the future? International Journal on Semantic Web and Information Systems (IJSWIS), 8(1), 1–21. Berners-Lee, T., Hall, W., Hendler, J. A., O’Hara, K., Shadbolt, N., & Weitzner, D. J. (2006). A framework for web science? Foundations and Trends in Web Science, 1(1), 1–130. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., et al. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2), 4. Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 263–270). Association for Computational Linguistics. Chiang, D. (2007). Hierarchical phrase-based translation? Computational linguistics, 33(2), 201–228. Crouch, C. J., & Yang, B. (1992). Experiments in automatic statistical thesaurus construction. In Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval (pp. 77–88). ACM. Dong, Z., & Dong, Q. (2006). HowNet and the computation of meaning. World Scientific. Eisele, A., & Chen, Y. (2010). Multiun: A multilingual corpus from United Nation documents. In LREC Erjavec, T., & Ide, N. (1998). The multext-east corpus. In Proc. of LREC, Vol. 98. Gorman, K., & Choobineh, J. (1990). The object-oriented entity-relationship model (ooerm). Journal of Management Information Systems, 41–65. Han, J., Haihong, E., Le, G., & Du, J. (2011). Survey on nosql database. In 2011 6th International conference on pervasive computing and applications (ICPCA) (pp. 363–366). IEEE. He, W., Zha, S., & Li, L. (2013). Social media competitive analysis and text mining: A case study in the pizza industry? International Journal of Information Management, 33(3), 464–472. Hecht, R., & Jablonski, S. (2011). Nosql evaluation. In International conference on cloud and service computing (p. 7). Jara, A. J., Olivieri, A., Bocchi, Y., Jung, M., Kastner, W., & Skarmeta, A. (2014). Semantic Web of things: An analysis of the application semantics for the IoT moving towards the IoT convergence? International Journal of Web and Grid Services, 10(2–3), 244–272. Jeffery, K. G., et al. (2009). The Internet of Things: The death of a traditional database? IETE Technical Review, 26(5), 313. Klyueva, N., & Bojar, O. (2008). Umc 0.1: Czech–Russian–English multilingual corpus. In Proc. of international conference corpus linguistics (pp. 188–195). Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT Summit 2005. Koehn, P., Birch, A., & Steinberger, R. (2009). 462 machine translation systems for Europe. In Proceedings of MT Summit XII (pp. 65–72). Liu, V., & Curran, J. R. (2006). Web text corpus for natural language processing. In EACL. McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge University Press. Och, F. J., & Ney, H. (2000). Giza++: Training of statistical translation models. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., et al. (2006). The jrc-acquis: A multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058. Sun, Y., Bie, R., Yu, X., & Wang, S. (2013). Semantic link networks: Theory, applications, and future trends. Journal of Internet Technology, 14(3), 365–377. Sun, Y., & Jara, A. J. (2014). An extensible and active semantic model of information organizing for the Internet of Things. Personal and Ubiquitous Computing, 18(8), 1821–1833. Sun, Y., Lu, C., Bie, R., & Zhang, J. (2015). Semantic Relation Computing Theory and Its Application. Journal of Network and Computer Applications,. http://dx.doi.org/10.1016/j.jnca.2014.09.017i Sun, Y., Yan, H., Lu, C., Bie, R., & Zhou, Z. (2014). Constructing the web of events from raw data in the Web of Things. Mobile Information Systems., 10(1), 105–125. Yee, K. Y., Chia, Y., Tsai, F. S., Tiong, A. W., & Kanagasabai, R. (2011). Cloud-based semantic service-oriented content provisioning architecture for mobile learning? Journal of Internet Services and Information Security, 1(1), 59–69.
J. Zhang et al. / International Journal of Information Management 35 (2015) 387–395 Zhang, J., Gao, Y., He, Y., Xu, H., Shi, C., & Qu, P. (2013). Semantically linking information resources for web-based sharing? International Journal of Cognitive Informatics and Natural Intelligence(IJCINI), 7(2), 65–79. Zhang, J., & Sun, Y. (2010). An analogy reasoning model for semantic link network? JDCTA: International Journal of Digital Content Technology and Its Applications, 4(7), 128–139. Zhang, J., Sun, Y., Wang, H., & He, Y. (2011). Calculating statistical similarity between sentences. Journal of Convergence Information Technology, 6(2). Zhang, J., Wang, H., & Sun, Y. (2009). Discovering associations among semantic links. In International conference on web information systems and mining, 2009. WISM 2009 (pp. 204–208). IEEE. Zhang, J., Wang, H., & Sun, Y. (2010). Are links on the web enough? In 2010 Sixth international conference on semantics knowledge and grid (SKG) (pp. 58–65). IEEE. Zhuge, H., & Sun, Y. (2010). The schema theory for semantic link network? Future Generation Computer Systems, 26(3), 408–420. Zhuge, H., & Zhang, J. (2011). Automatically constructing semantic link network on documents? Concurrency and Computation: Practice and Experience, 23(9), 956–971. Junsheng Zhang is an associate professor and the director of Language and Knowledge Technology Lab, Institute of Scientific and Technical Information of China. He received his PhD in Computer Science in 2009 from Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include information and knowledge management, natural language processing, mobile computing and cloud computing.
395
Yunchuan Sun is an associate professor and the director of Lab for Economics and Business in Beijing Normal University, Beijing, China. He acts as Secretary of IEEE Communication Society Emerging Technical Subcommittee of Internet of Things from January 2013. He also acts as an associate editor of the Springer journal Personal and Ubiquitous Computing. He received his PhD in Computer Science in 2009 from the Institute of Computing Technology, Chinese Academy of Science, Beijing, China. His research interests include Internet of Things, Data Science, Event-Linked Network, Semantic Technology, and Business Model. Antonio J. Jara is an Assistant Prof. and PostDoc at University of Applied Sciences Western Switzerland (HES-SO) from Switzerland, vice-chair of the IEEE Communications Society Internet of Things Technical Committee, and founder of the Wearable Computing and Personal Area Networks company HOP Ubiquitous S.L., He did his PhD (Cum Laude) at the Intelligent Systems and Telematics Research Group of the University of Murcia (UMU) from Spain. He received two M.S. (Hons. – valedictorian) degrees. Since 2007, he has been working on several projects related to IPv6, WSNs. and RFID applications in building automation and healthcare. He is especially focused on the design and development of new protocols for security and mobility for Future Internet of things, which were the topic of his Ph.D. Nowadays, he continues working on IPv6 technologies for the Internet of Things in projects such as IoT6, and also Big Data and Knowledge Engineering for Smart Cities and eHealth. He has also carried out a Master in Business Administration (MBA). He has published over 100 international papers, As well, he holds one patent. Finally, he participates in several projects about the IPv6, Internet of Things, Smart Cities, and mobile healthcare.