Article Applied Physics
October 2010 Vol.55 No.30: 3458–3465 doi: 10.1007/s11434-010-4114-3
SPECIAL TOPICS:
Language clusters based on linguistic complex networks LIU HaiTao1,2* & LI WenWen2 1 2
School of International Studies, Zhejiang University, Hangzhou 310058, China; Institute of Applied Linguistics, Communication University of China, Beijing 100024, China
Received March 6, 2010; accepted April 22, 2010
To investigate the feasibility of using complex networks in the study of linguistic typology, this paper builds and explores 15 linguistic complex networks based on the dependency syntactic treebanks of 15 languages. The results show that it is possible to classify human languages by means of the following main parameters of complex networks: (a) average degree of the node, (b) cluster coefficients, (c) average path length, (d) network centralization, (e) diameter, (f) power exponent of degree distribution, and (g) the determination coefficient of power law distributions. The precision of this method is similar to the results achieved by means of modern word order typology. This paper tries to solve two problems of current linguistic typology. First, the language sample of a typological study is not real text; second, typological studies pay too much attention to local language structures in the course of choosing typological parameters. This study performs better in global typological features of language and not only enhances typological methods, but it is also valuable for developing the applications of complex networks in the humanities, social, and life sciences. complex networks, linguistic typology, language network, syntactic dependency network, cluster analysis, language classification Citation:
Liu H T, Li W W. Language clusters based on linguistic complex networks. Chinese Sci Bull, 2010, 55: 3458–3465, doi: 10.1007/s11434-010-4114-3
A language system is a network with a complex structure [1]. For this reason, it is difficult to study the integrated characteristics of language by traditional linguistic research methods. Consequently, it is necessary to try to study language from the perspective of complex networks. This perspective is helpful for discovering the relationships between language systems and human cognition, human society, and other natural systems through the comprehensive investigation of human language within a complex network. Scholars have conducted many studies on language and complex networks [2–6]. There are several kinds of human language and various principles for constructing language networks. Research shows that most of the language networks within different languages and diverse construction principles have characteristics that are small-world and scale-free. This research has been valuable for understanding the universality of language networks and the commonness of language systems, human society, and other natural *Corresponding author (email:
[email protected]) © Science China Press and Springer-Verlag Berlin Heidelberg 2010
systems. However, the tendency of research to gravitate towards the commonalities of different networks does not enable a better understanding of the laws of language structure and evolution. If only commonalities are emphasized, the individuality of networks will disappear and “the networks will be the same all over the world”[7]. This goes against the application of complex network research in the real world. Language networks are not the goal but the means of language research for linguists [8]. Discovering the possibility of complex networks in language research is far more important than simply focusing on the general characteristics of language networks on different levels and then constructing various theoretical models of the complex networks. “Linguistic typology” is a subject about language classification. Altmann and Lehfeldt [9] regarded language classification as one of the two main tasks of general linguistic typology. They argued that language classification must construct a classification system for natural languages based on their global resemblance. Modern linguistic typology is csb.scichina.com
www.springerlink.com
LIU HaiTao, et al.
Chinese Sci Bull
not solely about language classification, but also studies the universals of human language through cross-linguistic comparison [10,11]. Compared with traditional linguistic typology, a modern trend emphasizes the research of universals in human language. However, modern linguistic typology has a tendency to emphasize some typological parameters over others. Research that focuses too much on details cannot accurately regard language as a whole. This type of research will influence the precision of language classification created by linguistic typological research. Another issue that should be noted is the resource problem of linguistic typology. Though the current language sample for typological study has thousands of languages, most of the typological data of these languages is not selected from natural speech or text in daily life. The conclusion obtained from such data makes it difficult to comprehensively reflect the typological characteristics of a language. To solve these two problems, we can adopt morphologically and syntactically annotated corpora as a resource to get a more objective and reliable conclusion. As to the selection and validation of parameters, we can choose some parameters which are convenient for automatic extraction from the corpora and will reflect the integrated features of language. Next, we can use clustering or some other modern statistical technique and quantitative method to confirm the validity and reliability of these parameters in language classification. Liu [12] used the dependency syntactic treebank of 20 languages as a resource and explored whether it is feasible to regard dependency direction as a means of word-order typology research. The results show that the syntactically annotated corpora can be used as a resource of linguistic typology research. The findings resemble the results obtained by adopting the typological language sample. Ever since the notion of complex network has been disseminated, more and more linguists have carried out research on language networks [8,13–20]. This research has included sub-topics of phonology, syntax, semantics, style, language development, and other issues. In the aspect of linguistic typology, Čech and Mačutek [16] proposed that the difference between word form network and lemma network may reflect the typological characteristics of language. Choudhury and Mukherjee [17] found that the average degree of Hindi spelling network differs substantially from that of English. This finding may reflect a difference of language typology. Apart from these studies, we have not seen any empirical research on linguistic typology by means of a complex network in China or abroad. This paper builds 15 linguistic networks based upon the dependency syntactic treebanks of the aforementioned languages. The main complex network parameters of the 15 languages are calculated by the tools for complex network research. The paper also studies the commonness of these language networks, while investigating the feasibility and reliability of using complex networks as parameters for the research of linguistic typology through clustering experiments.
October (2010) Vol.55 No.30
3459
1 The construction and measurement of a language complex network From the aspect of structure, no matter how big and complex a network is, the basic elements of the network are not complicated. All networks consist of nodes and edges. However, the nodes and edges of different networks represent different things when applied to the realities they are intended to represent. As for the syntactic network in this paper, the nodes are words and the edges are grammatical relationships between words. To construct the syntactic network of a certain language, it is necessary to select a feasible syntactic analytical method. Phrase structure analysis emphasizes part and whole relationships of the components which compose sentences, while dependency analysis aims at identifying the kinds of grammatical relationships that exist between words in sentences [1]. Dependency analysis is, by its fundamental nature, a means for understanding the binary grammatical relationship between words. For this reason, it is very easy to transfer the dependency analysis of sentences to a network form. Liu [8] has already proposed specific information about how to convert a dependency treebank to a dependency syntactic network. Figure 1 is the syntactic network of three Chinese and English sentence examples. The sentences are: 约翰在桌子 上放了本书 (John put the book on the table). 那学生读过 一本有趣的书 (The student read an interesting book). 那 本书的封面旧了 (The cover of the book is old). The syntactic networks of the two languages in Figure 1 are different. This data provides us with intuitive evidence that allows us to adopt language networks to the study of linguistic typology. Because we have a syntactic network, we can study the main characteristics of the network according to the indices and parameters of a complex network. Average path length (L), cluster coefficients (C), average degree (), diameter (D), and degree distribution (P(k)) are the most frequently used parameters of complex networks to determine the complexity of a network [21,22]. We also consider the network centralization (NC) [23] as a parameter, according to the characteristics of a syntactic network. Network centralization can help us find the central node of a syntactic network, and it indirectly reflects the degree of morphological change of a language. Based upon these parameters, we can usually evaluate the property of a network (e.g. whether it is a small-world or scale-free network). Figure 1(a) refers to the degree distributions of four nodes whose degree is one, eight nodes whose degree is two, one node whose degree is three, two nodes whose degree is four, and one node whose degree is five. Figure 1(b) has four nodes whose degree is one, six nodes whose degree is two, three nodes whose degree is three, one node whose degree is four, and one node whose degree is six.
3460
LIU HaiTao, et al.
Chinese Sci Bull
The data listed above shows that Figure 1(a) and (b) are different in their parameters of a complex network (Table 1). Because there are only three sentences in the network example, we wonder if the parameters of a different language network will still be different when the word count of the networks is increased. If the differences remain, then can they be used as the parameters or indices to study linguistic typology? To answer these two questions, we built 15 syntactic networks based upon available dependency treebanks (i.e. ISO 639-2 language codes are in parentheses): Arabic (ara), Catalan (cat), modern Greek (ell), ancient Greek (grc), English (eng), Basque (eus), Hungarian (hun), Italian (ita), Japanese (jpn), Portuguese (por), Romanian (rum), Spanish (spa), Turkish (tur), Latin (lat), and Chinese (chi). We used Network
October (2010) Vol.55 No.30
Analyzer, the network analysis plug-in of Cytoscape. Network Analyzer is an open source bioinformatics software platform for visualizing molecular interaction networks and calculating the parameters of complex networks [24].
2 The complexity of 15 linguistic networks Limited by the resources of treebanks, most of the languages in our sample are from the Indo-European family. From the viewpoint of linguistic typology, it is not appropriate to sample in this way. However, it is acceptable when one considers that the purpose of this paper is to propose a new method and to empirically investigate the feasibility of using this method for the research of language classification. Most of the treebanks [25–35] we used are from the training set, CoNLL-X “Multi-language dependency syntactic analysis competition”[36,37]. To make the research results more comparable, we randomly extracted corpora with equivalent words and converted them into syntactic networks of corresponding languages for the purpose of using complex network analysis software. We analyzed the syntactic network of 15 languages using Network Analyzer (see Table 2). Of the 15 languages we used, we found that Arabic, Chinese, English, Hungarian, Portuguese, and Romanian are news corpora. Japanese is conversational corpora, while Latin and ancient Greek are from classical literature. The remaining languages are corpora with mixed genres. As for the syntactic annotation system of the original tree bank, we found that Arabic, Chinese, modern Greek, ancient Greek, Basque, Romanian, Turkish, and Latin adopt an annotation scheme of dependency syntax. Italian, JapaTable 2
Figure 1 amples.
Syntactic networks of three Chinese and English sentence ex-
Table 1 Main parameters of Chinese and English networks from the examplea) E N C L NC D Chinese 20 17 2.235 0 3.074 0.125 6 English 16 14 2.286 0 2.604 0.333 6 a) E is the number of edges in the network, N is the number of nodes, is the average degree, C is the cluster coefficients, L is the average path length, NC is the network centralization, and D is diameter.
Main parameters of syntactic network of 15 languagesa)
C
L
NC
D
γ
R2
E
N
ara
30164
10190
5.783 0.165 3.622 0.196 10 1.211 0.723
cat
30944
8906
6.816 0.129 3.234 0.235 9 1.165 0.703
chi
13348
4015
6.478 0.128 3.371 0.231 10 1.33
ell
27942
9229
5.968 0.114 3.445 0.227 11 1.226 0.722
grc
0.801
23798
8870
5.291 0.089 3.638 0.146 11 1.343 0.746
eng 28229
7770
7.127 0.122 3.308 0.189 9 1.223 0.803
eus
27895
10561
5.207 0.115 3.571 0.213 13 1.334 0.75
hun 33146
13075
5.055 0.029 3.938 0.155 11 1.353 0.734
ita
32329
9051
7.059 0.126 3.243 0.194 8 1.185 0.701
jpn
8356
1638
9.716 0.279 2.755 0.319 6 1.123 0.789
por
29396
8855
6.444 0.207 3.123 0.312 8 1.125 0.685
rum 28032
8862
6.189 0.108 3.316 0.245 9 1.204 0.72
spa
25254
7939
6.209 0.181 3.146 0.271 9 1.108 0.688
tur
26421
11969
4.25
0.205 2.958 0.514 10 1.161 0.616
lat
28945
11571
4.91
0.107 3.598 0.196 11 1.266 0.721
a) γ is the power exponent of degree distribution and R2 is the determination coefficient of fitting the degree distribution to power law.
LIU HaiTao, et al.
Chinese Sci Bull
nese, Portuguese, Catalan, and Spanish are annotated by both phrase structure and dependency syntax, while English and Hungarian are with the annotation scheme of phrase structure. For those treebanks which do not use the annotation of dependency syntax, we adopted dependency formats automatically converted by CoNLL-X. Arabic and modern Greek adopt the annotation scheme of the Prague dependency Treebank [38].
3 Linguistic complex network and linguistic typology We analyzed the integrated characteristics of language networks to determine if they are small-world and scale-free. Figure 2 shows that the fluctuation range of average path length for the 15 languages is not obvious (e.g. from 2.755 to 3.938). In other words, the average distance of any two nodes of the 15 linguistic networks is approximately four nodes. Liu [8] hypothesized that the shortest path phenomenon of a linguistic syntactic network is closely related to the dependency distance that tends to a minimum value. This hypothesis connects small-world characteristics of a linguistic network with linguistics and cognitive science. Dependency distance is the linear distance between governor and dependent. For example, in the sentence “这是一 个例子(This is an example), the dependency distance of “是-这” is 2–1=1,“个-一” is 4–3=1, “例子-个” is 5–4=1, and “是-例子” is 5–2=3. The average dependency distance of this sentence is (1+1+1+3)/4=1.5. Liu’s research of the average dependency distance of 20 languages [39] shows that the average dependency distance of these languages ranges from 1.798 to 3.662. That is to say, the linear distance between two words in grammatical relationship is within three words. The minimum tendency of dependency distance is limited to human working memory capacity and grammar. In a syntactic network, a node is a word. We have reason to believe that the average path length in a syntactic network is closely related to the average
Figure 2 20).
Cluster coefficients and average path length (C multiplied by
October (2010) Vol.55 No.30
3461
dependency distance of the sentence. However, we need further research to better explain the relationship between the two. Our research is beneficial for deepening an understanding of the relationships among complex networks, human cognition mechanisms, and language faculties. Cluster coefficients reflect the possibility of a syntactic relationship between two words which are syntactically related to another word. Japanese and Hungarian are two languages at two ends of the curve of cluster coefficients as shown in Figure 2. The C of Japanese is ten times greater than that of Hungarian (0.279 and 0.029, respectively). Besides these two languages, cluster coefficients of the other 13 languages range from 0.088 to 0.207. However, the cluster coefficients of a syntactic network are much larger than that of a random network when they are compared with each other with the same nodes and average degree. According to C and L in Figure 2, this paper finds that the linguistic networks of the 15 languages are all small-world networks. We also noticed that Japanese and Hungarian are at the ends of the curve of cluster coefficients and the curve of average path length. Among the 15 languages, Japanese ranks the first in cluster coefficient and the last in average path length. Hungarian is the opposite. Why? Is it due to corpora or is it a reflection of the typological characteristic of language? Japanese and Hungarian are agglutinative languages from the perspective of morphological structure. Japanese corpus is a conversation of restricted fields and Hungarian corpus is general news. Although the two corpora are of similar scale, the lexical and syntactic restrictions of Japanese corpora are very different from Hungarian. This finding indicates that the sensitivity of a complex network to style takes effect not only within a language, but is also cross-linguistically valid [4,8]. The network whose degree distribution obeys power law (P(k)~k–γ) distribution is called a scale-free network. According to the function provided by Network Analyzer, we carried out power law fitting to 15 languages to get the power exponent and determination coefficients of each language as shown in Figure 3.
Figure 3
Power exponent and determination coefficients.
3462
LIU HaiTao, et al.
Chinese Sci Bull
The power exponent in Figure 3 ranges from 1.077 to 1.353. There are only five languages whose determination coefficient is above 0.75. Our research demonstrates that it is difficult to get viable fitting results with power law because the degree distribution of a real network has the characteristic of a long tail. Segmented fitting and accumulation of degree distribution are always used to avoid the disturbance of a long tail. Researchers have proposed some new and more effective methods for this purpose [40]. This paper aims at discovering the relationship between a complex network and linguistic typology, and is an area of study belonging to the applied research of a complex network. We will not discuss the theoretical study of power laws further, but only adopt the most convenient method and tool in existence [21,22]. Figure 3 shows that this parameter is sufficient to distinguish the languages being studied, and is likely to be the parameter for language classification. According to the research results of a syntactic network [3,8], the degree distribution of the network in this paper is close to the power law distribution if it adopts the accumulation of degree distribution or segmented interception. These networks are scale-free. After investigating the integrated characteristics of these networks, we will analyze some parameters which may relate to the linguistic typology. The degree of node in a syntactic network indicates the combining ability of words in real texts. From the viewpoint of linguistics, the degree of these networks is a reflection of a word’s syntactic valency, as well as instantiation of the “probabilistic valency pattern” [41] of the language (word type). Figure 4 shows that the average degree of a language is not necessarily related to the degree of network centralization. Centralization indices do not reflect the average ability of nodes connecting to each other, but they do indicate the difference between the degree of nodes in the network and the authority of nodes. Syntactically speaking, a linguistic network with a large NC has nodes of outstanding degree. These nodes are grammatically functional words, in
Figure 4 Average degree, network centralization, and ratio of edge and node (NC multiplied by 20).
October (2010) Vol.55 No.30
general. Consequently, NC reflects the degree of morphological change of a language and can be used as a parameter for typology. Theoretically, the average degree of the network is related to the number of edges and nodes of the network. Therefore, we calculate the ratio of edges and nodes of every network as the E/N curve shown in Figure 4. The degree of nodes is positively related to E/N (Pearson correlation coefficient = 0.999, P < 0.001). The average degree of the network is not a stable parameter for typology because it is easily influenced by the network scale. In the research of language typology or classification based on real corpora, ascertaining parameters independent from text size and annotation schemes is very difficult if the sample does not have some languages in the same family. We have difficulty determining whether the results are caused by elements within language or by other non-linguistic factors. Romanian, Italian, Portuguese, Catalan, and Spanish form the Romance language subgroup in our sample, while Latin is the ancestor of all Romance languages. These languages are the reference languages from which we select parameters. According to the traditional linguistic typology, verbs from the Roman family of languages have inflectional changes, while their nouns have isolating tendency. Modern Greek and Arabic are typical inflected languages. English, which belongs to the Germanic subgroup, is becoming more isolating and littered with inflectional changes. Basque, Hungarian, Japanese, and Turkish are agglutinative languages. Chinese is an isolating language. The inflectional changes of Latin and ancient Greek are more than those of the modern Romance languages and modern Greek. The degree of morphological change of a language may influence its network characteristic. Therefore, the classification of traditional linguistic typology is instructive for later analysis and discussion. However, to classify languages into several kinds according to morphological changes may not be correct because inflection, agglutination, and isolation may exist more or less in any language. The difference is that of degree, but not of essence. Consequently, when suggesting that a language is agglutinative, it is only to say that the agglutinative elements of the language are more than that of isolated languages and inflected languages. Greenberg [42], Altmann and Lehfeldt [9] have made detailed discussions on the quantitative research of this issue. Altmann and Lehfeldt [9] conducted research on language typology by adopting clustering. Cysouw [43] proposed a language typology clustering method based upon the network (Neighbor Net) [44]. Deng and Wang classified Chinese-Tibetan languages and dialects by adopting etymological statistics and a molecular anthropological method. Their tree diagram can be viewed as an example of language clustering [45]. On the basis of the above discussion, we take C, L, NC, γ, and R2 as variables. Using Euclidean minimum distance as our method, we obtained the language cluster shown in Figure 5.
LIU HaiTao, et al.
Figure 5
Chinese Sci Bull
Language cluster using C, L, NC, γ and R2.
Figure 5 shows that the resemblance of Portuguese and Spanish is 90.17, Catalan and Italian is 92.39, Romanian, Catalan, and Italian is 89.13; Romanian, Catalan, Italian, and Latin is 88.45 in the Romance subgroup. However, this clustering still has some problems. For example, Portuguese and Spanish are not divided into a group with other Romance languages. Another clustering with high resemblance deserving notice is ancient Greek and Basque (86.77). Hungarian, Japanese and Turkish are languages separated by great distance, which basically reflects the morphological and typological characteristic of these languages. We admit that although Figure 5 reflects the difference and similarity of languages in our corpora to some extent, the clustering result should be further improved and discussed. Therefore, we carry out the clustering experiment again using different parameters and methods. Figure 6 shows that the clustering results from adopting , C, L, NC, γ, D, and R2 (i.e. all parameters except E and N in Table 2) is the best result. Compared with the clustering in Figure 5, five languages of the Romance subgroup are divided into a group in Figure 6 (83.71). The resemblance of Chinese and English reaches 80.73. Turkish, Japanese, Basque, Hungarian, Arabic, and other languages occupy prominent places in Figure 6, which reflects that they are different from other languages in the
Figure 6
Language cluster using 7 main complex network parameters.
October (2010) Vol.55 No.30
3463
sample. The high resemblance of Latin, modern Greek and ancient Greek (84.54) indicates that there are close relationships among these languages. The resemblance of Latin and the five Romance languages (83.85) is slightly lower than that of ancient Greek and modern Greek (84.54). This finding suggests that the clustering places of the same language are different at different times, thus reflecting the diachronic evolution of language. We can also see this impact in the clustering analysis of 20 languages by Altmann and Lehfeldt [9], although the clustering parameters they adopted came from a morphological change system proposed by Greenberg [40]. As proposed by this paper, the linguistic clustering result based on a complex network is approximately the same as the classification result based on typology in [12]. If we compare the clustering result of Figure 6 to that of [12], both of them are capable of distinguishing languages with evident morphological features. These two methods can preferably discover the resemblance of languages in the Romance subgroup of this paper. The paper has additional advantages as well. Chinese and English are divided into a cluster in [12], as also evidenced in this paper. The clustering parameters in [12] are SV, OV, and AdjN for 20 languages, which is much closer to the parameters of modern word order typology [46]. We can easily conclude from the above discussion that it is feasible to study language classification by combining the main parameters of linguistic (syntactic dependency) networks. There are, however, a few problems remaining that require further research.
4
Conclusions
A linguistic network is a complex network based upon linguistic principles. Linguists can determine some integrated characteristics of language using the complex network technique. We are still unclear about the influence from local structure upon a global network because of the emergence phenomenon [47]. However, there must be relationships between local and global features of a network, or all real networks would display the same characteristics. If that were the case, we could not study the colorful world through complex networks. Modern linguistic typology research is a linguistic branch subject and studies human language universals through cross-linguistic comparison. Although the parameters of general typological research are often microscopic features, such as phonological, morphological and syntactical, the goal of microscopic comparison is to classify languages on a more macroscopic level. Therefore, the research of linguistic typology by complex networks that focus on macroscopic features may better reflect the integrated property of linguistic networks. Our research investigates the feasibility of using complex networks in the study of linguistic typology. This paper
3464
LIU HaiTao, et al.
Chinese Sci Bull
builds 15 linguistic networks based on the dependency syntactic treebanks of 15 languages and explores these language complex networks. The results show that it is possible to cluster human languages through the following main parameters of complex networks: (a) average degree of the node, (b) cluster coefficients, (c) average path length, (d) network centralization, (e) diameter, (f) power exponent of degree distribution, and (g) the determination coefficients of fitting degree distribution with power law. The precision of this method is similar to the results in modern word order typology. From the viewpoint of modern linguistic typology, this paper solves two problems. First, the language sample of a typological study is not real text. Second, the study pays too much attention to local structure details in the course of choosing typological parameters. This study performs better in global typological features of language. This is a linguistic typology research method, which performs better in robust and real corpora. Compared with linguistic (implicational) typology, the method proposed in this paper is helpful for solving unclear boundary problems when dividing languages according to morphological changes. This method is also beneficial for constructing a continuum of linguistic typology. The paper enlarges the application fields of complex networks in the humanities and social sciences, and goes further in the direction of adopting complex networks as the method to study language. Our study also expands the usage of complex networks from that of discovering network commonness to excavating network individuality. However, as a new linguistic typology research method, this paper also has a few problems that can be divided into two categories. The first problem concerns the method of complex network research. The existing parameters of complex networks mostly focus on the global characteristics of language and inevitably ignore the detailed difference of language structure. Further work on this aspect should include adopting the social network analysis technique, discovering new network parameters, and constructing a weight language network. The second problem concerns the corpora. We find it is better to guarantee the consistency of the corpus in a language network and annotate different styles of the same language or different languages of the same style by adopting the same dependency annotation scheme. We also suggest comparing and studying the commonness and individuality of these networks. We thank the anonymous reviewers for the insightful comments and the providers of the treebanks for high-quality linguistic resources. This work was partly supported by Communication University of China as one of “211” Key Projects and was supported by the National Social Science Foundation of China (09BYY024).
1 2
Hudson R. Language Networks: The New Word Grammar. Oxford: Oxford University Press, 2007 Ferrer i Cancho R. The structure of syntactic dependency networks:
October (2010) Vol.55 No.30
3 4
5 6
7
8 9 10 11 12
13 14 15
16
17
18 19
20 21 22 23 24 25
26 27 28
29
Insights from recent advances in network theory. In: Altmann G, Levickij V, Perebyinis V, eds. The Problems of Quantitative Linguistics. Chernivtsi: Ruta, 2005. 60–75 Ferrer i Cancho R, SoléR V, Köhler R. Patterns in syntactic dependency networks. Phys Rev E, 2004, 69: 051915 Liang W, Shi Y, Tse C K, et al. Comparison of co-occurrence networks of the Chinese and English languages. Physica A, 2010, 388: 4901–4909 Li J, Zhou J. Chinese character structure analysis based on complex networks. Physica A, 2007, 380: 629–638 Li Y, Wei L, Niu Y, et al. Structural organization and scale-free properties in Chinese phrase networks. Chinese Sci Bull, 2005, 50: 1304–1308 Liu H K, Zhang X L, Cao L, et al. Analysis on the connecting mechanism of Chinese city airline network (in Chinese). Sci China Ser G (Chinese Ver), 2009, 39: 935–942 Liu H. The complexity of Chinese dependency syntactic networks. Physica A, 2008, 387: 3048–3058 Altmann G, Lehfeldt W. Allgemeine Sprachtypologie: Prinzipien und Messverfahren. Munich: Fink, 1973 Croft W. Typology and Universals. 2nd ed. Cambridge: Cambridge University Press, 2002 Song J. Linguistic Typology: Morphology and Syntax. Harlow and London: Pearson Education, 2001 Liu H. Dependency direction as a means of word-order typology: A method based on dependency treebanks. Lingua, 2010, 120: 1567– 1578 Liu H. Statistical properties of Chinese semantic networks. Chinese Sci Bull, 2009, 54: 2781–2785 Liu H, Hu F. What role does syntax play in a language network? Europhys Lett, 2008, 83: 18002 Mehler A. Large text networks as an object of corpus linguistic studies. In: Lüdeling A, Merja K, eds. Corpus Lin-guistics. An International Handbook. Berlin, New York: de Gruyter, 2008. 328–382 Čech R, Mačutek J. Word form and lemma syntactic dependency networks in Czech: A comparative study. Glottometrics, 2009, 19: 85–98 Choudhury M, Mukherjee A. The structure and dynamics of linguistic networks. In: Dynamics on and of Complex Networks, Modeling and Simulation in Science, Engineering and Technology. Boston: Birkhaeuser, 2009. 145–166 Ke J, Yao Y. Analyzing language development from a network approach. J Quant Linguistics, 2008, 15: 70–99 Mukherjee A, Choudhury M, Basu A, et al. Self-organization of the sound inventories: Analysis and synthesis of the occurrence and co-occurrence networks of consonants. J Quant Linguistics, 2009, 16: 157–184 Peng G, Minett J W, Wang W S Y. The networks of syllables and characters in Chinese. J Quant Linguistics, 2008, 15: 243–255 He D, Liu Z, Wang B. Complex Systems and Complex Networks (in Chinese). Beijing: Higher Education Press, 2009 Albert R, Barabási A L. Statistical mechanics of complex networks. Rev Mod Phys, 2002, 74: 47–97 Dong J, Horvath S. Understanding network concepts in modules. BMC Syst Biol, 2007, 1: 24 Assenov Y, Ramírez F, Schelhorn S E, et al. Computing topological parameters of biological networks. Bioinformatics, 2008, 24: 282–284 Aduriz I. Construction of a Basque dependency treebank. In: Proceedings of the 2nd Workshop on Treebanks and Linguistic Theories, Vaxjo, Sweden. 2003 Afonso S. Floresta sinta(c)tica: A treebank for Portuguese. In: Proceedings of LREC-2002, 2002. 1698–1703 Atalay N B, Oflazer K, Say B. The annotation process in the Turkish treebank. In: Proceedings of LINC-2003, 2003 Bamman D, Crane G. The design and use of a Latin dependency treebank. In: Proceedings of the Fifth International Workshop on Treebanks and Linguistic Theories (TLT 2006), 2006. 67–78 Bamman D, Mambrini F, Crane G. An ownership model of annotation: The ancient Greek dependency treebank. In: Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories
LIU HaiTao, et al.
30
31 32
33 34 35
36
37
38
Chinese Sci Bull
(TLT8), 2009. 5–15 Csendes D. The szeged treebank. In: Proceedings of the 8th International Conference on Text, Speech and Dialogue, TSD 2005, LNAI 3658, 2005. 123–131 Torruella M C, Antonın M. Design principles for a Spanish treebank. In: Proceedings of TLT-2002, 2002 Kawata Y, Bartels J. Stylebook for the Japanese treebank in VERBMOBIL. Verbmobil-Report 240, Seminar fur Sprachwissenschaft, Universitat Tubingen, 2000 Liu H. Building and using a Chinese dependency treebank. GrKG/Humankybernetik, 2007, 48: 3–14 Montemagni S, Barsotti F, Battista M, et al. Building the Italian Syntactic-Semantic Treebank. Treebanks, 2003. 189–210 Prokopidis P, Desipri E, Koutsombogera M, et al. Theoretical and practical issues in the construction of a Greek dependency treebank. In: Proceedings of the Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005), 2005. 149–160 Buchholz S, Marsi E. CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), 2006. 149–164 Nivre J, Hall J, Kübler S, et al. The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007. 915–932 Hajic J, Smrz O, Zemanek P, et al. Prague Arabic dependency treebank:
October (2010) Vol.55 No.30
39 40 41 42
43
44
45 46 47
3465
Development in data and tools. In: Proceedings of NEMLAR-2004, 2004. 110–117 Liu H. Dependency distance as a metric of language comprehension difficulty. J Cognit Sci, 2008, 9: 159–191 Clauset A, Shalizi C R, Newman M E J. Power-law distributions in empirical data. SIAM Rev, 2009, 51: 661–703 Liu H T, Feng Z W. Probabilistic valency pattern theory for natural language processing (in Chinese). Linguistic Sci, 2007, 3: 32–41 Greenberg J H. A quantitative approach to the morphological typology of language. In: Method and Perspective in Anthropology. Minneapolis: University of Minnesota Press, 1954. 192–220 Cysouw M. New approaches to cluster analysis of typological indices. In: Köhler R, Grzbek P, eds. Exact Methods in the Study of Language and Text. Berlin: Mouton de Gruyter, 2007. 61–76 Bryant D, Moulton V. Neighbor-Net: An agglomerative method for the construction of phylogenetic networks. Mol Biol Evolut, 2004, 21: 255–265 Deng X H, Wang S Y. Classification of Languages and Dialects in China (in Chinese). Beijing: ZhongHua Book Company, 2009 Haspelmath M, Dryer M, Gil D, et al. The World Atlas of Language Structures. Oxford: Oxford University Press, 2005 Liu H T, Zhao Y Y, Huang W. How do local syntactic structures influence global properties in language networks? Glottometrics, 2010, 20: 39–59