Law of Cumulative Advantages in the Evolution of Scientific Fields1 Tomáš Cahlík, Marcel Jiřina2
Abstract: The evolution of scientific fields analyzed by co-word analysis and presented in strategic diagrams is simulated based on the law of cumulative advantages - the probability of a new tie between two keywords depends positively on the frequencies in which both keywords have taken part already. The results we get from simulations are compared with the results of real scientific field evolution. We consider the high correspondence of both to be a proof of the working of the law of cumulative advantages in the development of scientific fields and we believe that our research opens new possibilities for predictions of the development of scientific fields. Keywords: law of cumulative advantages, maps of science, co-word analysis, simulation, model, predictions
Introduction In the maps of science, we see either clusters of keywords or clusters of citations. These clusters are created on the base of frequencies of co-occurrences of keywords or citations in documents. Derek de Sola-Price showed that behind a lot of empirical bibliometrics laws is the law of cumulative advantages – success is rewarded, non-success is ignored. All bibliometric laws are about frequencies. So the hypothesis that the frequencies of co-occurrences follow the law of cumulative advantages is a natural one. The aim of this article is to describe experiments we have made for the verification of this hypothesis. In the first part we summarize the tools we have used for the verification - the co-word analysis illustrated on the real evolution of scientific field “economics” in 199719993 and the algorithm that generates co-occurrences of keywords based on the law of cumulative advantages. In the second part we describe the results of our experiments. In conclusions we say that in our opinion the results of our experiments prove the above mentioned hypothesis and we formulate a question for further research.
Tools
1
Results of grant 402/00/0999 „Research and Development in Economic Growth Models“, Grant Agency of the Czech Republic are used in this article. 2 Tomáš Cahlík, Charles University Prague, Faculty of Social Sciences, Institute of Economic Studies, Opletalova 26, 110 00 Prague 1, Czech Republic. E-mail:
[email protected]. Marcel Jiřina, Czech Technical University in Prague, Faculty of Electrical Engineering, Center of Applied Cybernetics, Technická 2, 166 27 Prague 6 - Dejvice, Czech Republic E-mail:
[email protected]. 3 In [3], this field was analyzed in the same periods in the context of search for fundamental articles in economics. The advantage of using the same periods is that some results can be compared with results in [3].
Co-word analysis The co-word analysis [3] has been elaborated mostly by French scientometricians (for example Courtial, Callon, Turner). The keywords used for the description of the content of an article are the basic building blocks of a research field structure. A cluster of keywords can be understood as a short description of a research theme.4 A research field is then described as a structure of mutually connected research themes. Each research theme obtained in this process has two parameters. The first one, called "density", measures the strength of internal ties among all the keywords describing the research theme. We can understand this parameter as a measure of the theme development. The second one, called "centrality", measures the strength of external ties to other themes. We can understand this parameter as a measure of importance of a theme in the development of the entire analyzed field. Both median and mean values for density and centrality can be used in classifying themes into four groups. Thereafter, a research field can be understood as a set of research themes, mapped in a "strategic diagram" - graph made by plotting themes according to their centrality and density rank values (if we use median for classifying clusters) or values (if we use mean) along two axes, x-axis signifying centrality, y-axis denoting density. Strategic diagrams with rank values are used more commonly than the ones with values, because of their legibility. The themes in the first quadrant are both well developed and important for the structuring of a research field. The themes in the fourth quadrant have well developed internal ties but unimportant external ties and so are only of marginal importance for the field. The themes in the third quadrant are both weakly developed and marginal. The themes in the second quadrant are important for a research field but are not developed.5 The articles from the thirteen most cited economic journals6 have been excerpted from the SSCI database for three successive years 1997 - 1999. In 1997 it was 1235 articles, in 1998 another1202 articles and in 1999 another 1253 articles were added. Software system Lexidyn was then filled in with these articles. A strategic diagram for the first period is in Fig. 1. The number of the theme in the strategic diagram is only a technical mark equal to the number of the most important keywords.7
4
A research theme can be identified by using the information about common occurrences of keywords in some articles. Let us calculate an "association index" as: eij = fij2/(fi.fj) , where fij is the number of common presence of keywords i and j in an article, fi is the total number of occurrences of the word i in all the articles. We can understand an association index as a measure of strength of ties between keywords in a research field. This measure is then used for clustering the keywords into research themes. 5 We use the clockwise convention of quadrant numbering and we hope it will not cause any misunderstandings later in the text. 6 The same journals as in [1] are used: Journal of Economic Literature, Econometrica, Journal of Political Economy, Quarterly Journal of Economics, American Economic Review, Review of Economic Studies, Rand Journal of Economics, Journal of Economic Theory, Journal of International Economics, Economic Journal, Journal of Public Economics, International Economic Review, Economica. 7 The numbering is created by system Lexidyn itself, the only reason is better legibility of the diagrams.
2
Fig. 1: Strategic diagram for the first period.
Both themes on Fig. 1 are clusters of eight (this is subjectively given) mutually connected keywords. Theme 1-Model in the first quadrant is formed by the following keywords: 1 model, 20 unemployment, 21 innovation, 26 inflation, 31 money, 45 monetary policy, 56 dynamics, 88 life cycle. Theme 2-Equilibrium in the third quadrant is then cluster: 12 markets, 15 games, 152 tiebout, 16 oligopoly, 2 equilibrium, 24 existence, 4 information, 7 market.8 With 2437 articles in both 1997 and 1998, some themes are added in the strategic diagram (Fig. 2) and for 3690 articles in all three years, the strategic diagram looks as shown in Fig. 3.9
8
Theme 2 demonstrates the main problem of this method, the quality of keywords. It is clear that market and markets are in principle the same keyword and that only one of them should be in the database. The solution to this problem lies in replacing synonyms by only one keyword. 9 In [3], strategic diagrams for single years 1998, 1999 are used, they are of course different from strategic diagrams from articles cumulated for more years.
3
Fig. 2: Strategic diagram from articles cumulated for 1997 and 1998.
Fig. 3: Strategic diagram from articles cumulated from 1997 to 1999
The key question is the identification of a specific research theme in different strategic diagrams when the theme can change its position and even the keywords that form a theme can change. The solution to this problem lies in defining as identical such themes on various
4
maps that have the number of common keywords over a (subjectively) given threshold.10 In Fig. 4 all themes that live in more than one diagram can be found.11 7 -> 7 17 -> 17 2 -> 2 -> 2 5 -> 5 13 -> 13 The chains above show the cumulative work done on some themes. For example: Theme 2 from the first period was further elaborated and is not dispersed in the field even if we cumulate articles for two, respectively three years. Threshold is three, so the number of common keywords in theme two is from map to map always three or more. Algorithm for generation of co-occurrences of keywords Algorithm is cyclical. In each cycle, all words try to form pairs with other words in the simplified dictionary used for the scientific field. The simplification means that on the start we get rid of all words that have not formed a pair yet. The probability that a word starts trying to form a pair is given by probability = AvFr / MaxAvFr, where AvFr is the average frequency of existing pairs in which the word takes part and MaxAvFr is the maximal average frequency of existing pairs among all words. It is evident that this probability depends on previous success of the word to form pairs, this is exactly where the law of cumulative advantages applies. If a word passes this filter successfully, the second word for the pair is sought. All other words are randomly ordered and we start to filter them exactly in the same way until we get a successful pass or until we do not have more words. The probability for the second word to pass is calculated exactly in the same way as for the first one. If no pair is created, we continue with another word. In each cycle, we have to recalculate the probabilities because the frequencies of pairs have changed. We repeat this cycle until we generate as many pairs of keywords as we want.
Results 10
Let’s imagine an analogous problem how we could distinguish the villages that have been moved by a giant to a different place on the maps from various eras of Liliputs‘ Empire. We would have to look at the houses that form the villages and if we could find more than for example 50 % of the same houses it would be the same village. 11 The number of the theme is only a technical mark created by program Lexidyn itself according to the most important keyword. That is why the numbers of themes in the chain can differ (we have no such case here, but we will find it in further text), according to the change of importance of keywords. A theme can be interpreted only by an expert from the specific scientific field.
5
In the real evolution of scientific field “economy”, the number of co-occurrences between the states described by Fig. 1 and Fig. 2 increases by 4486 and between the states described by Fig. 2 and Fig. 3 by another 6407. We take the real co-occurrences that are behind the strategic diagram in Fig. 1 and we add to them 4486 new pairs that we generate according to the algorithm described above. The strategic diagram we get thereafter is in Fig. 4 and it describes the generated state of the field for 1997 and 1998. Than we add another 6407 generated pairs, resulting strategic diagram is in Fig. 6 and it describes the generated state of the field for 1997, 1998 and 1999.
Fig. 4: Strategic diagram describing the generated state of the field for 1997 and 1998
6
Fig. 5: Strategic diagram describing the generated state of the field for 1997, 1998 and 1999, if we start to generate from the real state for 1997 Let us now compare the real state of scientific field for 1997 and 1998 in the strategic diagram in Fig. 2 with the generated state of scientific field in Fig. 4.12 Common themes are 2-2 17 - 3 13 - 13 Three of real themes have three or more keywords common with three generated themes and the similarity of their positions in strategic diagrams is almost perfect. Let us now compare the real state for 1997, 1998 and 1999 in the strategic diagram in Fig. 3 with the generated state in Fig. 5. Common themes are 3 – 11. Just one of real themes have more than three keywords common with a generated theme and the similarity of positions in strategic diagrams is not visible. Because of the stochastic character of the whole process, it could be expected that with the distance from the starting point the correspondence of real and generated states diminishes. That is why we ought to expect that if we start to generate new pairs from the real state for 1997 and 1998 (Fig. 2), the correspondence of the real and generated states ought to be better in comparison with the previous case. The strategic diagram describing the generated state of the field for 1997, 1998 and 1999, if we start to generate from the real state for 1997 and 1998, is in Fig. 6.
12
Our method for comparison is the same as we used for comparison of maps in Figures 1, 2 and 3.
7
Fig. 6: Strategic diagram describing the generated state of the field for 1997, 1998 and 1999, if we start to generate from the real state for 1997 and 1998 Let us now compare strategic diagrams in Fig. 3 and Fig. 6. Common themes are 2–1 13 - 13 17 - 17 5-5 Four of real themes have three or more keywords common with four generated themes and the similarity of their positions in strategic diagrams is almost perfect.
Conclusions Our results show that the correspondence of co-occurrences generated by our algorithm with those present in reality is in most cases satisfactory, the correspondence of figures 2 and 4 is almost perfect. In our opinion, this proves the logical hypothesis that the law of cumulative advantages is present in the development of scientific fields. Another question is, if this knowledge could be used for predictions of scientific fields development. Such predictions are important for decisions about allocation of resources among different scientific themes. In [3], some empirical knowledge13 about the dynamics of scientific fields and scientific themes is given. This knowledge is based on comparison of maps of one scientific 13
Concerning the evolution of a scientific field, we can find some arguments that the evolution of intensity of research activity (number of publications) during the life-span of a field is correlated with some patterns of research themes concentration in a strategic diagram. In the first and last stages, the themes are commonly concentrated in the second and fourth quadrants. In the second and fourth stages, the themes are dispersed in all four quadrants. In the third stage, the concentration is in the first and third quadrants. Concerning the evolution of themes, following statements can be made: 1. Themes that live more periods often survive to further periods. 2. Themes that have had an interesting evolution survive more often than themes with simple dynamics.
8
field obtained by co-word analysis in single periods and it is an open question how relevant this knowledge is if we work with maps that cumulate articles.
References [1] Beckmann, M. - Persson, O.: The Thirteen Most Cited Journals in Economics. Scientometrics, Vol. 42, No.2,1998 [2] Cahlik, T.:Comparison of the maps of science. Scientometrics, Vol. 49, No.3, 2000 [3] Cahlik, T.: Search for fundamental articles in economics. Scientometrics, Vol. 49, No.3, 2000 [4] Cahlik, T.- Jirina, M.: Scientometric Analysis of Artificial Neural Networks Scientific Field. Neural Network World, 1997,No. 2. [5] Cahlik, T.- Jirina, M.: Knowledge Restructuring during Scientific Field Development. In: Proceedings of the Workshop on Artificial Intelligence Techniques, Brno 1996. [6] Callon, M. - Law, J. - Rip,A.: Mapping the Dynamics of Science and Technology. MacMillan, London, 1986. [7] Courtial, J.P. - Cahlik, T. - Callon, M.: A Model for Social Interaction Between Cognition and Action through a Key-word Simulation of Knowledge Growth. Scientometrics, Vol. 31, No.2, 1994. [8] Garfield, E.: Bradford’s Law and Related Statistical Patterns. Current Contents 19: 5-12, May 12, 1980
3. 4.
5.
6.
Themes from the first and second quadrants survive more often than themes from the third or forth quadrants. One can see the tendency of the themes from the second quadrant to go to the first quadrant. This development is not at all surprising, the themes that are central are interesting for the field and thus have a tendency to be elaborated. Themes from the fourth quadrant are mostly coming into the field as already elaborated in another research field. If this spring-in succeeds that is if the connections of such a theme to its new field are so interesting that they are elaborated, then such a theme becomes central in further periods. But most of the themes from the fourth quadrant leave the field in the next period. Those are the springs-in that are not considered as interesting by researchers. Themes from the first quadrant that will not survive can make another research field richer or their development can continue (be hidden) in applications.
9