Maps and Mapping in Scientometrics

22 downloads 0 Views 699KB Size Report
polish digital libraries, archives and museums through usable interface. ... Scientometrics & Ninth COLLNET Meeting Humboldt-Universität zu Berlin, Institute for .... The authors' list was filtered against affiliation records and matching entities were .... 24 GLAM - is an acronym for "galleries, libraries, archives, and museums".
Maps and Mapping in Scientometrics Veslava Osinska1, Piotr Malak1 1

Istitute of Information Science and Book Studies, Nicolaus Copernicus University, 87-100 Torun, Poland. Abstract

Current scientometrics issues deal with the evaluation of scientists productivity , prediction of their career trajectories or the impact on funding decisions on the evolving structure of academic community. Science or knowledge domain maps are important scientometrics tools. They are created on the base of scholar bibliographic data. They provide detection and visualization of emerging trends and transient patterns in scientific literature. Besides quantitative methods covering mathematical, statistical analysis, the graphical representations aim to reveal multilevel structure of science. The ones consist of social, topical, geographical, economical and political relationship between mapping units in both micro (scientists) and macro scale (research centres, cities, countries). The article also introduces to the specifics of data extraction and processing through conversion to required format for final visualisation and analysis. A variety of presented charts and diagrams can help in instant evaluation of digital libraries dynamics and scientific collaboration and productivity in Poland.

Keywords: science mapping, scientometrics, visualisation of science. This paper is sponsored by National Science Center (NCN) under grant 2013/11/B/HS2/03048/Information Visualization methods in digital knowledge structure and dynamics study.

I.

Introduction to science mapping

Human knowledge, as a very complex system, evolves nonlinearly and unpredictably. In scientometrics a science mapping is widely used for discovering how different domains develop, disappear, coexist, integrate and overlap. The maps are constructed by measuring the common attributes of scientific documents and presenting their relationship on two- or three-dimensional space. Co-citations and co-authors are predominantly used to find correlations between articles or authors. A key science indicators, such as domain category or class, may specify the structure of knowledge domain. Aforementioned visualization layouts were called scientographs by E. Garfield whereas the methodology having origin in computer science, scientometrics and complex networks scientography 1. Another term, proposed by Ch. Chen and used interchangeably with scientography, is Knowledge Domain Visualization 2. If scientographs are performed periodically it is possible to obtain essential structural changes and reveal evolution of scientific domain.

1 2

E. Garfield. Scientography: Mapping the tracks of science. Social and Behavioural Sciences 1994; 7(45): 5-10. Ch. Chen. Information Visualization. Beyond the Horizon. 2nd ed. London: Springer, 2006, pp.143-170.

Researchers, by analysing these maps, could predict the emerging trends in disciplines, find the prominent scientists and their works or track collaborative networks. Analysing the series of maps in chronological order was described in 1994 by E. Garfiled as longitudinal mapping 3. During last two decades a variety of mapping tools and techniques has emerged. This element and the commonness of open data give researchers the chance to visualize scientific domain on a great scale. One of them, developed by Ch. Chen, is Java-based CiteSpace software, which provide the possibility of analysing word co-occurrence or as co-citations analysis 4. Evolution of knowledge maps can be presented as time slices in selected periods 567. Polish bibliometrics meets the barrier in national science mapping. There is an essential diffusion of scholar databases onto separate domains and specified fields. Any academic initiative aiming to centralize digital resources is tied to the expectation that unify landscape of Polish science has been performed. A Digital Libraries Federation (DLF) 8 initiative offers the most complete resources of polish digital libraries, archives and museums through usable interface. DLF, maintained by Poznan Supercomputing and Networking Center 9, was started in 2007. The purpose of this web service relies on gathering, processing and dissemination of digital collections of Polish cultural and scientific institutions. DLF share meta-data with Europeana and similar digital repositories, thus contributing to the promotion of Polish national heritage in the world. The authors in the frame of project “Information Visualization methods in digital knowledge structure and dynamics study” attempt to analyse the dynamics and structure of social science and humanity articles using DLF metadata correlations. In current paper we strive to present methodology of mapping, experiment’s assumptions, data specification and processing details as well as an initial results. II.

Methodology of Science Mapping

The following quotation clearly reflects the nature of science maps: “Mapping of science is a kind of graph which display the development and structure of science, it is the scientometrics’ production which from maths expression to figure, the result that knowledge from geography distributing map to

3

E. Garfield. The same Ch. Chen. (2004) CiteSpace: Visualizing Patterns and Trends in Scientific Literature, http://cluster.cis.drexel.edu/∼cchen/citespace/, accessed 2 March 2015. 5 Ch. Chen. (2006) CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science and Technology, 57(3), 359-377. 6 Y. Liang et al. Knowledge Mapping of Citation Analysis Domains. H. Kretschmer and F. Havemann (Eds.): Proceedings of WIS 2008, Berlin Fourth International Conference on Webometrics, Informetrics and Scientometrics & Ninth COLLNET Meeting Humboldt-Universität zu Berlin, Institute for Library and Information Science (IBI). 7 V. Osinska and P. Bala. (2010). New Methods for Visualization and Improvement of Classification Schemes – the case of Computer Science. Knowledge Organization, 37(3). 8 O FBC [on-line]. http://fbc.pionier.net.pl/pro/informacje-ogolne/. 9 http://www.man.poznan.pl/online/en/ 4

visualizing knowledge structure and evolution disciplinarian…” 10. The meta-data from scientific dataset can be specified as analysis units to obtain visualisation of science or particular scientific domain. In most cases the citations are considered as basic research material for mapping purposes. Ordination algorithms 11 place similar objects in the graph so that they are close to each other and dissimilar objects are far apart. The similarity, and therefore distance, between documents is calculated on the base of features they have in common: authors, citations, keywords, categories or classes 12, text or abstracts’ terms. Quantitative estimation of common features or simply co-occurrences constitutes the principle of science visualisation. The process of mapping consists of five main phases 13, as illustrated on Figure 1. The first one - data gathering, is time consuming and takes 60-80% of hole process duration. Phases 2 and 3 are related to data processing and preparation to specified format. These steps include also so important data extraction, relevant to the study subject. Depending of the data quality it can take approximately 2030% of the process time. There is also sorting and counting of featured pairs. Next the data must be mapped to the two- or three dimensional observation space. Specific ordination algorithms are used at this step to reduce dimension of large scale data. Last phase of the mapping process focus on improving of the usability of visual layout through matching glyphs and colours, constructing legend and adding data manipulation functions like sorting, filtering, zooming and another facilities.

1

2

3

4

5

Figure 1. The five phases of mapping process.

III.

Dataset a) Data characteristics

DLF provides meta-data as XML files. The number of obtained records equals N=171,000,000. Data derived from distinct digital libraries are identified by full domain name, for example 10

Y. Chen and Z. Liu (2005). The rise of mapping knowledge domain. Studies in Science of Science23(2), 149154. 11 K. Boerner. JASIST 2003. 12 V. Osinska Knowledge Organisation 2010. 13 V.Osinska Zagadnienia Informacji Naukowej. 2010.

FBC_bc.wbp.lublin.pl, FBC_dlibra.biblioteka.tarnow.pl, which allows easy identification of the data provider. There is also additional identification facility, a field storing OAI14 locator of the particular resource. Each library or repository exposed in DLF is represented by a separate XML file. Altogether, at the time of current article preparation, there were 101 libraries data available at DLF site, and thus we received 101 files. All the meta-data records provided by DLF are build with Dublin Core compatibility, with internally embedded CDATA fields, like presented in following code:

There are also additional Europeana Semantic Elements Specification15 tags reflecting mostly copyrights data, which are not in the scope of presented research interest, thus those data was not used for further processing. Due our research goal we extracted only part of the data for each DLF metadata records. Those are fields containing main resource description (formal and subject) elements. For facilitating further analysis steps all significant data for each represented data source was joined into one research corpus storing each record data in 13 fields, including internal ID number. The resulting file was text one, with CSV format used. Out of the data from DLF files we prepared co-authors pairs list, which was then confronted with affiliation list. The second, junction table in Excel format consists of the names and affiliations of all researchers and scholars in Poland. The authors’ list was filtered against affiliation records and matching entities were used for mapping localisation of co-authors, and thus for creating the cooperation and research map of Polish scholars (one-to-one relation).

14

OAI - Open Archives Initiative Protocol for Metadata Harvesting, http://www.openarchives.org/pmh/ Europeana Semantic Elements Specification. http://pro.europeana.eu/share-your-data/data-guidelines/esedocumentation 15

b) Data processing and tools For data processing a Python 2.7.9 scripting language was used 16. From our former experiments we know Python as very fast and efficient language for text documents pre-processing. It is worth noting, Google uses this scripting language. There Is also dedicated Natural Language Toolkit (NLTK) available, a platform for building human language processing systems. 17 Preprocessing phase consisted of few steps devoted to clearing up the text layer of retrieved documents and creating the corpora. As source date originated from different digital libraries, there were some differences in formats used for storing the data. Most of digital libraries in Poland uses dLibra as working platform, while there are some of them, mainly of technical universities, relaying on authorial solutions and systems. The differences were for example use of '”' or '|' characters for delimiting content of different meta-data fields. There were also UNIX-like end of the line (EOL) indicators used in few source files (LF – line feed instead of CRLF – carriage return, line feed for Windows) which needed to be considered in source code of pre-processing script. All the meta-data files used UTF-8 coding, while the affiliation list was delivered in Windows-1250 coding format ('cp1250' for Python use). For corpora creation we used only few of available meta-data fields. Those were data distinct for our further

analysis

purposes:

'dc:identifier>oai:',

'dc:title',

'dc:creator',

'dc:contributor',

'dc:subject', 'dc:language', 'dc:date', 'datestamp', 'dc:description', 'dc:type', 'dc:coverage', 'dc:source'. Python delivers extremely efficient tool for extracting data from structured sets, a list comprehension 18, which we used for extracting all interesting fields from source files, as in given example code: title =strip(' | '.join([field[1] for field in element if field[0].startswith('dc:title')]))

Where list comprehension feature is coded in square brackets. Relying on lists comprehension facilitated the whole process of corpora creation by very fast extraction of indicated data and formatting it in deliverable form in the corpora file. Unlike majority of modern data bases, documents meta-data provides the possibility of repeating data fields (feature used i.e. in CDS ISIS). This mete-data language feature may cause problems in automatic data processing as posting only the last repetition of the field to further analysis. In order to have access to all information stored in repeated data fields we joined such fields content with '|' used as separator (thus former removing of the sign) – expressed as “(' | '.join()” statement in our code. 16

Python Software Foundation: https://www.python.org/ NLTK 3.0 documentation: http://www.nltk.org/ 18 List comprehension: https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions 17

After we had ready corpora we could attempt to extraction author information from it. For 1,700,000 records available through DLF there were 102,000 of distinct author names in the meta-data creating 39,116 different authors sets (of two or more authors). Some possible processing issues appeared during this stage. One of them was for some publications the institutional author were given, like “Urząd Miasta Biała Podlaska. Wydział Strategii i Rozwoju”, which should be omitted on the base of lack of comma (with further manual confirmation of resulting file content). Another problem rising from our analysis needs was giving only the first name initial instead of full name. This was consequently used by some of the libraries. As general practice it is accepted, for author detecting purposes it is not sufficient distinguishing value. Further step was detecting only Polish authors from the list, as our research concerns only Polish digital resources, as for origin feature.

IV.

Statistical Analysis a) Quantitative analysis

The dynamics of Polish digital resources – the basic characteristic of digital platform, is presented in Figure 2. For the initial period of Polish digitalisation there were only few documents added (3 in 1995, one in the next year) the process speeded up in recent few years.

700000

Documenta quantity

600000 500000 400000 300000 200000 100000 0

Figure 2. FBC documents gain among a year.

We combined few lists and matched them against retrieved authors’ list receiving 59,517 individual authors (of which 37,300 occurred only once, meaning there were only one title of a given author available), and 57,987 distinct (unique) pairs of authors (Figure 3). The pairs were created on the basis of real co-authoring of available publications. Among those pairs we were able to provide affiliation data only for 15,680 pairs (27%). From the remaining 73% of author sets for 22,083 (38% of the

whole set) we could provide affiliation for one of the authors and there were 20,224 pairs without any affiliation available. This situation reflects the fact that not every author available in digital library creates academic contents. For our further mapping purposes we used only fully affiliated pairs list.

Coauthors number

7 6 5 4 3 2 0

500

1000

1500

2000

2500

Publications Quantity

Figure 3. Publications quantity among co-authors number.

b) Distribution patterns of publication quantity The number of single authors form Poland in the main table equals N = 59517. Sorting and grouping provided the information about number of publications. The authors tested statistical laws such as Pareto and power laws 19. Pareto diagram is shown on Figure 4. We can see, that 80% square fits to the first two columns of histogram. It covers about 45000 authors – a bit less than 80% (≈48000), the fitting error is 4.6%.

Figure 4. Pareto diagram for publications quantity distribution. 19

L. A. Adamic (2000). Zipf, Power-laws, and Pareto - a ranking tutorial. Information Dynamics Lab, HP Labs Palo Alto. http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html

There is well known in statistics the common generalization of Pareto rule – the power law. We also received good fitting of data by power law, presented on two log-log plots on Figure 5. Number of papers versus the number of authors is distributed according to a power law with high accuracy: chisquare=0.0002. Fitting equation is: y=0.7843x-2.161. Presented regularity implies that small occurrences of publications quantity (for example 1 to 10) are extremely common among population of authors, whereas large instances (inversely, more than 500 publications) occur rarely. The only five authors have extremely large and unreal amount of articles in the range 1000 – 2000. These distant points can disturb statistical pattern. Statisticians to keep good quality distribution, consider these data as measurement’s errors (outliers) and throw out of dataset 20.

V.

Visual Analysis

Mapping methodology is often used for scientific collaboration analysis. The procedure concerns quantitative estimation of co-authorship or authors’ co-citation 21. The clusters in the latter case can indicate the new and traditional schools, significant research trends and even more, an emerged paradigms in scientific domain 22. a) Map of scientific collaboration Coauthorship analysis showed that initial dataset includes 5354 groups of authors: ∑7𝑖𝑖=2 =5354, where

index i – varies from 2 to 7-persons in groups. Figure 4 illustrates, that among all numeric configuraFurther processing relies on summarizing the mutual relations between creators. From dataset 57986

Number of papers

Proportion of papers quantity

available pairs of co-authors have been extracted. Only seven couples of authors produced more than

Number of authors

Number of authors

Figure 5. Log-log scale plot of the distribution of publications among authors (left). The same plot for the proportional number of authors (right).

20

M.E.J. Newman. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 323. Available at: http://arxiv.org/pdf/cond-mat/0412004.pdf . 21 Ch. Chen. Information Visualisation. Beyond the horizon. . 2nd ed. London: Springer, 2006, pp.143-170. 22 V. Osińska. Wizualizacja paradygmatów w nauce. Zagadnienia naukoznawstwa 2012.

100 publications (in the range 100 – 650). To simplify large scale analysis, dataset was limited to the pairs with the number of common publications exceeding 6. This is often applied in scientometric mapping in the case of a very frequent single co-occurrences of measured units 23. It is Final collaboration network created by 235 pairs of authors is presented on Figure 6. This network visualisation clearly shows evidence of polish scientists prefer to work in a small groups: couples or trio. Significantly larger co-workers teams are on peripheries of layout (Fig. 6, right) and not representative for collaboration practice in Poland. Particular study of affiliations within larger scientific groups provides the view of links between polish high schools and institutions (Fig. 6, left). a) Geographical map of cooperation between schools Next step concerns cooperation analysis between scientific and educational institutions by transferring co-authorships into affiliation data. The junction table with covered affiliations was used for matching the records according name and surname of creators. Only one third of DLF authors (20224 from 57987) work in scientific or educational institutions that is reflected in matching procedure. Their records are in Ministerial table of Polish scientists and teachers and their output is regularly evaluated. The rest of co-authors are employed in such institutions like libraries, museums, archives and similar

Figure 6. Map of co-authorship with the number of common publications above 6 (right). One of the clusters with schools identification (left).

23

O. Person. Celebrating Scholarly Communication Studies. E-newsletter of ISSI 2009. Available on the Web: http://issi-society.org/ollepersson60/ollepersson60.pdf .

to GLAM 24. Thus the further analysis was made for the ties between researchers according founded affiliations. Collaboration pairs of schools shows such regularity that the most teams have formed within the same academic institution. The highest rank of schools according the number of publications is presented in Table 1. Nicolaus Copernicus University (UMK) researchers are very active inside Alma Mater relationships. However Figure 7 presents the view for scientific collaboration between cities in Poland. Warszawa node is the largest one because of the majority of articles in dataset is authorized by this city researchers. The degree of collaboration ties is indicated by the thickness of edges. The most effective in a proper sequence are: Warszawa - Kraków, Warszawa – Poznań, Warszawa – Wrocław, Warszawa – Toruń.

Figure 7. Geographical map of collaboration between schools and institutions in Poland according Digital Libraries Federation data.

VI.

Summary and conclusion

The current paper aims to present Polish digital libraries data processing and visual analysis in the terms of scientometric. The authors introduced to the context, an assumptions and principle of a new methodology related to science or scientific domains mapping. This type of visual representation manifest conceptualisation and models of scholarly activity and allows to improve our understanding of the structure and dynamics of science 25. Table1. Intrinsic collaboration of researchers within Polish schools sorted by publications quantity (over 70). 24

GLAM - is an acronym for "galleries, libraries, archives, and museums". GLAM are publicly funded, publicly accountable institutions collecting cultural heritage materials (Wikipedia). 25 K. Börner and A. Scharnhorst. 2009. Visual Conceptualizations and Models of Science. Journal of Informetrics, 3(3).

School 1

School 2

Publ. quantity

UMK

UMK

826

IMP im. Nofera

IMP im. Nofera

582

AGH

AGH

461

PK

PK

318

IERiGŻ

IERiGŻ

231

UTP, Bydgoszcz

UTP, Bydgoszcz

184





169

UAM

UAM

168

PW, Wrocław

PW, Wrocław

153

AWF, Poznań

AWF, Poznań

149

UKW, Bydgoszcz

UKW, Bydgoszcz

149

PW, Warszawa

PW, Warszawa

118

PL, Lublin

PL, Lublin

111

UM, Wrocław

UM, Wrocław

109

SGGW

SGGW

85

PG, Gdańsk

PG, Gdańsk

78

Such visual landscapes of bibliographic, bibliometric data can help in current identification of popular research topics, becoming so called topic maps. Co-citations and/or bibliographic coupling are often used in scientific domains visual analysis through mapping. We also can study co-authorship and thus conclude about collaboration ties in academic community, about formal links between distinct centres and institutions. Altmetrics as a new tool for measuring scientific communication, extends mapping applications in traditional scientometrics and science of science study 26. Using DLF data, the authors analyse social structure of scientific community and collaboration relations between institutions. Quantitative diagram (Fig. 3) shows researchers trio is the most frequent among other groupings in dataset. Perhaps, it is compatible with the common grants configuration in Poland, as researchers in order to effectively promote own research, use local or national digital libraries. The reference of co-authorship relations to an affiliation institutions is not complementary because the ministerial dataset does not cover all authors publishing at DLF. The part of them is employed out of typical scientific institutions. Polish scientists prefer to work in a small groups and the groups founded in the frame of native schools (Table 1). Larger partnerships networks (Fig. 6, right) consist of either one-two persons from scientific institution as leaders or the majority of scientists belonging to the same Alma Mater, like on Figure 6 from left. Mapping co-authorship onto the geographical distances

26

Bar-Ilan, J., Shema, H., & Thelwall (2014). Bibliographic References in Web 2.0. In B. Cronin, and C. Sugimoto (eds.), Bibliometrics and Beyond: Metrics-Based Evaluation of Scholarly Research, Cambridge: MIT Press.

between institutions results to the spatial arrangement between main cities participating in collaboration (Fig. 7). Warszawa and the links with Kraków, Poznań, Wrocław and Toruń (authors’ city) constitute the basic characteristic of Polish scientific collaboration according DLF dataset. The authors also present detailed data and text processing to the tabular form, that is both readable and functional in counting, sorting, filtering and manual tracking of anomalies/irregularities. Statistical analysis of total publications quantity is distributed according a power law with good accuracy: chisquare=0.0002 (Fig. 4,5). The authors intend to repeat all tests periodically and therefore analyse extending every year dataset. Changes in visualisation layouts and diagrams will express the dynamics of digital knowledge in Poland as well as the trends in social structures.

Acknowledgement We would like to thank the people concerned, from Poznan Supercomputing and Networking Centre for access to DLF metadata collection and Polish Science Ministry for Polish researchers and teachers database. References Adamic L.A. Zipf, Power-laws, and Pareto - a ranking tutorial, Information Dynamics Lab, HP Labs Palo Alto 2000, Available at the Web: http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html [accessed 2 March 2015]. Bar-Ilan J, Shema H. and Thelwall T. Bibliographic References in Web 2.0, [In:] B. Cronin, and C. Sugimoto (eds.), Bibliometrics and Beyond: Metrics-Based Evaluation of Scholarly Research, Cambridge: MIT Press 2014. Boerner K. Visualizing Knowledge Domains, “Annual Review of Information Science & Technology” 37, 2003. Börner K. and Scharnhorst A. Visual Conceptualizations and Models of Science. “Journal of Informetrics” 3(3), 2009. Börner K. Atlas of Science, NY: MIT Press, 2010. Chen Ch. CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. “Journal of the American Society for Information Science and Technology”57(3), 2006. Chen Ch. CiteSpace: Visualizing Patterns and Trends in Scientific Literature, 2004, http://cluster.cis.drexel.edu/∼cchen/citespace/ [accessed 2 March 2015]. Chen Ch. Information Visualization. Beyond the Horizon, 2nd ed. London: Springer, 2006. Chen Y. and Liu Z. The rise of mapping knowledge domain. “Studies in Science of Science” 23(2), 2005, pp. 149154. Garfield E. Scientography: Mapping the tracks of science, “Social and Behavioural Sciences” 7(45), 1994. Liang Y. et al. Knowledge Mapping of Citation Analysis Domains, [In:] H. Kretschmer and F. Havemann (Eds.): Proceedings of WIS 2008, Berlin Fourth International Conference on Webometrics, Informetrics and Scientometrics & Ninth COLLNET Meeting Humboldt-Universität zu Berlin, Berlin: Institute for Library and Information Science (IBI), 2008. Malak P. Indeksowanie treści, Warszawa: SBP, 2012.

Newman M.E.J. Power laws, Pareto distributions and Zipf’s law. “Contemporary Physics” 46, 225, p. 323-345. Available at: http://arxiv.org/pdf/cond-mat/0412004.pdf [accessed 2 March 2015]. Osinska V. and Bala P. New Methods for Visualization and Improvement of Classification Schemes – the case of Computer Science. “Knowledge Organization”37(3), 2010. Osinska V. Rozwoj metod mapowania domen naukowych i potencjal analityczny w nim zawarty, „Zagadnienia Informacji Naukowej” 2(96), 2010. Osinska V. Visual Analysis of Classification Scheme, “Knowledge Organisation” 37(4), 2010. Osińska V. Wizualizacja paradygmatów w nauce, „Zagadnienia naukoznawstwa” 48 (193), 2012. Person O. Celebrating Scholarly Communication Studies. “E-newsletter of ISSI” 2009. Available on the Web: http://issi-society.org/ollepersson60/ollepersson60.pdf [accessed 2 March 2015]. Vargas-Quesada B. et al. Showing the Essential Science Structure of a Scientific Domain and its Evolution, “Information Visualization” 9(4), 2014, pp. 288-300.

Suggest Documents