Proceedings Template - WORD - ACM Digital Library

4 downloads 12152 Views 1MB Size Report
Jul 28, 2016 - An important feature in data analysis is the exploration and data representation. ... interpreted as topics and areas of exploration and research.
Exploring data by PCA and k-means for IEEE Xplore digital library John Petearson Anzola

Luz Andrea Rodríguez

Giovanny Mauricio Tarazona

Faculty of Engineering University District Francisco José de Caldas Carrera 8 # 40 - 62, Bogotá, Colombia E-mail: [email protected]

Faculty of Administrative Sciences Foundation University Los Libertadores Carrera 16 # 63 A – 68 Bogotá, Colombia [email protected]

Faculty of Engineering University District Francisco José de Caldas Carrera 8 # 40 - 62, Bogotá, Colombia E-mail: [email protected]

ABSTRACT An important feature in data analysis is the exploration and data representation. This article describes the Principal Components Analysis techniques (PCA) and clusters analysis with k-means, in order to represent a set of two-dimensional spatial data and group similar data to find relationships between the two techniques. Data is extracted from IEEE Xplore digital library, which lacks processing tools and information display since it doesn't permit analysis and identification of trends and patterns in a query.

in which, the maximum number of records that can be downloaded in CSV format is 2000 records with their metadata. IEEE Xplore Digital Library lacks processing tools and information display, does not allow analysis in a trends query, patterns, identifying relevant topics, among others, as is done by Web of Science and Scopus to a lesser extent. The selection of IEEE Xplore Digital Library was because it's one of the databases most cited as shown in Figure 1.

At the end of the article, is discussed as a technique of data analysis unsupervised allows grouping and organizing of data by proximity based on the variance, finding similar keywords between groups and major components, allowing temporary and evolutionary view of a set of keywords, which can later be interpreted as topics and areas of exploration and research.

CCS Concepts • Information systems➝Clustering • Information systems➝Content analysis and feature selection • Humancentered computing➝Scientific visualization.

Keywords Principal components analysis; k-menas; scientometry; data mining.

1. INTRODUCTION This article contrasts the Principal Components Analysis (PCA) to the cluster analysis, in order to represent data in a space of two dimensions that have been defined in a single dimension, clustering them into similar groups and thus identify and extract hidden relationships in a dataset. The data used correspond to a consultation in the IEEE Xplore Digital Library database for the topic “Wireless Sensor Network”, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. KMO '16, July 25-28, 2016, Hagen, Germany © 2016 ACM. ISBN 978-1-4503-4064-9/16/07$15.00

DOI: http://dx.doi.org/10.1145/2925995.2926007

Figure 1. Top 20 publishers referenced Most[13]. Based on the data analysis deficiencies presented by IEEE Xplore Digital Library, throughout this article is presented the contrast between the Principal Components Analysis (PCA) and cluster analysis applied to the results from a query on IEEE Xplore Digital Library on the keywords of this dataset.

2. BACKGROUND The literature on scientometrics and bibliometrics for researchers initially focuses on the desire to know an overview of the state in researching a topic of interest, in order to assess the impact on some research topics. At this point, Web of Science integrates indexing and visualization of the world's leading research literature through a payment service that allows you to query, analyze, discover and investigate research topics through existing connectivity between publications, researchers, quotes and controlled indexing in databases that span multiple disciplines. Web of Science uses citations and search for references to track and monitor research topics, including prior to publication, becoming a tool that monitors

developments and trends in research subjects among thousands of files and records consulted.

3. TECHNIQUES USED 3.1 Principal Components Analysis (PCA)

Behind this tool cohabit algorithms of data mining, that for an investigator are useful in identifying hidden patterns and knowledge extraction, giving the possibility to find topics of emerging research that, with the results of basic and advanced consultations, one could not extract that information. Next, there are some contributions made in visual text analytics are highlighted, scientometry and text mining, which have gained extensive and relevant coverage from a level of confidence in research topics.

The Principal Components Analysis (PCA) is a statistical technique to synthesize information or reduce the dimensionality, in other words, the number of variables is reduced. This feature reduces variables so that the least amount of information may be lost, generating new factors or components, that are linear combinations of the original variables presenting independence, together, given that 𝑋 = {𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 }, where 𝑥𝑖 is an n-dimensional vector, and 𝑐 = (𝑐1 , 𝑐2 , ⋯ , 𝑐𝑑 ) ∈ ℝ𝑑 is the center of gravity of 𝑋. For 1 ≤ 𝑘 ≤ 𝑑 is denoted 𝑥𝑖𝑘 by k-th coordinate of the vector 𝑥𝑖 . Given two vectors 𝑢 and 𝑣 the inner product is denoted as 〈𝑢, 𝑣〉. For any unit vector 𝑣 ∈ ℝ𝑑 , the variance of 𝑋 in the direction of 𝑣 is:

2.1 Visual Text Analytics The search methods of research using keywords, author, title, etc., in millions of articles published in databases and search engines academic as IEEE Xplore, ACM Digital Library, InterScience, SpringerLink, Science Direct, Web of Science, Scopus, Google Scholar, among others, have the challenge of finding a way to help experts, analysts, and researchers in specific subject areas tacit knowledge hidden in volumes of text; not only for leaking information, also to understand complex relationships, giving meaning to the results at a higher level. Research proposals have been aimed at integrating text mining, visualization and interpretation, with the interaction man-machine, whose challenges include:  Extraction and visualization of concepts, names and relations of a large corpus of text with noise [1,14].  Visualization of the relationships between concepts in the text with graphic structures [1,19].  Support for displaying real-time processing offline and online [4,20,21].  New and technical text of interaction techniques which allow domain analyst navigate the contents of knowledge of the corpus of text and tune text mining, without being an expert in mining of text [6,15].

𝑛

1 𝑣𝑎𝑟(𝑋, 𝑣) = ∑〈𝑥𝑖 − 𝑐, 𝑣〉2 𝑚

(1)

𝑖=1

The most significant address is the unit vector 𝑣1 , such that 𝑣𝑎𝑟(𝑋, 𝑣), is maximal. Generally, after identifying the direction of 𝑗 most significant of 𝐵𝑗 = (𝑣1 , 𝑣2 , ⋯ , 𝑣𝑗 ) , and (𝑗 + 1)𝑡ℎ corresponds most significant direction of the unit vector 𝑣𝑗+1 , such that the variance 𝑣𝑎𝑟(𝑋, 𝑣𝑗+1 ) is maximum among all perpendicular unit vectors 𝑣1 , 𝑣2 , ⋯ , 𝑣𝑗 . This can be verified for any unit vector 𝑣 ∈ ℝ𝑑 . (2) 𝑣𝑎𝑟(𝑋, 𝑣) = 〈𝐶𝑢, 𝑣〉 where 𝐶 is the covariance matrix of 𝑋 and 𝐶 is the is a symmetric matrix 𝑑 × 𝑑 , from (𝑗 + 1) -th component, 𝑐𝑖𝑗 , 1 ≤ 𝑖, 𝑗 ≤ 𝑑 , is defined as: 𝑐𝑖𝑗 =

1 𝑚

𝑛

∑(𝑥𝑖𝑘 − 𝑐𝑖 )(𝑥𝑗𝑘 − 𝑐𝑗 )

(3)

𝑘=1

2.2 Scientometry The scientometrics integrates the study, analysis and measurement of areas of research and innovation, science and technology. Main research topics range from the measurement of the impact on authors, articles, journals, institutes and their relations, in order to understand their understanding and mapping of scientific areas.

A fundamental aspect of PCA is the interpretation of the factors, as this is not given and organized with a priori knowledge, but is deduced by observing the relationship of the factors with the initial variables, depending on your interpretation of experience and knowledge you have an expert on a research topic [11].

Research methods include qualitative approaches [22,24], quantitative [9,23] and computational [12,26], whose main focuses of study have been in comparisons of productivity [7,18] classifications of topics of research [2], trends, relations, algorithmic methods of search, machine learning and data mining [16], leaving issues to be resolved regarding the recovery of information and extraction problems [25].

3.2 Cluster Analysis

2.3 Text mining Text mining seeks to discover high quality information from large volumes of text, which is normally obtained by applying algorithms that allow the extraction of patterns and trends, through statistics cannot be. Text mining research typically include problems of categorization of text [10], clustering of text [8], production of taxonomies grouping granular [3,5], analysis of feelings [17], information retrieval, analysis of lexicon, recognition of patterns, association analysis, visualization, predictive analytics, among others.

For cluster analysis algorithm was used K-means, beginning with the construction some initial cluster centers, which are randomly assigned in order to obtain the cluster centers, based on the following the process:  Assign a class clusters based on the distance given by the repetition frequency of each word in the cluster centers.  Update the positions of the cluster centers based on the average values of the classes in each cluster. The above steps are repeated until any variation of classes do the clusters are internally more variables or externally similar.

4. METHODOLOGY The methodology used (see Figure 2), is divided into four levels of organized abstraction hierarchically in tasks ranging from the most general level, to obtain keywords more similarity and organizes the development of data mining process, in a series of phases own of each level.

The sequence of phases is rigid, in each phase, the general tasks is structured which from the second level specific tasks are projected, which eventually actions are described to be developed for each of the queries performed.

4.1 Phase Query The first phase is called phase of query, in this phase is desired to obtain an overall understanding of the research topic, point at which, is identified the biggest debility of IEEE Xplore digital library, to be one of the databases more consultation, referencing designed and publication, which it does not possess any information analysis module. This weakness becomes one of the problems facing most users, since the agglutination of information, difficult to understand a research topic, evolution, identifying trends and relationships.

trailing spaces that exist between each word, in order to have a better understanding of the data and avoid errors in post processing. The last task of this phase is to count the number of words per year and grouping of words and abbreviations that have the same meaning to improve the quality of the data and the most obvious relationships that allow to define these groups.

4.3 Data mining process In this phase applies the techniques of PCA and cluster analysis, where the parameters used depend on the characteristics of data and the characteristics of precision that you want to achieve with the model. PCA is a technique that transforms a considerable number of variables in other uncorrelated, with zero mean, which are linear combinations of variables first and which are called factors or principal components. These can be sorted by the magnitude of their variance, whose main components describe the percentage of total variability of the original variables. This percentage of variability can show some relations of correlation or similarity between them, although the original variables are part of each major component, some are more important than others and determine the nature of each major component. Clustering analysis is an exploratory technique which aims to highlight groups that help the researcher to explain the behavior of the analyzed keywords, identifying homogeneous groups of words that will be served him as starting point to explore a research topic, namely that PCA along with cluster analysis, constitute a tool that allows you to evaluate a temporary and evolutionary state of a topic of research phase evaluation of results.

4.4 Evaluation phase of results In the evaluation phase it is considered the interpretation of the results, by joining techniques of information visualization with data analysis techniques to expand cognitive interpretations of the research topic consulted, highlighting the following characteristics:     

Reduction of information search. Representation of large data sets in a small space. Identify pattern recognition. Identification of temporal relationships. Monitoring changes in variables over time (evolution).

The results are directly related to the keywords; they can be evaluated with other models in relation to other objectives different from the original to reveal additional information. Figure 2. Methodology used. To get the best out of a query, it is necessary to understand, which may be the most complete way that a topic can be analyzed. IEEE Xplore digital library allows you to download a corpus of data of 2000 records that, depending on the needs of the researcher, will allow you to collect the data depending on relevance, newest first, oldest first, most cited and search history. The purpose of this phase is to obtain a corpus of information to a csv file.

5. RESULTS Consultation in IEEE Xplore digital library was the topic “Wireless Sensor Network”, with the option to search by newest first, in order to analyze the issues of publications in wireless sensor network in recent years. This phase was obtained a corpus of 2000 records in csv format.

4.2 Preprocessing phase The second phase is called preprocessing, at this stage, as first task it seeks to delete the downloaded csv file data, which are not relevant, such as: Document Title, Authors, Author Affiliations, Publication Title, Date Added to Xplore, Volume, Issue, Start Page, End Page, Abstract, ISSN, ISBN, EISBN, DOI, PDF Link and IEEE authors. A second task is to remove records that are with empty fields, mainly in the fields Year, Author Keywords. A third task is to convert keywords in lower case and remove excess spaces and

Figure 3. 50 keywords with more frequency.

Preprocessing phase (see Figure 2), gets a file csv with data of the frequency of words calve per year (see Table 1) listed in alphabetical order and a graphic with 50 words of higher frequency (see Figure 3). Table 1. Frequency table (keywords \ year) Year keyword 6lowpan access control ⋮ zero-power zigbee

2004

2005



2014

2015

0 0 ⋮

0 1 ⋮

4 3 ⋮

1 2 ⋮

0

0

⋯ ⋯ ⋮ ⋯ ⋯

8

6

The greatest variation is given in the second PCA, but the graph of Figure 4, orders the PCA for greater variation, thus uses the first two PCA to represent keywords in a scatter chart. Coming up next, the distribution of the keywords of the PCA 1 and 2 is shown.

Figure 6. Headers of the first 4 PCA.

At the beginning of the process of data mining phase PCA is implemented in the statistical software R, representing each keyword in a two-dimensional space that is defined by 12 different variables, each one corresponds the number of key words indexed in the articles published in the IEEE Xplore digital library per year from 2004 to 2015. These variables represent not only counts total or average in the range of 2004 to 2015, but also all the variation in the time series and the relations of the key words in a given year. Through the use of PCA is reduced to 10 variables, but the two initial groups are those who catch most of the information (see Figure 4).

Then associates a color to the average value of each year, using a palette of colors ranging from yellow to lower values, to blue for the highest values, noting that the words are in blue, are the research topics more developed in recent years.

Figure 7. Distribution of keywords by color.

Figure 4. PCA existing.

Based on Figure 7 it shows that most variation occurs along the axis y which has been assigned to PC1, which explains 37.7% of the variance, while the calculations of the second PCA explains 9.3% for a total from 47%, almost half of the total information. At the top of Figure 7 the keyword “internet of things” and “wireless sensor network” as the most important keywords and more developed research topics is highlighted. While keywords that descend the shaft and belong to the less developed issues. Moreover, the relationship between color/size can encode the difference in the number of keywords in time (2015 minus 2004), the color gradient changes along the direction of the second principal component, with values more positive, allowing interpret that increased the number of works in a topic, as in the case of the keyword “wireless sensor network” visibly it appears three times in yellow, brown and blue. The first PCA capture most of the variation in the dataset explored and this variation is based on total keywords found between 2004 and 2015, the second PCA collects information from the variation of these keywords time. The fact that the principal component analysis creates factors that are called principal components, it performs a data classification. Then, using a data structure with the k-means clustering technique found in the PCA technique is explored.

Figure 5. Dispersion of PCA1 ~ PCA2.

The results obtained using cluster analysis with k-means clustering to group keywords based on the similarity of the keywords found by PCA possible to determine a number of groups and compare a quantity called the sum within the group of the squared distances for each iteration. Figure 8 shows a grouping with 𝑘 = 3.

variability, which is part of the keywords that have less than 3 keywords published per year, in other words, they are gathering research topics little impact or little explored. Table 2 shows the number of keywords contained in each group values 𝑘 = 3,4,5,6. Table 2. Keywords for cluster k 3 4 5 6

Cluster 1 4181 1882 1842 1057

Cluster 2 2 518 518 517

Cluster 3 27 2 40 40

Cluster 4 --1808 1808 1808

Cluster 5 ----2 2

Cluster 6 ------786

Figure 8. Group keywords with k = 3. Using an intuitive approach based on the understanding of the topic and the results found in PCA explores the two keywords grouped, which are “internet of things” and “wireless sensor network”, highlighted in red.

Figure 12. Number of keywords registered per year. Figure 9. Group keywords with k = 4.

Figure 10. Group keywords with k = 5.

Figure 11. Group keywords with k = 6. As the value of 𝑘 increases will be dialing a horizontal division red at the bottom that contains information of little

Figure 11 was taken for 𝑘 = 6, for this value of k, the cluster 5 shows a growth trend from 2004 to 2015 for the keywords “internet of things” and “wireless sensor network”, the rest of keywords has an average growth of 1.5 per year, which means, is that they are widely dispersed research topics and the results show that the complementary area to the wireless sensor network, it is the internet of things. For researchers knowledgeable on the subject can be a very obvious result, but the similarity found in PCA and the different values of 𝑘, for k-means tested and reaffirms this obvious result, a characteristic that can be transferred to other subjects or areas of exploration. For 𝑘 = 6 , cluster 3 has 40 keywords greater similarity between the years 2004 to 2006 (see Figure 12). These keywords (see Figure 13) represent the most relevant research topics in that time interval. The same research topics are distributed publications with keywords in the following years, but with growth below 0.2%, indicating that the majority of these research topics do not present major impact.

Figure 13. keywords for k = 6, Cluster 3. The cluster 1,2,4 and 6, are too large and heterogeneous. For analysis with these groups further refinement in the values of k need,

however, they represent a good grouping relative to the values of k between 3 and 5, as the keywords of these groups represent a growth below 0.6% per year. In any case, these groups contain keywords with fewer publications in the set of sampled data.

6. CONTRIBUTION Research in scientometrics covers the quantitative and qualitative mathematical study of science and technology, where the bibliometric analysis and the apparent increase of publications rates in Journals and Proceedings, have hindered more each day, the exploration, analysis and visualization of large volumes of information. This problem is exacerbated in the consultations in IEEE Xplore Digital Library, whose consultations today, doesn't allow to identify areas where research develops, the relationships between different disciplines, evolution and temporal trend of a research topic, geographic analysis, measures of impact between publications and patents, among many subjects that are being research topic at this time. The contribution of this research in the field of scientometrics lies in identifying investigation topics highly related and for prolonged periods of time, since PCA allows to classified dimensions in order of importance, discard the lower dimensions, which depending on the context of the processed data can be interpreted as sporadic data or noise, allowing improved classification. PCA is characterized by the proportion of variance found in the first two principal components, which variability explains most of the total variance of the data. This feature is used by k-means to create partitions of large datasets and provide better characterization of data previously classified by PCA, finding that the size of the smallest groups establishes a strong relationship of continuity over time. The combination of PCA techniques and k-means have allowed identified in other study areas, characteristics and patterns that identify relationships in different data sets [5,9,14,21,22,32– 36]. This allowed in this article, describe how techniques of PCA and k-means can select from thousands of keywords, words considered as areas or highly related research topics resulting in smaller groups.

This article discusses the link between PCA and k-means, finding that PCA allowed to reduce the dimensionality of information from a set of keywords, obtaining major components, of which the first minimize the loss of information, so was made a first group that is linearly according to the variance, ruling out the main components of less size, which were the lower impact keywords. Later k-means for the experimented k-values showed that smaller groups belong to highly related keywords, indicating a high cohesion of work over the past 12 years.

8. RECOMMENDATIONS AND FUTURE WORKS As a recommendation, extrapolate the link between PCA and kmeans to a business sector that may possibly shed as results identifying strengths in an organization, in a market, etc., depending on the domain of the data. As future work, we want to analyze a nonlinear variance technique, since the variance analysis delivered by PCA is linear and contrast using data analysis techniques based on information entropy

9. REFERENCES [1]

[2]

[3]

7. CONCLUSIONS The PCA technique allowed trace the distribution of all keywords in a two-dimensional space depending on the evolution of the number of keywords in an interval of 12 years, obtaining coincidentally the total number of years that were defined as main components, that is, the direction of the largest variation was identified, while the percentage change in time affected the second component. On the other hand, k-means allowed grouping similar keywords by the number of publications per year, since in an article cannot have a duplicate keyword. It has been found that by similarity provided by PCA can obtain a temporary trend of the behavior of keyword groups regarding the number of publications per year, which helps to estimate what topics or keywords grouped are being published more frequently. As the k value is increased, remain constant a group of keywords over time, these keywords are research topics that have a high relation between them, in the case discussed in this article, only two words were found highly related that are "internet of things" and "wireless sensor network", words that for an expert has obvious relationships, but for someone who is exploring a topic, this group remains constant as the k value is increased, representing topics and highly correlated areas of research.

[4]

[5]

[6]

[7]

Berna Altınel, Murat Can Ganiz, and Banu Diri. 2015. A corpus-based semantic kernel for text classification by using meaning values of terms. Engineering Applications of Artificial Intelligence 43: 54–66. http://dx.doi.org/10.1016/j.engappai.2015.03.015 John P Anzola Anzola, Luz Andrea Rodriguez Rojas, and Giovanny M Tarazona Bermudez. 2015. Knowledge Management in Organizations: 10th International Conference, KMO 2015, Maribor, Slovenia, August 24-28, 2015, Proceedings. In Lorna Uden, Marjan Hericko and IHsien Ting (eds.). Springer International Publishing, Cham, 463–476. http://doi.org/10.1007/978-3-319-21009-4_36 Nabil Arman. 2010. e-Learning Materials Development: Implementing Software Reuse Principles and Granularity Levels in the Small Using Taxonomy Search. Proceedings of the 1st International Conference on Intelligent Semantic Web-Services and Applications, ACM, 19:1–19:6. http://doi.org/10.1145/1874590.1874609 Tao Cheng and Jochen Teizer. 2013. Real-time resource location data collection and visualization technology for construction safety and activity monitoring applications. Automation in Construction 34: 3–15 http://dx.doi.org/10.1016/j.autcon.2012.10.017 Y Q Cheng, H C Li, T Celik, and F Zhang. 2013. FRFT-based improved algorithm of unsupervised change detection in SAR images via PCA and K-means clustering. Geoscience and Remote Sensing Symposium (IGARSS), 2013 IEEE International, 1952–1955. http://dx.doi.org/10.1109/IGARSS.2013.6723189 J K Chiang and R.-H. Yang. 2013. Multidimensional data mining for discover association rules in various granularities. International Conference on Computer Applications Technology, ICCAT 2013. http://doi.org/10.1109/ICCAT.2013.6522021 Wei Ming Chiew, Feng Lin, Kemao Qian, and Hock Soon Seah. 2014. A heterogeneous computing system for coupling 3D endomicroscopy with volume rendering in real-time image visualization. Computers in Industry 65, 2: 367–381. http://dx.doi.org/10.1016/j.compind.2013.10.002

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16] [17]

[18]

[19]

[20]

Guang-Feng Deng and Woo-Tsong Lin. 2012. Citation analysis and bibliometric approach for ant colony optimization from 1996 to 2010. Expert Systems with Applications 39, 6: 6229–6237. http://dx.doi.org/10.1016/j.eswa.2011.12.001 Chris Ding and Tao Li. 2007. Adaptive Dimension Reduction Using Discriminant Analysis and K-means Clustering. Proceedings of the 24th International Conference on Machine Learning, ACM, 521–528. http://doi.org/10.1145/1273496.1273562 Z Fan, S Chen, L Zha, and J Yang. 2016. A Text Clustering Approach of Chinese News Based on Neural Network Language Model. International Journal of Parallel Programming 44, 1: 198–206. http://doi.org/10.1007/s10766-014-0329-2 P Gautam. 2015. Deciphering the Department-Discipline Relationships within a University through Bibliometric Analysis of Publications Aided with Multi-variate Techniques. Advanced Applied Informatics (IIAI-AAI), 2015 IIAI 4th International Congress on, 468–471. http://doi.org/10.1109/IIAI-AAI.2015.212 A S Ghareb, A A Bakar, and A R Hamdan. 2016. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Systems with Applications 49: 31–47. http://doi.org/10.1016/j.eswa.2015.12.004 K.-Y. Ho and W Wang. 2016. Predicting stock price movements with news sentiment: An artificial neural network approach. Studies in Computational Intelligence 628: 395– 403. http://doi.org/10.1007/978-3-319-28495-8_18 K Honda, R Nonoguchi, A Notsu, and H Ichihashi. 2011. PCA-guided k-Means clustering with incomplete data. Fuzzy Systems (FUZZ), 2011 IEEE International Conference on, 1710–1714. http://doi.org/10.1109/FUZZY.2011.6007312 O C L Hou, Heigen Hsu, and J M Yang. 2010. An empirical investigation of research productivity on Text Mining #x2014; in bibliometrics view. New Trends in Information Science and Service Science (NISS), 2010 4th International Conference on, 646–650. IEEE. 2016. IEEE leads patent citations. Jahiruddin, Muhammad Abulaish, and Lipika Dey. 2010. A concept-driven biomedical knowledge extraction and visualization framework for conceptualization of text corpora. Journal of Biomedical Informatics 43, 6: 1020–1035. http://dx.doi.org/10.1016/j.jbi.2010.09.008 Mikael Johansson, Mattias Roupé, and Petra Bosch-Sijtsema. 2015. Real-time visualization of building information models (BIM). Automation in Construction 54: 69–82. http://dx.doi.org/10.1016/j.autcon.2015.03.018 C Katherine Andrea Cuartas, A John Petearson Anzola, and B Giovanny Mauricio Tarazona. 2015. Classification methodology of research topics based in decision trees: J48 andrandomtree. International Journal of Applied Engineering Research 10, 8: 19413–19424. Retrieved from http://www.scopus.com/inward/record.url?eid=2-s2.084929933512&partnerID=40&md5=03c7360c0a771362b2b 135b252f12021 M Kaya and S Conley. 2016. Comparison of sentiment lexicon development techniques for event prediction. Social Network Analysis and Mining 6, 1: 1–13. http://doi.org/10.1007/s13278-015-0315-8

[21] Ehsan Lotfi and Azita Keshavarz. 2014. Gene expression microarray classification using PCA–BEL. Computers in Biology and Medicine 54: 180–187. http://dx.doi.org/10.1016/j.compbiomed.2014.09.008 [22] T Matsui, K Honda, C H Oh, A Notsu, and H Ichihashi. 2009. Cluster validation in k-Means clustering based on PCAguided k-Means and procrustean transformation of PC scores. Fuzzy Systems, 2009. FUZZ-IEEE 2009. IEEE International Conference on, 1546–1550. http://doi.org/10.1109/FUZZY.2009.5277333 [23] José M Merigó, Anna M Gil-Lafuente, and Ronald R Yager. 2015. An overview of fuzzy research with bibliometric indicators. Applied Soft Computing 27: 420–433. http://dx.doi.org/10.1016/j.asoc.2014.10.035 [24] Jussi Nikander, Ari Korhonen, Eiri Valanto, and Kirsi Virrantaus. 2007. Visualization of Spatial Data Structures on Different Levels of Abstraction. Electronic Notes in Theoretical Computer Science 178: 89–99. http://dx.doi.org/10.1016/j.entcs.2007.01.029 [25] Du Ping-ping, Li Wen-ping, Sang Shu-xun, Wang Lin-xiu, and Zhou Xiao-zhi. 2009. Application of 3D visualization concept layer model for coal-bed methane index system. Procedia Earth and Planetary Science 1, 1: 977–981. http://dx.doi.org/10.1016/j.proeps.2009.09.151 [26] Daniel J Power and Ramesh Sharda. 2007. Model-driven decision support systems: Concepts and research directions. Decision Support Systems 43, 3: 1044–1061. http://dx.doi.org/10.1016/j.dss.2005.05.030 [27] M A Schuh, J M Banda, T Wylie, P McInerney, K Ganesan Pillai, and R A Angryk. 2015. On visualization techniques for solar data mining. Astronomy and Computing 10: 32–42. http://dx.doi.org/10.1016/j.ascom.2014.12.003 [28] F N Silva, F A Rodrigues, O N Oliveira Jr, and L da F. Costa. 2013. Quantifying the interdisciplinarity of scientific journals and fields. Journal of Informetrics 7, 2: 469–477. http:/dx.doi.org/10.1016/j.joi.2013.01.007 [29] F N Silva, F A Rodrigues, O N Oliveira Jr, et al. 2015. A corpus-based semantic kernel for text classification by using meaning values of terms. Automation in Construction 43, 6: 69–82. http://doi.org/10.1109/BigData.2014.7004345 [30] Thiago H P Silva, Mirella M Moro, Ana Paula C Silva, Wagner Meira Jr., and Alberto H F Laender. 2014. Community-based Endogamy As an Influence Indicator. Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, IEEE Press, 67–76. Retrieved from http://dl.acm.org/citation.cfm?id=2740782 [31] Xiaoli Sun, Dan Wu, and Chao Zhang. 2015. Based on bibliometrics and content analysis of the literature on science and technology media. Management of Engineering and Technology (PICMET), 2015 Portland International Conference on, 1339–1344. http://doi.org/10.1109/PICMET.2015.7273157 [32] Arthur Szlam. 2009. Asymptotic regularity of subdivisions of Euclidean domains by iterated {PCA} and iterated 2-means. Applied and Computational Harmonic Analysis 27, 3: 342– 350. http:/dx.doi.org/10.1016/j.acha.2009.02.006

[33] K Vijay and K Selvakumar. 2015. Brain FMRI clustering using interaction K-means algorithm with PCA. Communications and Signal Processing (ICCSP), 2015 International Conference on, 909–913. http://doi.org/10.1109/ICCSP.2015.7322628

[34] Z Wu and H Ju. 2008. Research of Printed Matter Flaws Inspection Based on Improved K-Means and PCA. Computational Intelligence and Industrial Application, 2008. PACIIA ’08. Pacific-Asia Workshop on, 247–251. http://doi.org/10.1109/PACIIA.2008.164 [35] Qin Xu, Chris Ding, Jinpei Liu, and Bin Luo. 2015. PCAguided search for K-means. Pattern Recognition Letters 54: 50–55. http:/dx.doi.org/10.1016/j.patrec.2014.11.017

[36] Shijie Zhang, Wei Jin, Ying Huang, Wei Su, Jiong Yang, and Zhaoyang Feng. 2011. Profiling a Caenorhabditis elegans behavioral parametric dataset with a supervised K-means clustering algorithm identifies genetic networks regulating locomotion. Journal of Neuroscience Methods 197, 2: 315– 323. http:/dx.doi.org/10.1016/j.jneumeth.2011.02.014

Columns on Last Page Shouls Close As Possible to Equal Length