sustainability Article
Constructing Patent Maps Using Text Mining to Sustainably Detect Potential Technological Opportunities Hei Chia Wang * , Yung Chang Chi and Ping Lun Hsin Department of Industrial and Information Management and Institute of Information Management, National Cheng Kung University, Tainan 701, Taiwan;
[email protected] (Y.C.C.);
[email protected] (P.L.H.) * Correspondence:
[email protected]; Tel.: +886-6-275-7575 (ext. 53724); Fax: +886-6-2362162 Received: 22 July 2018; Accepted: 11 October 2018; Published: 17 October 2018
Abstract: With the advent of the knowledge economy, firms often compete for intellectual property rights. Being the first to acquire high-potential patents can assist firms in achieving future competitive advantages. To identify patents capable of being developed, firms often search for a focus by using existing patent documents. Because of the rapid development of technology, the number of patent documents is immense. A prominent topic among current firms is how to use this large number of patent documents to discover new business opportunities while avoiding conflicts with existing patents. In the search for technological opportunities, a crucial task is to present results in the form of an easily understood visualization. Currently, natural language processing can help in achieving this goal. In natural language processing, word sense disambiguation (WSD) is the problem of determining which “sense” (meaning) of a word is activated in a given context. Given a word and its possible senses, as defined by a dictionary, we classify the occurrence of a word in context into one or more of its sense classes. The features of the context (such as neighboring words) provide evidence for these classifications. The current method for patent document analysis warrants improvement in areas, such as the analysis of many dimensions and the development of recommendation methods. This study proposes a visualization method that supports semantics, reduces the number of dimensions formed by terms, and can easily be understood by users. Since polysemous words occur frequently in patent documents, we also propose a WSD method to decrease the calculated degrees of distortion between terms. An analysis of outlier distributions is used to construct a patent map capable of distinguishing similar patents. During the development of new strategies, the constructed patent map can assist firms in understanding patent distributions in commercial areas, thereby preventing patent infringement caused by the development of similar technologies. Subsequently, technological opportunities can be recommended according to the patent map, aiding firms in assessing relevant patents in commercial areas early and sustainably achieving future competitive advantages. Keywords: text mining; patent mapping; technological opportunities; competitive advantage
1. Introduction In many cases, firms must expend substantial amounts of time and money if they are suspected of involvement in patent infringement [1]. To avoid this, firms often employ patent engineering teams to routinely retrieve and organize current patents relevant to the firm’s technologies [2]. This allows firms to understand which technologies are under patent protection, avoid conflicts between technologies from their research and development (R&D) and those in existing patents, and reduce the likelihood of patent infringement [2]. Analyzing the patent distribution of an industry is a vital task in preventing
Sustainability 2018, 10, 3729; doi:10.3390/su10103729
www.mdpi.com/journal/sustainability
Sustainability 2018, 10, 3729
2 of 18
infringement concerns. In addition, patent documents contain abundant credible technical information and key research results, which makes them a highly valuable and useful source of knowledge [3,4]. According to the World Intellectual Property Organization (WIPO), by searching for and reviewing patent literature, 90–95% of the world’s inventions can be understood, technology R&D time can be decreased by 60%, and research expenditures can be decreased by 40% [5]. Therefore, when firms intend to develop a new technology or product, they often first collect patents relevant to that technology. Through this collection process they accumulate new technical knowledge, which inspires innovation and assists firm decision-makers in developing a strategic direction and decreasing costs during the R&D process [6]. According to the 2016 WIPO report, from the initial implementation of the patent system to 2015, more than 75 million patent applications have been filed [7]. In practice, the precision and recall of patent retrieval systems have become increasingly incapable of meeting user expectations, resulting in information overload [3,8,9]. Although utilizing the International Patent Classification codes (IPC-codes) developed by the WIPO can limit the scope when searching for information, in practice these codes can only be used as a reference rather than an ultimate standard. In addition, patent documents can be found in numerous technical domains, but few people have professional knowledge in multiple domains [10]. To enable rapid user comprehension, it is convenient to represent the distribution of patents as a patent map. Patent mapping is a common method that involves presenting patent information obtained from a retrieval system using various qualitative and quantitative analysis methods. Patent maps have several functions; for instance, the use of patent maps enables more efficient detection of patent infringement [1]. Furthermore, when competitors possess prospective patents and latecomers have no choice but to mimic necessary technologies within the patents, the latecomers can use patent maps to understand competitors’ patent distributions and attempt to design around such patents to avoid infringement. Because a standalone patent does not possess as much legal force as a group of related patents in a patent portfolio, patent maps can be used to develop a firm’s own patent distribution. In addition to increasing the number of competitors’ infringement cases and increasing settlement amounts, this can also render a competitors’ design-around strategies more difficult to execute [11]. Patent mapping can also be used to assess firms that wish to collaborate or merge [12] and to compare different technologies (structured data and unstructured text) to analyze aspects such as technical trends and interactions with competitors [12,13]. Patent mapping visualizations are one of the best ways to compare different technologies [14]. Therefore, at the national, industrial, and technical domain levels, as well as the product and firm levels, patent mapping can provide decision-makers and R&D professionals with comprehensive summaries of patent-related information. Using graphical representations of industry trends and technology distributions further provides them with comprehensive support during the development of business strategies and plans. This study proposes a visualization method that supports semantics, reduces the number of dimensions formed by terms, and can easily be understood by users. In current patent analysis, numerous patent documents use different words to describe the same events, resulting in semantic inconsistency [15] and polysemy due to the multiple meanings that may exist for one word. To resolve this, document analysis often necessitates the merging of synonyms into the same semantic dimension; a thesaurus can facilitate this process. These word sense disambiguation (WSD) methods decrease polysemy and allow the term similarities of patent documents to be calculated more precisely. This study uses WordNet [15], which is commonly used for synonym analysis, to calculate similarities between terms and merge similar terms. Additionally, multiple meanings may exist for one word by the average value formula (see Equation (2)) of semantic similarity. On the other hand, different words may be used to describe the same events and multiple meanings may exist for one word; WordNet can also calculate term similarity faster and more precisely in this case. Finally, to reduce the dimensions of documents for reading convenience, multidimensional scaling is used to simplify multidimensional research subjects into a low-dimensional space. The outlierness
Sustainability 2018, 10, 3729
3 of 18
of each document is also calculated: If the local density of a document is smaller than neighboring local densities, that document possesses higher outlierness, which indicates a lower number of similar patents and a gap in related technologies, which may indicate technological opportunity [16]. Term analysis can be implemented in different programming languages that are suitable for different fields. Multiple programming languages can be used for implementing different packages. In recent years, the R language has become increasingly popular, and developers are continuously developing new R language kits; consequently, the R language has become increasingly powerful for statistical processing, graphics, data mining, and big data. Therefore, R is the major programming language used in this study. In this paper, the R text mining package (tm) is used for term analysis and the R statistical analysis package is used for multidimensional scaling. The three primary objectives of this study are (1) to enhance the effectiveness of patent retrieval using a semantic net, (2) to construct a patent map that distinguishes patent similarities to help firms avoid patent infringement caused by developing similar technologies, and (3) to use the outlier method to sustainably discover technological opportunities. The research contribution is a patent map that can assist firms in understanding patent distributions in commercial areas, thereby preventing patent infringement caused by the development of similar technologies. In addition, technological opportunities can be sustainably recommended using the patent map. In Section 2, we discuss the limitations of existing methods through a review of the literature on word-sense disambiguation and patent mapping. We also present our specific research objectives and explain WordNet, a key technology for achieving those objectives. In Section 3, we present the basic concept and the detailed process of the proposed approach. Section 4 shows the term analysis of the proposed methodology, using real patents to derive and verify the results. In Section 5, limitations and areas for future research are discussed. 2. Related Work 2.1. Text Mining Text mining extracts meaningful information from unstructured data and is used in many patent research fields because it can work with large amounts of text. Text mining can be divided into keyword-based analysis and word-based analysis [17]. As following Table 1, Song et al. [6] have applied text mining and T-term analysis to discover new technology opportunities. Walter et al. [18] have discussed text-mining analysis in relation to the beauty of the brimstone butterfly, the novelty of patents identified by near environment analysis. Alves et al. [19] have developed text-mining tools for information retrieval from patents. Kayser et al. [20] have extended the knowledge base of foresight and thereby the contribution of text mining. Shen et al. [21] have used text mining and cosine similarity to discover potential product opportunities. Kim et al. [22] have used text-mining patent keyword extraction for sustainable technology management. Roh et al. [23] have developed a methodology for structuring and layering technological information to apply text mining to patent documents. Table 1. Recent text mining-related research. Authors
Year
Title
Methodology
Summary
Song et al.
2017
Discovering new technology opportunities based on patents: Text mining and F-term analysis.
Text mining and F-term analysis
The F-term can provide effective guidelines for generating new technological ideas.
Walter et al.
2017
The beauty of the brimstone butterfly: novelty of patents identified by near environment analysis based on text mining.
Text mining and near environment analysis
The approach taken can single out the content-wise novelty of patents in near environments.
Sustainability 2018, 10, 3729
4 of 18
Table 1. Cont. Authors
Alves et al.
Kayster et al.
Year
2017
2017
Title Development of Text Mining Tools for Information Retrieval from Patents.
Extending the knowledge base of foresight: The contribution of text mining.
Methodology
Summary
Text mining
A patent pipeline was developed and integrated into @Note2, an open-source computational framework for BioTM(Biomedical text mining).
Text mining
Text mining facilitates the detection and examination of emerging topics and technologies by extending the knowledge base of foresight. Hence, new foresight applications can be designed.
Text mining and cosine similarity
Remote health monitoring technology is used as a case study. The results show four product opportunities, namely wireless sensor devices, telecommunication systems and technology, wearable devices and systems, and medical services and systems.
Shen et al.
2017
A cross-database comparison to discover potential product opportunities using text mining and cosine similarity.
Kim et al.
2018
Patent Keyword Extraction for Sustainable Technology Management.
Text mining and keyword extraction
The proposed method improves the precision by approximately 17.4% over the existing method.
2017
Developing a Methodology for Structuring and Layering Technological Information in Patent Documents through Natural Language Processing.
Text mining and natural language processing
The structured and layered keyword set does not omit useful keywords and the analyzer can easily understand the meaning of each keyword.
Roh et al.
2.2. Word-Sense Disambiguation In computational linguistics, word-sense disambiguation (WSD) [24] is an open problem in natural language processing and ontology. WSD means identifying which sense of a word (i.e., meaning) is being used in a given sentence when the word has multiple meanings. The solution to this problem impacts other computer-related issues, such as discourse, improving the relevance of search engines, anaphora resolution, coherence, and inference. Fixed grammatical structures in a language can be used to determine which semantic meaning should be attributed to a term; for example, in English, prepositions must be followed by nouns, pronouns, or noun phrases. Neighboring words can also be referenced to determine semantic meaning. Additionally, a term usually only represents a single meaning in a document; this concept can be used to limit the meanings of terms. Through these WSD methods, term similarities can be more precisely calculated. So far, expert dictionaries have only a few fields for polysemous words that frequently occur in WordNet. This study proposes a term similarity calculation and term grouping method that relies on WSD. These methods make it easier to decrease the calculated degree of distortion between two terms so that they can be compared more precisely. 2.3. WordNet WordNet constructs subsemantic networks for each of the following four parts of speech: nouns, verbs, adjectives, and adverbs [25]. Semantically similar terms in the semantic web (such as “kid” and “child”) are loosely grouped into synonym sets (synsets), whereas words with multiple meanings appear in several different synsets. The relationships among terms do not exist in isolation but also generate subsequent links among the four semantic networks for the synsets. In the noun semantic network, the links from synonymous word sets indicate hyponymy (super-subordinate relations) as well as meronymy (part-whole relations); all of these are used to form a hierarchical architecture.
Sustainability 2018, 10, 3729
5 of 18
In this study, based on the WordNet similarities between terms, the similarities between patents are quantified by calculating the degree of closeness based on the depth of the concepts in the hierarchy. 2.4. Patent Mapping Qualitative analysis is used to analyze the contents of individual patent documents, such as technical content files, and the results usually contain the relevant individual patent document number. Although patent analysis by expert analysts can provide valuable information, it involves manual, detailed reading of individual patent documents, which is time-consuming. Typically, the results of such analysis are presented through patent mapping, which generates illustrations in the form of tree structures, tables, or matrices [26]. A matrix is the basic form of patent presentation. A patent map in the form of a tree structure can indicate the development of technology, the spread of technology, and the state of joint application [26]. Quantitative analysis involves the formation of a group of patents as a parent group of a given category of patents. It uses bibliographic information contained in the patent literature, including: distinctions among documents; document numbers; the patent classification; the applicant’s nationality, name, and address; the name of the inventor; and, the number of inventions. Additional information provided by the patent office and related to retrieval, prosecution, and cited documents is also used in quantitative analysis [18]. Detailed analysis additionally involves separate supplemental indices. As in qualitative analysis, quantitative analysis results are presented in various forms, including illustrations, graphs, tree structures, and matrices. Among these, graphics are the basic form of presentation and visualization. Patent analysis requires both qualitative and quantitative analysis [26]. In this study, patent mapping is the core of patent visualization and refers to a series of methods used to build patent maps based on differences and similarities among patents. Patent mapping visualizations are one of the best ways to compare different technologies [14]. 2.5. Multidimensional Scaling Multidimensional scaling (MDS) is a nonparametric, distance-based multivariate analysis technique that produces statistical maps from the main characteristics of the data. It has the advantage of making the results accessible to non-specialists in an intuitive manner [27]. In this study, after similarity calculations, multidimensional matrices that are difficult to read are generated from the similarities of patent documents. To facilitate understanding of the relationships among patents, multidimensional matrices should be reduced dimensionally. Patent maps are mostly two-dimensional or, at most, three-dimensional. Therefore, multidimensional matrices cannot be represented by patent maps; MDS must first be used to convert multidimensional matrices into twoor three-dimensional matrices. 2.6. Local Outlier Factor Local outlier factor (LOF) is an anomaly detection algorithm that compares the local density of a point’s neighborhood with those of its neighbors. It indicates the degree of an object’s outlierness from a cluster [28]. LOF enables outlier data that could be more valuable than normal data to be extracted. Researchers have proposed numerous outlier mining algorithms that can effectively detect outlierness in a data set [29]. In this study, the final LOF output is intended to discover technological vacancies or technical opportunities. 3. Proposed Methods The research structure of this study is divided into five modules, as shown in Figure 1. These are the “Document collection and preprocessing module”, the “Term similarity calculation and term grouping module”, the “Document similarity calculation module”, the “Multidimensional scaling-based dimensionality reduction module” and the “Document outlierness calculation module”. The function of each module is described below.
Sustainability 2018, 10, 3729
6 of 18 6 of 18
Sustainability 2018, 10, x FOR PEER REVIEW
Figure Figure 1. 1. Research framework. framework.
3.1. Document Collection and Preprocessing Module 3.1. Document Collection and Preprocessing Module This study used patent documents approved by the United States Patent and Trademark This study used patent documents approved by the United States Patent and Trademark Office. Office. Regular documents are unstructured data consisting of several terms. Patent documents Regular documents are unstructured data consisting of several terms. Patent documents are semiare semi-structured data comprising the following fields: patent number (Pat. No.), patent title, structured data comprising the following fields: patent number (Pat. No.), patent title, patent abstract, patent abstract, patent assignee, references cited, IPC-code, patent claims, and patent description. patent assignee, references cited, IPC-code, patent claims, and patent description. The module adopted in the present study references methods from relevant studies [4,12,30–32] The module adopted in the present study references methods from relevant studies [4,12,30–32] to avoid incorporating an overly large number of words; only words in the patent title, abstract, to avoid incorporating an overly large number of words; only words in the patent title, abstract, and and claims fields were retrieved for processing. All numbers, punctuation marks, and special symbols claims fields were retrieved for processing. All numbers, punctuation marks, and special symbols were removed, and after processing by the Stanford parser, numerous terms containing relevant parts were removed, and after processing by the Stanford parser, numerous terms containing relevant of speech were acquired. This was followed by a removal of stop words from all terms. parts of speech were acquired. This was followed by a removal of stop words from all terms. 3.2. Term Similarity Calculation and Term Grouping Module 3.2. Term Similarity Calculation and Term Grouping Module The number of terms generated by the previous module is too large for easy interpretation by The number of terms generated by the previous module is too large for easy interpretation by users, and the problem of synonyms persists. Studies have indicated that semantic analysis can be users, and the problem of synonyms persists. Studies have indicated that semantic analysis can be used to solve this problem [33]. Regarding the semantic associations among terms, this study followed used to solve this problem [33]. Regarding the semantic associations among terms, this study the example of Miller [25] to calculate the similarities between pairs of terms. Several methods exist followed the example of Miller [25] to calculate the similarities between pairs of terms. Several for using WordNet to perform this calculation, such as PATH (A simple node-counting scheme) [34], methods exist for using WordNet to perform this calculation, such as PATH (A simple node-counting WUP (Wu & Palmer measure) [35], and LESK (Lesk algorithm) [36]. Among these, WUP is based on scheme) [34], WUP (Wu & Palmer measure) [35], and LESK (Lesk algorithm) [36]. Among these, WUP the measurement of the depth of each concept in use; that is, it measures the length of the path from the is based on the measurement of the depth of each concept in use; that is, it measures the length of the root node or the nearest common ancestor to two concepts, or the depth of the lowest common subsume path from the root node or the nearest common ancestor to two concepts, or the depth of the lowest (LCS) of the two concepts. Because it can reflect the specific level of the concept, WUP was selected. common subsume (LCS) of the two concepts. Because it can reflect the specific level of the concept, The WUP method calculates the similarity between two terms according to the following formula: WUP was selected. The WUP method calculates the similarity between two terms according to the following formula: 2 ∗ Depth( LCS( Term1 , Term2 )) Similarity( Term1 , Term2 ) = (1) Depth Term2)) ( Term1 ) + Depth, (𝑇𝑒𝑟𝑚 ) 2 ∗ 𝐷𝑒𝑝𝑡ℎ(𝐿𝐶𝑆(𝑇𝑒𝑟𝑚 1 2 (1) Similarity(𝑇𝑒𝑟𝑚1 , 𝑇𝑒𝑟𝑚2 ) = 𝐷𝑒𝑝𝑡ℎ(𝑇𝑒𝑟𝑚1 ) + 𝐷𝑒𝑝𝑡ℎ(𝑇𝑒𝑟𝑚2 )
Sustainability 2018, 10, 3729
7 of 18
Sustainability 2018, 10, x FOR PEER REVIEW
7 of 18
The Depth() function returns the depth of the collection of synonyms of the obtained terms in the The Depth() function returns the depth of the function collectionreturns of synonyms obtained terms in the WordNet hierarchical framework, and the LCS() the LCSofofthe the respective synonym WordNet hierarchical framework, and the LCS() function returns the LCS of the respective synonym collections of the two obtained terms. For instance, to calculate the similarity between “CPU” and collections of thethe twodepth obtained terms. instance, to of calculate similarity and “RAM,” because of “CPU” is For 8 and the depth “RAM”the is 10, the LCS between of “CPU”“CPU” and “RAM” “RAM,” becausewhose the depth “CPU” 8 and the in depth of “RAM” is 10, the LCS of “CPU” and “RAM” is “hardware,” levelofvalue is 7,isas shown Figure 2. Therefore: is “hardware,” whose level value is 7, as shown in Figure 2. Therefore: 2×7 Similarity(“CPU”, “RAM”) =2 × 7 = 0.778 Similarity(“CPU”, “RAM”) = 8 + 10 = 0.778 8 + 10
Figure2.2.Example Exampleofofthe theWordNet WordNethierarchical hierarchicalframework. framework. Figure
Additionally, multiple meanings may exist for one word. For instance, in WordNet, “processor” Additionally, multiple meanings may exist for one word. For instance, in WordNet, “processor” has three meanings, which are “a business engaged in processing agricultural products,” has three different differentnoun noun meanings, which are “a business engaged in processing agricultural “someone who processes things,” and “central processing unit”; similarly, five different noun meanings products,” “someone who processes things,” and “central processing unit”; similarly, five different exist meanings for “memory” 3). For (Figure this type situation, this tookthis thestudy meantook valuethe of mean maxa noun exist (Figure for “memory” 3).ofFor this type of study situation, (3, because “processor” has 3 noun meanings) multiplied by that of maxb (5, because “memory” has value of maxa (3, because “processor” has 3 noun meanings) multiplied by that of maxb (5, because 5 noun meanings) to represent the 15 total semantic similarities, as presented in Table 2. The average “memory” has 5 noun meanings) to represent the 15 total semantic similarities, as presented in Table value the semantic between “processor” and “memory”and is as“memory” follows: is as follows: 2. The of average value ofsimilarity the semantic similarity between “processor” Similarity “memory” (“processor”, AverageAverage Similarity "memory" ("processor", ) ) max
=
maxb
(2) (2)
a ∑ ∑ a=1b ∑ b =1 a= Similarity (processor #n #a, memory #n #b) max ×max =1 b =1
max a max
Similarity (processor #n #a, memory #n #b a
b
maxa maxb Table 2. Semantic Similarity.
processor#n#1 processor#n#2 processor#n#3
Memory#n#1
Memory#n#2
Memory#n#3
Memory#n#4
Memory#n#5
0.133 0.143 0.133
0.133 0.143 0.133
0.133 0.143 0.133
0.125 0.133 0.875
0.105 0.111 0.105
Sustainability 2018, 10, 3729 Sustainability 2018, 10, x FOR PEER REVIEW
8 of 18 8 of 18
Figure 3. Example of polysemes in WordNet. Table 2. Semantic Similarity.
processor#n#1 processor#n#2 processor#n#3
Memory#n#1 Memory#n#2 Memory#n#3 Memory#n#4 0.133 0.133 0.133 0.125 0.143 0.143 0.143 0.133 Figure 3. 3. Example Exampleof ofpolysemes polysemesin in WordNet. WordNet. Figure 0.133 0.133 0.133 0.875
Memory#n#5 0.105 0.111 0.105
The mean value of the semantic similarity of “processor” and “memory,” which is 0.179, Table 2. Semantic Similarity. The mean value the of the of “processor” is obtained by dividing sumsemantic of the 15similarity values in Table 2 by 15. and “memory,” which is 0.179, is obtained by dividing sum of the 15matrix, valuesEquation in Table 2(3)bycan 15.be used Memory#n#1 Memory#n#2 Memory#n#3 Memory#n#4 Memory#n#5 After obtaining thethe term similarity to obtain the distance matrix After obtaining the term similarity matrix, Equation (3) can be used to obtain the distance matrix processor#n#1 0.133can subsequently 0.133 utilize the0.133 0.105 between terms. This module distance matrix0.125 to group terms, as shown in between terms. utilize 0.143 the distance matrix processor#n#2 0.143 can subsequently 0.143 0.133to group terms, 0.111as shown Figure 4. We have:This module inprocessor#n#3 Figure 4. We have: 0.133 0.133 0.133 0.875 0.105 Distance( Term1 , Term2 ) = 1 − Similarity( Term1 , Term2 ) (3) (3) Distance(𝑇𝑒𝑟𝑚1 , 𝑇𝑒𝑟𝑚2 ) = 1 − Similarity(𝑇𝑒𝑟𝑚1 , 𝑇𝑒𝑟𝑚2 ) The mean value of the semantic similarity of “processor” and “memory,” which is 0.179, is obtained by dividing the sum of the 15 values in Table 2 by 15. After obtaining the term similarity matrix, Equation (3) can be used to obtain the distance matrix between terms. This module can subsequently utilize the distance matrix to group terms, as shown in Figure 4. We have: Distance(𝑇𝑒𝑟𝑚1 , 𝑇𝑒𝑟𝑚2 ) = 1 − Similarity(𝑇𝑒𝑟𝑚1 , 𝑇𝑒𝑟𝑚2 )
(3)
Figure 4. Term groups. Figure 4. Term groups.
In the original vector space model, distinct terms such as “CPU” and “processor” formed In the original vector space model, distinct terms such as “CPU” and “processor” formed independent dimensions. However, “CPU” and “processor” should possess a certain degree of independent dimensions. However, “CPU” and “processor” should possess a certain degree of semantic similarity. For instance, if a user intends to search for patents related to “CPU”, patents related semantic similarity. For instance, if a user intends to search for patents related to “CPU”, patents to “processor” must not be omitted because overlooking technology patents that use synonyms could related to “processor” must not be omitted because overlooking technology patents that use result in infringement. Therefore, in this study, terms with a certain degree of semantic similarity synonyms could result in infringement. Therefore, in this study, terms with a certain degree of 4. Term groups. (i.e., in the same cluster) were viewed Figure as identical terms. In Figure 4, for instance, because “CPU” and “processor” were grouped into the same cluster, they were considered identical terms and shared In the original vector space model, distinct terms such as “CPU” and “processor” formed the same dimension in the vector space model. This method was used to enhance the precision of independent dimensions. However, “CPU” and “processor” should possess a certain degree of calculating patent document similarities. semantic similarity. For instance, if a user intends to search for patents related to “CPU”, patents related to “processor” must not be omitted because overlooking technology patents that use synonyms could result in infringement. Therefore, in this study, terms with a certain degree of
Sustainability 2018, 10, 3729
9 of 18
3.3. Document Similarity Calculation Module In this module, terms in the same cluster (generated in the previous module) were considered related synonyms, and the term frequency-inverse document frequency (TF-IDF) method of information retrieval was used to calculate their weights. The concept of term frequency (TF) states that the more frequently a term appears in a document, the higher its weight should be. In contrast, in inverse document frequency (IDF), terms occurring in a greater number of documents are relatively less relevant and should be weighted less. In the TF-IDF formula, f reqi,j represents the number of occurrences of the word j in the file i, M represents the number of files, m j represents the number of files containing the word j, and Wij is the weight of word j in file i. The TF-IDF formula is as follows: t f i,j =
f reqi,j MAX f reqi,j
id f j = log
M mj
Wij = t f ij × id f j
(4)
The simple structure and ease of use of this method have enabled various applications of patent text analysis [23,37]. Additionally, Ref. [38] purports that employing a cosine measure to calculate the similarity between two documents in a vector space model generally results in better performance. Therefore, this study utilized a cosine measure [21] with the following formula: similarity( Doc1 , Doc2 ) =
Doc1 × Doc2 | Doc1 | × | Doc2 |
(5)
Here, Doc1 is expressed as {w11 , w12 , w13 . . . w1n } and Doc2 as {w21 , w22 , w23 . . . w2n }; n represents the number of terms; and, wij represents the weight of term j in document i generated using TF–IDF. Thus, the similarity matrix of a document can be obtained. 3.4. Multidimensional Scaling-Based Dimensionality Reduction Module The multidimensional similarity matrix resulting from the similarity calculation for a regular document is difficult to read. To render the relationships among patents more understandable, multidimensional matrices should be converted into low-dimensional patent maps. To reduce the number of dimensions for this purpose, we used MDS [39]. However, reducing the number of dimensions in the source data while retaining the original relationships among the data doubtlessly generates information loss problems. Therefore, the quality of the results obtained from using multidimensional scaling was measured using stress values calculated as follows: v 2 u u n −1 n u ∑i=1 ∑ j=i+1 dij − dij 0 Stress = t ∑in=−11 ∑nj=i+1 dij 2
(6)
v u p 2 u dij = t ∑ xik − x jk
(7)
k =1
Here, n represents the number of data, p the number of dimensions, xik –xjk the gap between the data points xi and xj in dimension k, dij the distance between the two data points prior to dimensionality reduction, and dij 0 the distance following dimensionality reduction. Following the use of multidimensional scaling, stress values range between 0 and 1. According to [40], if a stress value is less than 0.2, the result following dimensionality reduction is acceptable. The closer a stress value is to
Sustainability 2018, 10, x FOR PEER REVIEW Sustainability 2018, 10, 3729
10 of 18 10 of 18
use of multidimensional scaling, stress values range between 0 and 1. According to [40], if a stress value is less than 0.2, the result following dimensionality reduction is acceptable. The closer a stress 0, the more precisely a result has retained the original relationships among data. This study employed value is to 0, the more precisely a result has retained the original relationships among data. This study classic MDS from Quick-R in the R language to reduce dimensionality. employed classic MDS from Quick-R in the R language to reduce dimensionality. In Figure 5, each point represents a patent document plotted in R’s plot function. This figure In Figure 5, each point represents a patent document plotted in R’s plot function. This figure retains the similarity thatis, is,aashorter shorterdistance distance between two points retains the similarityrelationships relationshipsbetween between documents; documents; that between two points indicates a higher similarity between those two documents. Documents 122 and 15 are closer to each indicates a higher similarity between those two documents. Documents 122 and 15 are closer to each other than are are documents 57 and map can assist in firms understanding the distribution other than documents 57 15. andThis 15. patent This patent map canfirms assist in understanding the of distribution technology while they establish development strategies to avoid developing technologies that would of technology while they establish development strategies to avoid developing result in patentthat infringement. technologies would result in patent infringement.
Figure results. Each Eachpoint pointrepresents representsa apatent patent document number. Figure5.5.Patent Patentmap mapshowing showing MDS MDS results. document number.
3.5. Document 3.5. DocumentOutlierness OutliernessCalculation Calculation Module Module InIn the opportunitieswithin withina ahigh-density high-density cluster of patents, thepatent patentmap, map,to todetect detect technological technological opportunities cluster of patents, thethecluster must first be searched for patents with lower similarity, which indicates fewer R&D cluster must first be searched for patents with lower similarity, which indicates fewer R&D individuals by similarly similarlyfew fewpatents. patents.Although Although data have individualsinvolved involvedininthe thetechnologies technologies covered covered by thethe data have different operate favorably favorably[41]. [41].Therefore, Therefore, this module differentareas areasofofdensity, density,the theLOF LOFmethod method can can still operate this module employed thethe LOF method proposed by Breunig et al.et [42] calculate the outlierness of each document. employed LOF method proposed by Breunig al.to[42] to calculate the outlierness of each document. Thethe concept the if LOF that density if the local of a document lesslocal than densities the local of The concept of LOF isofthat theislocal of adensity document is less thanis the densities of k of itsthat neighbors, that document higher outlierness. Thecan LOF be calculated k of its neighbors, document possesses possesses higher outlierness. The LOF becan calculated by the by the following equations. following equations. 0 that k-distance:The Thedistance distancebetween betweenthe thekth kthnearest nearestpoint pointand andthe thepoint pointdoc doc′ that closest k-distance: is is closest to to thethe data datadoc point doc is called the k-distance the point doc denoted and denoted k-distance (doc). point is called the k-distance of theofpoint doc and k-distance (doc). Reachabilitydistance: distance:The Thereachability reachability distance When thethe parameter Reachability distance isisrelated relatedtotothe thek-distance. k-distance. When parameter k is given, the reachable distance from the data point doc to the data point doc′ is called Reachability 0 k is given, the reachable distance from the data point doc to the data point doc is called Reachability Distancek(doc ← doc′). It equals the data point doc′0 of the k-adjacent distance of the data point doc and 0 Distance k (doc ← doc ). It equals the data point doc of the k-adjacent distance of the data point doc and the maximum distance between the data points doc and doc′. We have the maximum distance between the data points doc and doc0 . We have ′
′
ReachabilityDistance𝑘 (𝑑𝑜𝑐 ← 𝑑𝑜𝑐 ) = max {𝑑(𝑑𝑜𝑐, 𝑑𝑜𝑐 ), Distance𝑘 (𝑑𝑜𝑐)} ReachabilityDistancek doc ← doc0 = max d doc, doc0 , Distancek (doc)
(8)
(8)
Local reachability density: The definition of local reachability density is based on the reachable distance. For the data density: point doc′, thedefinition data point distance from doc is is less thanon or the equal to kLocal reachability The ofwhose local reachability density based reachable
distance. For the data point doc0 , the data point whose distance from doc is less than or equal to
Sustainability 2018, 10, 3729
11 of 18
k-distance(doc) is called its k-nearest-neighbor and denoted Nk (doc). The local reachability density of the data point doc is the reciprocal of its average reachability distance from adjacent data points: Local Reachability Density, lrdk (doc) =
k Nk (doc)k 0 ReachabilityDistance ∑doc0 ∈ Nk (doc) k ( doc ← doc )
(9)
Local outlier factor: According to the definition of local reachability density, if a data point is farther away from other points, then its local reachability density is obviously small. However, the LOF algorithm measures the anomaly of a data point: not its absolute local density, but the relative density of its neighboring data points. The advantage of this is that it allows for an uneven distribution of data and different densities. The local anomaly factor is defined by the local relative density. The local relative density of the data point doc (local anomaly factor) is the ratio of the average local reachability density of the neighbors of doc to the local reachability density of doc. We have:
Local Outlier Factor, LOFk (doc) =
∑doc0 ∈ Nk (doc)
lrdk (doc0 ) lrdk (doc)
k Nk (doc)k
(10)
where doc is the current document for which outlierness is being calculated, d(doc,doc0 ) is the Euclidean distance between doc and doc’, Distancek (doc) is the Euclidean distance between doc and another neighbor k, and Nk (doc) is the collection of all documents whose Euclidean distance from doc is less than Distancek (doc). Following the calculation of the outlierness of each patent document, an outlierness ranking can be obtained. A higher outlierness ranking indicates a lower number of similar patents. The outlier patents, in an overall sense, were more novel than non-outlier patents in terms of related technologies and potential business opportunities [24]. 4. Term Analysis and Results 4.1. Data Set The data set for this study was a collection of patent documents from the United States Patent and Trademark Office (USPTO) that had the strings “USB connector” or “Universal Serial Bus connector” in the title fields and had patent issue dates between 2005 and 2014. A total of 152 documents meeting these criteria were retrieved. Twenty-eight documents were design patents without text in the patent abstract or patent application fields, so they were excluded. A final 124 invention patents were used as the data set for this study. 4.2. Assessment Indicators The assessment indicators for the term analysis were computed using the R package ROCR and examined whether the inclusion of a semantic net could enhance the effectiveness of patent retrieval. The assessment indicators were precision, recall, and F-measure. Precision refers to how many documents retrieved by the system were relevant, and recall is defined as how many of the existing relevant documents were retrieved [22]. The F-measure is the harmonic mean of precision and recall. The formulas are presented below and in Table 3. TP means true positive and corresponds to the number of positive examples correctly predicted by the classification model. FP is false positive and corresponds to the number of negative examples wrongly predicted to be positive by the classification model. We have Precision =
TP = the fraction of the relevant documents that have been retrieved. TP + FP
(11)
TP = the fraction of the retrieved documents that are relevant. TP + FN
(12)
Recall =
Sustainability 2018, 10, 3729
12 of 18
F-measure =
2 ∗ Precision ∗ Recall Precision + Recall
(13)
Table 3. Confusion Matrix. Gold Standard
Test Outcome
Positive Negative
Positive
Negative
True Positive (TP) False Negative (FN)
False Positive (FP) True Negative (TN)
4.3. Term Analysis 1: Examining the Effect of Semantic Nets on Patent Retrieval In this term analysis, similarities between the patent Doc x , which was marked as most familiar by patent engineers, and 10 other patents (Doca –Docj ) were indicated from 0 to 1. These indicators are presented as the first column in Table 4. According to the patent engineers, the documents marked with similarities of 0.6 and higher were documents that required retrieval, which included Doc a , Docb , Docc , and Docd . Obtained similarities that were not included in the WordNet calculations are shown as the second column in Table 4. Under the same condition (a similarity threshold of 0.6), the documents retrieved by this system were Doc a , Docb , and Doc f . Therefore, without inclusion in the semantic net, the precision was 0.667, the recall was 0.5, and the F-measure was 0.571. The similarities obtained with inclusion in the WordNet system search are shown as the third column in Table 4. Given the same similarity threshold (0.6), the documents retrieved by this system were Doc a , Docb , Docc , Doc f , and Doc g . Therefore, with inclusion in WordNet, the precision was 0.6, the recall was 0.75, and the F-measure was 0.667. Finally, because the F-measure value with inclusion in WordNet was higher than the F-measure value without inclusion in the semantic net, inclusion in WordNet demonstrably increases the effectiveness of patent searches. Table 4. Semantic Net Similarity Comparison.
Similarity (Docx , Doca ) Similarity (Docx , Docb ) Similarity (Docx , Docc ) Similarity (Docx , Docd ) Similarity (Docx , Doce ) Similarity (Docx , Docf ) Similarity (Docx , Docg ) Similarity (Docx , Doch ) Similarity (Docx , Doci ) Similarity (Docx , Docj )
Similarity Marked by Patent Engineers
Similarity Obtained without Inclusion in Semantic Net
Similarity Obtained with Inclusion in Semantic Net
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
0.6 0.7 0.5 0.4 0.4 0.6 0.5 0.4 0.1 0.2
0.7 0.8 0.6 0.5 0.5 0.7 0.6 0.5 0.2 0.3
4.4. Term Analysis 2: Examining Patent Documents with Higher Outlierness After the document similarity matrix undergoes dimensionality reduction using MDS, we can construct a patent map that can distinguish patent document similarities. Figure 6 shows the patent map constructed in this study. The numerical portion of the text string in the figure indicates the U.S. patent number of the patent document; patents with higher similarity are also closer in the figure. For future assessments of new patents, the patent must only be added to the data set, followed by applying the same steps. Thus, this patent map can be referenced to perform a preliminary judgement on patents with high similarity to this patent.
Sustainability 2018, 10, 3729
13 of 18
Sustainability 2018, 10, x FOR PEER REVIEW
13 of 18
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1
2
3
4
5
6
7
8
9
10
Similarity Marked by Patent Engineers
Similarity Obtained without Inclusion in Semantic Net Similarity Obtained with Inclusion in Semantic Net Figure 6. Semantic net similarity comparison (Table (Table 4: 4: Similarity Similarity Comparison). Comparison).
In Figure represents a patent document, and the area shows USB-related Figure7,7,each eachpoint point represents a patent document, andsurrounding the surrounding area shows USBtechnologies. By comparing the density each point andpoint its neighboring points, we points, can determine related technologies. By comparing the of density of each and its neighboring we can whether thewhether point isthe abnormal. If the density of density the point lower, more likely to belikely abnormal. determine point is abnormal. If the ofisthe pointitisislower, it is more to be The density is density calculated based onbased the distance betweenbetween the points. If the Ifdistance is higher, the abnormal. The is calculated on the distance the points. the distance is higher, density is lower; if the distance is lower, the the density is higher. If the LOF score of the data point is the density is lower; if the distance is lower, density is higher. If the LOF score of the data point approximately 1, the local densities of the data point with its neighbors are similar; if the data point’s is approximately 1, the local densities of the data point with its neighbors are similar; if the data LOF score is score less than 1, the point in a relatively dense area, unlike abnormal point; and, if point’s LOF is less thandata 1, the dataispoint is in a relatively dense area, an unlike an abnormal point; the point’s LOF score muchislarger 1, than the data point moreisalienated from other and,data if the data point’s LOFisscore muchthan larger 1, the dataispoint more alienated frompoints, other which that it isthat an it abnormal point. Outlier patents, in an overall sense, sense, are more points,indicates which indicates is an abnormal point. Outlier patents, in an overall are novel more than patents in terms of related technologies and potential business opportunities [20]. novelnon-outlier than non-outlier patents in terms of related technologies and potential business opportunities Figure 7 implements its own to calculate the results; its results are different from those plotted [20]. Figure 7 implements itspackage own package to calculate the results; its results are different from those by R’s plotting functionfunction package.package. plotted by R’s plotting 100 Item a b c d e f g h i j
Table 5. Outlierness.
U.S. Patent Number Outlierness Item U.S. Patent Number Outlierness 90 s:1.2531.31 8672692 1.915 k 8062067 t:1.224 c: 1.432 b: 1.552 d: 1.411 1.552 7234963 l 7645154 1.30 80 e: 1.386 6902432 1.432 m 8043101 1.29 7699633 1.411 n 8308507 1.28 70 8900012 1.386 o 7921233 1.28 0.296 8814583 n:1.275 1.384 p 7591657 1.28 60 0.348 7811111 1.379 r 8900016 1.26 j: 1.320 0.339 8172585 1.369 s 8047877 1.26 50 m:1.293 1.352 7172460 t 6808400 1.26 0.641 40 8900017 1.320 u 8235746 1.25 o:1.275 0.494 30 0.684 20 f: 1.384 l:1.304 p:1.264 g: 1.379 i: 1.352 10 h:1.369 q:1.261 r:1.258 a: 1.915 k:1.306 0 0 20 40 60 80 100
Figure 7. Local outlier factor patent map (Table 5: Outlierness).
point’s LOF score is less than 1, the data point is in a relatively dense area, unlike an abnormal point; and, if the data point’s LOF score is much larger than 1, the data point is more alienated from other points, which indicates that it is an abnormal point. Outlier patents, in an overall sense, are more novel than non-outlier patents in terms of related technologies and potential business opportunities [20]. Figure2018, 7 implements its own package to calculate the results; its results are different from14those Sustainability 10, 3729 of 18 plotted by R’s plotting function package. 100 90
s:1.253 t:1.224 b: 1.552
c: 1.432 d: 1.411 e: 1.386
80 70
0.296
60
n:1.275
0.348
j: 1.320
0.339
50 m:1.293 0.641 o:1.275
40
0.494
30 0.684 20 l:1.304 i: 1.352
10 a: 1.915 0 0
20
g: 1.379 r:1.258
k:1.306 40
60
f: 1.384 p:1.264 h:1.369 q:1.261 80
100
Figure 7. Local outlier factor patent map (Table (Table 5: 5: Outlierness). Outlierness).
According to the definition of the local anomaly factor, if the LOF score of the data point doc is approximately 1, the local density of the data point doc is similar to that of its neighbors; if the LOF score of doc is less than 1, doc is in a relatively dense area, unlike an abnormal point. If the LOF score of the data point doc is much larger than 1, doc is farther away from other points and is likely to be an abnormal point. For each data point, we calculate its distance from all other points and sort from the nearest data point to the farthest data point. We then find its k-nearest-neighbor and calculate the LOF score. Table 5 displays the top 20 patent documents in terms of outlierness ranking obtained after using the LOF method to calculate the outlierness of each patent document. Patent documents with high outlierness rankings indicate fewer similar patents, suggesting a gap in related technologies and concomitant business opportunities. Not all outlier patents deliver new approaches to technological development; some provide fresh or unusual signals for further technological development. In the competitive technological environment, an early grasp of potential technological opportunities is important for developing technologies that can increase the competitiveness of a business [13]. In our numerical analysis of the outlierness rankings, only the top two outlier values were higher than 1.5. Furthermore, a detailed reading of these 20 patents indicated that all 18 patents with outlier values lower than 1.5 were design patents related to the body structures of USB connectors. The two patents with outlier values higher than 1.5 were more closely oriented toward applicability and convenience. The USB connectors currently on the market are extremely similar, and most consumers would not notice design variations related to the body structures; however, increased applicability and convenience can become selling points. This study was based on patent documents from 2005 to 2014. A subsequent query using Google Advanced Patent Search retrieved 31 USB patents that were approved between January and December 2015 worldwide. Among these, only 4 were related to body structure; the remaining 27 were related to increasing applicability and convenience. Patent documents, which contain abundant highly credible technical information and crucial research results, are highly useful, valuable sources of knowledge. Therefore, this study employed the LOF method to calculate the outlierness of each patent document (Table 5). Only the top two outlier values were higher than 1.5, so we used the level of outlierness to discover potential technological or business opportunities related to the following two patents.
Sustainability 2018, 10, 3729
15 of 18
U.S. Patent No. 8672692 is primarily a USB connector structure. There is a component that can be inserted into a USB port, which has a terminal on one side, and a connector that links the terminal unit and the internal USB circuit. A reinforcing element is provided on the other side of the plug for protection, and an insulator is placed on its surface. U.S. Patent No. 7234963 is a design that corrects the orientation of the wire on a USB plug connector. It has a top cover, a connector, a wire, a wire slot, a cable rotating seat, a bottom cover, etc. It can prevent the fall of the USB transmission line due to rotation during use. 5. Conclusions The semantic net similarity comparison table indicates that in patent documents, semantic inconsistencies are caused by different modifiers being used to describe the same operation. When the system is searching, it cannot distinguish synonyms, which causes patents that should have been retrieved to be overlooked. At present, no worldwide uniform or similar adjective guidelines exist for patent documents. In addition, patent specifications must describe the contents of the patent in detail; once a patent has been published, all the technical details of the patent are made public. Therefore, to protect patent-holder interests, some patent applications are reserved in their descriptions, play word games, or even contain traps. These all create obstacles during system searches and substantially decrease search precision. However, given that the purpose of patents is to protect the rights and interests of inventors, these obstacles may be another method of protecting patents. To alleviate them, in addition to unifying term modifiers in relevant standards and regulations, technical means can be applied to enhance search precision. This study used WordNet synonyms to calculate the similarities between pairs of terms and merge synonyms. Further investigation is warranted to further increase the precision of system searches. Since polysemous words frequently occur in WordNet, this study proposed a different method than word-sense disambiguation (WSD) to decrease the calculated degree of distortion between two terms. For instance, “watch” can be interpreted as the noun meaning “wristwatch” or the verb meaning “to pay attention to.” The word “star” can be interpreted as “a star in space” or “a celebrity.” WSD determines the correct semantic meaning of a term in a document from numerous possibilities. A fixed grammatical structure in a language can be used to determine which semantic meaning should be attributed to a term; for example, in English, prepositions can be followed only with nouns, pronouns, or noun phrases. Neighboring words can be referenced to determine semantic meaning as well. Consider the term “pine cone.” “Pine” has two meanings in the dictionary: “a type of evergreen tree with needle-shaped leaves” and “to waste away through sorrow or illness.” “Cone” has three meanings: “a solid body which narrows at a point,” “something of this shape whether solid or hollow,” and “the fruit of a certain evergreen tree.” Therefore, the intersecting semantic meanings, namely, “evergreen” and “tree,” should be selected. Additionally, a term usually represents only one meaning in a document, which can also be used to limit the meanings of terms. Finally, if expert dictionaries can be established in the future for different areas of expertise, these dictionaries can be used to limit the meanings of terms or determine technical terms more precisely. Through these WSD methods, term similarity can be more precisely calculated. The rapid development of technology and the accumulation of patents have led to an immense number of patent documents, and sorting through them directly results in information overload. Therefore, this study proposed a more efficient method to distinguish patent document similarities [38]. It involves extracting the titles, abstracts, and application fields of USPTO patent documents, preprocessing the text in these fields, and using the WUP method to calculate similarities between terms to obtain term similarity matrices. These matrices are used to group terms, after which the TF–IDF method is used to calculate term weights and a cosine measure is used to calculate similarities between two patent documents. After the document similarity matrices have been obtained, a patent map can be constructed by MDS.
Sustainability 2018, 10, 3729
16 of 18
Therefore, calculating outlier values to sustainably detect technological opportunities is a viable approach when new patents appear. However, further investigation is warranted to improve the precision of verifying assessment indicators of patent outlierness. Manual verification remains the most precise method. However, if the number of documents is large, a substantial amount of time is required for review; if the number of documents is too small, the results do not possess adequate integrity, which results in an inconsistency between the intended level of persuasiveness and the available sample size. Thus, in the future, the number of patent documents in a patent’s citation field and measures of the increase in patent documents related to the patent’s technology field can serve as references in the development of relevant assessment indicators for verifying document outlierness. The indicators thus developed can be made more persuasive. This study integrated all operations into a single development environment. Because different programming languages are suitable for different fields, this study used multiple programming languages for different modules. In the future, if the functions of data mining and WordNet kits become increasingly comprehensive, the modules employed in this study can be integrated into a single development environment to reduce the complexity of development and enhance the overall performance of the implementation. This research limitation required us to avoid working with too many words; only the patent title field, the patent summary field and the text of the patent application scope field were used for processing. U.S. Patent Nos. 8672692 and 7234963 are USB body structure patents. In Term Analysis 2, we mentioned that there were only four body structure patents in the world in 2015. The LOF patent map outliers and surrounding areas, which represent technologies related to USB body structure, indicate potential technological and business opportunities. The primary goal of this study was to propose a method to reduce term dimensionality, specifically by grouping terms using term similarity matrices and merging semantically similar terms. In the related fields of information retrieval, data mining, and text mining [19], an immense number of eigenvalues or sparse matrices formed from terms tend to substantially reduce the effectiveness of the overall implementation. The method proposed in this study can reduce term dimensionality and facilitate user understanding through visualization. This method can also assist firms in formulating development strategies for avoiding patent infringement while sustainably discovering technological opportunities to achieve future competitive advantages. Author Contributions: H.C.W. designed this research and collected the data set for the term analysis. Y.C.C. and P.L.H. analyzed the data to show the validity of this study. Y.C.C. wrote the paper and performed the research steps. In addition, the authors have all collaborated in revising the paper. Funding: This research was funded by the Taiwan Ministry of Science and Technology grant number 107-2410-H-006-040-MY3. Conflicts of Interest: The authors declare no conflicts of interest.
References 1. 2. 3. 4. 5. 6.
Park, H.; Yoon, J.; Kim, K. Identifying patent infringement using SAO based semantic technological similarities. Scientometrics 2012, 90, 515–529. [CrossRef] Mukherjea, S.; Bamba, B.; Kankar, P. Information retrieval and knowledge discovery utilizing a biomedical patent semantic web. IEEE Trans. Knowl. Data Eng. 2005, 17, 1099–1110. [CrossRef] Chen, Y.L.; Chiu, Y.T. An IPC-based vector space model for patent retrieval. Inf. Process. Manag. 2011, 47, 309–322. [CrossRef] Lee, S.; Yoon, B.; Park, Y. An approach to discovering new technology opportunities: Keyword-based patent map approach. Technovation 2009, 29, 481–497. [CrossRef] WIPO. Some Basic Information. 1998. Available online: http://www.wipo.int/portal/en/index.html (accessed on 10 May 2017). Song, K.; Kim, K.S.; Lee, S. Discovering new technology opportunities based on patents: Text-mining and F-term analysis. Technovation 2017, 60–61, 1–14. [CrossRef]
Sustainability 2018, 10, 3729
7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
20. 21. 22. 23. 24. 25. 26.
27.
28. 29. 30. 31. 32.
17 of 18
World Intellectual Property Indicators. Some Basic Information. 2016. Available online: http://www.wipo. int/publications/en/details.jsp?id=4138&plang=EN (accessed on 12 June 2017). Chen, Y.L.; Chang, Y.C. A three-phase method for patent classification. Inf. Process. Manag. 2012, 48, 1017–1030. [CrossRef] Rosso, P.; Correa, S.; Buscaldi, D. Passage retrieval in legal texts. J. Log. Algebr. Program. 2011, 80, 139–153. [CrossRef] Tseng, Y.H.; Lin, C.J.; Lin, Y.I. Text mining techniques for patent analysis. Inf. Process. Manag. 2007, 43, 1216–1247. [CrossRef] Ernst, H. Patent portfolios for strategic R&D planning. J. Eng. Technol. Manag. 1998, 15, 279–308. [CrossRef] Park, H.; Yoon, J.; Kim, K. Identification and evaluation of corporations for merger and acquisition strategies using patent information and text mining. Scientometrics 2013, 97, 883–909. [CrossRef] Janghyeok, Y.; Hyunseok, P.; Kwangsoo, K. Identifying technological competition trends for R&D planning using dynamic patent maps: SAO-based content analysis. Scientometrics 2013, 94, 313–331. [CrossRef] Jeong, C.; Kim, K. Creating patents on the new technology using analogy-based patent mining. Expert Syst. Appl. 2014, 41, 3605–3614. [CrossRef] Meng, L.; Huang, R.; Gu, J. A review of semantic similarity measures in wordnet. Int. J. Hybrid Inf. Technol. 2013, 6, 1–12. Yoon, J.; Kim, K. Detecting signals of new technological opportunities using semantic patent analysis and outlier detection. Scientometrics 2012, 90, 445–461. [CrossRef] Yoon, J.; Kim, K. Identifying rapidly evolving technological trends for R&D planning using SAO-based semantic patent networks. Scientometrics 2011, 88, 213–228. [CrossRef] Walter, L.; Radauer, A.; Moehrle, M.G. The beauty of brimstone butterfly: Novelty of patents identified by near environment analysis based on text mining. Scientometrics 2017, 111, 103–115. [CrossRef] Alves, T.; Rodrigues, R.; Costa, H.; Rocha, M. Development of Text Mining Tools for Information Retrieval from Patents. In Proceedings of the International Conference on Practical Applications of Computational Biology & Bioinformatics, Porto, Portugal, 21–23 June 2017; pp. 66–73. [CrossRef] Kayser, V.; Blind, K. Extending the knowledge base of foresight: The contribution of text mining. Technol. Forecast. Soc. Chang. 2016, 116, 208–215. [CrossRef] Shen, Y.C.; Lin, G.T.; Lin, J.R.; Wang, C.H. A Cross-Database Comparison to Discover Potential Product Opportunities Using Text Mining and Cosine Similarity. J. Sci. Ind. Res. 2017, 76, 11–16. Kim, J.; Choi, J.; Park, S.; Jang, D. Patent Keyword Extraction for Sustainable Technology Management. Sustainability 2018, 10, 1287. [CrossRef] Roh, T.; Jeong, Y.; Yoon, B. Developing a Methodology of Structuring and Layering Technological Information in Patent Documents through Natural Language Processing. Sustainability 2017, 9, 2017. [CrossRef] Edilson, A.C.; Alneu, A.L.; Diego, R.A. Word sense disambiguation. Inf. Sci. 2018, 442, 103–113. [CrossRef] Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [CrossRef] Introduction to Patent Map Analysis, Asia Pacific Industrial Property Center, JIII. Available online: https://www.jpo.go.jp/torikumi_e/kokusai_e/training/textbook/pdf/Introduction_to_Patent_ Map_Analysis2011.pdf (accessed on 15 July 2017). Sagarra, M.; Mar-Molinero, C.; Garcıa-Cestona, M. Spanish savings banks in the credit crunch: Could distress have been predicted before the crisis? A multivariate statistical analysis. Eur. J. Financ. 2015, 21, 195–214. [CrossRef] Weiwei, X.; Liya, S.; Xiang, W. Human Motion Behavior Segmentation based on Local Outlier Factor. Open Autom. Control Syst. J. 2015, 7, 540–551. [CrossRef] Mong, G. Research and Application of Abnormal Data Mining Algorithm. 2013. Available online: http: //wap.cnki.net/lunwen-1013309998.html (accessed on 20 October 2017). Trappey, C.V.; Wu, H.Y.; Taghaboni-Dutta, F.; Trappey, A.J.C. Using patent data for technology forecasting: China RFID patent analysis. Adv. Eng. Inform. 2011, 25, 53–64. [CrossRef] Wang, M.Y.; Fang, S.C.; Chang, Y.H. Exploring technological opportunities by mining the gaps between science and technology: Microalgal biofuels. Technol. Forecast. Soc. Chang. 2015, 92, 182–195. [CrossRef] Hongbin, K.; Junegak, J.; Kwangsoo, K. Semi-automatic extraction of technological causality from patents. Comput. Ind. Eng. 2018, 115, 532–542. [CrossRef]
Sustainability 2018, 10, 3729
33. 34. 35. 36. 37. 38. 39. 40. 41. 42.
18 of 18
Daniel, J.; James, H.M.; Peter, N.; Stuart, R. Speech and Language Processing, 2nd ed.; Pearson Education India: London, UK, 2014; ISBN 0131873210. Rada, R.; Mili, H.; Bicknell, E.; Blettner, M. Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybern. 1989, 19, 17–30. [CrossRef] Wu, Z.; Palmer, M. Verbs semantics and lexical selection. In Proceedings of the 32nd annual Meeting on Association for Computational Linguistics, Las Cruces, NM, USA, 27–30 July 1994. [CrossRef] Banerjee, S.; Pedersen, T. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico, 9–15 August 2003. Janghyeok, Y.; Wonchul, S.; Byoung, Y.C.; Inseok, S.; Jae-Min, L. Identifying product opportunities using collaborative filtering-based patent analysis. Comput. Ind. Eng. 2017, 107, 376–387. [CrossRef] Wan, X. A novel document similarity measure based on earth mover’s distance. Inf. Sci. 2007, 177, 3718–3730. [CrossRef] Tenreiro Machado, J.A.; Lopes, A.M.; Galhano, A.M. Multidimensional Scaling Visualization Using Parametric Similarity Indices. Entropy 2015, 17, 1775–1794. [CrossRef] Kruskal, J.B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 1964, 29, 1–27. [CrossRef] Pang-Ning, T.; Michael, S.; Vipin, K. Introduction to Data Mining, 1st ed.; Addison-Wesley: Boston, MA, USA, 2008; ISBN 0321321367. Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Osaka, Japan, 20–23 May 2008; pp. 93–104. [CrossRef] © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).