A Deep Patent Landscaping Model using Transformer and Graph

0 downloads 0 Views 321KB Size Report
using Transformer and Graph Convolutional Network. Seokkyu Choi1 , Hyeonju Lee1 , Eunjeong Park2,∗ , Sungchul Choi1,∗. 1 TEAMLAB, Gachon University, ...
A Deep Patent Landscaping Model using Transformer and Graph Convolutional Network Seokkyu Choi1 , Hyeonju Lee1 , Eunjeong Park2,∗ , Sungchul Choi1,∗ 1 TEAMLAB, Gachon University, Republic of Korea 2 Papago, NAVER, Republic of Korea {korest75, l67378810}@gmail.com, [email protected], [email protected] Abstract Patent landscaping is a method that is employed for searching related patents during the process of a research and development (R&D) project. To avoid the risk of patent infringement and to follow the current trends of technology development, patent landscaping is a crucial task that needs to be conducted during the early stages of an R&D project. Generally, the process of patent landscaping requires several advanced resources and can be tedious. Furthermore, the patent landscaping process has to be repeated throughout the duration of an R&D project. Owing to such reasons, the demand for automated patent landscaping is gradually increasing. However, the shortage of well-defined benchmarking datasets and comparable models makes it difficult to find related research studies. In this paper, an automated patent landscaping model based on deep learning is proposed. The proposed model comprises a modified transformer structure for analyzing textual data present in patent documents and a graph convolutional network for analyzing patent metadata. Twelve patent landscaping benchmarking datasets, which were processed by the Korean patent attorney, are proposed for determining the resources required for comparing related research studies. Obtained results indicate that the proposed model with the proposed datasets can attain state-of-the-art performance, and mean classification accuracy of 98% can be achieved.

1

Introduction

A patent is a significant deliverable in research and development (R&D) projects. A patent protects an assignee’s legal rights and also represents current technology development trends. To study technological trends and seize potential infringement patents, majority of the R&D projects conduct the task of patent landscaping, which involves collecting and analyzing patent documents related to the projects. Generally, ∗

Co-corresponding Authors

the task of patent landscaping is a human-centric, tedious, and expensive process [Trippe, 2015; Abood and Feltenberger, 2018]. Researchers and patent attorneys create keyword sets for querying related patents in large patent databases, eliminate unrelated patent documents, and extract valid patents related to their project. The valid retrieved patents can be accordingly represented as a contour map by employing various visualization techniques [Yang et al., 2010; Wittenburg and Pekhteryev, 2015]. However, since the participants of the process have to be familiar with the scientific domain as well, they are costly . Furthermore, the patent landscaping task has to be repeated regularly during an ongoing project for searching newly published patents. In this paper, a semi-supervised deep learning model for patent landscaping is proposed. The proposed model eliminates repeated patent landscaping tasks after initially selecting patent keywords and valid patent documents by employing a classification model based on deep learning. Additionally, the proposed model incorporates a modified transformer structure [Vaswani et al., 2017] and graph convolutional network (GCN) [Kipf and Welling, 2016]. Since a patent document can contain several textual features and bibliometric data, a modified transformer structure was applied for processing textual data and a GCN was applied for processing bibliometric data fields constituted as network-based. To compare the performance of the proposed model with existing models, benchmarking datasets for patent landscaping were introduced. Generally, owing to issues such as high cost and data security, appropriate benchmarking datasets are not available. Such benchmarking datasets are generally created by human experts such as patent attorneys. Through the incorporation of six datasets generated through the Korea Intellecutral Property STrategy Agancey (KISTA), the authors aim to contribute research resources towards machine learningbased patent landscaping research. Experimental results indicate that the proposed model with the incorporated datasets can attain state-of-the-art performance, and the average classification accuracy for each dataset can be improved by approximately over 15%.

2

Patent Landscaping

The entire process of patent landscaping is shown in Figure 1. Generally, when searching for patent documents, a keyword query is generated using various technological keywords. Be-

Topic Selection

Keyword Candidate Listing

Search Query Formulation

Valid Patent Selection

Patent Retrieval

Patent Landscape Report

Regularly repeated task Figure 1: The general process of patent landscaping

cause many assignees do not allow their patents to be discovered easily in the search to gain an advantage in the infringement issues that may arise, they tend to write patent titles and abstracts very generically or omit technical details [Tseng et al., 2007]. To deal with this problem, researchers work on completing search queries with complicated words or key phrases to reinforce the search [Magdy and Jones, 2011]. For example, the following search query is created, as shown in the box below. (”underwater glider*” or ”underwater vehicle*” or ”underwater drone”) AND (emergency or urgency)) AND (hull or shape or body or drag or ”fluid force” or ”fluid power” or CFD or ”computational fluid dynamic*” or PMM or ”Plana Motion Mechanism” or design* or test or measure* or sensing or control* or overcome or obstacle or obstruction or ”wireless communitcat*” or ”satellite communitcat*” or ”command protocol” or battery or cell or power or launch or recovery or server or database or display*) Primary focus here is the regular repetition process going back to search query formulation from valid patent selection in Figure 1. Once the search formula is created, it is necessary to track the new patents published regularly using a similar search formula. Because the first valid patent selection is similar to creating a training dataset of supervised learning, it is one of the tasks that can solve repetitive tasks with text classification. For this reason, patent classification is one of the areas that have been studied steadily in the past decade [Sureka et al., 2009; Chen and Chiu, 2011; Lupu et al., 2013]. Relatively recent studies have challenged the International Patent Code (IPC) classification by using long short-term memory (LSTM) [Shalaby et al., 2018] and a text convolutional neural network (text-CNN) [Li et al., 2018]. Google has developed automated patent landscaping technology using a deep neural network [Abood and Feltenberger, 2018]. However, the biggest weakness of previous studies is the lack of a suitable benchmarking dataset. Unlike the purpose of the actual patent landscape, an IPC classification study predicts a fittable IPC code for each patent, which is already granted to all patents by assignees and patent examiners. It is not useful for patent landscaping work in the real world. In the case of Google’s work, heuristic algorithms are used to generate the training dataset by using technology keywords, such as the Internet of things. Because the keywords and datasets are too general, there is uncertainty in applying the model to real-world R&D project handling specifics and de-

Dataset

Full name

ENDV NGUG

Ensuring Technology of Nighttime Driver Visibility Next Generation Technology of Underwater Glider Marine Plant Technology Using Augmented Reality Technology Technology for 1MW Dual Frequency System Technology for Micro Radar Rain Gauge Technology for Geostationary Orbit Complex Satellite

MPUART 1MWDFS MRRG GOCS

Table 1: Patent landscaping benchmarking dataset

tailed technology for patent landscaping. To the best of the authors’ knowledge, a fittable benchmarking dataset, generated by human experts for patent landscaping in the detailed technology field is not available yet.

3

Proposed Datasets for Patent Landscaping

3.1

Dataset description

We provide a benchmarking dataset for patent landscaping based on the KISTA patent trends reports. Every year, the Korean Patent Office publishes more than 100 patent landscaping reports through KISTA1 . In particular, most reports are available to validate the results of the trends report by disclosing the valid patent list together with the patent search query2 . Currently, more than 2,500 reports are disclosed. The kinds of technology in the reports are specific, concrete, and sometimes include fusion characteristics. We have constructed datasets for the six technologies listed in Table 1.

3.2

Building process

To ensure the reproducibility of building patent datasets, we have built the benchmarking datasets using Google BigQuery public datasets. Most of the patent data in the KISTA report are obtained by the required use of a search query of the local Korean patent database service WIPS. We first constructed a Python module that converts the WIPS query into a Google BigQuery service query, extracted the patent data set from BigQuery, and marked the valid patents among the extracted patents. In the case of patent search, other datasets may be extracted depending on the type of publication date and database to be searched. Therefore, we excluded the queried patents published after the original publication date depicted in the report. 1

http://biz.kista.re.kr/patentmap/ Most of the search queries were based on the WIPS (https://www.wipson.com) service, which is a local Korean patent database company. 2

3.3

Results of dataset

We built extracted patent data sets of six technological areas from Google BigQuery, and the number of patents in each dataset and their downloadable URLs are given in Table 2. Dataset name ENDV NGUG MPUART 1MWDFS MRRG GOCS

# of patents

# of valid patents

4,420,806 8,011 1,470,065 1,774,999 2,068,828 295,194

618 243 468 1,507 286 1,139

Table 2: Datasets

4 4.1

(a) Transformer structure

Deep Patent Landscaping Model Model overview

Figure 3: Strcutures of Transformer and GCN layers

4.2

Figure 2: Architecture of deep patentlandscaping model

The proposed deep patent landscaping model is composed of two parts: Transformer [Vaswani et al., 2017] and GCN [Kipf and Welling, 2016]. The model has the concatenation layer of the the embedding vectors and stack neural net layers to classify the valid patents as shown in Figure 2. Patents, which is a sort of scientific documents, has metadata fields, called bibliometrics, such as International Patent Code(IPC), Cooperative Patent Classification(CPC), citation with textual data such as abstracts and document titles. Theses metadata fields have a characteristic such that it can be represented in a graphical format. For example, citation information can be converted to a citation network representing how each patent cite other patents [Yoon and Park, 2004]. IPC and CPC, which are technological classification codes for patents, are used as source of graph information for cooccurrence matrices representing whether patents have the same IPC code or not. To consider the patent features, we apply a transformer encoder to extract the embedding vectors for textual data and GCN scturucture to extract the embedding vectors for metadata. Finally, we use binary cross entropy as a loss function for the binary classification model.

(b) GCN structure

Base features

A patent document contains a lot kind of features3 . Among them, we use five features from a patent document: the abstract, IPC, CPC, USPC, and FamilyID. Firstly, the abstract is the only textual feature used. The abstract is not long, but contains general and important descriptions of the invention. Even though we also examined other textual features such title, claim, and full description, the performance was worse than the abstract only. We think they only have too short or too long to represent the features of patent. Among metadata fields, we use IPC, CPC and USPC. They are kinds of technological classification codes invented by the organizations for managing patents. Because these codes are determined by the inventors and examiners directly, they are good sources to represent the characteristics of a patent document. Although we also examined citation information, we finded the accuracy is worse when we use citation with the classification codes. We use a familyID feature as the feature matrix of GCN. The familyID is an useful identification code to represent the collection of technological related patents which is issued by same applicant.

4.3

Structures of Transformer and GCN

Transformer The core of our model is the output of embedding through the Transformer and GCN layers. First, we only use only the encoder part of the Transformer to learn the abstract data of the patent, as shown in Figure 3a. Like transformer structure, we stack encoder layers six times. We also use multi-head self-attention and scaled dot-product attention without the modification of transformer encoder. We set the the number of heads of multi head self attention was 8. As an input, we use the pretrained embedding vector from word2vec for the whole patent datasets. When we tokennize the the abstract text, the tag [SEP] is inserted at the beginning, and the tag [CLS] is inserted at the end of the sent3

https://bit.ly/2IvkMHf

nece. We set the sequence length is 128, and the hidden size is 512. In order to concat the output part of the transformer with the GCN embedding vectors, we adopt the squeez technique from BERT [Devlin et al., 2018] and covert matrix (sequence length, embedding size) to vector (embedding size) based on [CLS] GCN One GCN structure stacks two GCN layers to produce an embedding vector for one metadata, as shwon in Figure 3b. The formula of GCN layer is as shown in Equation ( 1). 

˜ − 12 A˜D ˜ − 21 H (l) W (l) H (l+1) = σ D



Figure 5: A preprocessed adjacency matrix

5

Experiments

5.1 (1)

The GCN layer first constructs a preprocessed adjacency matrix for each metadata CPC, IPC, and USPC using the cooccurrence information in a patent, constructs a one-hot vector for the occurrence of FamilyIDs for each metadata, stack the one-hot vectors N times to create a feature matrix. As shown in Figure 4, N shows the number of elements for each metdata, and F shows the number of FamilyIDs. Next, the following a layer-wise propagation rule is used as shown in Figure 4 with the caluations of dot-product.

Dataset

We measured the performance of the proposed model for the classification of valid patents in the six KISTA datasets. We extracted 8,000 to 40,000 patent data for each dataset from Google BigQuery. We created training and test sets with 8:2 in the entire patent data set. The data set we proposed can be downloaded via URLs in the following Table 3. Dataset name

Trainingset

Teseset

URL

ENDV NGUG MPUART 1MWDFS MRRG GOCS

0 0 0 0 0 0

0 0 0 0 0 0

bit.ly/2Xk2goS bit.ly/2U2eyQz bit.ly/2U2e4Kf bit.ly/2GEQRKW bit.ly/2TYzQi4 bit.ly/2VdasVL

Table 3: Datasets for experiments

5.2

Figure 4: A layer-wise propagation rule for propossed GCN

˜ − 21 A˜D ˜ − 21 is a preprocessed adjacency In Equation ( 1), D matrix. A˜ is represented as sum of adjacency matrix and identity matrix called self connection matrix,. Adjacency matrix is a binary matrix representing the co-occurence relationship that the elements of metadata, classification codes in here, exits in same patent. If two CPC codes exists in a same patent, the value of the column representing relationship between ˜ is a ditwo codes in binary matrix is 1. Degree matrix, D, agonal matrix which adds the value of adjacency matrix row by row, and the size of degree matrix is same with adjacency matrix. We use ReLu as activation function σ, and the final output size of GCN structure is 128. Using the GCN structure, we were able to extract useful features for the patent by learning the relationship between classification codes that emerged together. Because the values of GCN embedding vectors are close to 0, we apply batch normalization [Ioffe and Szegedy, 2015] to the embedding vectors before concatenating them to other embedding vectors from the transformer structure.

Hyper parameter setting

We extracted embedding vectors from Transformer and GCN built on the separate learning process. In Transformer, six encoder layers are stacked, and the number of head of multihead attention is 8. Total learning epoch is 20, batch size is 64, the optimizer is Adam optimizer [Kingma and Ba, 2014], the learning rate is 0.0001, and epsilon is 1e-8. We set the sequence length, which is the maximum length of the input sentence, to 128, and padding it to 0 if it is shorter than 128. As a result, 512-dimensional embedding vectors are extracted for each word. In GCN, two GCN layers were used, and the total learning epoch is 200. 128-dimensional embedding vectors are extracted for each metadata from GCN, and an average of the extracted embedding vectors is taken. The total learning epoch is 200, and we use the early stopping technique to decrease the total time of the learning process.

5.3

Evaluation metric

We use the accuracy and f1 score, which are commonly used in binary classification problems, as an evaluation metric. Because the datasets are highly imbalanced data, we consider the precision and recall values through the f1 score.

5.4

Overall results

We compare our proposed model(TRF+GCN in Table 4) with Google’s proposed Automated Patent Landscaping (APL in

Dataset ENDV NGUG MPUART 1MWDFS MRRG GOCS

TRF+GCN Acc F1 0.9919 0.9848 0.9946 0.9900 1.00 0.9934

0.99 0.98 0.99 0.98 1.00 0.99

TRF Acc F1 0.9597 0.8687 0.9255 0.9121 0.9130 0.9015

0.96 0.87 0.93 0.91 0.91 0.90

GCN Acc F1 0.7258 0.6212 0.7527 0.7496 0.7696 0.7287

0.72 0.62 0.75 0.75 0.77 0.73

APL Acc F1 0.9435 0.7576 0.8511 0.9171 0.8435 0.8534

0.94 0.77 0.85 0.92 0.85 0.86

Table 4: Overall experiment results

Table 4) model which is a baseline. The results of the proposed model show that the proposed model has the highest performance, over 98% accuracy, in all data sets, as shown in Table 4. In particular, the results show a performance improvement of the proposed model is more than 20% over using only GCN structure with metadata. As well, our model attains the average 10% improvement or more over Google’s Automated Patent Landscaping model which is a baseline model. We will investigate the following subjects empirically about the layers and features that affect patent landscaping model.

Dataset ENDV NGUG MPUART 1MWDFS MRRG GCOS

• Which indicators of textual data and metadata have more influence on the model? • Which of the features, including citation information, has the most impact on the classification of patent landscaping? • When a misclassification occurs, what are the characteristics of the dataset that is misclassified?

5.5

Textual data vs. Metadata represented in graph structure

In the first experiment, we can find that a significant level of performance can be obtained only by textual data, abstract, that is compared with data that can be represented as a graph, bibliometric data. This is due to the fact that most of the datasets that we use are convergence technology, which can not be distinguished only by the existing patent technology classifications. In other words, in the patent landscaping process, it is difficult to classify the technology type of patents only by the patent technology classification code defined by the existing patent-related organization such as USPTO and WIPO. For example, an electric vehicle battery technology may include not only a device but also various technology groups such as an electric car battery material and a coolant. Therefore, by considering the contents of the patent along with the patent technology classification code, you can classify the patent according to the exact purpose. As shown in Table 4, we can find that there is a certain performance correlation between the transformer only model and APL model. On the other hand, the GCN only model has a relatively weak correlation. This also shows that it is difficult to classify a patent in R&D projects only with classification codes of existing patents.

Metric

3CODE

CITATION

4CODE

Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1

0.9919 0.99 0.9848 0.98 0.9946 0.99 0.9900 0.98 1.00 1.00 0.9934 0.99

0.9879 0.98 0.9848 0.98 0.9893 0.93 0.9800 0.98 0.9739 0.97 0.9890 0.98

0.9919 0.99 0.9797 0.97 0.9760 0.75 0.9834 0.98 0.9826 0.98 0.9759 0.97

Table 5: The performance based on metadata perspective

5.6

Classification codes vs. Citation information

Our second question is how useful is the already widely used citation information in patent landscaping. In conclusion, although citation information is as useful as classification codes, it is more useful to use only classification codes. As shown in table 5, 3CODE means the model uses the transformer with GCN including IPC, CPC, and USPC. REF means the model uses the transformer with GCN including citation information only. 4CODE means the model uses the transformer with GCN including all metadata mentioned above. The performances of the 3CODE and REF are not significantly different, but in most cases, the performance of 3CODE is higher than REF only. In addition, when using 4CODE in a special case, the performance is lower than the case of using only REF or 4CODE. We guess the reason is that the citation information can be kinds of noise. In a patent citation, inventors cited not only prior patent in the same application but also the patent related the principle technology Sometimes, the princple technology patent is rare and not related to all same technology application patents. In addition, technology classification codes have the advantage that the learning space is relatively small compared to the citation because the range that can be made in a similar patent field is narrowed. For this reason, technology classification codes are more fittable features than citation information in patent landscaping.

5.7

What factors make misclassification?

Our model achieves a performance improvement of 10% compared to the existing model and over 98% accuracy in

the patent landscaping applicable to the proposed dataset. Nonetheless, there are misclassifications. There are the following characteristics when we consider the reason for misclassification. First, when the length of the abstract sequence is too long or short, the model is not well working. We set the length of the abstract is 128 words, and force padding or truncating if the abstract is too long or short. When 0 padding is too much or truncate is too many, it seems that misclassification occurs. In the case of misclassification category, the total length of the abstract is 30 and it is shorter than that of well-classified data. Also, since the numbers and special characters are removed in the preprocessing process, when there are many numbers or special characters in the abstract, the data is not able to express the information rightly, resulting in a drop in the classification. In the case of metadata, the average classification code is X, and the number of misclassification data is less than five on average x. When only the abstract and classification codes are used, if the source of features is not enough, it seems that misclassification occurs.

6

Conclusion and Future Work

In this paper, we proposed a deep patent landscaping model that solves the classification problem in patent landscaping using Transformer and GCN structures. This research suggested a new benchmarking dataset for the automated patent landscaping task and worked to make it a practical study for automated patent landscaping. In particular, it is anticipated that it will be possible to reduce the repetitive patent analysis tasks of practitioners performing patent analysis tasks. The main contributions of our research are as follows. • We released patent landscaping dataset that has been categorized by experts who have not existed in the past. • We propose a state of the art model which is at a level practicable in the real world. • We examined the several features of a patent that can be used for patent landscaping. We discovered the following points about the use of patent features for automated patent landscaping. We need to consider both graphical information such as bibliometric and textual data to achieve the highest classification performance. The classification codes specified by inventors in static metadata were available as a useful feature in patent landscaping. The combining between citation information and classification codes are not recommended. We think that we have developed a model for patent landscaping that have a possibility to be commercialized at a laboratory scale. In the future, we will build a system that can support to build benchmark datasets of over 2000 KISTA dat sets to verify the scalability of the model. We will also build models that can classify not only US patents but also European, Japanese, Chinese and Korean patents. Particularly in Japan, it uses special classification code like F-Term in itself, and further study is needed. In the long term, our goal is to construct a fully automated classification system for patent landscaping that customized for the company and government organization.

References [Abood and Feltenberger, 2018] Aaron Abood and Dave Feltenberger. Automated patent landscaping. Artificial Intelligence and Law, 26(2):103–125, June 2018. [Chen and Chiu, 2011] Yen-Liang Chen and Yu-Ting Chiu. An ipc-based vector space model for patent retrieval. Information Processing & Management, 47(3):309–322, 2011. [Devlin et al., 2018] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, October 2018. [Ioffe and Szegedy, 2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015. [Kingma and Ba, 2014] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv e-prints, December 2014. [Kipf and Welling, 2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [Li et al., 2018] Shaobo Li, Jie Hu, Yuxin Cui, and Jianjun Hu. Deeppatent: patent classification with convolutional neural networks and word embedding. Scientometrics, 117(2):721–744, 2018. [Lupu et al., 2013] Mihai Lupu, Allan Hanbury, et al. Patent R in Information Reretrieval. Foundations and Trends trieval, 7(1):1–97, 2013. [Magdy and Jones, 2011] Walid Magdy and Gareth JF Jones. A study on query expansion methods for patent retrieval. In Proceedings of the 4th workshop on Patent information retrieval, pages 19–24. ACM, 2011. [Shalaby et al., 2018] Marawan Shalaby, Jan Stutzki, Matthias Schubert, and Stephan G¨unnemann. An lstm approach to patent classification based on fixed hierarchy vectors. In Proceedings of the 2018 SIAM International Conference on Data Mining, pages 495–503. SIAM, 2018. [Sureka et al., 2009] Ashish Sureka, Pranav Prabhakar Mirajkar, Prasanna Nagesh Teli, Girish Agarwal, and Sumit Kumar Bose. Semantic based text classification of patent documents to a user-defined taxonomy. In International Conference on Advanced Data Mining and Applications, pages 644–651. Springer, 2009. [Trippe, 2015] Anthony Trippe. Guidelines for preparing patent landscape reports. Patent landscape reports. Geneva: WIPO, page 2015, 2015. [Tseng et al., 2007] Yuen-Hsien Tseng, Chi-Jen Lin, and YuI Lin. Text mining techniques for patent analysis. Information Processing & Management, 43(5):1216–1247, 2007. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. [Wittenburg and Pekhteryev, 2015] K Wittenburg and G Pekhteryev. Multi-dimensional comparative visualization for patent landscaping. merl.com, 2015. [Yang et al., 2010] Y Y Yang, L Akers, C B Yang, T Klose, and S Pavlek. Enhancing patent landscape analysis with visualization output. 2010. [Yoon and Park, 2004] Byungun Yoon and Yongtae Park. A text-mining-based patent network: Analytical tool for high-technology trend. The Journal of High Technology Management Research, 15(1):37–50, 2004.