A Text Mining Tool supporting Business Intelligence __________________________________________________________________________
STING: A Text Mining Tool supporting Business Intelligence
Dr. Antoine Spinakis ; Dr. Georgia Panagopoulou ; Asanoula Chatzimakri QUANTOS SARL, 154 Sygrou Av., 176 71 Athens, Greece email:
[email protected]
Abstract STING software was implemented under the framework of the STING European-funded project and it specializes on the analysis of patent data using textual and statistical analysis techniques. In this paper a detailed analysis of STING software is presented. From one hand the technical features of STING are described while on the other hand the statistical methodology supporting the functionality of the software is explained.
Keywords: Multidimensional analysis of patents, textual analysis, Cluster Analysis, technology indicators, measurement of technological innovation
__________________________________________________________________________________ Page:1 of 14
A Text Mining Tool supporting Business Intelligence __________________________________________________________________________
1. Introduction STING was a European-funded project implemented under the framework of FP5-IST programme for research and technological development and demonstration of a user-friendly information society. The software developed under STING constitutes a highly sophisticated, as well as a user-friendly system that enables the textual and statistical processing of patent data, aiming at measuring the scientific and technological evolution. In addition, STING produces technological indicators, based on a complex statistical methodology that combines multidimensional statistical analysis techniques, as well as textual analysis techniques. In the sequel, we intend to present the rationale on which the need for Sting’s development was based. Additionally, the overall architecture and the functional characteristics of the system’s individual components are presented. An overview of the textual and statistical analysis techniques, which are applied within the framework of STING is considered necessary, so as to point out the methodological approach that supports the analysis and the derivation of the results (Benzécri et al., 1973; Lebart et al., 1997; Lebart L., 1998; Johnson and Wichern, 1998; Rajman et al. 2002). Finally the innovative character of STING software is presented not only in relation to the methodology that applies but also in point of the services that the system provides to its potential user.
2. Motivation Nowadays in order to observe and analyse economic, technological or scientific activity it is necessary to take into account the flow of information that is related to those activities. Usually information is stored to large databases and therefore access to information is performed through access to those databases. Information can be distinguished in several categories. Information related to research is stored in research publications, scientific magazines etc. Information related to the phase of development and production is mainly stored in patents. Patents are closely related to technological and scientific activities i.e. with Technology Watch. Technology Watch is the activity of surveying the development of new technologies, new products, tendencies of technology as well as measuring their impact on actual technologies, organizations or people. Therefore it is closely related to innovation. Statistical exploitation of this information may lead to useful conclusions relating to technological development, trends of technology or innovation. The access to scientific and technological knowledge and the ability to exploit it are crucial for the economic performance of organizations, individuals or even countries. In addition measuring technological innovation has gained interest in many domains of economic and technological activity. Therefore Research & Development planners, business analysts, patents analysts, national and international patent offices, economic organisations, national statistical offices, venture capitalists and industrial bodies with high scientific and technological activity are interested in discovering and exploiting information related to __________________________________________________________________________________ Page:2 of 14
A Text Mining Tool supporting Business Intelligence __________________________________________________________________________ technological activities and innovation, “hidden” in patent databases. Patent data constitute a source of information that not only supports research related to technological evolution but also influences the technological trends, once the absence or the domination of particular technological innovations is captured through patent data analysis. In Figure 1, this two-way relationship is illustrated.
RESEARCH
PATENTS
TECHNOLOGY
Figure 1: Two-way relationship between research and technology combined through patents.
The analysis of the information “hidden” in patents, which are stored in many international databases, can provide a very clear view of the current trends regarding technological and scientific innovation and can provide measurable and comparable results. However, in order to exploit more effectively information stored in patent databases it is necessary to develop efficient, innovative methodologies and tools for the analysis of existing information related to the technological innovation. The need for exploiting information “hidden” in patents, in order to reveal technological or scientific trends, as well produce indicators that identify technological or scientific progress, is widely accepted. Technology Watch, which is a discipline recently developed that makes explicitly use of patents in order to provide reliable and comparable results, requires the need for exploiting patent data. However, these data require special treatment. Therefore it has become a necessity to extend the used technologies and tools and to develop the required abilities and means in order to exploit patent data (Comanor and Scherer, 1969; Dou, 1995; Narin, 1995). STING software constitutes an innovative Text Mining tool within a Business Intelligence process that supports the analysis and the exploitation of the information provided by patent data. The main objectives of the system are the following: •
Exploitation of the information and knowledge “hidden” in patent data through the use of an advanced statistical methodology, combined with textual analysis techniques.
•
Support of multidimensional comparisons between countries, sectors and/or companies.Identification of competition in a given technological sectorTracing of interactions that might exist between domains of technological activity and poles of innovation inside these domains.Production of indicators for the measurement of
__________________________________________________________________________________ Page:3 of 14
A Text Mining Tool supporting Business Intelligence __________________________________________________________________________ innovation in different fields of technological activity.Hence, the system enables the user to extract only necessary information and exploit it in an informative way in order to draw useful conclusions.
3. Description of STING Architecture STING is Windows-based system, which has been implemented in Visual C++. It is distinguished for combining different features and technologies, as well as a complex statistical methodology for analysing huge amounts of textual data and more specifically of patents. Figure 2 depicts the functional architecture of the STING system. STING System
Database Manager Module Import Process
Linguistic Preprocessing
Patent Database
patents.txt Data Cleansing Lemmatization Part-of-speech Tagging Part-of-Speech Selection
Statistical Analysis Module Simple Analysis
Clustering Process
Bootstrap Analysis
Correspondence Analysis
Hierarchical Analysis Coordinates in 2-dimensional level
Hierarchical Analysis
User Results Presentation Module Graphs
Tables
Reports
Figure 2: STING Functional Architecture
The overall architecture of STING generates a modular system, consisting of different independent modules giving to the user the ability to perform various types of analyses, suitably adapted to his/her needs. The modular basis of the system supports the flexibility of the system to subsequent changes in the used statistical methodologies (addition of new methods, etc.). Furthermore, the structure of the system enables the use of different patent databases as input sources. Each module corresponds to a main system task, has a prespecified input and output data format and executes a set of well-defined procedures. The functioning of the system is based on the existence of different modules integrated for its correct operation. Therefore as depicted in the figure these modules are: •
Data Manager Module.
•
Statistical Analysis Module.
__________________________________________________________________________________ Page:4 of 14
A Text Mining Tool supporting Business Intelligence __________________________________________________________________________ •
Results Presentation Module.
A more detailed presentation follows.
3.1 Data Manager Module. The Data Manager Module is mainly responsible for the data import, cleaning and preparation. Specifically, the module reads the patents data from the selected Patent Database and transforms them into the appropriate format, in order to be ready for further processing. The system database accepts textual and numerical data, in our case i.e. patents described by a set of fields. There is no restriction in the database from which will be taken the patent data. According to the user requirement analysis ESPACE ACCESS and ESPACE Bulletin are among the databases most frequently used. The flexibility of the system to different input databases is achieved through the adoption of specific input formats. Although data from different databases may be used depending on the problem and the information one wants to extract, they are each time standardized to a uniform representation in the system database. The parser is used for reading the textual data and consequently linguistic pre-processing is applied on them. The use of a dictionary and of a grammar is necessary for the linguistic processing. Figure 3 shows the procedure.
Patent Database
Data Cleansing
Lemmatization
Part-ofSpeech Selection
Part-ofSpeech Tagger
L i n g u i s t i c
P r e p r o c e s s i n g
Frequency Tables W o r d s
Patents
Figure 3: Linguistic Pre-processing
This procedure includes: -
Data cleansing: cleans the input data by removing irrelevant html characters and punctuation characters.
-
Lemmatization: focuses in restricting the morphologic variation of the textual data by reducing each of the different inflections of a given word form to a unique canonical representation (or lemma).
__________________________________________________________________________________ Page:5 of 14
A Text Mining Tool supporting Business Intelligence __________________________________________________________________________ -
Part-of-speech tagging: automatically identifies the morpho-syntactic categories (noun, verb, adjective) of words in the documents. The non-significant words can be filtered on the basis of their morpho-syntactic category. The part-of-speech tagging runs on the textual content i.e. on titles and abstracts of the patents.
-
Part-of-speech selection: to further reduce the vocabulary size, the user has the ability to select the word categories as identified by the assigned parts-of-speech and restrict the analysis to specific word categories (i.e. nouns, verbs, adjectives). Moreover, the user can select words that will not be involved in the subsequent steps of the analysis, and create synonyms.
3.2 Statistical Analysis Module. The Statistical Analysis Module performs the Simple Statistical Analysis, as well as the Advanced Statistical Analysis. In particular, it applies textual analysis methods on the preformatted data, in order to extract valuable information and create the first groups of patents. The user can select complicated statistical methods such as the Correspondence Analysis in order to analyze the patent data. The clustering analysis is considered one of the important features of the system since it is basis for the derivation of technology indicators. It allows obtaining a clear view of the technology on-goings.
3.3 Results Presentation Module. The Results Presentation Module is mainly responsible for the data export and the graphically representation of the results. The main output of the system is related to the production of technology indicators. The results can be depicted using powerful graphical outputs and ready-made reports. It is also remarkable that the system is so robust that any time an error occurs, the system does not crash but in only displays an error message and returns to the last previously known stable stage.The interactivity is an important feature of the system giving to the user the opportunity to intervene in the outputs and adapt those to his real needs. Changes in the colors, types of lines, types of graphs (2-dimensional, 3-dimensional, etc.), fonts are supported so that one elaborates the results. Furthermore, the positioning of the graphs in the space is supported by the system, giving to the user the opportunity to have the optical view he considers appropriate for the visualization of the results. Each module corresponds to a main system task, has a pre-specified input and output data format and executes a set of well-defined operations. The different flows are also depicted in the figure, defining the different stages of the analysis and the connections between them. It is also common sense that the natural sequence followed by the flows should be respected for the correct operation and for the robustness of the results.
__________________________________________________________________________________ Page:6 of 14
A Text Mining Tool supporting Business Intelligence __________________________________________________________________________
4. Statistical Methodology for the Analysis of Patent Data In this section, the analysis of data, which is based on textual and multidimensional analysis techniques that enable the simultaneous treatment of complex information that comes from the textual data of patents, is presented. These analysis techniques of textual data and consequently of patents regroup methods such as Correspondence and Cluster Analysis. Correspondence Analysis is used for the graphical representation of contingency tables, by projecting the data in a space of reduced dimension but without losing part of the initial information. On the other hand Cluster Analysis consists of synthesizing the initial information by creating clusters of the patents. In other words, cluster techniques try to regroup the patents in homogeneous clusters (homogeneity is measured on the basis of the words contained in the titles and the abstracts that describe these patents). These two methods are complementary, and the combination of both of these permit to restitute and describe all the information in a comprehensive way. We should also remark that correspondence has been preferred instead of using a similarity criterion because in contrast to the similarity criterions it permits to obtain better results. Cluster Analysis is based on the factorial axes obtained from the Correspondence Analysis. In order to apply multidimensional analysis techniques to textual data, we construct particular contingency tables, which contain the relative frequency with which each word appears in each patent. In the following part of this documentation, the way in which textual and statistical methodologies are applied and used in the system is described.
4.1 Linguistic preprocessing As far as the preprocessing of the textual data is concerned, lemmatization and part-ofspeech assignment is automatically performed. Lemmatization consists in restricting the morphologic variation of the textual data by reducing each of the different inflections of a given word form to a unique canonical representation (or lemma). In order to reduce further the vocabulary size, it is possible to restrict the analysis to those specific word categories (as identified by the assigned parts-of-speech) that bear most of the semantic content: nouns, verbs and symbols that contain adjectives and all other grammatical forms. Furthermore, a combination of all these can be used. It is also remarkable that the user can select to include in the analysis either the words existing in titles or abstracts or a combination of these.
4.2 Correspondence Analysis In traditional approaches to patent analysis, knowledge about the content of a patent is usually restricted to (parts of) the IPC1 codes that have been assigned to that patent during 1
International Patent Classification constitutes a system for the classification of patents according to the technical
characteristics of the invention and allocates inventions to specific sectors of technology.
__________________________________________________________________________________ Page:7 of 14
A Text Mining Tool supporting Business Intelligence __________________________________________________________________________ the application process. Although the IPC represent a quite rich hierarchy of codes, not taking into account the textual content of the patent may be considered as a limitation to a fully efficient exploitation of patent data. Therefore, one of the important characteristics of the presented methodology is the integration of textual data analysis techniques for the processing of the textual content of the patents (titles and abstracts). The underlying idea is that the vocabulary, which is characteristic for patent classes, is built on the basis of various descriptive variables, which are associated with the patents such as country or date of application provide additional interesting insights for the analysis of the patents themselves. According to the above described logic, linguistic pre-processing of the textual data is essential for the performance of Correspondence Analysisi. Once, linguistic pre-processing has been performed the input data for Correspondence Analysis compose a contingency table where the rows contain the patents, the columns represent the words, and each cell gives the frequency of the specific word in the corresponding patent. This analysis enables to explore the non-random dependencies between the variables involved in it and the vocabulary obtained from the title and abstracts of the patents. More precisely, the Correspondence Analysis produces a new vector space in which similarities between the 2 rows and the columns of the input contingency table (as measured by the χ -distance) can
be visualized as geometric proximities. Moreover, bootstrap techniques test the stability of the results obtained from the Correspondence Analysis (Davison and Hinkley, 1997).The user has also the ability to select different parameters in order to adapt the results in his needs. Of course different statistical measures are produced in order to help the user interpret the results.
4.3 Cluster Analysis In the case of textual data, clustering techniques are used for representing proximities between the elements of lexical tables. In the general case, Cluster Analysis operates on contingency tables to identify relationships between two different nominal variables. In the case of patent data, the aim of the procedure is to identify groups of technologies that share common vocabulary and groups of patents that share common technologies in order to derive conclusions about technological trends and innovation (Guellec and van Pottelsberghe, 1998). More specifically, we apply Cluster Analysis to contingency tables cross-tabulating full IPC codes and words in order to identify homogeneous groupings of and relevant relationships between grouping of IPC codes (resp. patents). The information captured in such clusters and inter-cluster relationships can then be directly used for the production of technological indicators, the goal being to identify areas of technology that share common characteristics, as well as innovative areas characterized by isolated clusters. The clustering procedure depending on the number of the data is distinguished in hierarchical clustering or k-means and hierarchical clustering. In the second case the hierarchical clustering is performed in the k clusters obtained from the k-means procedure. In addition in __________________________________________________________________________________ Page:8 of 14
A Text Mining Tool supporting Business Intelligence __________________________________________________________________________ both cases the Cluster Analysis is performed in the factorial axes derived from the Correspondence Analysis. In the case of textual data, clustering techniques are used for representing the proximities between the elements of the lexical tables. In the general case, Cluster Analysis operates on contingency tables to identify relationships between the two different nominal variables. In the case of patent data, the aim of the procedure is to identify groups of patents that share common vocabulary and groups of patents that share common technologies in order to derive conclusions about technological trends and innovation. Especially for the clustering procedure we should mention the production of the relationship map that demonstrates the relationships between the clusters or in other words the relationships between the different areas of technology. The technology indicators are also based on the clustering procedure and constitute an important characteristic of the system. These are produced for each cluster separately and permit to identify the technology ongoings in different areas of technology. Furthermore these indicators are categorized in four different levels depending on whether they refer to the sector of technology (through IPC codes), the country or the continent, the assignees or the inventors and finally time (due to the priority year or other).
5. Innovation One side of the innovative character of STING concerns the methodological approach that is applied. The main innovative characteristic of the applied methodology is related to the fact that not only the full International Patent Classification (IPC) codes are taken into consideration, but also additional information, such as titles and abstract of the textual part of the patent data. In order to fully exploit this information and derive more reliable results, textual and statistical analysis techniques such as correspondence and Cluster Analysis are applied. In addition, the usage of supplementary variables such as companies that submit the patents, inventors, countries in which the patents are submitted, etc. enriches the result of the analysis. On the other hand the innovative character of STING is also formed through the services that the system offers to a variety of actors. Furthermore, the system supports the activity of surveying the emergence of new technologies, new products, tendencies of technology, as well as measuring their impact on actual technologies, organizations or people. In other words, STING supports the process of Technology Watch. In addition to that, STING software identifies new technologies within specific geographical areas diachronically. Furthermore, one can track dynamic evolutions, opportunities or risks within specific technological sectors. The above-mentioned activities that are provided by the text-mining tool enable also the performance of beneficial policies from the companies or organizations within the area of investments and also the performance of comparisons with competitors who are specialized in the same technological sector.
__________________________________________________________________________________ Page:9 of 14
A Text Mining Tool supporting Business Intelligence __________________________________________________________________________
6. Conclusions STING is the result of an FP5 IST project that aimed at the creation of a software tool for the analysis of patents and the production of technology indicators. The innovative character of this project is closely related to the used statistical methodology that enables taking into account all the fields that describe a patent. Summarising we consider that the value of the specific system -from a scientific point of viewis related to the complex statistical procedures used and the derivation of a large spectrum of scientific and technological indicators. From a social and commercial point of view it is a tool that helps users extracting the desired information in a flexible, rapid and easy way, defining their policy in technology issues, following-up competitors, specifying patenting activity of countries, inventors and becoming rival in the area.
__________________________________________________________________________________ Page:10 of 14
References __________________________________________________________________________
7. References Benzécri J.-P. et al. L’Analyse des Données, volume II : L’Analyse des Correspondances. Dunod, (1973). Comanor W.S. and Scherer F.M. Patent statistics as a measure of technical change. Journal of political economy, 77(3): 392-398, (1969). Davison A. C. and Hinkley D. V. Bootstrap Methods and their Application. Cambridge University Press, (1997). Dou H. Veille technologique et competitivite-L’intelligence economique au service du developpement industriel. Dunod Paris, (1995). Guellec D. and van Pottelsberghe B. New indicators from patent data. In Proc. of Joint NEST/TIP/GSS Workshop, (1998). Johnson R. and Wichern D. Applied multivariate statistical analysis. Prentice-Hall, Inc., (1998). Lebart L., Morineau A., and Piron M. Statistique exploratoire multidimentionnelle. Dunod, 2eedition, (1997). Lebart L. Salem A. B. L. Exploring Textual Data, volume 4. Kluwer Academic Publishers, (1998). Narin F. Patents as indicators for the evaluation of industrial research output. Scientometrics, 34 (4): 489-496, (1998). Rajman M., Peristera V., Chappelier J-C., Seydoux F., Spinakis A. Evaluation of Scientific and Technological Innovation using statistical analysis of patents. In 6es Journees internationals d’ analyse statistique des donnees textuelles (JADT), France (2002).
__________________________________________________________________________________ Page:11 of 14
Appendix __________________________________________________________________________
8. Appendix 8.1 Description of patent data 1. Description of a patent A patent can be decomposed and described by several fields. Each field contains specific information while each patent is described by a code (or in many cases more than one codes) depicting its technical characteristics. The information included in patents can be categorised in four large sections as shown in Figure 2. The first section concerns the technological features of each invention, the second section refers to all the characteristics concerning the application of each invention. In addition information about the countries in which an invention is protected is provided as well as information about the invention’s origin. In Figure 2 we have a description of the information that generally a patent contains. As it is obvious there are cases i.e. some patents for which it is not available all this information either due to mistakes or due to other factors. Inventions' origin Inventor: resident country of inventor Control : geographical situtation of applicant
Geographical protection Designated states for protection
First filing : Priority office of one country
Technological features Technical classes(IPC codes) Patent citations Industrial fields claimed Citation of scientific articles
Applications' History Priority date Application date Publication date Refusal or withdrawal date Grant date
Figure 4: The information contained in a patent.
As already mentioned each patent is characterised by a specific code that contains all the information related to it. This code is mostly related to the technical characteristics of the invention and therefore is considered to allocate inventions to specific sectors of technology. These codes are given to patents due to the International Patents Classification system (IPC) or other classification systems. Also, it should be mentioned that the fields contained in each patent database might differ between them. Below, the fields describing a patent in the database of patents ESPACE ACCESS are presented: PN: Priority Number (Number of the patent) AN: Application Number PR: Priority Year DS: Designated States __________________________________________________________________________________ Page:12 of 14
Appendix __________________________________________________________________________ MC: Main Classification Codes IC: All Classification ET: English Title FT: French Title IN: Inventor PA: Applicant (Name of the company depositor) AB: English Abstract AF: French Abstract Each field associated with a patent provides useful information about the specific invention. In the statistical analysis we can use several of the fields mentioned above that describe patents. The approach presented in this document that is based on the analysis either of codes (based on the IPC Classification System) or of titles / abstracts. These fields are primarily involved in cluster analysis techniques in order to create homogeneous groups of clusters in terms of shared technology. However, supplementary information, described by the other fields such as inventor, assignees, etc., will be used for providing complete information of the results obtained from the analysis of the basic fields i.e. either of codes or of titles/abstracts. 2. Patent Classification systems Inventions are classified by one or more symbols, so that patents belonging to a given technological field can be stored and retrieved. 2.1 IPC Classification System The International Patents Classification system (IPC) is adopted by 52 countries and 4 international organisations. In this classification system the patents are classified according to the technology related to the invention. IPC is designed so that each technical object to which a patent relates can be classified as a whole. In fact, inventions are classified by one or more symbols so that patents belonging to a technological field can be filed and retrieved. A patent may contain several technical objects and therefore be allocated by several classification symbols. According to IPC system all techniques are classified in sections. This classification follows a hierarchical structure. The sections themselves are then subdivided into subsections, classes, sub-classes, groups and sub-groups. Each subgroup may be further subdivided. Every level in this hierarchy is represented by a codification. 2.2 Other Classification Systems Although IPC is an international classification system, some patent offices prefer to use one of their own. This applies in particular to the United States Patent Office, which classifies patent applications filed with it in accordance with USPOC, the United States Patent Office Classification. USPOC is also a technology-based classification but its pattern differs from that of IPC. In fact USPOC classifies patents in accordance with the technology as it appears __________________________________________________________________________________ Page:13 of 14
Appendix __________________________________________________________________________ in the patent claims. Therefore UPSOC is seen as being geared more to function i.e. the intrinsic characteristic of products or processes attempting to give a picture of the state of the art. In addition the structure of USPOC differs from that of IPC. UPSOC consists of three main patent groups (chemicals, electricals and machinery). The WPI(L) classification produced by Derwent Inc. Each patent abstract is assigned to one or more "Derwent classes" according to subject area, regardless of the patent’s original IPC, USPOC or other national classification. The classification under 80 main chemistry headings devised for the Chemical Abstracts database (CAS). In this database of chemistry-related scientific literature and patents, every abstract (publication or patent) is assigned to a "CAS section".
__________________________________________________________________________________ Page:14 of 14