In order to build technology road map, one can make use of the search facilities provided by the patent websites to fetch patents of interest. The official web site ...
MINING PATENT DATABASE TO BUILD TECHNOLOGY ROADMAP Veena Bansal1, A. K. Mittal2 1,2
Department of Industrial and Management Engineering, Indian Institute of Technology Kanpur-208016, India
Abstract: Patent literature is a rich source of information. Patents can be used by technologists and researchers to learn and track the progress of a field. The technology development trajectory (also referred to as technology road map or technology map) reveals the research focus and the gaps that can help researchers identify the thrust research areas. In this paper, we describe our complete system for building technology map to meet user requirements. We have created technology maps for supply chain management, scheduling and solar systems. We use text component as well images contained in the patents for building technology map. Keywords: Patents; Technology Roadmap, Patent Mining, Text mining
1.
INTRODUCTION
Patent literature is a rich source of information and this source has been used extensively by industry and the researchers to extract useful information. For instance, collective profile of inventors who hold key patents in an area can be used to build a research team [1]. Patents may show that a company is a leader and differentiate it from its competitors if it holds large number of seminal and leading edge patents [2]. A patent holder enjoys a temporary monopoly to reap the benefits from his innovations. At the same time, others must make sufficient innovative efforts to get another patent in the same field, which in turn favors technological advancement and economic growth [3, 4]. Patent analysis may be done to figure out the adequacy of research and development in a particular field [5]. Citations referred to in patents have also been analyzed to build technology development trajectory [6]. The technology development trajectory (also referred to as technology road map or technology map) reveals the research focus and the gaps that can help researchers identify the thrust research areas. Technology road map for a technology reveals maturity level, life cycle of the technology and provides some help in innovation forecasting. Building a technology road map involves more than a simple search through data sources such as patent database. Many mining processes and statistical techniques are required to extract the technology development path and technology map from the patent literature. Data and text mining techniques are semi-automatic in nature and require human intervention. Publications and patents provide primary raw material for building a technology map, as these are available easily and are in well organized form. There are many sources of patent information such as World Trade Organization (WTO), World Intellectual Property Organization (WIPO), International Patent Documentation (INPADOC), European Patent Office (EPO), Singaporean Patent Office, US Patent and Trademark Office and Indian Patent Office. In order to build technology road map, one can make use of the search facilities provided by the patent websites to fetch patents of interest. The official web site of US (USPTO) contains approximately 7,000,000 patents starting from 1790 [7]. An AND based search may miss out on many relevant and important patents and an OR based search may pick too many irrelevant patents. The difficulty in picking the relevant patents led us to build a complete system for creating a technology map. In this paper, we describe our complete system for building technology map. We have created technology maps for supply chain management, scheduling and solar systems. Before we discuss our system in section 3 and 4, we review the related work in section 2. In section 5, the experimental results are presented.
2.
LITERATURE REVIEW
Patent analysis has been used to evaluate the competitiveness of firms, develop technology plans, prioritize research & development investment, and monitor technological change in firms [8]. The sheer size of the information available about the patents has led many re-searchers to develop efficient and effective retrieval techniques for patent databases. Fattori et al [9] developed a system, PackMOLETM, that applies text mining to the patent database. The search facilities of patent portals do not render appropriate results. Therefore, people spend more and more time filtering information to find what they want [10]. Many techniques specifically for analyzing textual data have been developed to extract information from large textual datasets such as K-nearest neighbors, multiple concept and distribution of concepts, neural networks (NN), bibilometric indicators, citation methodology, parallel Monte Carlo technique, Naive Bayes model, Rocchio, Support vector machine, decision tree, linear least square fit (LLSF) etc. [11]. Patent classification [12] may utilize term frequency (TF), inverse document frequency (IDF) and term distribution (inter-class, intra-class and in-collection distributions) for representing importance of words and/or terms in classifying a patent. We make use of binary logit model and K-nearest neighbors algorithms to filter and classify patents respectively. The non-textual content of the patent database needs special techniques for search and retrieval. Hopkins [13] has described a search method for non-word US trademarks using codes from the Design Search Code Manual. Initial image retrieval techniques were text-based that associated textual information, like filename, captions and keywords with every image in the repository [14]. The term Content-Based Image Retrieval (CBIR) [15] has been used to describe automatic retrieval of images from a database by color and shape features. The original Query by Image Content (QBIC) system [16] allowed the user to select the relative importance of color, texture and shape. The virage system [17] allows queries to be built by combining color, composition (color lay-out), texture, and structure (object boundary information). Moments [18], Fourier descriptors [19, 20], Chain code [21] [22] have also been used as features for images. Jain and Vailya [14] introduced edge direction histogram (EDH). They found the edge information from the image using the Canny edge operator [23]. The edge orientation autocorrelogram (EOAC) classifies edges based on their orientations and correlation between neighboring edges in a window around the kernel edge [24]. We selected EOAC for our work because it is computationally inexpensive and translation & scale invariant.
3.
SYSTEM FOR TEXTUAL COMPONENT OF PATENTS
The very first activity that we need to perform is to download patents from the websites such as USPTO. The fixed format of US patents has enabled us to automate this activity. We have used USPTO web site as our primary source of data as it is available in public domain and patents are in a standard format. We request the user who has domain knowledge to provide us a set of keywords to be used for basic search for selecting and downloading primary set of patents. We have built a portal that is password protected. User accesses the website and provides us with a set of keywords. We rely on the search facility of the patent portal ensuring that our search criteria are liberal. Our objective is to pick all (well almost all) relevant patents even at the cost of picking many non-relevant patents. The downloaded patents are segmented into text part and image part. The text part is broken down with the help of its html tags into title, author, abstract, claim, assignee, I-class (international classification) and U-class (US classification) and inserted into a database. We then randomly pick a small set of patents and present them to the user through our portal. User who possesses the domain knowledge, labels these patents as relevant and non-relevant. If he is interested in further classification of the relevant patents, he can provide us a class label as well. These patents become our training set. The training set construction process is semi-automatic and user participates in the process by labeling randomly selected patents. Our next job is to extract features from the patents of the training set (relevant and non-relevant patents) so that we can label the entire set of patents automatically. 3.1. Features Extraction We have experimented with two different classification systems. We explain each of these separately.
Text Based Classification System I Patents are filtered using binary logit model: 𝜃𝜃 =
𝐼𝐼
𝑒𝑒 𝛼𝛼 + ∑𝑖𝑖=1 𝛽𝛽 𝑖𝑖 𝑓𝑓 𝑖𝑖 𝐼𝐼
∑ 1+𝑒𝑒 𝛼𝛼 + 𝑖𝑖=1 𝛽𝛽 𝑖𝑖 𝑓𝑓 𝑖𝑖
where θ is the probability of the patent being relevant, α is the constant of the equation, βi is the coefficient of the predictor variables, fi is the frequency of ith key term. Key terms are defined by the user. Higher values of θ indicate higher relevance of the patent. The key terms are defined by the user. Variables α and βs are determined by setting θ to 1 for a relevant patent from the training set of the patent and to 0 for a non-relevant patent. Once α and βs have been determined, we label all patents in the database as relevant and non-relevant. User is then asked to randomly check the patents labeled as relevant. If the number of relevant patents is less than 60% (or threshold set by the user) in a randomly selected sample, we assume that training set is inadequate. The labeling done by the system is removed and we request the user to label some more patents as relevant and non-relevant. The parameters α and βs are computed again. To classify a class-unknown patent X, the nearest neighbor classifier algorithm has been used that ranks the patent’s neighbors among the training patent vectors, and uses the class label of the most similar neighbor to predict the class of the new patent. The classes of these neighbors are weighted using the similarity of each neighbor to X, where similarity is measured by Euclidean distance or the cosine value between two patent vectors. The cosine similarity is defined as follows: sim(X, Dj) =
∑𝑡𝑡 ∈(𝑋𝑋 ∩𝐷𝐷 ) 𝑥𝑥 𝑖𝑖∗ 𝑑𝑑 𝑖𝑖𝑖𝑖 𝑖𝑖 𝑗𝑗 ‖𝑋𝑋‖2 ∗ �𝐷𝐷 𝑗𝑗 �
2
where X is the class unknown patent, represented as a vector; Dj is the jth training patent; ti is a word shared by X and Dj; xi is the weight of word ti in X; dij is the weight of ti in patent Dj. Weight assigned to each term is 𝑁𝑁 𝑁𝑁 𝑑𝑑𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑖𝑖𝑖𝑖 ∗ log and 𝑥𝑥𝑖𝑖 = 𝑓𝑓𝑖𝑖 ∗ log 𝑛𝑛𝑖𝑖 𝑛𝑛𝑖𝑖
Where fij is the frequency of word i in patent j, N is the number of patents in the training set and ni is the total number of times word i occurs in the training set, xi is the weight assigned to term i in class-unknown patent X and fi is the frequency of word i in class-unknown patent X. We set cutoff threshold τ on sim(X, Dj) to assign the new patent to a known class. We then sample 10% of the patents of a class and check the precision (ratio of relevant patents to total patents in the class). For our experiments, we have set precision to be 60%. If the precision is not acceptable, we adjust τ and re-iterate.
Text Based Classification System II All words except stop words present in patents are treated as key terms. The key terms whose frequency of occurrence is smaller than the threshold is dropped from the list of key terms. We treat title, abstract and claims part of a patent independently. I-class, U-class, author and assignee that are predominantly present in relevant patents provide strong classification. The basic assumption is that an author or assignee work and get patents in a specific area. I-class and U-class are classifications that can be used directly. A patent now is represented as shown in table 1. We assign weight to each keyword and each patent is represented as shown in table 2. Keywords (Title) kwt1 ft1 kwt2 ft2
Keywords (Abstract) kwa1 fa1 kwa2 f a2
Keywords (Claim) kwc1 fc1 kwc2 fc2
Kwtn
kwan
kwcn
ftn
fan
fcn
I-class
U-class
I1 I2
1 0
U1 U2
Author Assignee Name Name 0 au1 0 as1 1 1 au2 0 as2 0
In
0
Un
1 aun
0
asn
0
Table 1- Document representation in terms of its key terms, their frequencies and other features
Document
Key1
Key2
...
Keym
Doci Docj
fi1 * w1 fj1 * w1
fi2 * w2 fj2 * w2
. . . fim* wm . . . fjm* wm
Class (1: Relevant, 0: 1 0
Table 2- Document representation with weights assigned to its key terms, and other features where m is the number of key terms and flk is the frequency of kth key term in lth patent. Weight for each key term is computed as follows: 𝑤𝑤𝑗𝑗 = −
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑖𝑖𝑖𝑖 𝑤𝑤ℎ𝑖𝑖𝑖𝑖ℎ 𝑘𝑘𝑘𝑘𝑘𝑘 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑗𝑗 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑛𝑛𝑛𝑛𝑛𝑛 − 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑖𝑖𝑖𝑖 𝑤𝑤ℎ𝑖𝑖𝑖𝑖ℎ 𝑘𝑘𝑘𝑘𝑘𝑘 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑗𝑗 𝑎𝑎𝑎𝑎𝑎𝑎𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑛𝑛𝑛𝑛𝑛𝑛 − 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
This function ensures that if a key term is equally likely to be found in relevant as well as in non-relevant patents, it does not carry much weight. We used SPSS for getting a classification function from the training set for each part of the patent. Each classification function is of the following form: Ftitle(Relevant) = αt * x1 + βt ~ x2 +.. . Ftitle(Non - relevant) = γt * x1 + τt * x2 + . . . We then combine the three discriminant functions to obtain a single discriminant function using SPSS. We also assigned relatively larger weights to author, I-class and U-class. These large weights assign a test patent to the relevant class if its author, I-class or U-class belongs to relevant class without considering discriminant function. Discriminant function for title turns out to be: y(title, non-relevant) = 39.56 * x1 + 46.59 * x2 + 1.6 * x3 + 2.9 * x4 + 1.8* x5 + .7 * x6 + .64 * x7 - 1.01 y(title, relevant) = 16.88 * x1 + 91.82 * x2 + 2.67 * x3 + 3.44* x4 + .98* x5 + 1.99 * x6 + 1.56 * x7 -1.52 where xi are the key terms associated with title. We obtained such functions for abstract and claims as well. These functions were then combined to get a consolidated discriminant function.
4. SYSTEM FOR NON-TEXTUAL COMPONENT OF PATENTS The non-textual component of the patents consists of drawings that explain the design. A patent contains anywhere from 5 to 30 images. One simple use of these images is to locate similar images in the patent database. Say, someone applies for a new patent and the patent contains few images. We can query the patent database to locate similar images to examine corresponding patents for similarity. We refer to this component as PATSEEK that has been described in [26]. A little more involved application is to classify patents based on the images, similar to the classification explained in the previous section for text. A challenge is to augment results obtained using textual information with the results based on non-textual information. User has classified and labeled patents using only the abstract part of the patents. We take images associated with the patents labeled as relevant and build image clusters. Since user has not looked at the images, and a patent may contain very different looking images, we do not use the class labels given to corresponding patents by the user. Our objective is to render to the user all possible relevant patents using text as well as image information of patents. There are two phases involved- one for Creation of Image Database and Feature Vectors and another for classification of patents based on feature vectors. The image grabber searches the patent database at USPTO website http://www.uspto.gov. and grabs the image pages from the patents labeled as relevant by the user. USPTO images are in bi-level TIFF format. A page image may contain more than one individual image that we segment into individual images. The separated images are stored in the image database along with patent number and the
page number within the patent where this image was found. The graphic content are then used to calculate the image feature vector as explained earlier and store them in the database. We ignore the class labels provided to the relevant patents by the user. The reason is that a patent contains images that do not always relate to each other. By ignoring the class label associated with an image, images from a single patent may go to different classes/clusters. An image that does not make it to an existing cluster is put in a new cluster. We check for overlap between the clusters formed by text classifier: T1,. .. , Tm and image classifier to label the classes formed by the image classifier: I1,.. . An image class gets the label from the text class with maximum overlapping patents. Image content of a patent is much larger in size than the text content. To cut down on the download, we rely on the text classifier to filter irrelevant patents. We download images from the relevant patents only.
5. EXPERIMENTS In this section, we report experimental results. All experiments were performed on an Intel Pentium IV Processor 2.4 GHz with 512 MBytes of RAM. The system was implemented in Java (Sun JDK 1.4.1). The feature vector database was initially created on Oracle and later on moved to MySQL Ver 12.21.
5.1. Classification System I For supply chain patents, we started with keywords planning, execution, inventory, quality, manufacturing, sales, transportation, plant, decision, logistics, information, system, strategic, tactical. We downloaded 4,342 patents related to supply chain. We then randomly selected 10% (434) patents. We read these patents and labeled 124 patents as relevant to supply chain management and rest 310 non-relevant. We then extracted key terms from relevant patents that turned out to be 317 in number. Then logit model was used to determine parameters α and βs by assigning θ to 1 for relevant patents and to 0 for non-relevant patents. 1100 patents turned out to be relevant when we set the filtering threshold to .42. At this threshold, we obtained precision of 60%. Then we classified the patents related to supply chain management into two classes that in turn got classified further. We built three levels of classification. Part of the classification tree is shown in table 3. Supply chain growth pattern Graphs (figure 1 Supply Chain Management Execution
Planning Strategic Planning
Tactical Planning
Operational Planning
Inventory Management
Quality Control
Assembly Plant Management
Decision Support Systems
E-commerce
Information Systems
Sales
Table 3- Classification tree for Supply Chain Management and 2) show the patents granted every year since 1975 till 2003 in supply chain planning, supply chain execution and their comparison. We can see that lot of work was done during the period 1995 to 2002 in planning as well as in execution. The number of patents granted is comparable in both the areas of supply chain management.
5.2. Classification System II We were working with a proprietary software for converting downloaded patents for stemming, tagging different parts of the patent and keyword extraction. We soon realized that a html patent can be easily tagged and we can use public domain database systems such as mysql and do away with the proprietary software. We then built system II and experimented with different classification system for scheduling patents. We downloaded 2,000 patents and only 300 were found relevant. We prepared classification function using 300 relevant and 300 non-relevant patents. Using the classification function, we classified all patents again. We achieved 75% precision and similar recall rate. Four classes namely production scheduling, project scheduling, scheduling for computers and miscellaneous were formed. At the next level, number of classes increased to 6 and number of patents in each class decreased accordingly. The recall rate and precision dropped to 60%. At the next level, number of classes went up to 13 and the recall rate and precision dropped to 50%.
Figure 1- Number of Patents granted for Supply Chain Execution
Figure 2- Number of Patents granted for Supply Chain Planning
We also experimented with solar battery energy system. We downloaded 2,617 patents and around 109 out of 400 were found relevant on manual reading. These patents were used to train the system. 2,115 patents were found relevant out of 2,617 patents. 6 classes namely Electronic, Home-lighting, Power Plant, Solar Batteries, Solar Cell and Solar Energy Utilization were formed.
5.3. Image Classification Our database contains approximately 12,394 images for solar battery system that have been picked up from the patent database of United States Patent Office (http://www.uspto.gov). All the images are monochrome images in TIFF format. For the performance evaluation, one needs to manually mark the relevant images in the database. There are some test databases available for color images. However, we could not locate any line monochrome image test database. We have arbitrarily chosen 50 images from our collection. For each query image, a set of relevant images in the database have been manually identified. The images in this set are similar to the corresponding query image with some differences in scale or viewing position. An ideal image retrieval system is expected to retrieve all the relevant images. The ratio of number of relevant images retrieved to total number of images retrieved is referred to as precision. 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝑜𝑜𝑜𝑜 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑓𝑓 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
Another parameter on which content based image retrieval systems are evaluated is the recall rate [25]. 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 =
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖 𝑡𝑡ℎ𝑒𝑒 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
An efficient content based image retrieval system should have reasonable recall rate and precision. We have compared the performance of two similarity measures, L1 and L2 distance, in our experiments. We achieved nearly 100% recall rate and precision varied from 10% to 60%. We used images corresponding to relevant patent marked by user to create our initial clusters. But it turns out that images belonging to one patent are very different. We did not force all images of a patent to go into one cluster and let an image go to another cluster if its distance was more than the pre-defined threshold distance. We have then added images from remaining patents into clusters using EOAC feature vector as explained earlier. For solar energy, we downloaded 26,000 images. Total 27 clusters were formed where smallest cluster has 69 images and the largest one has 750 images. We were hoping that text classification of patents and image based classification of patents will complement each other. But it turns out that images included in a patent are very diverse and do not form a coherent set. As a result, the overlap between the clusters formed based on text classifiers and image classifiers are 40% maximum. The clusters formed by image
classifier compliment the clusters formed by text classifier. The user can look at both the clusters to cover all relevant patents. The initial website page is shown in figure 3. Some of the functions are password protected and some are available to all. We ran this website for a while and many users used the website to create the technology map according to their requirements. The main objective of our work is to let the user define relevant patents and classify them instead of providing him some generic classification.
Fig 3- Website for patent mining References [1] Moehrle M. G., Walter L., Geritz A., Mller S.: Patent-based inventor profiles as a basis for human resource decisions in re-search and development. R&D Management 2005;35 (5) :513-24. [2] Wilhelm K., Finnegan J.: Dawn of a new asset class. Intellectual Asset Management 2005;14:59. [3] Grupp, H.: Foundations of the Economics of Innovation: Theory, Measurement and Practice, Cheltenham/Northampton MA: Edward Elgar, 1998. [4] Grupp, H.: The measurement of technical performance of innovations by technometrics and its impact on established technology indicators. Research Policy 1994; 23(2):175-93. [5] Huang Z., Chen H., Chen Z.-K., Roco M.C.: International Nanotechnology Development in 2003: Country, Institution, and Technology Field Analysis Based on USPTO Patent Database. Journal of Nanoparticle Research 2004;6(4):325-54. [6] Hummon N.P., Doreain P.: Connectivity in a Citation Net-work: The Development of DNA Theory. Social Networks 1989;11:39-63. [7] http: //www.uspto.gov/patft/help/datesdb.htm. [8] Archibugi D., Sirilli G.: The direct measurement of technological innovation in business, In: Innovation and
enterprise creation: Statistics and Indicators, ed. European Commission (Eurostat), European Commission; 2001. [9] Fattori M., Pedrazzi G., Turra R.: Text mining applied to patent mapping: a practical business case. World Patent In-formation 2003;25:335-42. [10] Weng S.S. and Lin Y.J.: A study on searching for similar documents based on multiple concepts and distribution of concepts. Expert Systems with Applications 2003;25:355-368. [11] Ko Y., Park J., Seo J.: Improving text categorization using the importance of sentences. Information Processing & Management 2004;40:65-79. [12] Lertnattee, V. ,Theeramunkong T.: Effect of term distributions on centroid-based text categorization. Information Sciences 2004;158:89-115. [13] Hopkins D. K.: Searching for graphic content in USPTO trade-mark databases. World Patent Information 2003;25: 107-16. [14] Jain A.K., Vailaya A.: Image Retrieval using Color and Shape. Pattern Recognition 1996;29:1233-44. [15] Kato T.: Database architecture for content-based image retrieval. In: Image Storage and Retrieval Systems (Jambardino A and Niblack W eds) 1992, Proc. SPIE 2185:112-23. [16] Flickener M. et al: Query by image and video content: the QBIC system. IEEE Computer 1995;28:23-32. [17] Bach J.R. et al: The virage image search engine: an open framework for image management. Proc. SPIE: Storage and retrieval for Still Image and Video Databases IV 1996;2670:76-87. [18] Hu M. K.: Visual pattern recognition by moments invariants. IRE Transactions on Information Theory1962;8: 179-87 [19] Persoon, E., Fu, K. S.: Shape discrimination using Fourier descriptors. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 1986;8:388-97. [20] Gonzalez R. C., Wints P.: Digital Image Processing. Addison-Wesley, Reading, MA, 1992. [21] Zhang D., Lu G.: Review of shape representation and description techniques. Pattern Recognition 2004;37(1) :1-19. [22] Freeman H.: On the encoding of arbitrary geometric configurations. IRE Trans. on Electronic Computers 1961;10:260-8. [23] Canny J. A., Computational approach to edge detection. IEEE transaction on Pattern Analysis and Machine Intelligence 1986;8:679-98. [24] Mahmoudi F., Shanbehzadeh J., Eftekhari A.M., Soltanian-Zadeh H.: Image retrieval based on shape similarity by edge orientation autocorrelogram. Pattern Recognition 2003;36:1725-36. [25] Muller H., Mullerm W., Squire D. McG.: Marchand-Maillet S., Pun T.: Performance evaluation in content-based image retrieval: overview and proposals. Pattern Recognition Letters 2001;22:593-601. [26] Tiwari, A. and Bansal, V.: PATSEEK: Content Based Image Retrieval System For Patent Database, in Proceedings- International Conference on Electronic Business-04, Tsinghua University, Beijing, China, Dec 5 - 8, 2004.