Named Entity Recognition Based on A Machine ... - Semantic Scholar

3 downloads 0 Views 88KB Size Report
Oct 15, 2012 - Then, the joint probability distribution of our model. DSTCRFs has the following factorization form: (2). (. ) (. ) (. ) (. ) (. ) p y y x. p y x p y. x y. p y x.
Research Journal of Applied Sciences, Engineering and Technology 4(20): 3973-3980, 2012 ISSN: 2040-7467 © Maxwell Scientific Organization, 2012 Submitted: December 20, 2011 Accepted: April 20, 2012 Published: October 15, 2012

Named Entity Recognition Based on A Machine Learning Model Jing Wang, Zhijing Liu and Hui Zhao School of Computer Science and Technology, Xidian University, Xi’an, 710071 China Abstract: For the recruitment information in Web pages, a novel unified model for named entity recognition is proposed in this study. The models provide a simple statistical framework to incorporate a wide variety of linguistic knowledge and statistical models in a unified way. In our approach, firstly, Multi-Rules are built for a better representation of the named entity, in order to emphasize the specific semantics and term space in the named entity. Then an optimal algorithm of the hierarchically structured DSTCRFs is performed, in order to pick out the structure attributes of the named entity from the recruitment knowledge and optimize the efficiency of the training. The experimental results showed that the accuracy rate has been significantly improved and the complexity of sample training has been decreased. Keywords: Entity identification, Hidden Markov Model (HMM), named entity INTRODUCTION In general, the task of Named Entity Recognition (NER) is to recognize phrases in a document, such as the Person Name (PN), Organization Name (ON), Location Name (LN) and so on. As an basic subtask of Natural Language Processing (NLP), it has been applied to Information Extraction (IE), Question Answering (QA), Parsing, Metadata Tagging in Semantic Web. In this study, the NER study is used to extract special Web information---recruitment information, which we mainly focus on Position (PSN), ON, LN and Date. In this study, we pay attention to the PSN recognition, which can help us get more knowledge about the recruitment information. In fact, compared with the ON and LN, the PSN is a special and complex kind of NE. Firstly, PSN always contains some special words which can denote certain job such as "engineer" and "teacher". Secondly, the length of PSN is variable, some one contain dozens of words and some one just two words which is difficult to decide the border of PSNs. Finally, the PSN always contains location name, such as "Shanghai sub-company manager". The reasons mentioned above will seriously influence the performance for the NER. There have been many studies on PN and LN recognition, while very little research on PSN. In this study, a novel unified model for NER based on improved Dramitic hidden Semi-makov Tree Conditional Fields (DSTCRFs) is proposed for the PSN NER. This article is organized as follows. We briefly describe the related work of NER in section II and the NE feature is presented in section III. Section IV devotes to give the definition of DSTCRFs model for NER. With the framework, section

V gives the experimental results. Finally, summary and the future work is introduced in section VI. In this way, we reduce the transfer of state and the total numbers of entered symbols in the sequence to improve the operating efficiency for the NER. The models provide a simple statistical framework to incorporate a wide variety of linguistic knowledge and statistical models in a unified way. The experimental results showed that the accuracy rate has been significantly improved and the complexity of sample training has been decreased. LITERATURE REVIEW In the Seventh Message Understanding Conference (MUC-7) (Eason et al., 1955), 95% of the recall and 92% of precision were reached at the best level for the English NER system. However, compared with English NER, the Chinese NER is still on initial stage. In Multilingual Entity Task (MET-2) (Chinchor, 1988), the best Chinese NER system in MUC-7 achieved 66, 89 and 89% precisions and 92, 91 and 88% recalls. At present, the models for Chinese NER can mainly be divided into 2 categories, rule-based method and statistical method. Rule-based method commonly used features of the word to trigger the NER (Toutanova et al., 2005). For instance, Chinese family name is adopted to trigger the PN recognition (Kou, 2008) and the keywords at the end of the organization name ON recognition (Sun et al., 2002). The statistical methods build model for the NER based on the statistical analysis for the large corpus of NEs and the semantic analysis for their context feature (Cohen and Sarawagi, 2004; Fu and Luke, 2005). The statistical ML is introduced for Chinese NER, such as N-gram Grammar, Hidden Markov Model

Corresponding Author: Jing Wang, School of Computer Science and Technology, Xidian University, Xi’an, 710071 China


Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012 (HMM) (Cohen and Sarawagi, 2004; Fu and Luke, 2005), Maximum Entropy Model (MEM) (Fresko et al., 2005; Uchimoto et al., 2000; Lim et al., 2004), Support Vector Machine (SVM) (Li et al., 2005), Conditional Random Fields (CRF) (Cohn and Blunsom, 2005; Roark et al., 2004; Zou et al., 2005) and so on. Then, an integrated model was approved later, which combined different ML models into a unified statistical framework to improve the system's performance. CRFs has been widely adopted for text processing applications, such as Part-of-Speech tagging (POS), chunking and semantic role labeling (Sarawagi and Cohen, 2004; Chen and Goodman, 1998; Gao et al., 2006). Recently, the application of CRFs has been expanded to word alignment, IE and document summarization (Lafferty et al., 2001; Peng et al., 2004; Peng and McCallum, 2004). Thereby, more and more improved methods for CRF have been proposed to address this challenge (Li and McCallum, 2003). 2D Conditional Random Fields (2D CRFs) (Zhu et al., 2005) is proposed, aimed at extracting object(a kind of NEs) from two-dimensionally laid-out Web pages and better incorporate the two-dimensional neighborhood dependencies. However, they did not take into account the dependence conditions of cross-distance. For the simple object, the effect of IE is very good. But for the complex object, the result is not satisfactory. A possible explanation for this is that there are a lot of noise elements between the different blocks. DCRFs (Zhu et al., 2008) combine the best of both conditional random fields and the widely successful Dynamic Bayesian Networks (DBNs). Often, however, we wish to represent more complex interaction between labels-for example, when longer-range dependencies exist between labels, when the state can be naturally represented as a vector of variables, or when performing multiple cascaded labeling tasks on the same input sequence. These models assume predefined structures, therefore, they are not flexible to adapt to many realworld datasets. Subsequently, the two integrated models of SemiCRFs and HSCRF are approved. The idea of SemiMarkov Conditional Random Fields (Semi-CRFs) (Zhu et al., 2007) is a tractable extension of CRFs, which offers much of the power of higher-order models and allows features which measure properties of segments, rather than individual elements. And the Hierarchical Semi-markov Conditional Random Field (HSCRF), a generalization of embedded undirected Markov chains to model complex hierarchical, nested Markov processes, is described in Truyen et al. (2008). Consequently, these 2 models put more emphasis on the semantic features for the entity recognition, which make a good use of physical characteristics instead of the single word. Moreover, the models are divided into different levels for the entity

recognition. But the shortcoming is that they are more suitable for a single record and attributes recognition. In a word, the motivation for our improved method comes from the following two aspects. On the one hand, the available NER model cost too much number of training examples in sample training. That is why we improved the traditional probability statistical model to enhance the efficiency of the training. On the other hand, for the complex physical structure of NE, we optimize model to improve the accuracy for the NER. We propose a novel unified framework DSTCRFs for the PSN identification. Under this framework, firstly, we adopt hidden semi-Markov models to identify entities with their word feature table. Then, based on the labels, the DTCRFs is used to recognize nested PSNs for the target entity type. METHODOLOGY The feature definition of NE: As the NE is usually emphasized with some features in the Web pages, we can take into account these characteristics for the NER. We will expand on the analysis of their characteristics in this section. Structure features: In Web pages, the font of PSN display as the large size and red color, different from that of other texts. The display of NE is employed to emphasize some important information and to attract user attention. As a result, the CSS attributes are introduced to represent the definition of the structure of features. However, different from the traditional ways, we introduce the simple Boolean model instead of using the detail values of the CSS attribute to express the features. Content features: The characters of the NE depend on their context feature. Here are some examples to illustrate. Generally, the emergence of the PSN is mostly in the form of "recruit market manager". Therefore, we can adopt the context features for the identification. If the front word is "recruited", the probability of which the next word is PSN will be very large. The DSTCRFS model for NER: We adopt DSTCRFs to identify various types of NE, which based on the feature word tagging. The basic idea is that we build a feature words tag sets according to the composition and characteristics of various NEs respectively. The HSCRFs model divided Conditional Random Fields model into two layers, POS and NP and each state embed Semi-markov model. But the model cannot effectively deal with included nested NER, but just the parallel nested NER, which are defined as follows: Definition 2: In external characteristics of PSN, given fa1 and fb1 as the following trigger feature and the previous


Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012 trigger feature. fa2 is the feature word and the others are fb2 in internal characteristics. So fa1 fa2 fa1 fb1 fb2 fb1 stands for the include nested NE and fa1 fa2 fb2 fb2 fb1 parallel nested NE. Therefore, the HSCRFs cannot handle such complex NE, included nested NE. For addressing this issue, we proposed the improved model DSTCRFs to resolve the nested NER problem. HSCRFs models are generally divided into three layers to complete the identification task. However, DSTCRFs model is not a fixed model, but a dynamically generated model. The levels of DSTCRFs depend on the components relationship of NE, which we call this as "dynamic." DSTCRFs: DSTCRFs definition 3: The joint model is a pair , where, y is tree-structure hierarchical model DTCRFs and ySM is the label model of the HSMM, known as DSTCRFs. A valid assignment must satisfy the condition that the two assignments match at the leaf variables. Then, the joint probability distribution of our model DSTCRFs has the following factorization form:



node to child node and Ecp from child nodes to parent nodes. For p(si | x, yTj), we introduced the GHSMM to computing its value. Formally, a GHSMM consists of the following five parts: C C

Let G = {g1, g2,...} denote a segmentation of x, where, segment gj = (x, st, ed) consists of a start position st, an end position ed. So in this model, the g is instead of y. C C

0≤ i ≤ F j= L


i p ySM | x , yTi




where, yTj denotes the nodes of j-th level, so when j = L, that means the leaf nodes of DTCRFs and the calculation of the different parts is as follows: p ( y| x ) =

⎛ 1 exp⎜⎜ ∑ z( x) ⎝ e∈{ E pc , E cp , E ns , E ss },

λ j t j (e, ye , x )⎟⎟ ⎠


⎞ + ∑ µ k S k ( v , y v , x )⎟ ⎠ v ∈V , k

⎛ Z ( x ) = Σ ⎜⎜ exp ( ∑ ⎝ e∈{ E pc , E cp , E nc , E ss },

⎞ + ∑ µ k sk (v , y v , x ))⎟ ⎠ v ∈V , k


B: The emission probability distribution:



B = B1 , B 2 ,..., B z , B = (b j ( k s )) N × M b j ( k s ) = P( xt , s = Vk | yt = S j )1 ≤ j ≤ N ,1 ≤ k ≤ M


For given the model 8, find out the state transition sequence Y which maximizes P(X, Y|8) is our objective: P( X , Y | λ ) = P(Y | λ ) P( X |Y , λ )



= π y1by1( x1, s )a y1 y2 by1 y2 ( x2, s )∏ (a yt −2 yt −1 yt byt −1 yt ( xt , s ))



λ j t j ( e, y e , x ) j

B: The initial state probability distribution and B = {B1,B2,...,BN}, where, Bi = P(y1 = Si), 1#i#N. A: The hidden state transition probability distribution. A = (aij)N×N, where,

aij = P(yt+1 = Sj | yt = Si) ,1#i, j#N

p yT , ySM | x = p( yT | x ) p( ySM | x , yT ) = p( yT | x )

N is the number of states in model. These N states are S = {S1, S2, ..., SN} and suppose the state at time t is yt , so we can know yt , S. M is the number of probability observable symbols. V = {V1,V2,...,VM} and the observable symbol at time t is xt, where, xt , V.

In GHMM an observation symbol k is extended from its mere term attribute to a set of CSS attributes as k1, k2, ..., kz. We assume the attributes are independent from each other and consider a linear combination of these



where, ye and yv are the label sequence, denoting the set of components of y associated with edge e and vertex v in the linear chain respectively; tj and Sk are feature functions; The parameters 8j and :k refer to the weight of the feature functions tj and Sk respectively, estimated from the training data; ZT(x) is the normalization factor, also known as partition function. In addition, Ens represents the state transition between brother nodes and Ess between the brother's child nodes. Likewise, Epc stands for the state transition from parent


bij ( k ) = ∑ α s bijs ( k s ) . where, "s is the weight factor s =1


for the S

attribute and


∑ α s = 1,0 ≤ α s ≤ 1 . So (7) now s =1



⎤ ⎡ Z P( X , Y | λ ) = π y1 ( x1, s )a y1 y2 ⎢ ∑ α s bys1 y2 ( x 2 , s ) ⎥ ⎦ ⎣ s=1 Z T ⎛ ⎡ ⎤⎞ ⎜ a yt −2 yt −1 yt ⎢ ∑ α s byst −1 yt ( x t , s ) ⎥ ⎟ ∏ ⎝ t =3 ⎣ s=1 ⎦⎠


Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012 DSTCRFs training: Generally, for the data training of DSTCRFs, there are two inference problems to be solved for an unlabeled sequence x: Compute the parameter estimation p( yT , y SM i | x i , Θ , Ψ ) over all cliques. The parameter estimation problem is to calculate the parameters from training data D = (+yT, ySM,i, si). More specifically, we optimize the log-likelihood objective i function with a conditional model p y T , y SM | s i , Θ , Ψ :


L( Θ , Ψ ) =

0≤ i ≤ N


i log p⎛⎜⎝ yT , ySM | xi , Θ , Ψ ⎞⎟⎠


where, Θ = λ1 , λ2 L ; µ1 , µ2 L is the parameter vector of

0≤ i ≤ N



log p

0≤ i ≤ N


yTi | xi , Θ , Ψ

L(Θ ) =

0≤ i ≤ N

⎡ ⎢∑ ⎣ 1≤ i ≤ N −


log p yTi | si , Θ , Ψ +



i log p xSM | si , yTi , Θ , Ψ

0≤ i ≤ N


log p

0≤ i ≤ N j= L

j i ySM | xi , yT , Θ , Ψ






∑ aijk = 1, aijk ≥ 0 . And in the same way, (6) can be

k =1

rewritten as: bij ( k s ) = P( xt , s = Vk | yt = S j , yt −1 = Si )

1#i, j#N, 1#k#M Decode y * = arg max p( t T , y SM | x ) with the Viterbi, which is used to decide the label sequence with maximum probability of the sequence x. The Viterbi decoding problem is employed y * = arg max p t T , y SM | x to calculate the maximum




y * = arg max p( yT , ySM | x )




= arg max p( yT | x ) * p( ySM | x , yT )


= arg max p( yT | x ) *arg max p( ySM | x , yT )

log p( y i | x i , Θ ) =


probability of the sequence x. Let Pvt(t, ys) indicate the maximum of a sequence of length t, ending with state s. Then we have following calculations:

the TCRFs and Ψ = aij L ; bij L of GHSMM: L (Θ , Ψ ) =


aijk = P yt +1 = Sk | yt = S j , yt −1 = Si 1 ≤ i , j , k ≤ N

= Pvt * Pvsm

∑ p( x , y ) x, y

λk f k ( yi −1 , yi , x ) +


∑ ∑λ

1≤ i ≤ N


∑ p( x) log Z ( x)


⎤ f k ( yi , x ) ⎥ ⎦

Pvt (t , ys ) = max{ Pvsm (t − 1, ys −1 ) * At ( ys −1 , ys | x)}


At ( ys −1 , ys | x ) = exp(

λl tl ( ys −1, ys , x )


e∈{ E pc , E cp , E ns , E ss },l



∑ µk sk ( ys, x ))

v ∈V , k

n ∂ L(Θ ) = ∑ p( x , y ) ∑ f k ( yi −1, yi , x ) − ∂λk x, y i =1

where, At(ys-1, ys|x) represents the potential functions between the state s-1 and the state S to record the optimal previous state of each state. According to Bayes law P( y| x) = PP( (yx, x) ) and have:

n +1

∑ p( x) p( y| x, Θ ) ∑ f k ( yi −1, yi , x) x, y


i =1

= E p ( x , y ) [ f k ] − E p ( y | x ,Θ ) [ f k ]

log P (y|x) = log P (y)+log(P) (x|y)


Pvsm = arg maxlog P( y| x ) = arg max(log P( y ) + log P( x| y ))


If we assume that the training data consists of a set of data points D = ⎛⎜⎝ yT , ySM i , xi ⎞⎟⎠ , each of which has been generated independently and identically from the joint empirical distribution p(x, y), such that the log-likelihood of the training data is maximized. When the state St transits to the state St+1, the state transition and emission probability are related to the state at time t, but not any previous state. However, the emission probability is not only related to the current state but also the state before. So, in this study, we supposed that the state transition sequence is based on second-order Markov chain; it means that when the state St transits to the state St, the state transition probability is related to not only the state at time t but also the state at time t-1. In this case, (5) becomes:







P ( x| y ) =




∏ P( xi | yi ) i =1

Applying it to (18), we have: Pvsm = arg max P( y| x ) = arg max T n


( ∑ log P( xi | yi ) + log P( y ))


i =1

Now, the mutual information between y and x is instead of conditional probability. We assume mutual information independence (Chen and Goodman, 1998):


Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012 Table 1: The sample datasets Website Category computers 109 84 91 biomedical 86 52 107 architecture 95 68 80 environmental 87 91 84 protection mechanization 78 83 74 secretary 89 67 72 Training number 436 329 371 Test number 108 116 137 Total number 544 445 508


MI ( y , x ) = ∑ MI ( yi , x )


n P( y , x ) P( yi , x ) = ∑ log P( y ) ⋅ P( x ) i =1 P( yi ) ⋅ P( x )


i =1


Or we write as: n


i =1

i =1

log P( y| x ) − log P( y ) = ∑ log P( yi | x ) − ∑ log P( yi )


We can obtain (23) from (22) by assuming abovementioned mutual information: log P( y| x ) = log P( y ) −



i =1

i =1

∑ log P( yi ) + ∑ logP( yi | x)

Many of the possible sequence of words may not be collected in the training corpus. Namely, if Pbo(h|h’, g) = 0, the probability of whole sentence is zero. Therefore the smoothing is used to address this problem:

(23) PGT (h| h' , g ) =

C ( h , h' , g ) C ( h' , g )


So the final tagging result is as follow:





N (C (h, h' g ) + 1) N (C (h; h' , g ))

CGT (h, h' , g ) = (C (h, h' , g ) + 1 ×

Pvsm = Arg max log P( y| x ) = Arg max(log P( y ) n

∑ log P( y ) + ∑ logP( y | x)) i

i =1

i =1


N(C) is the number of the bigram items that occur C times in the training corpus and "(h!, g) is the back-off weight:


For the P(y) in (24), we introduce n-gram language model, based on statistical probability in NLP, to calculate the probability of a sentence y = (y1, y2, ..., ym), according to the chain rules:

β (h ' , g ) = 1 −


β (h ' , g )


p(h|h') 1 −

s:C ( h ', h , g ) = 0


P( y ) = P( y1)∏ P( yi | y1, y2 ,..., yi −1)

β (h', g )

α (h', g ) =

s:C ( h ', h , g ) > 0

p(h| h')


s:C ( h ', h , g ) > 0

pGT (h|h', g )



As a matter of fact, we can calculate it by this formula owing to the data-sparseness problem. So, one viable solution is that we suppose each tag probability lies on the previous N tags. There are mainly four models: context-free grammars-unigram (N = 0), bigram (N = 1), trigram (N = 2) and fourgram (N = 3). Usually, the trigram model is used mostly and we choose it in this study. n

In order to compute ∑ log P( yi | x) , an improved back-off i =1

model is proposed, which is showing as following: ⎧ PGT ( h|h', g ) if C (h, h', g ) > 0 Pbo (h|h', g ) = ⎨ ⎩α (h', g ) Pbo (h| h') otherwise


where, Pbo(h|h’) is the probability for the traditional trigram model, h = ti!n+1...ti!1, h’ = ti!n+2...ti!1. And PGT(h|h’, g) is the probability for the bigram model smoothed by the Good--Turing method (Chen and Goodman, 1998).

EXPERIMENTAL RESULTS Data set: Based on the above discussion, we verify the NER based on the DSTCRFs model in this study. We have adopted the following test datasets from three Websites. There are six classes of Web pages, including Computers, Biomedical, Architecture, Environmental Protection, Mechanization and Secretary. Therefore, we randomly select 1497 Web pages as datasets, 1136 as the training datasets, the rest of the 361 as testing. The datasets are presented in Table 1. When we compare the identification results with different NE, we employ the Recall (R), Precision (P) and F measure which combines recall and precision. They are defined as follows (Sun et al., 2002): R = (number of correct identified entities)/(numbers of all entities) P = (number of correct identified entities)/(numbers of possible entities)


Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012


60% 40% 20%

00 14

00 12

00 10


0% 80

Date 94.8 96.9 96.1 96.7 97.8 97.2 96.9 98.4 97.3 97.2 99.7 98.5


60 0

Table 3: The results of three methods for the NER (%) Pos Org Loc CRFs P 71.6 69.8 95.7 R 77.9 77.6 96.7 F 74.8 78.4 95.9 HCRFs P 76.7 76.4 96.5 R 81.4 79.5 97.3 F 79.2 79.9 96.9 HSCRFs P 79.3 80.5 97.2 R 82.0 81.6 98.1 F 81.7 80.9 97.8 DSTCRFs P 84.2 86.3 98.2 R 88.5 89.0 98.7 F 86.3 87.1 98.4



40 0

Date 94.4 96.8 95.5 96.9 97.6 97.4 95.5 98.0 97.2 96.1 98.4 97.5 97.3 97.8 97.5 93.8 95.1 94.3

20 0

Loc 92.4 94.2 93.0 93.7 95.7 94.8 92.5 96.9 94.1 95.6 96.3 95.9 94.9 96.9 95.1 94.8 98.4 96.3


Table 2: The results for NER base on DSTCRFs Pos Org Computer R 82.9 86.5 P 87.3 89.3 F 85.2 88.7 Biomedicine R 81.1 81.2 P 85.6 87.2 F 83.4 84.1 Architecture R 84.7 79.2 P 88.2 89.0 F 86.0 84.8 Environment protection R 86.5 85.5 P 88.8 87.1 F 87.6 86.2 Mechanization R 87.4 83.4 P 89.3 86.2 F 88.3 84.8 Secretary R 82.0 83.8 P 87.2 86.8 F 85.8 85.4

Fig. 1: The average of NER based on DSTCRFs

2× R× P R+ P

The result of web IE: We adopt the improved DSTCRFs method to identify the Position (Pos), Organization (Org), Location (Loc) and the date respectively. The averages of the experimental results for the NER are as follows Table 2 showing. Results in Table 2 reveal that DSTCRFs model achieves higher accuracies for Loc and Date. Their feature words are numerals or fixed nouns, thereby, the F measure are almost over 94%. However, due to the complexity of Pos and Org, their identification accuracies are not higher than the Loc and Date, but have improved greatly. Because professional glossaries in Mechanization recruitment pages are more abundant than other recruitment pages, the F measure of Pos is high at 88.3%. The F measure of Org in Computer recruitment pages is 88.7%, which are higher than the other kinds of pages. In the Computer recruitment pages, most of the Org are ended with “Co.” or “Ltd.”, so the feature words are simple to identify. For the different datasets, we have performed an experiment to test the proposed model. So the P, R and F measure for the NER, based on the improved DSTCRFs, CRFs, HCRFs and HSCRFs are shown in the Table 3.

From the results obtained in the experiment, it can be detected that the performance of DSTCRFs is excellent compared with the two mainstream algorithms, CRFs and HCRFs. For the Pos and Org, the F measure of DSTCRFs significantly higher than those of the previous three models, increased by 11, 7 and 5%, Loc by 3, 2 and 1%, Date by 2, 1 and 1%, respectively which means that our model is more applicable to the Web NE, especially for the nested NE. As the reason of the Loc is the fixed noun in the feature table there, the accuracy of extraction is relatively high. Some of the Pos and Org are usually named for short or the nested NE which lead to the F measure for its recognition is relatively low. Obviously, HSCRFs model achieves lower F measure than the DSTCRFs but higher than the other methods. HSCRFs puts more emphasis on the entity characteristics instead of the single word. Moreover, the models are divided into different levels for the entity recognition. Hence, F measure of HSCRFs model is higher than the CRFs and HCRFs methods. However, it has no effect on the Web page, which contains multi-record. Consequently, the DSTCRFs performs well than HSCRFs. For the different number of datasets, the average accuracy of 4 methods for NER, are indicated in the following Fig. 1. In the curve, the results indicate that the accuracy of DSTCRFs is improved significantly compared to other three methods. Especially, after the datasets more than 500, the identification accuracy is more than 76.4% based on the DSTCRFs. The results depict that the accuracy for NER, improved markedly based on the method we proposed in this article. After about 35% of the training samples documents, the F meature basically keep unchanged. So the number of training examples can be reduced in this study. CONCLUSION Recently the NER is a difficult and more challenging problem. A practical approach combined DSTCRFs


Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012 technology with multi-rules is proposed in the study, which take the nested structure and content features of NE into cosideration for a better recognition. And the experimental results reveal that the novel approach can improve the NER accuracy significantly, only with small amount trained data. Results of this study could have considerable impact on the NER. It is demonstrated that DSTCRFs method is adaptable to the Chinere NER. As we all know, current NER technologies are all based on the machine learning. So how to optimize well learning model is an urgent problem, in this way, we can reduce the transfer of state and the total numbers of entered symbols in the sequence to improve the operating efficiency for the NER. And these will be our future work. ACKNOWLEDGMENT This research project was promoted by the National Science and Technology Pillar Program No. 2007BAH08B02. REFERENCES Chen, S.F. and J. Goodman, 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Center for Research in Computing Technology Harvard University Cambridge, Massachusetts, Technical Report Tr-10-98. Cohen, W.W. and S. Sarawagi, 2004. Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods. KDD’04, Seattle, Washington, USA. Chinchor, N., 1988. MUC-7 Named Entity Task Definition (Version 3.5). The 7th Message Understanding Conference, Fairfax, Virginia. Cohn, T. and P. Blunsom, 2005. Semantic role labeling with tree conditional random fields. Proceedings of the Ninth Conference on Computational Natural Language Learning, (CoNLL'2005), USA, pp: 169172. Eason, G., B. Noble and I.N. Sneddon, 1955. On certain integrals of Lipschitz-Hankel type involving products of Bessel functions. Phil. Trans. Roy. Soc. London, 247: 529-551. Fu, G.H. and K.K. Luke, 2005. Chinese named entity recognition using lexicalized HMMs. ACM SIGKDD Explorat. Newslett., 7 (1): 19-25. Fresko, M., B. Rosenfeld and R. Feldman, 2005. A Hybrid Approach to NER by MEMM and Manual Rules. CIKM’05, Bremen, Germany. Fu, G.H. and K.K. Luke, 2005. Chinese named entity recognition using lexicalized HMMs. ACM SIGKDD Explorat. Newslett., 7 (1): 19-25.

Gao, J.F., A. Wu, M. Li and C.N. Huang, 2006. Chinese word segmentation and named entity recognition: A pragmatic approac. Assoc. Computational Linguistics, 31(4). Kou, Y., 2008. Improving the accuracy of entity identification through refinement. Proceedings of the 2008 EDBT Ph.D. WorkshopMar.2008. Lim, J.H., Y.S. Hwang, S.Y. Park and H.C. Rim, 2004. Semantic Role Labeling using Maximum Entropy Model. (CoNLL'2004), Retrieved from: Li, W. and A. McCallum, 2003. Rapid development of Hindi named entity recognition using conditional random fields and feature induction. ACM Trans. Asian Lang. Inf. Proc., 2003: 290-294. Li, Y., K. Bontcheva and H. Cunningham, 2005. Using uneven-margins svm and perceptron for information extraction. Proceedings of the 9th Conference on Computational Natural Language Learning, (CoNLL'05), USA, pp: 72-79. Lafferty, J., A. Mccallum and F. Pereira, 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML '01 Proceedings of the 18th International Conference on Machine Learning, USA, pp: 282-289. Peng, F.C. and A. McCallum, 2004. Accurate Information Extraction from Research Papers using Conditional Random Fields. (HLT-NAACL'2004), pp: 329-336. R e t r i e v e d f r o m : 04.pdf. Peng, F., F.F. Feng and A. McCallum, 2004. Chinese segmentation and new word detection using conditional random fields. Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004. Roark, B., M. Saraclar, M. Collins and M. Johnson, 2004. Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm. In ACL 2004, Retrieved from: P/P04/P04-1007.pdf. Sarawagi, S. and W.W. Cohen, 2004. Semi-Markov Conditional Random Fields for Information Extraction. (NIPS'2004), Retrieved from: RF.pdf. Sun J., J.F. Gao, L. Zhang, M. Zhou and C.N. Huang, 2002. Chinese named entity identification using class-based language model. Presented at the 19th International Conference on Computational Linguistics. Toutanova, K.A. Haghighi and D.C. Manning, 2005. Joint learning improves semantic role labeling. Proceedings of the 43rd Annual Meeting of the ACL, pp: 589-596.


Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012 Truyen, T.T., D.Q. Phung, H.H. Bui and S. Venkatesh, 2008. Hierarchical semi-Markov conditional random fields for recursive sequential data. Proceeding of Twenty-Second Annual Conference on Neural Information Processing Systems, Dec 2008, Vancouver, Canada. Uchimoto, K., Q. Ma, M. Murata, H. Ozaku and H. Isahara, 2000. Named entity extraction based on a maximum entropy model and transformation rules. Proceedings of 33rd Annual Meeting of the Association of the Computational Linguistics, (ACL'2000). Zhu, J., Z.Q. Nie, J.R. Wen, B. Zhang and W.Y. Ma, 2005. 2D Conditional Random Fields for Web information extraction. Proceedings of the 22nd International Conference on Machine Learning, pp: 1044-1051.

Zhu, J., Z.Q. Nie, J.R. Wen, B. Zhang and H.W. Hon, 2007. Webpage understanding: An integrated approach. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA, pp: 903-912. Zhu, J., Z.Q. Nie, B. Zhang and J.R. Wen, 2008. Dynamic hierarchical markov random fields for integrated web data extraction. J. Mach. Learn. Res., 9: 1583-1614. Zou, J.Q., G.L. Chen and W.Z. Guo, 2005. Chinese web page classification using noise-tolerant support vector machines. Proceeding of NLP-KE’, 05: 785790.