All documents in test collections also have attributed IPC symbols, so there is no blind data. Each document includes a title, a list of inventors, a list of applicant.
Chapter 13 EXPERIMENT WITH A HIERARCHICAL TEXT CATEGORIZATION METHOD ON WIPO PATENT COLLECTIONS Domonkos Tikk, György Biró, and Jae Dong Yang
1.
INTRODUCTION
! ! " # $ % &
' ! (
)
!
! * + ,
-../ 0 1 -... 0 1 -..2 0 # -... "
! 1 3
284
Domonkos Tikk, György Biró, Jae Dong Yang
4554 "+ -...
6
+ -../ ) # 0 7'' 8
8 !
) 8 ! )8 1 )8 ( 1 )
8 (
)8
9
4554 7 - )8 ! 4 )8
:5
! ! 2 ; &
+ )8 : 8 8
1 )8 ( 1 )8 ( " 1 )8 ( % 4554 4552
7'' '' ' 1 )8 (
2 %* , ? =555 )8 0 1 )8 (
--5555 > , ? ! )8 1 )8 ( 1 )8 (
9
4552
Experiment with HITEC on WIPO patent collections
285
(
++ *@ 4552 0 ++ 4552
&
3 4 A 9"> 6
B) " 3 2 1 )8 (
3 : 2.
THE CLASSIFIER A9"> A 9 " )
A 9">
+ + A 9"> ++ 4552 9
! B + A 9"> A 9">
! 1
A 9">
!
1 C 9 1
!
)
1 !
286
Domonkos Tikk, György Biró, Jae Dong Yang
D ! ! 3 + 9 - 3 4 4 9 ! + + Training documents
Preprocession - removal of function words - stemming - dimensionality reduction - term indexing
To be improved
Classifier
Topic descriptors Raise weight of cooccurring terms in topic descriptors
Correct
Misclassified
Category not found
Lower the weight of cooccurring terms in topic descriptors
Performance check (Q-measure) Acceptable
Figure 1.
EFG HIJKFLMN IO NFG NMLPQPQR LSRIMPNFT IO U VWX
3 4 - 3 4 4 ! 2.1. Notations ? ! ! ) C
+ + ++ 4552
Experiment with HITEC on WIPO patent collections
287
? D D ) !d ∈ D 1 C d ∈ D Train d ∈ DTest DTrain ∩DTest = ∅ DTrain ∪DTest = ! D ! ! " ! dj ∈ D 1 Y 9
- topic(dj ) = {c1 , . . . , cq ∈ C} Y dj
cq ! * d
1
dj 4 dj = (w1j , . . . , w|T |j ), T DTrain ≤ w k kj ≤ 1 * 0 d 6 T 1 2 : 3 × , %
-./2 7 wkj = okj · log wkj
N nk
2
,
! N 1 X fki fki = log(okj +1) 1+ . log log N nk nk i=1
:
B 0 k dj nk kj o 0 k N = |D | Train 4
288
Domonkos Tikk, György Biró, Jae Dong Yang
1 descr(ci ) = hv1i , . . . , v|T |i i,
ci ∈ C
=
1i ≤ 1 0 ≤ v
0 2.2.
Classification and training
2.2.1.
Classification
1 4 d ∈ D d = d !
! + 9 +
+ ! ! ? ! 7 m . , cm ∈ C 1 , . . dj c dj descr(c1 ), . . . , descr(cm ) 1
f 7 |T | X conf(dj , descr(ci )) = f wkj · vki ,
Z
k=1
f : R → [0, 1] x→0 f (x) =
lim 0 limx→∞ f (x) = 1
3 A 9"> [ ! ) dj
3 -7 Z
m
3 4 7 3 7 dj cbest
289
Experiment with HITEC on WIPO patent collections
3 2 7 )
cbest
0 3 -
,
,
-../ )
+ ! conf min ∈ [0, 1] 3 - 3 4 "
δ(c)
! /
descr(c)
descr(c) − δ(c)
- 1 c conf req (c) = 0 δ(c) dj c 4 1 c conf (c) = 1 req δ(c) c dj
3 2 \ 3 - 3 4
3 : ) ! 0 3 -D 3 2 1 ! + ! ? (n) δ (v ki ) ! (0) n δ (vki ) = 0
! 1≤k ≤T δ (n+1) (ci ) = hδ (n+1) (v1i ), . . . , δ (n+1) (vT i )i δ (n+1) (vki ) = α · (conf req − conf(dj , descr(ci ))) · wij +δ
(n)
(vki ) · β
.
& β ∈ [0, 1] α β
1
0.05..0.2 ! ) 3 5 ) 3 -
δ (0) (vki ) = 0
δ(c)
! .
) 3 : 1
Experiment with HITEC on WIPO patent collections
291
)
! ! 1
++ 4552 ! 7 d #( d) Q(d) = #(
Q
1
. d) 7 ·
d)
1 + #(
Q(d)
Q=
P
Q(d) |DTrain |
d∈DTrain
-5
!Q \ 6 -. , ? ! ;
9
4554 9
4552 3.2. Performance measures 1 1 )8 (
9
4554 ?
)8 ! Z !
9 47
- C 7 ! )8 ] ^ 9 4 4
2 ! )8 ) + )
2
1 !
)8 )8 9 4 ) 9
4554 )8
3.3. Dimensionality reduction 1
| |T
; \
Experiment with HITEC on WIPO patent collections
ijji
Figure 2.
293
W_`SLQLNPIQ NI NFG NFMGG GaLSbLNPIQ TGLcbMGc EI` d EI` ed fQg VLSS GN LShd
0 3 4554 * |T | |T | |T |
! k
3 -..< 0 1 1
4554 ) ++ 4552 -l ; \ & ! 1 |T | minoccur D * Train nk /|DTrain | ≥ maxfreq ! ! minoccur ∈ [1 .. 10] max .. 1.0] ; \ freq ∈ [0.05
( > , ? ! " 3.4. Results 1 B) " 1 )8 (
! 9 2D , ? !
m n 0 : 0 minoccur 9 2 m n = 2
max = 0.25 freq - ! 1 !
100 Top (class)
correct guess (%)
90
Top3 (class)
80
Any (class) Top (subclass)
70
Top3 (subclass) 60
Any (subclass)
50
Top (main group)
40
Top3 (main group) Any (main group)
30 0
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
1
confidence level
Top (class)
100 Top3 (class)
% of documents recalled
90 80
Any (class)
70
Top (subclass)
60 Top3 (subclass)
50 40
Any (subclass)
30 Top (main group)
20 10 35
45
55
65
75
85
95
% of correct assignments relative to recalled documents
Top3 (main group) Any (main group)
oGNNPQR pP`NLKqh rMGKPcPIQ sg KIQtuGQKG SGaGSch sv wIT`LMPcIQc IO `MGKPx cPIQcd G_NML`ISLNGu NI yjjz MGKLSS
Figure 3.
295
Experiment with HITEC on WIPO patent collections
m n 0 ! m n 3 9 : 100 Top (class)
correct guess (%)
90
Top3 (class)
80
Any (class) Top (subclass)
70
Top3 (subclass) 60
Any (subclass)
50
Top (main group)
40
Top3 (main group) Any (main group)
30 0
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
1
confidence level
Top (class)
100 Top3 (class)
% of documents recalled
90 80
Any (class)
70
Top (subclass)
60 Top3 (subclass)
50 40
Any (subclass)
30 Top (main group)
20 10 35
45
55
65
75
85
95
% of correct assignments relative to recalled documents
Top3 (main group) Any (main group)
oGNNPQR pP`NLq JPNF GQNMI`g JGPRFNPQR LQu minoccur = 2 LQu maxfreq = 0.25 h Lv rMGKPcPIQ sg KIQtuGQKG SGaGSch sv wIT`LMPcIQc IO `MGKPcPIQcd G_NML`ISLNGu NI y jjz MGKLSS 1
9
4554 +
9 =
Figure 4.
296
Domonkos Tikk, György Biró, Jae Dong Yang
100 Top (class)
correct guess (%)
90
Top3 (class)
80
Any (class) Top (subclass)
70
Top3 (subclass) 60
Any (subclass)
50
Top (main group)
40
Top3 (main group) Any (main group)
30 0
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
1
confidence level
Top (class)
100 Top3 (class)
% of documents recalled
90 80
Any (class)
70
Top (subclass)
60 Top3 (subclass)
50 40
Any (subclass)
30 Top (main group)
20 10 30
50
70
90
% of correct assignments relative to recalled documents
Top3 (main group) Any (main group)
oGNNPQR{ IQSg T LPQ KLNGRIMPGc bcGu OIM NMLPQPQRh rMGKPcPIQ sg KIQtuGQKG SGaGSc sv wIT`LMPcIQc IO `MGKPcPIQcd G_NML`ISLNGu NI y jjz MGKLSS
Figure 5.
1 2 ×
25l 3 × 9 Z maxvar
0.5 0.01 + !
+ ! +
297
Experiment with HITEC on WIPO patent collections
100 Top (class)
correct guess (%)
90
Top3 (class)
80
Any (class) Top (subclass)
70
Top3 (subclass) 60
Any (subclass)
50
Top (main group)
40
Top3 (main group) Any (main group)
30 0
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
1
confidence level
Top (class)
100 Top3 (class)
% of documents recalled
90 80
Any (class)
70
Top (subclass)
60 Top3 (subclass)
50 40
Any (subclass)
30 Top (main group)
20 10 30
50
70
90
% of correct assignments relative to recalled documents
Top3 (main group) Any (main group)
UcG IO NO×PuO JGPRFNPQRh Lv rMGKPcPIQ sg KIQtuGQKG SGaGSch sv wIT`LMPcIQc IO `MGKPcPIQcd G_NML`ISLNGu NI y jjz MGKLSS
Figure 6.
! ! 2l 3 9 < - - 9
4552 )8 1 9
4552 5 5 ! &
298
Domonkos Tikk, György Biró, Jae Dong Yang
100 Top (class)
correct guess (%)
90
Top3 (class)
80
Any (class) Top (subclass)
70
Top3 (subclass) 60
Any (subclass)
50
Top (main group)
40
Top3 (main group) Any (main group)
30 0
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
1
confidence level
Top (class) 100 Top3 (class)
% of documents recalled
90 80
Any (class)
70
Top (subclass)
60 Top3 (subclass)
50 40
Any (subclass)
30 Top (main group)
20 10 40
60
80
100
Top3 (main group)
Any (main % of correct assignments relative to recalled documents group)
UcPQR cGT LQNPK PQOIMT LNPIQ h Lv rMGKPcPIQ sg KIQtuGQKG SGaGSch sv wITx `LMPcIQc IO `MGKPcPIQcd G_NML`ISLNGu NI yjjz MGKLSS
Figure 7.
( B ) " 9
4552 -5 : -l )8 1 9
4552 ) 2
Experiment with HITEC on WIPO patent collections
299
obTTLMg IO MGcbSNc IQ | }r~xLS`FL `LNGQN KISSGKNPIQ fssMGaPLNPIQ IO TGNFIu QLTGc{ LaG LgGcd o
ob``IMN GKNIM
LKFPQGd kx k GLMGcN GPRFsIMcv WaLSbLNPIQ oGNNPQR }rwKIQOh SGaGS T GLcbMG KSh j hj KShi j h chKShij hj T hRh j hj EI` P`NL hj i he eh eh P`NLK h he ehy eh T LPQ i hy he hy e i hi j j NtuO h eh h eeh i h cGT LQNPK hy he ehe sGcN IO hjj yhjj | }r~ `L`GM do
o
uPGMGQKG yyhy yehe i EI`e P`NL h i he hjjj h P`NLK hy he h h T LPQ ehy he j h h j NtuO hy hyy h ehj cGT LQNPK hy h h h sGcN IO hjj i hjj k x | }r~ `L`GM uPGMGQKG y j hy yh fQg P`NL eh he ii h h P`NLK ehyi h hi he T LPQ yhe h h yhy NtuO i hii j he j hy ji hyj cGT LQNPK h eh he h sGcN IO ehjj hjj | }r~ `L`GM o
uPGMGQKG yeh yhe
Table 1.
1 2
! ( 2
7 < × < Y
)8 2 B ) " !
/. : -l !
*
300
Domonkos Tikk, György Biró, Jae Dong Yang
+ B + Z:l ! 4 %
1 )8 ( " B ) "
B) "
! + Table 2.
obTTLMg IO MGcbSNc IQ | }r~xuG `LNGQN KISSGKNPIQ WaLSbLNPIQ }rwKIQOh SGaGS TGLcbMG KSLcc j hj KSLcc j h cbsKSLcc j hj T LPQ RMIb` j hj EI` hji h he ehe EI`e hy hjj hy he fQg hj h h j h
? + & ( 8 ? ( 3 4 %B %* \ ,
+ 6 :5 < 4× ) ! 4.
CONCLUSION 1 B ) " !
" %
1 )8 ( )8 )8
& 8 !
3
& ! !
)8 1 )8 ( 9
4554 B) "
Experiment with HITEC on WIPO patent collections
301
Acknowledgments +
k 3 " 9 k ( 3"9 % Y \ 5=4552555 --./Z5
302
Domonkos Tikk, György Biró, Jae Dong Yang
REFERENCES fLcd h LQu WPaPSd h yvh EG_N KLNGRIMPcLNPIQ { f cbMaGgh L`IMN yd IMJGx RPLQ wIT`bNPQR wGQNGMh LGMd h h LQu
KwLSSbT d f hi h yvh PcNMPsbNPIQLS KSbcNGMPQR IO JIMuc OIM NG_N KSLccPtKLNPIQ h }Q rMIKh IO NFG yNF fQQbLS }QNh fw
o}} wIQOGMGQKG IQ GcGLMKF LQu GaGSI`TGQN PQ }QOIMT LNPIQ GNMPGaLS o}} vd `LRGc yj ed
GSsIbMQGd fbcNMLSPLh wFLMLsLMNPd ohd IT d hd fRMLJLSd hd LQu LRFLaLQ d rh yvh oKLSLsSG OGLNbMG cGSGKNPIQ d KSLccPtKLNPIQ LQu cPRQLNbMG RGQGMLNPIQ OIM IMRLQPPQR SLMRG NG_N uLNLsLcGc PQNI FPGMLMKFPKLS NI`PK NL_IQITPGch EFG IbMQLSd ev{ye yh VLSSd w h hd EMKcaMPd f hd GQPQGs d hd LQu LMGNLd h ijj eLvh fbNIT LNGu KLNGx RIMPLNPIQ PQ NFG PQNGMQLNPIQLS `LNGQN KSLccPtKLNPIQ h fw
o}} VIMbT LMKFPaGd eyv{y j i h VLSSd w h hd EMKcaMPd f hd VPGaNd rhd LQu LMGNLd h ijj esvh fuuPNPIQLS MGLuTG PQOIMT LNPIQ OIM | }r~xuG LbNIKLNGRIMPLNPIQ uLNL cGNh FNN` {JJJ hJP`IhPQNPsPc uLNLcGNcJP`IxuGxMGLuTGhFNTSh VLSSd w h hd EMKcaMPd f hd LQu LMGNLd h ijji vh GLuTG PQOIMT LNPIQ OIM | }r~x LS`FL LbNIKLNGRIMPLNPIQ NMLPQPQR cGNh FNN` {JJJ hJP`IhPQNPsPcuLNLcGNcJP`Ix LS`FLxMGLuTGhFNTSh ISSGMd h LQu oLFLTPd
h yvh PGMLMKFPKLSSg KSLccPOgPQR uIKbTGQNc bcPQR L aGMg OGJ JIMuch }Q }QNGMQLNPIQLS wIQOGMGQKG IQ
LKFPQG GLMQPQRd aISbTG yd oLQ
Lx NGId wf h
IMRLQx LbOT LQQ h
KwLSSbT d f hd IcGQOGSud hd
PNKFGSSd E hd LQu Rd f h yvh }T`MIaPQR NG_N KSLccPx tKLNPIQ cFMPQLRG PQ L FPGMLMKFg IO KSLccGch }Q rMIKh IO }w
xh FNN` {JJJx i hKchKTbsg hGub∼TKKLSSbT `L`GMcFPGMxPKTSh`chRh oLSNIQ d h LQu
KPSSd
h h yevh fQ }QNMIubKNPIQ NI
IuGMQ }QOIMT LNPIQ Gx NMPGaLSh
KMLJxPSSh oGsLcNPLQPd V h ijji vh
LKFPQG SGLMQPQR PQ LbNIT LNGu NG_N KLNGRIMPLNPIQ h fw
wITx `bNPQR obMaGgcd e yv{yh EPd h LQu PMd h ijj evh W_`GMPTGQNc JPNF TbSNPSLsGS NG_N KSLccPtGM IQ NFG GbNGMcj KISSGKNPIQ h }Q }QNGMQLNPIQLS wIQOGMGQKG IQ wIT`bNLNPIQLS wgsGMQGNPKc }www evd `LRGc eeed oPOId bQRLMgh EPd hd LQRd h hd LQu LQRd oh h ijj evh PGMLMKFPKLS NG_N KLNGRIMPLNPIQ bcPQR Obg MGSLNPIQLS NFGcLbMbch gsGMQGNPLd ev{e jj h aLQ P csGMRGQ d w h h yvh }QOIMT LNPIQ GNMPGaLSh bNNGMJIMNFcd IQuIQ d i Qu GuPx NPIQ h FNN` {JJJ huKchRSLhLKhbGPNF h |GPccd oh
hd f`NGd w hd LTGMLb d V h hd IFQcIQ d h W hd ~SGcd V h hd IGNd E hd LQu LT``i d E h yvh
L_PTPPQR NG_NxTPQPQR `GMOIMT LQKGh }WWW }QNGSSPRGQN ogcNGTcd yv{ h | PsIaId | h LQu | PSSPLTcd h W h ijji vh oPT`SG LQu LKKbMLNG OGLNbMG cGSGKNPIQ OIM FPGMLMKFPKLS KLNGRIMPcLNPIQ h }Q rMIKh IO NFG ijji fw
cgT`IcPbT IQ IKbT GQN GQRPQGGMPQRd `LRGc yyy yyd
KGLQ d PMRPQPLd U of h | PGQGMd W hd rGuGMcGQ d h ~ hd LQu |GPRGQu d f h oh yevh f QGbMLS QGNJIM L``MILKF NI NI`PK c`INNPQRh }Q rMIKh IO NFG iiNF fQQbLS ogT`IcPbT IQ IKbTGQN fQLSgcPc LQu }QOIMT LNPIQ GNMPGaLSd `LRGc eh LQRd h yvh fQ GaLSbLNPIQ IO cNLNPcNPKLS L``MILKFGc NI NG_N KLNGRIMPLNPIQ h }QOIMx T LNPIQ GNMPGaLSd yy i v{ j h FNN` {KPNGcGGMhQ hQGKhKIT gLQRGaLSbLNPIQ hFNTSh