EXPERIMENT WITH A HIERARCHICAL TEXT CATEGORIZATION ...

3 downloads 0 Views 269KB Size Report
All documents in test collections also have attributed IPC symbols, so there is no blind data. Each document includes a title, a list of inventors, a list of applicant.
Chapter 13 EXPERIMENT WITH A HIERARCHICAL TEXT CATEGORIZATION METHOD ON WIPO PATENT COLLECTIONS Domonkos Tikk, György Biró, and Jae Dong Yang

1.

INTRODUCTION        

                                                                                               

  

              !          !   "                 # $  %                 &                                                                     

      '             !       (

                                 )                           

      !    

                      

!  * +  , 

  -../ 0 1    -... 0 1     -..2 0 #  -...  "           

             !         1                   3

284

Domonkos Tikk, György Biró, Jae Dong Yang

  4554                     "+  -...                            

                               6               

                                +     -../  )        # 0  7''                 8      

                 8              !                 

           )    8   !   )8                1 )8 ( 1   ) 

  8   (          

  

              )8                      

    9

   4554 7 - )8                !    4 )8     

        :5  

           !              !       2 ;            &            

  

           +      )8     : 8                8                 

    

        1 )8 (            1 )8 (  "   1 )8 ( %          4554    4552      

               7''   ''  '    1 )8 (     

   2 %*     , ?         =555                )8 0 1 )8 (   

 

      --5555 > , ?     !        )8   1 )8 (                    1 )8 (  

     9

   4552 

Experiment with HITEC on WIPO patent collections

285

(              

               

   

     ++  *@  4552 0 ++    4552         

       &                     

   3   4     A 9">    6    

    B) "   3   2         1 )8 ( 

            3   :  2.

THE CLASSIFIER A9"> A 9  "                                 ) 

        

                       A 9">   

                   +    +                                     A 9">         ++    4552  9  

      

   !      B    +        A 9">                     A 9">           

    !                                                                    1           

        

                      

                      A 9">                

    !                  

                 1   C        9                                          1  

    !                     

          )                          

       1        !           

286

Domonkos Tikk, György Biró, Jae Dong Yang

        D                                 !         !  3   +    9 -          3   4 4           9         !  +         +    Training documents

Preprocession - removal of function words - stemming - dimensionality reduction - term indexing

To be improved

Classifier

Topic descriptors Raise weight of cooccurring terms in topic descriptors

Correct

Misclassified

Category not found

Lower the weight of cooccurring terms in topic descriptors

Performance check (Q-measure) Acceptable

Figure 1.

EFG HIJKFLMN IO NFG NMLPQPQR LSRIMPNFT IO U VWX

          

   3   4 -                 3   4 4     !           2.1. Notations ?    !  !             ) C                   

  

     +      + ++    4552 

Experiment with HITEC on WIPO patent collections

287

?                   D D         )         !d ∈  D         1       C d ∈ D Train           d ∈ DTest DTrain ∩DTest = ∅ DTrain ∪DTest =                 !  D               !                       !    "       !          dj ∈ D        1      Y                                                         9  

 - topic(dj ) = {c1 , . . . , cq ∈ C}                         Y dj                  

            cq                         !  *                          d             

 

         1 

               dj             4 dj = (w1j , . . . , w|T |j ),                     T             DTrain ≤ w  k kj ≤ 1  *            0     d                 6                                      T 1         2     :   3   × , %

 -./2 7 wkj = okj · log wkj



N nk



2

,

 ! N  1 X fki fki = log(okj +1) 1+ . log log N nk nk i=1

:

B          0      k dj nk kj  o           0   k N = |D | Train   4          

288

Domonkos Tikk, György Biró, Jae Dong Yang

1                             descr(ci ) = hv1i , . . . , v|T |i i,

  

ci ∈ C

=

        

   1i ≤ 1     0 ≤ v               

    0         2.2.

Classification and training

2.2.1.

Classification

1               4 d ∈ D d         =          d            !      

         !        +           9                                +    

             +              !                                               !   ?                       !       7 m    . , cm ∈ C 1 , . .                dj c    dj             descr(c1 ), . . . , descr(cm )         1 

                                  f       7   |T | X conf(dj , descr(ci )) = f  wkj · vki  ,

Z

k=1

            f : R → [0, 1] x→0 f (x) =          

 lim     0     limx→∞   f (x) =  1 

   

        3           A 9"> [  !               )           dj

3  -7     Z   

m

         

3  4 7 3                   7  dj cbest

289

Experiment with HITEC on WIPO patent collections

3  2 7 )

cbest

       0   3  -

, 

 , 

    -../                              )  

     +       !                                                  conf min ∈ [0, 1]   3  -           3  4 "      

δ(c)

!  /

descr(c)

    

descr(c) − δ(c)



- 1            c conf req (c) = 0 δ(c)               dj             c 4 1           c conf (c) = 1 req                δ(c)  c             dj

3  2 \   3  -  3  4   

         3  : )          !   0    3  -D 3  2  1                                   !    +                !  ? (n) δ (v  ki )      !         (0) n δ (vki ) = 0 

     !        1≤k ≤T              δ (n+1) (ci ) = hδ (n+1) (v1i ), . . . , δ (n+1) (vT i )i  δ (n+1) (vki ) = α · (conf req − conf(dj , descr(ci ))) · wij +δ

(n)

(vki ) · β

.

        &         β ∈ [0, 1] α β       

                1           

               

     0.05..0.2                                             !                )      3  5 ) 3  -     

δ (0) (vki ) = 0

δ(c)

  

 !  .

) 3  :             1

Experiment with HITEC on WIPO patent collections

291

                         )          

                                        !         !           1   

                  ++    4552            !       7 d        #( d)    Q(d) = #(

Q

1

       . d)           7 ·

 

d)

1 + #(

Q(d)

Q=

P

Q(d) |DTrain |

d∈DTrain

-5

            

         !Q         \ 6    -. , ? !   ;          

         9

   4554  9

   4552  3.2. Performance measures 1                             1 )8 ( 

   9

   4554  ?  

               )8               !   Z    !  

    9 47

-    C 7           !          )8        ] ^  9 4  4

   2           !          )8     )                                                +     )             

            

2 

    1           !   

                 )8        )8           9 4  )                          9

   4554             )8                  

            3.3. Dimensionality reduction 1           

                             |    |T                             

            ; \ 

Experiment with HITEC on WIPO patent collections

ijji

Figure 2.

293

W_`SLQLNPIQ NI NFG NFMGG GaLSbLNPIQ TGLcbMGc EI` d EI` ed fQg VLSS GN LShd

    0 3  4554  *     |T | |T |  |T |                      

      !                             k

  3  -..< 0 1    1 

   4554 )        ++    4552                     -l               ; \    &     !  1            |T | minoccur                     D  *        Train    nk /|DTrain | ≥ maxfreq   !    !                                         min occur ∈ [1 .. 10]  max .. 1.0]  ; \ freq ∈ [0.05      

   

     

 (       > , ? !                             "                                    3.4. Results 1        B) "   1 )8 (  

          ! 9 2D , ? !      

                      m n  0      :   0 minoccur   9 2         m n  = 2

max = 0.25  freq  -   !       1        !   

100 Top (class)

correct guess (%)

90

Top3 (class)

80

Any (class) Top (subclass)

70

Top3 (subclass) 60

Any (subclass)

50

Top (main group)

40

Top3 (main group) Any (main group)

30 0

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1

confidence level

Top (class)

100 Top3 (class)

% of documents recalled

90 80

Any (class)

70

Top (subclass)

60 Top3 (subclass)

50 40

Any (subclass)

30 Top (main group)

20 10 35

45

55

65

75

85

95

% of correct assignments relative to recalled documents

Top3 (main group) Any (main group)

oGNNPQR pP`NLKqh rMGKPcPIQ sg KIQtuGQKG SGaGSch sv wIT`LMPcIQc IO `MGKPx cPIQcd G_NML`ISLNGu NI yjjz MGKLSS

Figure 3.

295

Experiment with HITEC on WIPO patent collections

             m n 0       !            m n   3 9 :  100 Top (class)

correct guess (%)

90

Top3 (class)

80

Any (class) Top (subclass)

70

Top3 (subclass) 60

Any (subclass)

50

Top (main group)

40

Top3 (main group) Any (main group)

30 0

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1

confidence level

Top (class)

100 Top3 (class)

% of documents recalled

90 80

Any (class)

70

Top (subclass)

60 Top3 (subclass)

50 40

Any (subclass)

30 Top (main group)

20 10 35

45

55

65

75

85

95

% of correct assignments relative to recalled documents

Top3 (main group) Any (main group)

oGNNPQR pP`NLq JPNF GQNMI`g JGPRFNPQR LQu minoccur = 2 LQu maxfreq = 0.25 h Lv rMGKPcPIQ sg KIQtuGQKG SGaGSch sv wIT`LMPcIQc IO `MGKPcPIQcd G_NML`ISLNGu NI y jjz MGKLSS 1                                          

   9

   4554              +                         

                      9 =

Figure 4.

296

Domonkos Tikk, György Biró, Jae Dong Yang

100 Top (class)

correct guess (%)

90

Top3 (class)

80

Any (class) Top (subclass)

70

Top3 (subclass) 60

Any (subclass)

50

Top (main group)

40

Top3 (main group) Any (main group)

30 0

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1

confidence level

Top (class)

100 Top3 (class)

% of documents recalled

90 80

Any (class)

70

Top (subclass)

60 Top3 (subclass)

50 40

Any (subclass)

30 Top (main group)

20 10 30

50

70

90

% of correct assignments relative to recalled documents

Top3 (main group) Any (main group)

oGNNPQR{ IQSg T LPQ KLNGRIMPGc bcGu OIM NMLPQPQRh rMGKPcPIQ sg KIQtuGQKG SGaGSc sv wIT`LMPcIQc IO `MGKPcPIQcd G_NML`ISLNGu NI y jjz MGKLSS

Figure 5.

1             2    ×                              

        25l                 3 × 9 Z                maxvar

                  0.5 0.01         +            !                      

   +       !   + 

297

Experiment with HITEC on WIPO patent collections

100 Top (class)

correct guess (%)

90

Top3 (class)

80

Any (class) Top (subclass)

70

Top3 (subclass) 60

Any (subclass)

50

Top (main group)

40

Top3 (main group) Any (main group)

30 0

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1

confidence level

Top (class)

100 Top3 (class)

% of documents recalled

90 80

Any (class)

70

Top (subclass)

60 Top3 (subclass)

50 40

Any (subclass)

30 Top (main group)

20 10 30

50

70

90

% of correct assignments relative to recalled documents

Top3 (main group) Any (main group)

UcG IO NO×PuO JGPRFNPQRh Lv rMGKPcPIQ sg KIQtuGQKG SGaGSch sv wIT`LMPcIQc IO `MGKPcPIQcd G_NML`ISLNGu NI y jjz MGKLSS

Figure 6.

                                                               !              !                   2l  3 9 <    -   -           9

   4552   )8                               1        9

   4552   5 5  !      &        

298

Domonkos Tikk, György Biró, Jae Dong Yang

100 Top (class)

correct guess (%)

90

Top3 (class)

80

Any (class) Top (subclass)

70

Top3 (subclass) 60

Any (subclass)

50

Top (main group)

40

Top3 (main group) Any (main group)

30 0

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1

confidence level

Top (class) 100 Top3 (class)

% of documents recalled

90 80

Any (class)

70

Top (subclass)

60 Top3 (subclass)

50 40

Any (subclass)

30 Top (main group)

20 10 40

60

80

100

Top3 (main group)

Any (main % of correct assignments relative to recalled documents group)

UcPQR cGT LQNPK PQOIMT LNPIQ h Lv rMGKPcPIQ sg KIQtuGQKG SGaGSch sv wITx `LMPcIQc IO `MGKPcPIQcd G_NML`ISLNGu NI yjjz MGKLSS

Figure 7.

   

  (    B ) "                 9

   4552    -5 : -l                   )8     1                9

   4552        )           2    

                   

                     

Experiment with HITEC on WIPO patent collections

299

obTTLMg IO MGcbSNc IQ | }r~xLS`FL `LNGQN KISSGKNPIQ fssMGaPLNPIQ IO TGNFIu QLTGc{ € ‚ € LƒaG LgGcd o„ … ‚ ob``IMN „GKNIM … LKFPQGd kx€€ ‚ k €GLMGcN €GPRFsIMcv WaLSbLNPIQ oGNNPQR }rw†KIQOh SGaGS T GLcbMG KSh† j hj KSh†i j h‡ chKSh†ij hj T hRh† j hj EI` P`NL ˆ‰hЉj ‹ i h‹e ‰eh ‰ eˆh‡‹ P`NLK ˆ‰h‰ ‹ h‹e ‰ehyŒ eˆhЇ T LPQ ˆ i h‡y ‡ŒheŠ Œ‹hŒy e i hi ‡ j j NtuO ˆŒh Œ ‡eh‰ˆ ‰ hŠˆ eehЉ i h‡ˆ cGT LQNPK ˆˆhŒy Š ‰Œhˆe e‡he‡ sGcN IO ‰‰hjj ‚ Œyhjj ‚ | }r~ `L`GM € do„ … o„ … uPGMGQKG yyhŒy ‚ yehˆe i EI`e P`NL ‡‰h‰ˆ ‹ i h‹e Љhjjj‰ ‰‰hŒŒ P`NLK ‡‰hˆy ‹ h‹e Љh ‰‰h‰‡ T LPQ ‡ehˆy ‡ŒheŠ Š j h‡‹ Œ‡h‹‡ j NtuO Š hŠy ‡Œhyy ‰ˆh‰Š eŠhj ‰ cGT LQNPK ‡‹hŒy ŠˆhŒ‰ Š‹hŒ‡ ‰‹hˆŒ sGcN IO Š‹hjj ‚ ˆ i hjj ‚ k x€€ | }r~ `L`GM € uPGMGQKG y j hŒy ‚ yŠhŒ‡ fQg P`NL Šehˆ‡ ‹‰h‡e ˆ ii hŒ‰ ŒˆhŒˆ P`NLK ŠehŒyi ‹‰hˆŒ ˆ hi ‡ Œˆhe‡ T LPQ Šyhe ‹Œh‹Š ‰‡h‹Š Œyh‰y NtuO Š i hii ‹ j he‡ ˆ j hy‡ Œ ji hŠyj cGT LQNPK ŠˆhŒˆ ‹ehŒ‡ ˆˆheˆ ‰ h‹ sGcN IO ˆehjj ‚ Œ‡hjj ‚ | }r~ `L`GM € o„ … uPGMGQKG yehŒˆ ‚ y‡heˆ

Table 1.

                               1                        2            

    !                               (                    2           

 7  < ×  <        Y  

                     )8      2    B ) "      !     

 

         /. : -l              !                   

            

    *    

300

Domonkos Tikk, György Biró, Jae Dong Yang

                   

       +  B                      +     Z:l            !    4           %    

   1 )8 (                                             "                        B ) "    

               B) "      

        !   +  Table 2.

obTTLMg IO MGcbSNc IQ | }r~xuG `LNGQN KISSGKNPIQ WaLSbLNPIQ }rw†KIQOh SGaGS TGLcbMG KSLcc† j hj KSLcc† j h‡ cbsKSLcc† j hj T LPQ RMIb`† j hj EI` ˆ‰hji ‡ˆh‹‰ ‰‰heŠ eŠh‹e EI`e ‡ŠhyŒ ‡‹hjj ŠŠhˆy ‰ŠheŒ fQg Љhj Œ ‹ˆh‹‰ ˆˆh‡‡ ‰ j hŠ‹

?      +    &         (              8  ? ( 3  4 %B     %* \ ,          

   +              6  :5                  <        4×        )                  !  4.

CONCLUSION 1   B ) "          !    

      "   %    

    1 )8 (   )8     )8    

                &         8   !                       

    3                                 

           &         !               !              

     )8          1 )8 ( 9

   4554  B) "                

Experiment with HITEC on WIPO patent collections

301

Acknowledgments   +  

   k  3  " 9    k ( 3"9  % Y  \ 5=4552555 --./Z5 

302

Domonkos Tikk, György Biró, Jae Dong Yang

REFERENCES fLcd Ž h LQu WPaPSd Ž h y‹‹‹vh EG_N KLNGRIMPcLNPIQ { f cbMaGgh L`IMN € ‹Œyd €IMJGx RPLQ wIT`bNPQR wGQNGMh LGMd ‘ h ’ h LQu …KwLSSbT d f hi‘ h y‹‹‡vh ’PcNMPsbNPIQLS KSbcNGMPQR IO JIMuc OIM NG_N KSLccPtKLNPIQ h }Q rMIKh IO NFG yNF fQQbLS }QNh fw… o}“} wIQOGMGQKG IQ GcGLMKF LQu ’GaGSI`TGQN PQ }QOIMT LNPIQ GNMPGaLS o}“} ”‹‡vd `LRGc ‹ˆ‚ yj ed …GSsIbMQGd fbcNMLSPLh wFLMLsLMNPd ohd ’IT d  hd fRMLJLSd  hd LQu  LRFLaLQ d rh y‹‹‡vh oKLSLsSG OGLNbMG cGSGKNPIQ d KSLccPtKLNPIQ LQu cPRQLNbMG RGQGMLNPIQ OIM IMRLQP•PQR SLMRG NG_N uLNLsLcGc PQNI FPGMLMKFPKLS NI`PK NL_IQITPGch EFG „Ž’ –IbMQLSd Šev{yˆe‚ yЇh VLSSd w h –hd E—MKca˜MPd f hd GQ•PQGs d ‘ hd LQu ‘ LMGNLd “ h ijj eLvh fbNIT LNGu KLNGx RIMP•LNPIQ PQ NFG PQNGMQLNPIQLS `LNGQN KSLccPtKLNPIQ h fw… o}“} VIMbT LMKFPaGd eŠyv{y j ‚ i ‰h VLSSd w h –hd E—MKca˜MPd f hd VPGa™Nd rhd LQu ‘ LMGNLd “ h ijj esvh fuuPNPIQLS MGLuTG PQOIMT LNPIQ OIM | }r~xuG LbNIKLNGRIMP•LNPIQ uLNL cGNh FNN` {††JJJ hJP`IhPQN†PsPc† uLNLcGNc†JP`IxuGxMGLuTGhFNTSh VLSSd w h –hd E—MKca˜MPd f hd LQu ‘ LMGNLd “ h ijji vh GLuTG PQOIMT LNPIQ OIM | }r~x LS`FL LbNIKLNGRIMP•LNPIQ NMLPQPQR cGNh FNN` {††JJJ hJP`IhPQN†PsPc†uLNLcGNc†JP`Ix LS`FLxMGLuTGhFNTSh ‘ISSGMd ’ h LQu oLFLTPd … h y‹‹Švh šPGMLMKFPKLSSg KSLccPOgPQR uIKbTGQNc bcPQR L aGMg OGJ JIMuch }Q }QNGMQLNPIQLS wIQOGMGQKG IQ … LKFPQG ŽGLMQPQRd aISbTG yŒd oLQ … Lx NGId wf h …IMRLQx‘ LbOT LQQ h …KwLSSbT d f hd IcGQOGSud  hd …PNKFGSSd E hd LQu €Rd f h y‹‹‡vh }T`MIaPQR NG_N KSLccPx tKLNPIQ cFMPQLRG PQ L FPGMLMKFg IO KSLccGch }Q rMIKh IO }w… Žx‹‡h FNN` {††JJJx i hKchKTbsg hGub†∼TKKLSSbT †`L`GMc†FPGMxPKTS‹‡h`chR•h oLSNIQ d “ h LQu …K“PSSd … h –h y‹‡evh fQ }QNMIubKNPIQ NI …IuGMQ }QOIMT LNPIQ Gx NMPGaLSh …K“MLJxšPSSh oGsLcNPLQPd V h ijji vh … LKFPQG SGLMQPQR PQ LbNIT LNGu NG_N KLNGRIMP•LNPIQ h fw… wITx `bNPQR obMaGgcd eŒ yv{y‚ŒŠh EPd ’ h LQu PM›d “ h ijj evh W_`GMPTGQNc JPNF TbSNPSLsGS NG_N KSLccPtGM IQ NFG GbNGMcj KISSGKNPIQ h }Q }QNGMQLNPIQLS wIQOGMGQKG IQ wIT`bNLNPIQLS wgsGMQGNPKc }www evd `LRGc ee‚e‡d oP›OId šbQRLMgh EPd ’ hd œLQRd –h ’ hd LQu LQRd oh Ž h ijj evh šPGMLMKFPKLS NG_N KLNGRIMP•LNPIQ bcPQR Ob••g MGSLNPIQLS NFGcLbMbch ‘gsGMQGNPLd e‹‰v{‰‡e‚ˆ jj h aLQ P csGMRGQ d w h –h y‹Š‹vh }QOIMT LNPIQ GNMPGaLSh bNNGMJIMNFcd ŽIQuIQ d i Qu GuPx NPIQ h FNN` {††JJJ huKchRSLhLKhb†‘GPNF h |GPccd oh … hd f`NGd w hd ’LTGMLb d V h –hd –IFQcIQ d ’ h W hd ~SGcd V h –hd “IGN•d E hd LQu š LT``i d E h y‹‹‹vh …L_PTP•PQR NG_NxTPQPQR `GMOIMT LQKGh }WWW }QNGSSPRGQN ogcNGTcd yŒŒv{ ‚‡h | PsIaId | h LQu | PSSPLTcd š h W h ijji vh oPT`SG LQu LKKbMLNG OGLNbMG cGSGKNPIQ OIM FPGMLMKFPKLS KLNGRIMPcLNPIQ h }Q rMIKh IO NFG ijji fw… cgT`IcPbT IQ ’IKbT GQN GQRPQGGMPQRd `LRGc yyy‚ yy‡d …KŽGLQ d „PMRPQPLd U of h | PGQGMd W hd rGuGMcGQ d –h ~ hd LQu |GPRGQu d f h oh y‹‹evh f QGbMLS QGNJIM L``MILKF NI NI`PK c`INNPQRh }Q rMIKh IO NFG iiŒNF fQQbLS ogT`IcPbT IQ ’IKbTGQN fQLSgcPc LQu }QOIMT LNPIQ GNMPGaLSd `LRGc ‚eŒh œLQRd œ h y‹‹‹vh fQ GaLSbLNPIQ IO cNLNPcNPKLS L``MILKFGc NI NG_N KLNGRIMP•LNPIQ h }QOIMx T LNPIQ GNMPGaLSd yy‚ i v{ˆ‹‚‹ j h FNN` {††KPNGcGGMhQ hQGKhKIT †gLQR‹ŠGaLSbLNPIQ hFNTSh

Suggest Documents