Procedia Information Technology & Computer Science - Text-Mess ...

12 downloads 70125 Views 739KB Size Report
Given the high quantity of available information, it is impossible to keep up daily without ... Textile domain: A Specialized Search engine and Virtual Observatory.
  Procedia  Information   Technology  &  Computer   Science  

 

 

 

00  (2013)  000-­‐000  

    3  World  Conference  on  Information  Technology  2012   rd

 

Specialized  Information  Retrieval  in  the   Context  of  the  Chemical  Textile  Domain      

 a  

b

b  

Carolina  PRIETO 1,  Javi  FERNÁNDEZ   ,  Elena  LLORET   and  Manuel  PALOMAR   a

b  

 AITEX,  Technological  Textile  Institute,  Plaza  Emilio  sala  Nº1,  Alcoy  03801,  Spain   b  University  of  Alicante,  Alicante,  Spain  

 

  Abstract    

Given   the   high   quantity   of   available   information,   it   is   impossible   to   keep   up   daily   without   taking   advantage   of   Natural   Language  Processing  tools.  This  article  provides  an  analysis  of  the  use  of  two  domain-­‐specific  Information  Retrieval  systems   applied   to   the   Chemical   Textile   domain.   The   aim   of   this   paper   is   to   study   whether   search   engines   and   virtual   observatory   systems   are   appropriate   for   retrieving   specific   information   in   the   Chemical   Textile   domain.   To   this   end,   we   develop   a   specialized  search  engine  and  we  propose  the  study  of  this  tool  together  with  an  existing  alert  system  (virtual  observatory   system)   and   we   compare   them   with   two   widespread   general-­‐purpose   alert   systems,   Google   Alerts   and   Yahoo!   Alerts.   The   results  obtained  show  that  the  specialized  search  engine  is  the  most  appropriate  tool  for  professionals  because  it  is  able  to   retrieve   information   that   is   of   interest   for   the   studied   domain.   Moreover,   several   limitations   that   have   been   encountered   with  regard  to  the  chosen  systems  are  discussed,  thus  suggesting  possible  solutions  for  further  work.     Keywords:  Information  Retrieval,  Alert  Systems,  Chemical  Textile  domain,  Search  engine,  Virtual  Observatory;     Selection  and/or  peer  review  under  responsibility  of  Prof.  Dr.  Dogan  Ibrahim.     ©2012  Academic  World  Education  &  Research  Center.  All  rights  reserved.  

 

1   *  ADDRESS  FOR  CORRESPONDENCE:   Carolina  Prieto  Ferrero.  AITEX,  Technological  Textile  Institute,  Plaza  Emilio  Sala  1,   03801  Alcoy,  Spain”.          E-­‐mail  address:  [email protected]  /  Tel.:  +0034-­‐96-­‐554-­‐2200  

1. Introduction  and  motivation   With   the   rapid   growth   of   the   Web,   the   professionals   are   often   faced   with   high   quantity   of   information  and  find  it  difficult  to  search  for  relevant  and  useful  information  on  the  Web.     To   process   these   high   amounts   of   information,   we   can   take   advantage   of   Natural   Language   Processing   (NLP)   tools,   which   will   allow   us   to   retrieve,   extract,   classify   and   summarize   the   useful   information   for   our   domain.   Within   NLP,   one   of   the   areas   is   Information   Retrieval   (IR).   Particularly,   professionals  working  on  the  Chemical  Textile  domain,  who  deal  with  a  lot  of  information  every  day,   could  make  use  of  the  previous  mentioned  tools  for  increasing  their  performance  at  the  work  place.   To  the  best  of  our  knowledge,  a  domain  which  has  not  many  specialized  resources  is  the  Chemical   Textile  domain.  In  previous  work  [8],  we  carried  out  a  preliminary  evaluation  of  the  appropriateness  of   general-­‐purpose  alert  systems  for  finding  specific  information  that  professionals  of  this  field  need  for   their   day-­‐to-­‐day   activities.   As   conclusion,   we   reported   a   number   of   limitations   of   using   this   type   of   general   systems   for   specific   domains.   Therefore,   in   this   paper   we   focus   on   specialized   resources   (a   domain-­‐specific   alert   system   and   a   specialized   crawler)   in   order   to   analyze   whether   they   are   more   useful  and  help  to  solve  some  of  the  limitations  encountered.   Besides   the   general-­‐purpose   search   engines   and   IR   systems,   in   the   literature,   we   can   find   several   works   that   aim   at   addressing   the   retrieval   of   relevant   information,   but   focusing   on   a   very   specific   domain.   Related   to   this,   we   find   BioPatentMiner   [7],   a   system   that   facilitates   IR   from   biomedical   patents,   and   MedSearch   [5],   a   specialized   medical   Web   search   engine,   which   uses   several   specific   techniques   (e.g.,   tf.idf)   for   improving   its   usability   and   the   quality   of   search   results.   In   [10],   an   evaluation  of  the  information  retrieved  by  a  patent  IR  system  in  the  Chemical  domain  is  carried  out,   thus   concluding   that   domain-­‐specific   search   engines   may   be   more   appropriate   for   retrieving   information  of  interest  when  focusing  on  a  specific  domain.     Not   only   IR   systems   have   been   developed   to   deal   with   the   restricted   domain   problem,   but   also   different   crawling   techniques   [4].   Focused   web   crawlers   identify   when   an   URL   or   a   document   is   relevant   to   a   specific   domain   and   prioritize   and   analyze   them   in   a   more   appropriate   manner,   using   more  advanced  techniques,  such  as  ontologies  [1,6].   The   aim   of   this   paper   is   to   analyze   to   what   extent   different   IR   tools   are   appropriate   when   the   domain   of   application   is   very   restricted   and   specific,   as   in   the   case   of   the   Chemical   Textile.   In   particular,  for  this  study  we  develop  two  IR  systems  and  analyze  their  application  in  the  context  of  the   Chemical   Textile   domain.   These   systems   are   a   specialized   search   engine   and   a   specific   alert   system,   to   see  which  of  them  is  more  appropriate  for  finding  specific  information  in  this  domain.     2. Specialized  Information  Retrieval  Systems  for  the  Chemical  Textile  domain     In  this  Section  we  explain  the  two  Information  Retrieval  systems  used  for  the  study  of  the  Chemical   Textile  domain:  A  Specialized  Search  engine  and  Virtual  Observatory.     The  search  engine  has  been  developed  to  help  performing  queries  about  the  Chemical  Textile  domain.   An   expert   in   the   domain   selected   a   restricted   set   of   web   sites   to   be   included   in   the   system.   The   documents   obtained   from   these   web   sites   are   downloaded   using   the   crawler   [2]   developed   by   the   department  DLSI2  in  the  University  of  Alicante3.  The  search  engine  is  based  on  a  modified  version  of   Lucene4.  In  this  version,  document  terms  are  analyzed  using  a  stemmer  (i.e.,  Snowball5),  but  both  the   stem  and  the  original  term  are  indexed.  In  this  way,  we  can  retrieve  a  bigger  number  of  documents   2 3 4 5

http://www.dlsi.ua.es/ http://www.ua.es/ http://lucene.apache.org/ http://snowball.tartarus.org/

but   always   giving   more   weight   to   those   containing   the   original   words.   The   system   also   gives   more   relevance   to   precision,   offering   a   smaller   number   of   results   but   with   a   higher   reliability.   Additional   features   have   been   included,   like   the   prioritization   of   recent   documents;   duplicate   removal   and   automatic   grouping   of   the   results   (clustering)   using   Carrot26,   for   a   faster   navigation   through   the   results  list.   The   Virtual   Observatory7   is   an   alert   system   developed   at   the   University   of   Alicante.   As   for   the   search  engine  for  the  Chemical  Textile  domain,  a  group  of  experts  in  the  domain  select  a  set  of  sources   to   be   checked   periodically.   These   sources   are   mainly   RSS   but   also   include   generic   web   pages.   When   these  sources  publish  new  content,  the  system  extracts  the  new  information  and  subsequently  sends   it  to  the  subscribed  users  in  a  daily  e-­‐mail  as  alerts.  The  challenge  at  this  point  is  to  decide  which  alerts   are   relevant   to   which   users.   First,   the   experts   create   a   set   of   categories   of   interest   in   the   specific   domain.  Second,  they  select  a  set  of  documents  and  categorize  them  using  those  categories.  Then,  the   system  uses  these  documents  as  examples  and  learns  how  to  automatically  classify  new  documents.   This  learning  is  made  using  Machine  Learning  techniques,  specifically  the  Weka8  [3]  implementation  of   the  Support  Vector  Machines  algorithm,  due  to  its  good  performance  in  text  categorization  tasks  [9].       3. Experiments   To   perform   the   study,   an   expert   of   the   Chemical   Textile   domain   defined   4   groups   of   terms   of   different   granularity:   generic   terms,   specific   terms,   compounds   terms   and   multiword   expressions   applied   to   the   Chemical   Textile   domain.   All   the   terms   were   in   English   and   these   groups   of   terms   were   chosen   because   they   are   relevant   to   this   domain.   Most   of   them   are   found   in   legislation   Webs   as   REACH9  ,  CPSC10  or  OEKO-­‐TEX11  .  For  our  experiments,  we  have  chosen  6  terms  for  each  group12.     The   evaluation   was   performed   using   the   previously   mentioned   tools.   The   results   obtained   are   compared  with  the  results  provided  with  the  general-­‐purpose  alert  systems  for  the  same  terms  [8].   The   assessment   consisted   in   counting   the   number   of   interesting   documents   each   of   the   systems   returned.   For   this,   an   expert   of   the   Chemical   Textile   domain   evaluated   individually   each   of   the   documents  retrieved  and  classified  them  into  interesting  and  uninteresting  for  that  domain.   Table  1  shows  the  overall  percentages  of  interesting  documents  retrieved  by  the  general-­‐purpose   alert   systems,   i.e.,   Google   Alert   and   Yahoo!   Alerts   compared   to   the   search   engine   system.   Such   percentages   are   calculated   as   the   number   of   retrieved   documents   classified   as   interesting   for   each   group  with  respect  to  the  total  retrieved  documents  in  the  same  group.     Table  1.  Overall  percentage  of  interesting  documents  for  each  group  of  terms  retrieved  by  the  different   Information  Retrieval  systems    

6 7 8 9 10 11 12

Google  Alerts    

Yahoo!  Alerts    

http://project.carrot2.org/ http://en.ovtt.org/alerts http://www.cs.waikato.ac.nz/ml/weka/ http://www.reachinnova.com http://www.cpsc.gov/about/cpsia/cpsia.html http://www.oekotex.com

http://intime.dlsi.ua.es/papers/wcit2012091101.html

Search  Engine    

Generic  Terms   Specific  Terms   Compound  Terms   Multiword  expressions  

3.9%   1.8%   50%   0%  

9.7%   12.3%   20.5%   83.3%  

71.2%   77.67%   69.47%   82.40%  

  In   these   results   we   can   observe   that   it   is   difficult   for   general-­‐purpose   alert   systems   to   retrieve   information   in   a   specific   domain.   Often,   we   have   problems   with   the   ambiguity   of   the   terms.   For   instance,   terms   such   as   lead   or   flame   have   others   meanings,   and   as   a   consequence,   these   systems   cannot  distinguish  which  meaning  do  we  refer  to.  Despite  this,  Yahoo!  Alerts  as  a  generic  IR  system  is   more  accurate  and  has  more  coverage  than  Google  Alerts.   Regarding  the  specialized  search  engine,  we  notice  that  it  performs  better  than  the  general-­‐purpose   alert   systems.   This   is   because   the   search   engine   is   domain   specific,   thus   being   capable   of   retrieving   more   interesting   information   for   the   Chemical   Textile   domain.   Only   for   multiword   expressions,   the   results   for   the   search   engine   are   lower   than   Yahoo!   Alert   system.   This   is   due   to   the   fact   that   Yahoo   only  retrieved  6  documents,  5  of  which  were  interesting.  In  contrast,  our  search  engine  retrieved  142   documents,   117   of   which   were   interesting.   As   shown,   the   search   engine   retrieves   much   more   information  than  Yahoo.     Concerning  the  virtual  observatory  the  results  obtained  were  lower  than  expected.  The  number  of   retrieved   alerts   within   the   studied   period   of   time   was   very   low,   having   a   total   of   40   alerts.   Among   them,  only  15  alerts  were  of  interest.  We  believe  that,  in  this  case,  it  may  be  necessary  to  broaden  the   period  of  time  that  we  spent  for  analyzing  this  system  in  order  to  obtain  more  concluding  results.         4. Conclusion  and  Future  Work   In   this   paper,   we   developed   and   studied   two   specific   Information   Retrieval   systems   for   the   Chemical   Textile   domain.   In   particular,   such   systems   were:   a   specific   search   engine   and   a   virtual   observatory.  These  systems  can  be  of  great  help  for  experts  for  dealing  daily  with  lots  of  information   pertaining   to   such   domain.   For   the   experiments,   we   compared   their   performance   with   respect   to   results  obtained  in  a  previous  work  for  general-­‐purpose  alert  systems  that  are  accessible  to  any  user,   thus  using  the  same  terms.   The   specialized   search   engine   is   a   good   retrieval   system   that   retrieves   interesting   information   for   professionals  and  users.  Moreover,  in  this  system,  most  of  the  retrieved  information  is  interesting,  as   it   was   shown   from   the   results   obtained.   With   this   system   the   problem   of   ambiguity   that   we   have   with   some  terms  when  using  generic  alert  systems  disappear.    Concerning   the   virtual   observatory,   its   results   were   not   very   satisfactory,   since   it   did   not   retrieve   a   high   quantity   of   information.   It   may   be   necessary   to   broaden   the   period   of   time   that   we   spent   for   analyzing  this  system,  in  order  to  see  whether  it  can  retrieve  more  sites,  and  analyze  their  usefulness.     Despite   the   encouraging   results,   several   limitations   have   been   encountered   with   regard   to   the   chosen  systems,  such  as  the  virtual  observatory,  where  it  sometimes  retrieve  information  that  is  not   directly  related  to  the  topic  of  interest.   Therefore,  as  future  work  we  plan  to  build  a  specific  ontology  for  the  Chemical  Textile  domain,  that   can  be  applied  to  IR  systems  for  improving  the  search  results.  We  propose  to  make  the  analysis  with   the  Virtual  Observatory  alert  system  for  a  wider  period  of  time  (6  months),  using  the  ontology  of  the   Chemical   Textile   domain.   With   the  integration   of   the   ontology   in   this   system,   we   could   check   if   it   is   possible  to  retrieve  more  specific  and  relevant  information  for  our  domain.  

Acknowledgements   This  research  work  has  been  funded  by  the  Spanish  Government  through  the  project  TEXT-­‐MESS  2.0   (TIN2009-­‐13391-­‐C04)   and   by   the   Valencian   Government   through   projects   PROMETEO   (PROMETEO/2009/199)  and  ACOMP/2011/001.     References   [1]  Naresh  Chauhan,  Nisha  Pahal,  and  A  K  Sharma.  Context-­‐Ontology  Driven  Focused  Crawling  of  Web  Documents.  Pages  121–124,  2007.     [2]  Javi  Fernández,  J.M.  Gómez,  and  Patricio  Martínez-­‐Barco.  Evaluación  de  sistemas  de  recuperación  de  información  web  sobre  dominios   restringidos.  Procesamiento  de  Lenguaje  Natural,  45(0):273–276,  2010.     [3]  Mark  Hall,  Hazeltine  National,  Eibe  Frank,  Geoffrey  Holmes,  Bernhard  Pfahringer,  Peter  Reutemann,  and  Ian  H  Witten.  The  WEKA  Data   Mining  Software:  An  Update,  volume  11.  2009.     [4]  Maryam  Hazman.  A  Survey  of  Focused  Crawler  Approaches.  Journal  of  Global  Research  in  Computer  Science,  2012.   [5]   Gang   Luo,   Chunqiang   Tang,   Hao   Yang,   and   Xing   Wei.   Medsearch:   a   specialized   search   engine   for   medical   information   retrieval.   In   Proceedings   of   the   17th   ACM   conference   on   Information   and   knowledge   management,   CIKM   ’08,   pages   143–152,   New   York,   NY,   USA,   2008.   ACM.   [6]   Hiep   Phuc   Luong,   Susan   Gauch,   and   Qiang   Wang.   Ontology-­‐Based   Focused   Crawling.   2009   International   Conference   on   Information,   Process,  and  Knowledge  Management,  pages  123–128,  February  2009.     [7]  Sougata  Mukherjea  and  Bhuvan  Bamba.  Biopatentminer:  an  information  retrieval  system  for  biomedical  patents.  In  Proceedings  of  the   Thirtieth  international  conference  on  Very  large  data  bases  -­‐  Volume  30,  VLDB  ’04,  pages  1066–1077.  VLDB  Endowment,  2004.     [8]   Carolina   Prieto,   Elena   Lloret,   and   Manuel   Palomar.     Análisis   de   la   Calidad   de   la   Información   Recuperada   por   Sistemas   de   Alertas   en   el   dominio  Químico  Textil.  II  Spanish  Conference  on  Information  Retrieval,  2012.       [9]  Fabrizio  Sebastiani.  Machine  learning  in  automated  text  categorization.  ACM  Computing  Surveys,  34(1):1–47,  March  2002.     [10]   Jianhan   Zhu   and   John   Tait.   A   proposal   for   chemical   information   retrieval   evaluation.   In   Proceedings   of   the   1st   ACM   workshop   on   Patent   information  retrieval,  PaIR  ’08,  pages  15–18,  New  York,  NY,  USA,  2008.  ACM.