Data Linkage Graph: computation, querying and knowledge discovery

0 downloads 0 Views 3MB Size Report
22 Mar 2010 - phrase and conjunctive multi term queries hit excerpt & link to data browser feature vector relevance score synonym & database filter relevance.
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

International Symposium on Integrative Bioinformatics 2010

The LAILAPS Search Engine A Feature Model for Relevance Ranking in Life Science Databases

M Lange, K Spies, C Colmsee, S Flemming, M Klapperstück, U Scholz Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Germany

[email protected] IPK Gatersleben, March 22, 2010

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Outline • Information Retrieval in Live Science

• The LAILAPS Approach • Query Engine & Feature Scoring

• Relevance Prediction • The LAILAPS System

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

2

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Life Science Data Universe

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

3

Life Science Data Universe Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Data Domain

Data Domain

Entry#

Literature (PubMed): Molecules (PubChem) DNA-sequences (GenBank):

19*106 62*106 112*106

Pathway (KEGG): Genes (KEGG):

Data Domain Protein sequences(UniProt): Protein functions (OLS, PDB, INTACT, PFAM):

Matthias Lange @ IPK-Gatersleben, Germany

Entry# 0.1*106 5*106

Entry# 11*106 1*106

03/22/2010

4

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Life Science Data Universe chlorophyll synthase

words: 1111 lines: 164

words: 1516 lines: 238

476 entries of chlorophyll synthase on average 1200 words per entry on average 200 lines 571,200 words 95,200 lines 1322 pages A4 Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

5

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Life Science Data Universe - Search

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

6

The Story of Searching Data Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

1. 2.

37% of life science scientists spending over 80% of working time in front of a computer 47% of all scientist make daily use of search engines Divoli A, Hearst MA, Wooldridge MA., Evidence for showing gene/protein name suggestions in bioscience literature search interfaces., Pac Symp Biocomput. 2008:568-79

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

7

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

The LAILAPS Approach

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

8

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Motivation for LAILAPS • platform for information retrieval over isolated databases • content sensitive relevance ranking • user profiles for relevance estimation • estimation of relevance factors by tracking user behavior

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

9

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

LAILPAS Search Engine

hit excerpt & link to data browser

user login

Javascript data feature scores browsing behaviour tracker

relevance rating

phrase and conjunctive multi term queries relevance score

feature vector

synonym & database filter chart type

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

10

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

LAILAPS Feature Model 1. 2. 3. 4. 5. 6. 7. 8. 9.

attribute in the record database in which the hit was found frequency of hits per attribute and record co-occurrence of query terms keyword surrounding the hit position organism that is reported in the record sequence length text position of the hit in the record structure synonym that produced the hit

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

11

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

LAILAPS Ranking Workflow query Q = t1 …

tn

find: using a text index system

chlorophyll synthase

query result R = (t,d,{(a,p)})

score: using features F1 … F 9 relevance ( 1… 9) : using a feed forward neural network

relevance ranking 1… n

('chlorophyll', 'Q9MBA1', {('DE', 83), ('RT',2)} ('synthase', 'Q9MBA1', {('DE', 97)}

entries feature scores ( 1… 9)

scorefreq= 3 score co-ocurence= 1-(distance/maxdist) * {cor: 1; wrong: 0.3} ...

relevance score

Nfeedforward :(neurons = 20, input =9, output =1, edges = 7 x 4) Q9MBA1 = N(68, 27,71,33,80,0,31,0,0) = 0.935

rank: sorting of relevance scores

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

12

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Query Engine: Text Index • decompose text into textual atoms – – – – –

avoid line breaks in tokens do not decompose number groups keep chars in token _-/*+. ignore chars in token ()[]'',:' token delimiter whitspace and =;\t

• stop words (ignore frequent but useless tokens) – standard in natural language • but, his, nor, than, was, by, how, not, that, we, can, however, of, the, were … – more in life sciences • product, locus_tag, db_xref, region, source, DB, organism, CDS, xref, interpro, submission, submitted, pfam, binding, table, embl, geneid, genomic, transl, cdd, sequence, pubmed, taxon, IEA, NIH, genbank, data, here …

• synonyms (from databases Brenda, UniProt) – Synonyms Relations: 67521 – Protein names: 76869

M. Lange, K. Spies @ IPK-Gatersleben

09/12/2008

13

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Relevance Prediction

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

14

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Relevance Prediction: Confidence Classes confidence class

accession

curated rank

lailaps rank

relevant

problem

1

Acc1 Acc2 Acc3

1 2 3

5 2 3

no yes yes

wrong ranking

Acc4

-

4

-

Acc5 Acc6 Acc7 Acc8 Acc9 Acc10 Acc11

4 5 6 7 8 9 10

5 6 7 8 1

yes no yes yes no yes no

2 3

Matthias Lange @ IPK-Gatersleben, Germany

additional hit (e.g. synonm) database version

text decomposition wrong ranking

03/22/2010

15

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Relevance Prediction: Training Data

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

16

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Relevance Prediction: Training Data

• 22 protein queries using data retrieval systems • 1069 manual ranked database records

• 3 quality classes (high, medium, low)

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

17

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Relevance Prediction: Neural Network Training • split of training/validation data: 80/20 • training epochs: 500 • mean square error: 0.33 • type: feed forward • architecture: 7-4

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

18

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

User Ranking Profile: Relevance Rating • Explicit rating • Implicit rating: tracking browsing behavior using JavaScript – clicked result entries – clicked entries above, below – activity time – scroll amount – mouse movement – lost / got focus – text selection

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

19

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Relevance Prediction: Workflow Query Results Manual Rating Training Data

Relevance Prediction

Predicted Rating

Feedback System

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

20

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

The LAILAPS System

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

21

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

LAILAPS – Technical Background Render Engine 3rd-party Retrieval System (e.g. SRS@EBI, in house)

HTML

Result Browser

Search Form

Administration Configuration

RMI

Color Code - Technology:

Text Query

Tapestry / iSpring

HTTP/SQLNet

ORACLE 11g

JAVA 1.5, JOONE, SPRING

Network Communication

DBMS

Backend/Logic

Feedback Tracking

SQL

Color Code – Software Modules: Frontend

Result Ranking

ORACLE 11g Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

22

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

LAILAPS Customization •

database to be indexed •

synonyms •

configuration

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

23

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

LAILAPS Benchmark

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

24

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010 Precision in %

Benchmark of Neuronal Network Ranking

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

25

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

Conclusion 1.

LAILAPS as search engine for life science dabases

2.

„objective subjectiveness“ – learning of human relevance classes

3.

cost-reduction of scientific information retrieval

# database records

estimated effort in person day

- 50

0,5

51 – 100

1

101 - 250

>1

4.

> 250 user relevance profiles objective rating hardly not possible

5.

deterministic quality of database research

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

26

Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010

LAILAPS Project: http://lailaps.ipk-gatersleben.de

Contact: [email protected]

Acknowledgement Jinbo Chen (IPK Gatersleben) Mandy Weißbach (IPK Gatersleben) Gregor Haberhauer (BASF) Michael Leps (BASF Plant Science) Jens Stein (BASF Plant Science) Röbbe Wünschiers (FH Mitweidae)

Matthias Lange @ IPK-Gatersleben, Germany

03/22/2010

27

Suggest Documents