Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
International Symposium on Integrative Bioinformatics 2010
The LAILAPS Search Engine A Feature Model for Relevance Ranking in Life Science Databases
M Lange, K Spies, C Colmsee, S Flemming, M Klapperstück, U Scholz Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Germany
[email protected] IPK Gatersleben, March 22, 2010
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Outline • Information Retrieval in Live Science
• The LAILAPS Approach • Query Engine & Feature Scoring
• Relevance Prediction • The LAILAPS System
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
2
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Life Science Data Universe
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
3
Life Science Data Universe Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Data Domain
Data Domain
Entry#
Literature (PubMed): Molecules (PubChem) DNA-sequences (GenBank):
19*106 62*106 112*106
Pathway (KEGG): Genes (KEGG):
Data Domain Protein sequences(UniProt): Protein functions (OLS, PDB, INTACT, PFAM):
Matthias Lange @ IPK-Gatersleben, Germany
Entry# 0.1*106 5*106
Entry# 11*106 1*106
03/22/2010
4
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Life Science Data Universe chlorophyll synthase
words: 1111 lines: 164
words: 1516 lines: 238
476 entries of chlorophyll synthase on average 1200 words per entry on average 200 lines 571,200 words 95,200 lines 1322 pages A4 Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
5
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Life Science Data Universe - Search
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
6
The Story of Searching Data Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
1. 2.
37% of life science scientists spending over 80% of working time in front of a computer 47% of all scientist make daily use of search engines Divoli A, Hearst MA, Wooldridge MA., Evidence for showing gene/protein name suggestions in bioscience literature search interfaces., Pac Symp Biocomput. 2008:568-79
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
7
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
The LAILAPS Approach
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
8
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Motivation for LAILAPS • platform for information retrieval over isolated databases • content sensitive relevance ranking • user profiles for relevance estimation • estimation of relevance factors by tracking user behavior
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
9
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
LAILPAS Search Engine
hit excerpt & link to data browser
user login
Javascript data feature scores browsing behaviour tracker
relevance rating
phrase and conjunctive multi term queries relevance score
feature vector
synonym & database filter chart type
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
10
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
LAILAPS Feature Model 1. 2. 3. 4. 5. 6. 7. 8. 9.
attribute in the record database in which the hit was found frequency of hits per attribute and record co-occurrence of query terms keyword surrounding the hit position organism that is reported in the record sequence length text position of the hit in the record structure synonym that produced the hit
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
11
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
LAILAPS Ranking Workflow query Q = t1 …
tn
find: using a text index system
chlorophyll synthase
query result R = (t,d,{(a,p)})
score: using features F1 … F 9 relevance ( 1… 9) : using a feed forward neural network
relevance ranking 1… n
('chlorophyll', 'Q9MBA1', {('DE', 83), ('RT',2)} ('synthase', 'Q9MBA1', {('DE', 97)}
entries feature scores ( 1… 9)
scorefreq= 3 score co-ocurence= 1-(distance/maxdist) * {cor: 1; wrong: 0.3} ...
relevance score
Nfeedforward :(neurons = 20, input =9, output =1, edges = 7 x 4) Q9MBA1 = N(68, 27,71,33,80,0,31,0,0) = 0.935
rank: sorting of relevance scores
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
12
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Query Engine: Text Index • decompose text into textual atoms – – – – –
avoid line breaks in tokens do not decompose number groups keep chars in token _-/*+. ignore chars in token ()[]'',:' token delimiter whitspace and =;\t
• stop words (ignore frequent but useless tokens) – standard in natural language • but, his, nor, than, was, by, how, not, that, we, can, however, of, the, were … – more in life sciences • product, locus_tag, db_xref, region, source, DB, organism, CDS, xref, interpro, submission, submitted, pfam, binding, table, embl, geneid, genomic, transl, cdd, sequence, pubmed, taxon, IEA, NIH, genbank, data, here …
• synonyms (from databases Brenda, UniProt) – Synonyms Relations: 67521 – Protein names: 76869
M. Lange, K. Spies @ IPK-Gatersleben
09/12/2008
13
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Relevance Prediction
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
14
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Relevance Prediction: Confidence Classes confidence class
accession
curated rank
lailaps rank
relevant
problem
1
Acc1 Acc2 Acc3
1 2 3
5 2 3
no yes yes
wrong ranking
Acc4
-
4
-
Acc5 Acc6 Acc7 Acc8 Acc9 Acc10 Acc11
4 5 6 7 8 9 10
5 6 7 8 1
yes no yes yes no yes no
2 3
Matthias Lange @ IPK-Gatersleben, Germany
additional hit (e.g. synonm) database version
text decomposition wrong ranking
03/22/2010
15
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Relevance Prediction: Training Data
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
16
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Relevance Prediction: Training Data
• 22 protein queries using data retrieval systems • 1069 manual ranked database records
• 3 quality classes (high, medium, low)
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
17
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Relevance Prediction: Neural Network Training • split of training/validation data: 80/20 • training epochs: 500 • mean square error: 0.33 • type: feed forward • architecture: 7-4
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
18
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
User Ranking Profile: Relevance Rating • Explicit rating • Implicit rating: tracking browsing behavior using JavaScript – clicked result entries – clicked entries above, below – activity time – scroll amount – mouse movement – lost / got focus – text selection
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
19
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Relevance Prediction: Workflow Query Results Manual Rating Training Data
Relevance Prediction
Predicted Rating
Feedback System
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
20
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
The LAILAPS System
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
21
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
LAILAPS – Technical Background Render Engine 3rd-party Retrieval System (e.g. SRS@EBI, in house)
HTML
Result Browser
Search Form
Administration Configuration
RMI
Color Code - Technology:
Text Query
Tapestry / iSpring
HTTP/SQLNet
ORACLE 11g
JAVA 1.5, JOONE, SPRING
Network Communication
DBMS
Backend/Logic
Feedback Tracking
SQL
Color Code – Software Modules: Frontend
Result Ranking
ORACLE 11g Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
22
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
LAILAPS Customization •
database to be indexed •
synonyms •
configuration
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
23
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
LAILAPS Benchmark
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
24
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010 Precision in %
Benchmark of Neuronal Network Ranking
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
25
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
Conclusion 1.
LAILAPS as search engine for life science dabases
2.
„objective subjectiveness“ – learning of human relevance classes
3.
cost-reduction of scientific information retrieval
# database records
estimated effort in person day
- 50
0,5
51 – 100
1
101 - 250
>1
4.
> 250 user relevance profiles objective rating hardly not possible
5.
deterministic quality of database research
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
26
Nature Precedings : doi:10.1038/npre.2010.4337.1 : Posted 7 Apr 2010
LAILAPS Project: http://lailaps.ipk-gatersleben.de
Contact:
[email protected]
Acknowledgement Jinbo Chen (IPK Gatersleben) Mandy Weißbach (IPK Gatersleben) Gregor Haberhauer (BASF) Michael Leps (BASF Plant Science) Jens Stein (BASF Plant Science) Röbbe Wünschiers (FH Mitweidae)
Matthias Lange @ IPK-Gatersleben, Germany
03/22/2010
27