View selected. Documents &. Reports. U.S.. Patents. (1976 -—. 2009). U.S.. Pre-.
Grants .... structured content to determine the answer with the best confidence.
Source – J Kreulen
1
Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value
Moneyball Medicine
Stephen K. Boyer, Ph.D.
[email protected] 408-858-5544 2
The Problem All content and no discovery ?
3
The Question
Can we use computers to “read” documents, identify critical entities, and perform meaningful associations – that can help us with our work ?
4
For Example : Patents and scientific papers contain molecular data in many different forms As text As bitmap images Pictures of chemicals found in the document Images
Chemical names found in the text of documents
5
Massive Computing Environment Chemical & Biological information derived from text analytics Find and compute the 3D structures Identify every protein Identify every disease Identify every Medline MeSh code Identify occurrence of every biomarker Equivalent to 240K simultaneous Google searches -
Compute properties, & find relationships,
Data warehouse
6
Computer Curation Process Overview Services Hosted at IBM Almaden User Applications
Annotation Factory
ChemVerse
Selected Internet Content
Knime or Pipeline Pilot U.S. Patents (1976 -— 2009)
(Semantic Associations)
e Classifier & Other Data Associations
ChemVerse db
Data Sources
BIW U.S. PreGrants (All)
Database
Parse & Extract data
PCT & EPO Apps
Medline Abstracts
Annotator 1
(>18 M)
+ compu ted Meta Data
Computational Analytics
ADU* IP Database (e.g. DB2)
InHouse Content
View selected Documents & Reports
Cognos/DDQB/ Other Apps
Chem Axon Search
Annotator 2
* ADU = Automated Data Update
SIMPLE 7
Current Activates …
8
Current activity : “in line” entity tagging & classification identify chemical names
----------------Text
= Chemical
convert chem names to chemical structures SMILES – then convert these Into inchi & Inchkeys
-----------------
replace all chemical names with the term “inchikey_& the unique inchikey” for that chemical
Re-index inchikeys w SOLR
-----------------
Annotated Text
Annotated Text
----------------Annotated Text
= aspirin = inchikey = BSYNRYMUTXBXSQ-UHFFFAOYSA-N
= aspirin = CC(=O)OC1=CC=CC=C1C(=O)O
= Target = Disease = Assay data
dB
SOLR index
9
Current activity : “in line” entity tagging & classification In line text tagging (classification) coupled with computational & experimental data
-----------------
= Chemical_” inchikey BSYNRYMUTXBXSQ-UHFFFAOYSA-N” = [target _gene name]
Compound – Targets associations Known from the SEA computations
Compound – Targets associations Known from the literature
Target 1
Target 1
compound
Text
Target 2
Target 2
Target 3
Target 3
= Chemical
Target 4
= Target
Target 5
= Disease = Assay data
Compound – Targets associations Known from NIh experimental HTS
dB
NIH HTS Assay data 10
Chemical Structures vs. Signs and Symptoms Medline co-occurrence of Statin structures vs. MeSH -Signs & Symptoms (C23)
11
Search
Chemical Search using ChemAxon w/ DB2
Proximal Search
Nearest Neighbor Search 12
Clustering
Discovery
Claims Originality
BioTerm Analysis 13
Landscape Analysis
Visualization
Networks 14
Computer Curation Process Overview Services Hosted at IBM Almaden User Applications
Annotation Factory
ChemVerse
Selected Internet Content
Knime or Pipeline Pilot U.S. Patents (1976 -— 2009)
(Semantic Associations)
e Classifier & Other Data Associations
ChemVerse db
Data Sources
BIW U.S. PreGrants (All)
Database
Parse & Extract data
PCT & EPO Apps
Medline Abstracts
Annotator 1
(>18 M)
+ compu ted Meta Data
Computational Analytics
ADU* IP Database (e.g. DB2)
InHouse Content
View selected Documents & Reports
Cognos/DDQB/ Other Apps
Chem Axon Search
Annotator 2
* ADU = Automated Data Update
SIMPLE 15
backup
16
Molecular Attributes Attributes derived from different sources IP Attributes Orange Book - Legal status - Assignee - Foreign filings - Expiration Date
Spectral Attributes NIST dB -IR spectra -NMR, -Mass Spec, etc.
Physical Attributes Computational -MW, -MF -Bp -Mp , etc
Screening Attributes Drug Attributes PubChem -Activity - Pharm data - Target data for SRA - Literature references
Molecules have Various Attributes ( From different sources)
Drugbank - Activity - Pharm data - Protein Binding - half life Toxicity Attributes
WomBat EPA databases - Activity - Pharm data - Target data for SRA - Literature references
- Toxicity studies - LD50 - Literature references
17
Cross mapping attributes from different sources Semantic association of attributes
Orange Book
Pub Chem
Drugbank
Structure (trusted database)
FDA
Database A (Medline)
Others SMILE
InChi_i d
Code name
Trade Name
app_id
Activity Target Binding site Database C (Tox)
To x
Geo
Pathway IP status
Country Certifications Licensing Location
Internet
Data Source 1 Schema 1
Input list of SMILES
The Tank
Data Sources
Attributes |||||||||| |||||||||
Data Source 2 Schema 2 Input list of Attributes
Attributes
Attributes |||||||||| |||||||||
Output file list of attributes Output file list of SMILES 18
Watson
19
IBM’s - Massively Parallel Probabilistic Architecture Watson generates and scores many hypotheses using an extensible collection of Natural Language Processing, Machine Learning and Reasoning Algorithms. These gather and weigh evidence over both unstructured and structured content to determine the answer with the best confidence.
E. Sources A. Sources
Question
Question/T opic Analysis
Primary Search
Query Decomposition
Candidate Answer Generation
Hypothesis Generation
Deep Evidence Scoring Evidence Retrieval
Hypothesis Generation
Hypothesis Generation
20
Supporting Evidence Retrieval
Answer Scoring
Hypothesis & Evidence Scoring
Soft Filtering Soft Filtering
Hypothesis & Evidence Scoring Hypothesis & Evidence Scoring
Synthesis
Deep Evidence Scoring
Final Merging & Ranking Trained Models
Answer, Confidence
Source – J Kreulen
20
Technical Issues to consider when applying QA systems like Watson Nature of Domain: Open vs. Closed Closed domain implies all knowledge is contained within a specific domain characterized by ontologies and there is no need to go outside the domain. Jeopardy is an open-domain example where it is general knowledge. Knowledge/Data Sources: Availability QA systems are natural language search engines. Watson goes beyond NL search. If knowledge sources are incomplete, unavailable, insufficient or inadequate then it is not possible for the system to provide an answer. In some cases one would need to envisage Interactive QA that require human interaction to guide the search. Another very important consideration is the availability of sufficient sample data for training (i.e. training corpus). Need for multi-modality Is there a need for Transcription from Speech to Text before a question is answered? This would require integration of Speech to Text capabilities that are not really ready for real-time applications.
DeepQA Application (Java/C++) Apace Hadoop + Apache UIMA
SUSE Linux Enterprise Server 11 Watson Infrastructure • • • • • •
90 Power 750 Servers Each Server 3.5GHz POWER7 8 Core Processor with 4 threads/core Total: 2880 POWER7 Cores with 16TB RAM Processing speed: 500Gb/sec; 80 TeraFLOPS 94th on Top 500 Supercomputers Note: This hardware is for Jeopardy. Any other application of Watson will require appropriate sizing and optimization for purpose.
Latency Watson is capable of processing 500GB of information per second with 3 sec response to questions and used most of its knowledge source in memory (as opposed to disk) for speed. What is the latency requirement for the application? Multi-Lingual or Cross-Lingual Support Watson can support only English at this time; with language-specific parsers other languages can be supported . If knowledge sources or QA is required in multiple languages then that would not be a good candidate. Additionally if cultural context have to be accommodated in the answer then it would not be prudent to deploy QA systems directly interacting with users. Question Type Decomposition and classification of the question is critical to how QA systems work. Bulk of the question types in Jeopardy were Factoid questions. Watson did not include 2 question categories: One is Audio/Video type questions that require looking at a video to answer and another are questions that require special instructions (e.g. verbal instructions to explain a question.) Answer Types Watson is not designed to curate a task-oriented system. It can handle temporal and geo-spatial reasoning in its answers. As it stands it cannot handle business process type of reasoning (to do task B tasks A, C must be completed etc.) 21
I would like to acknowledge the IBM Almaden Research – team
Jeff Kreulen Ying Chen Scott Spangler Alfredo Alba Tom Griffin Eric Louie Su Yan Issic Cheng Prasad Ramachandran Bin He Ana Lelescu Brian Langston
Qi He Linda Kato Ana Lelescu Brad Wade John Colino Meenakshi Nagarajan Timothy J Bethea German Attanasio Cassidy Kelly Jack Labrie Fredrick Eduardo Ionia Stanoi + a host of folks from IBM China Labs 22