Open slides in pdf

14 downloads 74261 Views 4MB Size Report
View selected. Documents &. Reports. U.S.. Patents. (1976 -—. 2009). U.S.. Pre-. Grants .... structured content to determine the answer with the best confidence.
Source – J Kreulen

1

Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value

Moneyball Medicine

Stephen K. Boyer, Ph.D. [email protected] 408-858-5544 2

The Problem All content and no discovery ?

3

The Question

Can we use computers to “read” documents, identify critical entities, and perform meaningful associations – that can help us with our work ?

4

For Example : Patents and scientific papers contain molecular data in many different forms As text As bitmap images Pictures of chemicals found in the document Images

Chemical names found in the text of documents

5

Massive Computing Environment Chemical & Biological information derived from text analytics  Find and compute the 3D structures  Identify every protein  Identify every disease  Identify every Medline MeSh code  Identify occurrence of every biomarker Equivalent to 240K simultaneous Google searches -

Compute properties, & find relationships,

Data warehouse

6

Computer Curation Process Overview Services Hosted at IBM Almaden User Applications

Annotation Factory

ChemVerse

Selected Internet Content

Knime or Pipeline Pilot U.S. Patents (1976 -— 2009)

(Semantic Associations)

e Classifier & Other Data Associations

ChemVerse db

Data Sources

BIW U.S. PreGrants (All)

Database

Parse & Extract data

PCT & EPO Apps

Medline Abstracts

Annotator 1

(>18 M)

+ compu ted Meta Data

Computational Analytics

ADU* IP Database (e.g. DB2)

InHouse Content

View selected Documents & Reports

Cognos/DDQB/ Other Apps

Chem Axon Search

Annotator 2

* ADU = Automated Data Update

SIMPLE 7

Current Activates …

8

Current activity : “in line” entity tagging & classification identify chemical names

----------------Text

= Chemical

convert chem names to chemical structures SMILES – then convert these Into inchi & Inchkeys

-----------------

replace all chemical names with the term “inchikey_& the unique inchikey” for that chemical

Re-index inchikeys w SOLR

-----------------

Annotated Text

Annotated Text

----------------Annotated Text

= aspirin = inchikey = BSYNRYMUTXBXSQ-UHFFFAOYSA-N

= aspirin = CC(=O)OC1=CC=CC=C1C(=O)O

= Target = Disease = Assay data

dB

SOLR index

9

Current activity : “in line” entity tagging & classification In line text tagging (classification) coupled with computational & experimental data

-----------------

= Chemical_” inchikey BSYNRYMUTXBXSQ-UHFFFAOYSA-N” = [target _gene name]

Compound – Targets associations Known from the SEA computations

Compound – Targets associations Known from the literature

Target 1

Target 1

compound

Text

Target 2

Target 2

Target 3

Target 3

= Chemical

Target 4

= Target

Target 5

= Disease = Assay data

Compound – Targets associations Known from NIh experimental HTS

dB

NIH HTS Assay data 10

Chemical Structures vs. Signs and Symptoms Medline co-occurrence of Statin structures vs. MeSH -Signs & Symptoms (C23)

11

Search

Chemical Search using ChemAxon w/ DB2

Proximal Search

Nearest Neighbor Search 12

Clustering

Discovery

Claims Originality

BioTerm Analysis 13

Landscape Analysis

Visualization

Networks 14

Computer Curation Process Overview Services Hosted at IBM Almaden User Applications

Annotation Factory

ChemVerse

Selected Internet Content

Knime or Pipeline Pilot U.S. Patents (1976 -— 2009)

(Semantic Associations)

e Classifier & Other Data Associations

ChemVerse db

Data Sources

BIW U.S. PreGrants (All)

Database

Parse & Extract data

PCT & EPO Apps

Medline Abstracts

Annotator 1

(>18 M)

+ compu ted Meta Data

Computational Analytics

ADU* IP Database (e.g. DB2)

InHouse Content

View selected Documents & Reports

Cognos/DDQB/ Other Apps

Chem Axon Search

Annotator 2

* ADU = Automated Data Update

SIMPLE 15

backup

16

Molecular Attributes Attributes derived from different sources IP Attributes Orange Book - Legal status - Assignee - Foreign filings - Expiration Date

Spectral Attributes NIST dB -IR spectra -NMR, -Mass Spec, etc.

Physical Attributes Computational -MW, -MF -Bp -Mp , etc

Screening Attributes Drug Attributes PubChem -Activity - Pharm data - Target data for SRA - Literature references

Molecules have Various Attributes ( From different sources)

Drugbank - Activity - Pharm data - Protein Binding - half life Toxicity Attributes

WomBat EPA databases - Activity - Pharm data - Target data for SRA - Literature references

- Toxicity studies - LD50 - Literature references

17

Cross mapping attributes from different sources Semantic association of attributes

Orange Book

Pub Chem

Drugbank

Structure (trusted database)

FDA

Database A (Medline)

Others SMILE

InChi_i d

Code name

Trade Name

app_id

Activity Target Binding site Database C (Tox)

To x

Geo

Pathway IP status

Country Certifications Licensing Location

Internet

Data Source 1 Schema 1

Input list of SMILES

The Tank

Data Sources

Attributes |||||||||| |||||||||

Data Source 2 Schema 2 Input list of Attributes

Attributes

Attributes |||||||||| |||||||||

Output file list of attributes Output file list of SMILES 18

Watson

19

IBM’s - Massively Parallel Probabilistic Architecture Watson generates and scores many hypotheses using an extensible collection of Natural Language Processing, Machine Learning and Reasoning Algorithms. These gather and weigh evidence over both unstructured and structured content to determine the answer with the best confidence.

E. Sources A. Sources

Question

Question/T opic Analysis

Primary Search

Query Decomposition

Candidate Answer Generation

Hypothesis Generation

Deep Evidence Scoring Evidence Retrieval

Hypothesis Generation

Hypothesis Generation

20

Supporting Evidence Retrieval

Answer Scoring

Hypothesis & Evidence Scoring

Soft Filtering Soft Filtering

Hypothesis & Evidence Scoring Hypothesis & Evidence Scoring

Synthesis

Deep Evidence Scoring

Final Merging & Ranking Trained Models

Answer, Confidence

Source – J Kreulen

20

Technical Issues to consider when applying QA systems like Watson Nature of Domain: Open vs. Closed Closed domain implies all knowledge is contained within a specific domain characterized by ontologies and there is no need to go outside the domain. Jeopardy is an open-domain example where it is general knowledge. Knowledge/Data Sources: Availability QA systems are natural language search engines. Watson goes beyond NL search. If knowledge sources are incomplete, unavailable, insufficient or inadequate then it is not possible for the system to provide an answer. In some cases one would need to envisage Interactive QA that require human interaction to guide the search. Another very important consideration is the availability of sufficient sample data for training (i.e. training corpus). Need for multi-modality Is there a need for Transcription from Speech to Text before a question is answered? This would require integration of Speech to Text capabilities that are not really ready for real-time applications.

DeepQA Application (Java/C++) Apace Hadoop + Apache UIMA

SUSE Linux Enterprise Server 11 Watson Infrastructure • • • • • •

90 Power 750 Servers Each Server 3.5GHz POWER7 8 Core Processor with 4 threads/core Total: 2880 POWER7 Cores with 16TB RAM Processing speed: 500Gb/sec; 80 TeraFLOPS 94th on Top 500 Supercomputers Note: This hardware is for Jeopardy. Any other application of Watson will require appropriate sizing and optimization for purpose.

Latency Watson is capable of processing 500GB of information per second with 3 sec response to questions and used most of its knowledge source in memory (as opposed to disk) for speed. What is the latency requirement for the application? Multi-Lingual or Cross-Lingual Support Watson can support only English at this time; with language-specific parsers other languages can be supported . If knowledge sources or QA is required in multiple languages then that would not be a good candidate. Additionally if cultural context have to be accommodated in the answer then it would not be prudent to deploy QA systems directly interacting with users. Question Type Decomposition and classification of the question is critical to how QA systems work. Bulk of the question types in Jeopardy were Factoid questions. Watson did not include 2 question categories: One is Audio/Video type questions that require looking at a video to answer and another are questions that require special instructions (e.g. verbal instructions to explain a question.) Answer Types Watson is not designed to curate a task-oriented system. It can handle temporal and geo-spatial reasoning in its answers. As it stands it cannot handle business process type of reasoning (to do task B tasks A, C must be completed etc.) 21

I would like to acknowledge the IBM Almaden Research – team

Jeff Kreulen Ying Chen Scott Spangler Alfredo Alba Tom Griffin Eric Louie Su Yan Issic Cheng Prasad Ramachandran Bin He Ana Lelescu Brian Langston

Qi He Linda Kato Ana Lelescu Brad Wade John Colino Meenakshi Nagarajan Timothy J Bethea German Attanasio Cassidy Kelly Jack Labrie Fredrick Eduardo Ionia Stanoi + a host of folks from IBM China Labs 22