LINCS Data Portal (LDP) and the gene expression omnibus (GEO); and a web-site that enables users to register, browse, search, and grade dataset-tool ...
Datasets2Tools: Enriching DataMed with Canned Analyses Denis Torre1, Alexander Lachmann1, Maxim Kuleshov1, Zichen Wang1, Edward He1, Caroline D. Monteiro1, Sherry L. Jenkins1, Lucila Ohno-Machado2, Avi Ma’ayan1 1BD2K-LINCS
DCIC, Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA 2BioCADDIE, Department
of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093, USA
Datasets2Tools Added Functionality to DataMed and GEO
Abstract Biomedical data repositories, such as the bioCADDIE DataMed, enable the search and discovery of relevant research data digital objects. At the same time, tools that can operate on such data are indexed by repositories such as the Aztec.bio developed by the BD2KCCC. However, direct associations between datasets and tools are currently not available. Beyond such associations, it would be useful to systematically provide canned bioinformatics analyses for processed datasets on dataset landing pages. Here we present part of a pilot project to create a new type of digital object: canned analysis of a dataset with a specific online bioinformatics tool. To enable the creation, management and usability of this new type of data object, we are developing the Datasets2Tools platform. Datasets2Tools includes a registry: a database of dataset-tool associations; a Google Chrome extension that displays icons that provide links to canned analyses from dataset landing pages of data repositories such as DataMed, the LINCS Data Portal (LDP) and the gene expression omnibus (GEO); and a web-site that enables users to register, browse, search, and grade dataset-tool associations. In the future, we hope that data repositories such as DataMed would adopt the Datasets2Tools protocol, making Datasets2Tools native to the DataMed site so users do not have to install the Chrome extension to view the icons for the canned analyses. By providing a simple and intuitive platform that links datasets to analysis tools, Datasets2Tools will lower the point of entry for many users of DataMed, LINCS LDP, GEO and other biomedical research repositories; the canned analyses will help these users extract more knowledge from such data repositories.
Fig. 2. Datasets2Tools Chrome extension embedding new functionality to DataMed and GEO. Search results using DataMed or GEO are enriched with a browser extension that loads toolbars that display stored canned analyses. Canned analyses on search result pages are grouped by computational tool (2A and 2B). Clicking on a tool icon triggers a dropdown menu that displays a searchable table, containing links to the canned analyses (2C). Users can obtain more information on the computational tools, such as links to the tool’s homepage and video tutorials, by clicking on the information icon next to the tool name (2D). Metadata can be viewed by hovering over the information icon (2E), or the metadata can be downloaded as a separate file in a variety of formats (XLS, TXT, JSON) by clicking on the download icon. Finally, the extension allows sharing canned analyses by providing buttons to copy the canned analysis URL, or by obtaining a code snippet which can be embedded as icons on other web-sites, linking to canned analyses result pages (2F).
Datasets2Tools Datasets2Tools
Datasets2Tools
A C
D
E Enrichr Link
Tool Description
Search:
+
PDGF
Enrichr is an easy to use intuitive enrichment analyses web-based tool providing various types of visualization summaries of collective functions of gene lists.
Description
Metadata
Share
PDGFR KO vs control, top 100 overexpressed…
Ma’ayan Laboratory, Icahn School of Medicine at Mount Sinai, New York, NY 10029,USA.
PDGFR KO vs control, top 100 underexpresse…
Links
PDGFR KO vs control, top 200 combined ge…
Homepage
Reference
Description PDGFR KO vs control, top 100 overexpressed…
Developer Link
Metadata
GSM671878.
PDGFR KO vs control, top 200 combined ge…
Control samples: GSM671723, GSM671877.
PDGFBB 12h vs control, top 100 underexpress…
Differential expression method: Characteristic Direction.
PDGFBB 12h vs control, top 100 underexpress…
# "& # "& # &
Extract knowledge
# !
"
!
"
" !
%&
"# "
#
PDGFBB 12h vs control, top 100 underexpress… PDGFBB 12h vs control, top 100 underexpress…
Conclusions
" " " " " "
Figure 1. Schema of the Datasets2Tools MySQL database. The database contains information about repositories, datasets, computational tools, canned analyses, and associated metadata. The schema supports the insertion of custom metadata tags, allowing accurate annotation of canned analysis objects.
Canned Analyses
Chronic time course and dose dependent analysis of the clofibrate (Clofib)-regulated transcriptome in rat liver
The table below displays the canned analyses associated to the selected dataset, grouped by tool. To browse the canned analyses, select a tool from the menu below. The search bar can be used to restrict the menu to desired tools. Mouse over the info icon to receive additional information regarding computational tools. Search tools:
"# !
! $ "
Tool
The Datasets2Tools platform has a MySQL database containing dataset-tool associations (Figure 1). The associations are stored as a new type of digital object known as the canned analysis. Users can access the database through a Google Chrome browser extension, a website, or an API. The front-end of the extension and the website are built with JavaScript, HTML and CSS. The backend of the website and API are built with Flask, a framework for serving Python web applications. The Chrome Extension adds functionality to pages of biomedical repositories by embedding JavaScript code (Figure 2).
Overview
+
Experiments
+
Canned Analyses
+ -
Enrichr
Enrichr is an easy to use intuitive enrichment analysis web-based tool providing various types of visualization summaries of collective functions of gene lists.
9
+
L1000CDS2
LINCS L1000 characteristic direction signature search engine is a tool which enables users to find consensus L1000 small molecule signatures that match user input signatures.
7
+
Clustergrammer
Clustergrammer is a visualization tool that enables users to easily generate highly interactive and shareable clustergram and heatmap visualizations from a matrix of their own data.
5
+
The table below displays the canned analyses associated to the selected dataset, grouped by tool.
iLINCS
iLINCS (Integrative LINCS) is an integrative web platform for analysis of LINCS data and signatures. The portal provides biologists-friendly user interfaces for analyzing transcriptomics and proteomics LINCS datasets.
3
+
To browse the canned analyses, select a tool from the menu below. The search bar can be use d to restrict the menu to desired tools. Mouse over the info icon to receive additional information regarding computational tools.
Canned Analyses
Description
Canned Analyses
Search tools: