Nazar Zaki, Chandana Tennakoon and Hany Al Ashwal. College of Information ..... [26] M. Al Hasan, V. Chaoji, S. Salem, and M. Zaki, âLink prediction using ...
Knowledge Graph Construction and Search for Biological Databases Nazar Zaki, Chandana Tennakoon and Hany Al Ashwal College of Information Technology United Arab Emirates University Al Ain, UAE {nzaki,chandana,halashwal}@uaeu.ac.ae Abstract—The number of biological databases available both in the public domain and in private keep on increasing every day. Scientists and researchers need to analyze and make use of the data stored in different databases. One limitation is that these databases are stored in diverse formats. However, semantic web methods have introduced the Resource description format (RDF) to unify heterogeneous databases. In this paper we illustrate how to construct a knowledge graph out of biological RDF databases by connecting possibly related data. We also show how the resulting knowledge graph can be made compact. We, then show how free-text search can be implemented to access nodes in the knowledge graph. Finally, we introduce a way to display and navigate the created knowledge graph. Keywords—knowledge graph;biological databases;RDF
I. INTRODUCTION
Linked life data also transforms databases by connecting resources based on several integration patterns that has been identified. Two other methods KaBOB [11] and BioLOD [12] integrate databases based on available ontological information. There are many browsers available for browsing RDF data that use methods such as faceting [13]–[15] or graph navigation [16]–[19]. Faceted browsing methods can provide a compact way of searching for data, while graph navigation methods make it easy to visualize connections between data. However, when there are large datasets involved, graph based methods tend to give a convoluted set of connections that is not very useful. In this paper, we present a concept paper of useful approaches that can be used to create a knowledge graph out of heterogeneous sources of data. Unlike other approaches, the construction relies on identifying basic types of entities like SNPs, genes etc. found in biological databases. Furthermore, we try to fill a gap in existing methods by discussing how new relationships may be inferred using graph properties. In addition, we show how end users can operate on this knowledge graph using free text search. Finally, we provide some ideas to interactively explore the knowledge graph.
There are many biological databases publicly available. By the latest counts, their number exceeds 1800 [1] and even this number is a vast under-representation as there are many other private databases available. Biological data is stored in a wide variety of formats. Many popular databases like those at NCBI are available in tab or comma separated formats, and there is a trend to make data available in RDF format (e.g. Uniprot). Some other popular formats are VCF, BED, GFF, and Excel. Moreover, there are databases that can only be accessed through APIs. Scientists will need to access different types of databases on the course of their research. The first hurdle they typically face is how to connect these homogeneous databases.
RDF format is used to model relationships between two entities using a 3-tuple, called a triple. The 3-tuple consists of a subject, predicate, and an object. A predicate is a relationship between, or a fact about, the subject and the object.
The developments of semantic web methodologies have introduced the Resource Description Format (RDF) [2] as a way to connect heterogeneous databases. There are several ways that the diverse sources of databases can be converted to RDF [3]– [6]. These databases would have been more useful if they can be connected together in to a knowledge graph [7]. A knowledge graph can be defined as a collection of related facts.
Subjects, predicates, and objects can be uniquely represented by uniform resource identifiers (URI). All the elements in a triple can be URIs. However, the objects can also be strings. URIs might have a repeated prefix when they are assigned to related concepts. In such cases, a namespace can be assigned to avoid redundancy and to easily identify related concepts.
There are several biological databases that have connected different types of entities. However not much work has been done on connecting heterogeneous biological databases. Most of the methods that aim to link biological data work with a subset of biological data[8] and Bio2RDF [9] and Linked life data [10] are the only two mainstream projects that aim to connect diverse sets of databases. Bio2RDF converts data in diverse formats into RDF and provides a set of rules to build a knowledge graph.
There is a natural correspondence between RDF data and graphs. The subject and object can act as nodes with an edge having a predicate as the label connecting them. Note that while there is no reason why any two nodes can’t be joined together with an edge in a normal graph, in an RDF database an object must be an URI if it is to be connected with a subject using some predicate. With this relationship in mind, we can think of an
II. BASIC CONCEPTS
978-1-5090-6255-3/17/$31.00 ©2017 IEEE
RDF database and the associated knowledge graph interchangeably, and understand that the knowledge graph structure can be specified in RDF. III. CREATING THE KNOWLEDGE GRAPH In this section, we will show how different RDF databases can be interconnected to construct a knowledge graph. In methods like Bio2RDF and Linked life data, one of the ways to link data is by assigning some normalized URIs for objects. We use a similar method that uses the entity type of the object to link data.
_:b1 rdf:description “rs10069690” . _:b1 rdf:type canonical:SNP . A. Connecting Databases The next step is to connect the canonical entities with the canonical databases. There are standard IDs associated with canonical entities. For example SNPs have dbSNP rs identifiers, Genes have HGNC and Ensemble identifiers and proteins have Uniprot IDs. If there are multiple IDs (for example genes have both HGNC and Ensemble ID) one ID can be transformed to another.
In biological databases, we can identify several basic entities. Elements in biological databases can be thought of as belonging to several layers. The first layer is the genomic layer where data related to the genomic structure are stored. These data would include information such as sequencing data, structural variation data etc. The second layer would contain data related to transcription processes (genes, RNA, transcripts etc.) Another layer would contain information about proteins (proteins and protein structure, etc.). The final layer would be entities related to biological processes (biological pathways, diseases, drug information etc.). For each layer described above, there are specific entities we can identify. For example, SNPs, proteins and biological pathways are three such entities. These entities are studied by the scientific community and there are databases associated with them that summarize the facts known about them. For the three entities mentioned above, dbSNP [20], Uniprot [21] and KEGG [22] are three such key databases. We will call these wellrecognized entities as “canonical entities” and a database for canonical entities as a “canonical database”. We will now describe how to prepare a knowledge graph connecting heterogeneous databases. We begin by annotating all canonical entities in databases with a type. This type information can be annotated using the standard rdf:type predicate and we will have a reserved namespace (e.g. canonical) to avoid clashes with other type information and unique types to describe each canonical entity. In the RDF database, both a subject and an object might represent a canonical entity. For example, in the statement dbSNP:10177chr1 rdf:description “rs367896724” . dbSNP:10177chr1 is a SNP object that describes the SNP at chr1:10177. We can attach the canonical type of this subject by adding the statement dbSNP:10177chr1 rdf:type canonical:SNP . However, in the statement GWAS:breast_cancer GWAS:SNP “rs10069690” . rs10069690 is the name of the SNP associated with breast cancer, and this appears in the object. Since we can’t attach a predicate to a literal in RDF, we will replace the literal object with a blank node, and attach type information on to the blank node. i.e. the above statement will be replaced with GWAS:breast_cancer GWAS:SNP _:b1
Fig. 1. Figure shows how a database entry is connected to a canonical database. A is a node having the rdf:type of protein. Its description shows a gene ID MAPK3 that has the canonical name P27361. A and the canonical database are connected using an edge labelled rdfs:sameAs.
The canonical databases will have their unique namespaces and information for an entity with the standard name ID is stored under the URI NAME:ID, where NAME is the namespace of the canonical database. (We will call the NAME:ID as the canonical URI for a subject.) If a standard ID is associated with a subject in a database, we will find the canonical URI corresponding to that ID. Next we will connect it to the canonical database using the predicate “rdfs:sameAs”. After this process a user can follow the knowledge graph from a SNP to its canonical database to get extra information. As an illustration, the protein with Uniprot ID P27361 is sometimes described using the gene ID MAPK3 (see Figure 1). We will assume that Uniprot is the canonical database used for proteins and has the namespace UNIPROT. The following entry from a hypothetical database entry DB:abc rdf:description “MAPK3” DB:abc rdf:type canonical:PROTEIN would be connected to the canonical protein database using the statement DB:abc rdfs:sameAs UNIPROT:P2731 In the next step, we will attempt to extend the knowledge graph by connecting databases that are non-canonical. If two subjects
S1 and S2 are connected to the same canonical database entry, we will add the statement S1 rdfs:seeAlso S2 . S2 rdfs:seeAlso S1 . After this connection is made, A user can follow the rdfs:seeAlso predicate from S1(S2) to see other information related to S1(S2). If S1 and S2 are in the same database the connection is internal, and if they are in distinct databases user will now be moving to a different database to continue his/her exploration. B. Inferring connections. In the next step, we will be inferring possible connections between two subjects that has not been made before. If S1 and S2 are two subjects, and they are connected by a set of paths P1,...,Pn we will connect S1 and S2 by adding the statements S1
rdfs:inferred S2.
S2
rdfs:inferred S1.
The user can now follow rdfs:inferred links from S1(S2) to find information that may relate to them. However, the inferred connections might not in fact be real. For example, in Figure 2, we see four entities: a SNP, a gene, a disease and a pathway. The SNP affects the gene and is inside a gene, and it is known to cause a disease. The pathway also affects the disease. We can find a path from the gene to the pathway. However, this does not necessarily mean that the gene is connected with the pathway. It maybe that the SNP affects the disease through a different pathway not given here. In such cases, we need to provide some measures about the confidence of the inferred connection. One measure is to find the length of the shortest path between two connected subjects. If this is small, we can have more confidence that the connection is true. For further judging the reliability of the inferred connections, a score can be calculated using an algorithm such as PEWCC [23]. It determines the reliability of an edge with a score using weights assigned to neighboring edges. The user can then choose to follow the inferred paths based on these scores.
C. Live Search The size of biological databases can be very large. Especially for canonical databases, the size can be in the range of several gigabytes. If the user wants to setup a knowledge graph in a standard desktop system he/she will not find it to be practical. A solution to overcome this problem is to use live searching of databases, where the data is loaded only as necessary from an external source. Most of the established databases that can be used as canonical databases have API’s for programmatic access of their data. The knowledge graph will only store the canonical subject. Whenever a request is made to access the canonical subject, an API call to the canonical database will be made and the results converted to the format required by the knowledge graph will be generated. For efficiency after the initial preparation of the knowledge graph connections, large databases can be hosted in different places, and only a minimal structure of the knowledge graph can be kept in a central location. IV. FREE TEXT SEARCH Free-text search is becoming an essential feature of exploring databases. Free-text search is done by creating a reverse text index of the content in the database. Indexing RDF databases is not a straightforward indexing process. The textual content is not evident, and some of the text content are URIs. In this section, we will describe how the knowledge graph can be made amenable to a free-text search (see Figure 3). The first step we do is to decide which entries in the database are to be indexed. It will be not useful in general to index things like the strand of the genomes or numerical quantities such as pvalues. We filter out these unnecessary data based on predicate that will contain them. (i.e. we identify predicates that contain objects and ignore these entries entirely). We next extract the triples (S,P,O) from the databases, where S is the subject, P the predicate and O the corresponding object. For each predicate P we provide a human-readable description H(P). This human-readable description H(P) can be either manually specified or obtained from a suitable description of the database schema (for example these maybe derived from column headers of a tabular database) . Then we convert the 3-ple (S,P,O) into the 2-ple (S,H(P)+O), where + denotes string concatenation. Finally, if there are 2-tuples (P,T1),...,(P,Tn) they will be replaced by (P,T1+…+Tn). For example we might find a 3-ples like (protein:BRCA1,protein:dis,“breast cancer”) (protein:BRCA1,protein:chr,“17”) Instead of the cryptic P=“protein:dis” we use H(P)=“related to disease ”, and for P=”protein:chr” we use H(P)=”is in chromosome ”. After the first transformation for the first statement, we have (protein:BRCA1,”related to disease breast cancer”),
Fig. 2. This graph shows four relationships between entities; a gene, a SNP,a disease and a patway. We can tentatively make a connection between the gene and the pathway.
and the final transformation yields:
(protein:BRCA1,”related to disease breast cancer is in chromosome 17”) Which conveys the meaning of the triples in a form more suitable for a text search. A. Indexing approach We will describe how to do the reverse indexing using the popular open-source Solr [24] platform. In the 2-tuple (P,T), only the text content T will be indexed. The two components P and T of the search will be considered as two fields, ID and TEXT respectively. During indexing, usual process of tokenizing where words are split at white spaces, and removing common words are applied. In the next step, synonyms are added. This step can be used to add support to translating identifiers used by different databases. For example, we can add synonyms to map an entrezgene ID to its corresponding HGNC ID. We can also use synonyms to substitute for terms with similar terms (e.g. “carcinoma” and “cancer”). During this process, we can use extensive thesauruses such as MESH. Finally, a stemming step is where words are converted to a form that allows free-text searching of different verb and noun forms. Next, we will look at how a free-text query can be translated into a query in the knowledge graph. The user can enter a free-text using a search box in the search engine interface. The search can be formatted according to the syntax compatible with Solr searches. Solr will search the TEXT fields for a match and return the corresponding ID fields as the search result. Now, these ID’s will correspond to the matching nodes in the knowledge graph we have built. Note that unlike a classical text search, showing snippet information does not convey much information. Therefore, we do not have to store the full text used for the index building in the Solr index. We need only to store data sufficient for performing the text match and to retrieve the matching IDs.
more flexibility and speed in free-text search using the Solr engine directly as it is specifically built for free-text searches. V. DISPLAYING THE KNOWLEDGE GRAPH A free-text search can isolate nodes in the knowledge graph relevant to the user’s query. We now need to provide a method that will enable the user to make further explorations. We will not take the approach of showing all the possible connections to a node in the knowledge graph as it might show an excessive amount of information. We will provide the user with nodes connected by a path of length L to a hit, where L is a user defined value. If a connected path has the label rdfs:sameAs, rdfs:seeAlso or rdfs:inferred, we will not follow the paths further. i.e. the user initially will not have a view of the connections between nodes added during the processing of the database. This is done so that users will have the set of connections with highest confidence in the knowledge graph and to avoid spurious connections that may occur by following artificially created connections. Now, nodes connected with rdfs:sameAs shows additional information about the node entity. rdfs:seeAlso shows other nodes related to the node entity. rdfs:inferred shows nodes that might have a relationship with the node. Following these nodes is left for user discretion. Nodes connected by rdfs:inferred are sorted according to a weighted scores so that most likely connections appear at the top. When a new node is selected by the user, the same procedures are applied focusing on the selected node. After a few rounds of knowledge graph navigation, it might be possible to lose track of the search. A trail of “breadcrumbs” of the nodes travelled onto can be kept so that the user can orient him/herself in the search at any time, and switch back to a previous state of his/her search. With the knowledge graph we have created, it may be possible to answer questions that relate to entities in different databases. For example, a user might want to know diseases caused by a gene. There may be a set of diseases that is directly related to the gene in some database, and there may be other disease inferred from a protein corresponding to the gene. However, all these will be available as a link to the node or as an inferred node. If the functionality to filter objects connected to a node by the canonical database of the objects is available, the query can be answered. i.e. from the previous example, if we retain only the object that belongs to the canonical database for diseases, we can have a possible list of diseases that the gene could be responsible for. Furthermore, consider the query of finding SNPs in a gene G that are related to diseases. We can find a set of disease D that are related to the gene G. Then we can find the set of SNPs S that are related to diseases in D. Finally, our query can be answered by finding SNPs in S that are related to the gene G.
Fig. 3. Figure shows the workflow of how the text index is created using the database. The relationships in the database are converted to a form close to a human-readable sentence. These set of sentences are indexed by applying several pre-processing steps to improve free-text searching.
Free-text searching capabilities are provided in many semantic web frameworks. For example, free-text capability using Lucene is provided in the Jena framework. However, we can expect
The method of navigating the knowledge graph we describe tries to avoid giving the user the full global picture to prevent overwhelming the user with too much information. However, it still leaves space for the user to find all entities that are related (or probably related), and to selectively navigate towards the goal of search. Furthermore, user can easily switch back and
forth between their previous search states and thus he/she is always in control of the search. VI. FUTURE WORK We believe that our solution offers a complete framework (See Figure 4) for connecting and browsing RDF databases to answer users’ queries. The basic ideas given here can be further extended. To facilitate exploratory searches, a dynamic faceted search interface can be incorporated. We have only considered a few graph features that can be used to predict links. We can add further features to increase the confidence of inferred links. We can also explore the possibility of incorporating machine learning methods like [25], [26] to predict additional confident links. VII. CONCLUSION Working with heterogeneous data is an important aspect of biological research. In this paper, we have shown how heterogeneous biological databases in RDF format can be converted to a linked knowledge graph. We also showed how it can be queried using free-text search and navigated. Compared to existing methods, our proposed solution tries to infer relationships based on graph features and creates a text index that can be used to access graph nodes with a free-text search. Furthermore, the knowledge graph considers live searching of databases, thereby reducing the resources required to host it.
ACKNOWLEDGMENT The authors would like to acknowledge financial support received from the ICT Fund (Ref: ICT Fund/CEO/L/21/152) by Telecommunications Regulatory Authority, UAE. REFERENCES [1]
[2] [3] [4] [5] [6] [7] [8]
[9]
[10]
[11]
[12]
[13] [14] [15] Fig. 4. Heterogeneous data sources are converted to RDF which has an inherent graph structure. These graphs are then connected to create the knowledge graph. The knowledge graph will also try to infer connections not directly available from the original data sources. Finally the knowledge graph is indexed for text search and is accessed through a browser interface.
[16]
[17]
D. J. Rigden, X. M. Fernandez-Suarez, and M. Y. Galperin, “The 2016 database issue of Nucleic Acids Research and an updated molecular biology database collection,” Nucleic Acids Res., vol. 44, no. D1, pp. D16, 2016. D. Beckett and B. McBride, “RDF/XML Syntax Specification (Revised),” W3C Work. Draft, no. October, pp. 1–56, 2003. I. Ermilov, S. Auer, and C. Stadler, “Csv2rdf: Userdriven csv to rdf mass conversion framework,” Proceedings of the ISEM, vol. 13. pp. 4–6, 2013. R. P. Reck, “Excel2rdf for microsoft windows.” 2003. M. Grove, “Mindswap Convert To RDF Tool.” . L. Han, T. Finin, C. Parr, J. Sachs, and A. Joshi, “RDF123: from Spreadsheets to RDF,” in International Semantic Web Conference, 2008, pp. 451–466. L. Zhang, Knowledge graph theory and structural parsing. Twente University Press, 2002. H. Chiba, H. Nishide, and I. Uchiyama, “Construction of an ortholog database using the Semantic Web Technology for integrative analysis of genomic data,” PLoS One, vol. 10, no. 4, 2015. F. Belleau, M. A. Nolin, N. Tourigny, P. Rigault, and J. Morissette, “Bio2RDF: Towards a mashup to build bioinformatics knowledge systems,” J. Biomed. Inform., vol. 41, no. 5, pp. 706–716, 2008. V. Momtchev, D. Peychev, T. Primov, and G. Georgiev, “Expanding the Pathway and Interaction Knowledge in Linked Life Data,” Int. Semant. Web Chall., 2009. K. M. Livingston, M. Bada, W. A. Baumgartner, and L. E. Hunter, “KaBOB: ontology-based semantic integration of biomedical databases.,” BMC Bioinformatics, vol. 16, no. 1, p. 126, 2015. K. Nishikata and T. Toyoda, “BioLOD.Org: Ontologybased Integration of Biological Linked Open Data,” in Proceedings of the 4th International Workshop on Semantic Web Applications and Tools for the Life Sciences, 2012, pp. 92–93. “Longwell RDF Browser, SIMILE.” 2005. D. F. Huynh and D. Karger, “Parallax and companion: Set-based browsing for the data web,” in WWW Conference. ACM, 2009, p. 6. G. Kobilarov and I. Dickinson, “Humboldt: Exploring linked data,” context, vol. 6, p. 7, 2008. P. Heim, J. Ziegler, and S. Lohmann, “gFacet: A Browser for the Web of Data,” in Proceedings of the International Workshop on Interacting with Multimedia Content in the Social Semantic Web (IMCSSW’08), 2008, vol. 417, pp. 49–58. A. Russell, P. R. Smart, D. Braines, and N. R. Shadbolt,
[18]
[19]
[20]
[21] [22]
[23]
[24] [25] [26]
“NITELIGHT: A graphical tool for semantic query construction,” in CEUR Workshop Proceedings, 2009, vol. 543. D. Schweiger, Z. Trajanoski, and S. Pabinger, “SPARQLGraph: a web-based platform for graphically querying biological Semantic Web databases.,” BMC Bioinformatics, vol. 15, no. 1, p. 279, 2014. S. S. e Zainab, A. Hasnain, M. Saleem, Q. Mehmood, D. Zehra, and S. Decker, “FedViz: A Visual Interface for SPARQL Queries Formulation and Execution,” VOILA Work. ISWC 2015, 2015. S. T. Sherry, M. H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin, “dbSNP: the NCBI database of genetic variation.,” Nucleic Acids Res., vol. 29, no. 1, pp. 308–11, 2001. The UniProt Consortium, “UniProt: a hub for protein information.,” Nucleic Acids Res., vol. 43, no. Database issue, pp. D204-12, 2015. H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa, “KEGG: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Research, vol. 27, no. 1. pp. 29–34, 1999. N. Zaki, D. Efimov, and J. Berengueres, “Protein complex detection using interaction reliability assessment and weighted clustering coefficient.,” BMC Bioinformatics, vol. 14, no. 1, p. 163, 2013. D. Smiley and E. Pugh, Solr 1.4 Enterprise Search Server. 2009. M. Nickel, V. Tresp, and H.-P. Kriegel, “A Three-Way Model for Collective Learning on Multi-Relational Data,” in ICML, 2011, pp. 809--816. M. Al Hasan, V. Chaoji, S. Salem, and M. Zaki, “Link prediction using supervised learning,” SDM06 Work. link …, 2006.