User:TiagoLubiana/10.7554/eLife.52614

This is an wikification of the article Science Forum: Wikidata as a knowledge graph for the life sciences by Andra Waagmeester at al. available at https://elifesciences.org/articles/52614.

This article incorporates text available under the CC BY 4.0 license.

Abstract

Wikidata is a community-maintained knowledge base that has been assembled from repositories in the fields of genomics, proteomics, genetic variants, pathways, chemical compounds, and diseases, and that adheres to the FAIR principles of findability, accessibility, interoperability and reusability. Here we describe the breadth and depth of the biomedical knowledge contained within Wikidata, and discuss the open-source tools we have built to add information to Wikidata and to synchronize it with source databases. We also demonstrate several use cases for Wikidata, including the crowdsourced curation of biomedical ontologies, phenotype-based diagnosis of disease, and drug repurposing.

Introduction

Integrating data and knowledge is a formidable challenge in biomedical research. Although new scientific findings are being discovered at a rapid pace, a large proportion of that knowledge is either locked in data silos (where integration is hindered by differing nomenclature, data models, and licensing terms ^[1]) or locked away in free-text. The lack of an integrated and structured version of biomedical knowledge hinders efficient querying or mining of that information, thus preventing the full utilization of our accumulated scientific knowledge.

Recently, there has been a growing emphasis within the scientific community to ensure all scientific data are FAIR – Findable, Accessible, Interoperable, and Reusable – and there is a growing consensus around a concrete set of principles to ensure FAIRness ^[1]^[2]. Widespread implementation of these principles would greatly advance efforts by the open-data community to build a rich and heterogeneous network of scientific knowledge. That knowledge network could, in turn, be the foundation for many computational tools, applications and analyses.

Most data- and knowledge-integration initiatives fall on either end of a spectrum. At one end, centralized efforts seek to bring multiple knowledge sources into a single database (see, for example, Mungall et al., 2017^[3]): this approach has the advantage of data alignment according to a common data model and of enabling high performance queries. However, centralized resources are difficult and expensive to maintain and expand ^[4]^[5], at least in part because of bottlenecks that are inherent in a centralized design.

Here we explore the use of Wikidata (www.wikidata.org; Vrandečić, 2012; Mora-Cantallops et al., 2019) as a platform for knowledge integration in the life sciences. Wikidata is an openly-accessible knowledge base that is editable by anyone. Like its sister project Wikipedia, the scope of Wikidata is nearly boundless, with items on topics as diverse as books, actors, historical events, and galaxies. Unlike Wikipedia, Wikidata focuses on representing knowledge in a structured format instead of primarily free text. As of September 2019, Wikidata's knowledge graph included over 750 million statements on 61 million items (tools.wmflabs.org/wikidata-todo/stats.php). Wikidata was also the first project run by the Wikimedia Foundation (which also runs Wikipedia) to have surpassed one billion edits, achieved by a community of 12,000 active users, including 100 active computational ‘bots’ (Figure 1—figure supplement 1).

As a knowledge integration platform, Wikidata combines several of the key strengths of the centralized and distributed approaches. A large portion of the Wikidata knowledge graph is based on the automated imports of large structured databases via Wikidata bots, thereby breaking down the walls of existing data silos. Since Wikidata is also based on a community-editing model, it harnesses the distributed efforts of a worldwide community of contributors, including both domain experts and bot developers. Anyone is empowered to add new statements, ranging from individual facts to large-scale data imports. Finally, all knowledge in Wikidata is queryable through a SPARQL query interface (query.wikidata.org/), which also enables distributed queries across other Linked Data resources.

In previous work, we seeded Wikidata with content from public and authoritative sources of structured knowledge on genes and proteins (Burgstaller-Muehlbacher et al., 2016) and chemical compounds (Willighagen et al., 2018). Here, we describe progress on expanding and enriching the biomedical knowledge graph within Wikidata, both by our team and by others in the community (Turki et al., 2019). We also describe several representative biomedical use cases on how Wikidata can enable new analyses and improve the efficiency of research. Finally, we discuss how researchers can contribute to this effort to build a continuously-updated and community-maintained knowledge graph that epitomizes the FAIR principles.

The Wikidata Biomedical Knowledge Graph

The original effort behind this work focused on creating and annotating Wikidata items for human and mouse genes and proteins (Burgstaller-Muehlbacher et al., 2016), and was subsequently expanded to include microbial reference genomes from NCBI RefSeq (Putman et al., 2017). Since then, the Wikidata community (including our team) has significantly expanded the depth and breadth of biological information within Wikidata, resulting in a rich, heterogeneous knowledge graph (Figure 1). Some of the key new data types and resources are described below.

Genes and proteins: Wikidata contains items for over 1.1 million genes and 940 thousand proteins from 201 unique taxa. Annotation data on genes and proteins come from several key databases including NCBI Gene (Agarwala et al., 2018), Ensembl (Zerbino et al., 2018), UniProt (UniProt Consortium, 2019), InterPro (Mitchell et al., 2019), and the Protein Data Bank (Burley et al., 2019). These annotations include information on protein families, gene functions, protein domains, genomic location, and orthologs, as well as links to related compounds, diseases, and variants.

Genetic variants: Annotations on genetic variants are primarily drawn from CIViC (http://www.civicdb.org), an open and community-curated database of cancer variants (Griffith et al., 2017). Variants are annotated with their relevance to disease predisposition, diagnosis, prognosis, and drug efficacy. Wikidata currently contains 1502 items corresponding to human genetic variants, focused on those with a clear clinical or therapeutic relevance.

Chemical compounds including drugs: Wikidata has items for over 150 thousand chemical compounds, including over 3500 items which are specifically designated as medications. Compound attributes are drawn from a diverse set of databases, including PubChem (Wang et al., 2009), RxNorm (Nelson et al., 2011), the IUPHAR Guide to Pharmacology (Harding et al., 2018; Pawson et al., 2014; Southan et al., 2016), NDF-RT (National Drug File – Reference Terminology), and LIPID MAPS (Sud et al., 2007). These items typically contain statements describing chemical structure and key physicochemical properties, and links to databases with experimental data, such as MassBank (Horai et al., 2010; Wohlgemuth et al., 2016) and PDB Ligand (Shin, 2004), and toxicological information, such as the EPA CompTox Dashboard (Williams et al., 2017). Additionally, these items contain links to compound classes, disease indications, pharmaceutical products, and protein targets.

Pathways: Wikidata has items for almost three thousand human biological pathways, primarily from two established public pathway repositories: Reactome (Fabregat et al., 2018) and WikiPathways (Slenter et al., 2018). The full details of the different pathways remain with the respective primary sources. Our bots enter data for Wikidata properties such as pathway name, identifier, organism, and the list of component genes, proteins, and chemical compounds. Properties for contributing authors (via ORCID properties; Sprague, 2017), descriptions and ontology annotations are also being added for Wikidata pathway entries.

Diseases: Wikidata has items for over 16 thousand diseases, the majority of which were created based on imports from the Human Disease Ontology (Schriml et al., 2019), with additional disease terms added from the Monarch Disease Ontology (Mungall et al., 2017). Disease attributes include medical classifications, symptoms, relevant drugs, as well as subclass relationships to higher-level disease categories. In instances where the Human Disease Ontology specifies a related anatomic region and/or a causative organism (for infectious diseases), corresponding statements are also added.

References: Whenever practical, the provenance of each statement added to Wikidata was also added in a structured format. References are part of the core data model for a Wikidata statement. References can either cite the primary resource from which the statement was retrieved (including details like version number of the resource), or they can link to a Wikidata item corresponding to a publication as provided by a primary resource (as an extension of the WikiCite project; Ayers et al., 2019), or both. Wikidata contains over 20 million items corresponding to publications across many domain areas, including a heavy emphasis on biomedical journal articles.

Bot automation

TODO

References

^ ^a ^b Wilkinson, Mark D.; Dumontier, Michel; Aalbersberg, IJsbrand Jan; Appleton, Gabrielle; Axton, Myles; Baak, Arie; Blomberg, Niklas; Boiten, Jan-Willem; da Silva Santos, Luiz Bonino; Bourne, Philip E.; Bouwman, Jildau (2016-03-15). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data. 3 (1): 160018. doi:10.1038/sdata.2016.18. ISSN 2052-4463. PMC 4792175. PMID 26978244.{{cite journal}}: CS1 maint: PMC format (link)
^ Wilkinson, Mark D.; Dumontier, Michel; Sansone, Susanna-Assunta; Bonino da Silva Santos, Luiz Olavo; Prieto, Mario; Batista, Dominique; McQuilton, Peter; Kuhn, Tobias; Rocca-Serra, Philippe; Crosas, Mercѐ; Schultes, Erik (2019-09-20). "Evaluating FAIR maturity through a scalable, automated, community-governed framework". Scientific Data. 6 (1): 174. doi:10.1038/s41597-019-0184-5. ISSN 2052-4463. PMC 6754447. PMID 31541130.{{cite journal}}: CS1 maint: PMC format (link)
^ Mungall, Christopher J; McMurry, Julie A; Köhler, Sebastian; Balhoff, James P.; Borromeo, Charles; Brush, Matthew; Carbon, Seth; Conlin, Tom; Dunn, Nathan (2016-11-03). "The Monarch Initiative: An integrative data and analytic platform connecting phenotypes to genotypes across species". dx.doi.org. Retrieved 2022-02-18.
^ Chandras, C.; Weaver, T.; Zouberakis, M.; Smedley, D.; Schughart, K.; Rosenthal, N.; Hancock, J. M.; Kollias, G.; Schofield, P. N.; Aidinis, V. (2009-10-23). "Models for financial sustainability of biological databases and resources". Database. 2009 (0): bap017–bap017. doi:10.1093/database/bap017. ISSN 1758-0463.
^ Gabella, Chiara; Durinx, Christine; Appel, Ron (2018-03-22). "Funding knowledgebases: Towards a sustainable funding model for the UniProt use case". F1000Research. 6: 2051. doi:10.12688/f1000research.12989.2. ISSN 2046-1402.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[:0-1] Wilkinson, Mark D.; Dumontier, Michel; Aalbersberg, IJsbrand Jan; Appleton, Gabrielle; Axton, Myles; Baak, Arie; Blomberg, Niklas; Boiten, Jan-Willem; da Silva Santos, Luiz Bonino; Bourne, Philip E.; Bouwman, Jildau (2016-03-15). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data. 3 (1): 160018. doi:10.1038/sdata.2016.18. ISSN 2052-4463. PMC 4792175. PMID 26978244.{{cite journal}}: CS1 maint: PMC format (link)

[2] Wilkinson, Mark D.; Dumontier, Michel; Sansone, Susanna-Assunta; Bonino da Silva Santos, Luiz Olavo; Prieto, Mario; Batista, Dominique; McQuilton, Peter; Kuhn, Tobias; Rocca-Serra, Philippe; Crosas, Mercѐ; Schultes, Erik (2019-09-20). "Evaluating FAIR maturity through a scalable, automated, community-governed framework". Scientific Data. 6 (1): 174. doi:10.1038/s41597-019-0184-5. ISSN 2052-4463. PMC 6754447. PMID 31541130.{{cite journal}}: CS1 maint: PMC format (link)

[3] Mungall, Christopher J; McMurry, Julie A; Köhler, Sebastian; Balhoff, James P.; Borromeo, Charles; Brush, Matthew; Carbon, Seth; Conlin, Tom; Dunn, Nathan (2016-11-03). "The Monarch Initiative: An integrative data and analytic platform connecting phenotypes to genotypes across species". dx.doi.org. Retrieved 2022-02-18.

[4] Chandras, C.; Weaver, T.; Zouberakis, M.; Smedley, D.; Schughart, K.; Rosenthal, N.; Hancock, J. M.; Kollias, G.; Schofield, P. N.; Aidinis, V. (2009-10-23). "Models for financial sustainability of biological databases and resources". Database. 2009 (0): bap017–bap017. doi:10.1093/database/bap017. ISSN 1758-0463.

[5] Gabella, Chiara; Durinx, Christine; Appel, Ron (2018-03-22). "Funding knowledgebases: Towards a sustainable funding model for the UniProt use case". F1000Research. 6: 2051. doi:10.12688/f1000research.12989.2. ISSN 2046-1402.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[1]

[2]

[3]

[4]

[5]