Award details

Linking data with Identifiers.org

ReferenceBB/K016946/1
Principal Investigator / Supervisor Mr Henning Hermjakob
Co-Investigators /
Co-Supervisors
Mr Camille Laibe
Institution EMBL - European Bioinformatics Institute
DepartmentComputational Neurobiology Group
Funding typeResearch
Value (£) 119,457
StatusCompleted
TypeResearch Grant
Start date 26/08/2013
End date 25/11/2014
Duration15 months

Abstract

Providing annotated life science data (which include cross-references to other sources of knowledge) has always be very important. For both human and tool consumers, the key requirement for such metadata is globally unique, perennial, resolvable and free identifiers. MIRIAM Registry and Identifiers.org are tools providing such identifiers in the form of HTTP URIs. The aim of this proposal is to address our most important users' requests, in order to transform a prototype system into a full featured and reliable service to the community. We will fully support semantic web applications by upgrading the services to follow the standards of this area, namely RDF and SPARQL. By making the data of the Registry available in RDF and supplying a SPARQL endpoint, users will be able to efficiently integrate data coming from multiple sources, even if they do not use the same URIs for identifying the same concepts. Moreover, the services currently resolving Identifiers.org URLs only leads to HTML pages. This is very useful for users but need to be extended for tools. It will require listing for each physical location (or resource) the different formats they can provide. By doing so, tools only able to handle a given format could still traverse the various linked datasets. The system will also provide customised services, by allowing users to create 'profiles'. Those will record, for example, their resolving and format preferences. Web services and data export will make use of this information to provide targeted services. Also, to tackle the future maintenance in a sustainable way of a Registry growing at an increasing rate, we will involve the community, and more specially the data providers, in contributing to the update of their own records. Finally, to provide more reliable services, we will migrate the infrastructure to new, fully redundant, data centres, with the aim to ultimately provide a framework deployable on more mirrors and possibly cloud infrastructures.

Summary

Annotating data, life science datasets with cross-references to other sources of knowledge has always be very important. These metadata are often what separate valuable information from heaps of unusable data. With the advent of systems biology, the size and complexity of datasets shifted the balance from direct human interaction to automated computer processing. Such operations are greatly facilitated if the metadata is encoded following standard procedures and using controlled vocabularies. If those procedures and vocabularies are shared between different types of data, it becomes possible to align, compare and integrate different datasets. A key part of any cross-reference is the identifier of the resource it points to. This identifier must be unique, perennial, resolvable and free. Most data providers create identifiers for their own records; for example '9606' identifies 'Homo sapiens' in the Taxonomy, and '22140103' identifies the latest publication about Identifiers.org in PubMed. However, those identifiers are only unique within a given dataset so their usefulness is limited when considering records in a wider context. Identifiers.org provides such global identifiers, and resolves them to the relevant dataset. In order to achieve this purpose, it uses the information recorded in the MIRIAM Registry (http://www.ebi.ac.uk/miriam/). Therefore both projects provide a distinct part of the final technical solution. Identifiers generated with the Registry make use of the accession numbers supplied by data providers, but also contain information about the collection they come from. All identifiers are unique, resolvable and robust. They allow persons or software tools to directly access the identified pieces of data on the web, via alternative providers. Although a prototype, Identifiers.org has been adopted by a number of communities and projects, as it fulfils their need for perennial cross-references and removes their previous need for maintaining and keeping upto date long lists of ever changing web links (or URLs). As more and more communities realise the benefits of using Identifiers.org URIs, new needs and use cases have appeared. This proposal seeks to strengthen and extend the services provided by the resource in order to respond to those new user requests. We will make the resource easier to use in automated procedures, specially for semantic web applications. This involves providing the content of the Registry in Resource Description Framework (RDF) format and supply a SPARQL endpoint for query purposes. Users will be able to fine-tuned the way identifiers are resolved, via the creation of 'profiles', that will record their preferences. The resource will allow the communities (more specially the data providers themselves) to get involved in the maintenance of the Registry. This will take place via a system of "ownership" by data providers of their record in the Registry. Although we currently have automatic systems in place to detect obsolete information, having the actual data providers contributing to the maintenance would ensure better quality of the recorded information, meaning a better quality of the services provided. Finally we will improve and extend the underlying computing infrastructure. By deploying it in more more data centres, we will provide more reliable services to an ever growing number of users. The resulting resource will provide a way to seamlessly link all data annotated with the same URI to represent the same concept, a key step towards data integration. By providing a semantic glue between those datasets, Identifiers.org will facilitate data retrieval, comparison, integration, locally or through the semantic web. It will also facilitate the reasoning on the integrated datasets and lead to new, possibly automated discovery in the biomedical domain.

Impact Summary

The new developments of Identifiers.org infrastructure presented in this project will have several major impacts. Firstly, this will extend the services provided as well as improve their usability. It will thus become easier and more convenient for anybody to use the system. Any data provider having to record cross-references (that is most of them) will benefit much from it. Having such a central resource prevents the duplication of effort when maintaining up to date lists of URLs. Moreover, it reduces the amount of code necessary for handling identifiers: the same identifier can be stored in the database and directly used on the user interface. This removes the need to convert internal identifiers into resolvable URLs. From the data providers' point of view, the whole cross-reference handling processes will become much easier, more efficient and more reliable. Moreover, the proposed developments on the front of semantic web technologies will directly benefit providers of open and linked data. These can be the primary data providers (such as UniProt or Ensembl) or secondary providers (such as Bio2RDF or Pathway Commons), some of them already using MIRIAM URIs. Those efforts will gain much from using the same standardised URIs. This will make processes such as validation or integration easier, as entities in different datasets can be linked together; Identifiers.org URIs providing a key element for linking disparate pieces of information. The discovery and reasoning capabilities added by the planned developments related to RDF and SPARQL will also improve those aspects. Users' specific needs will also be catered for by the provision of customised services. For instance given projects and consortia will be be able to utilise only the subset of the Registry they need, and to make use of their own preferences. Finally, as the system better fulfil users' needs (specially as those new features directly come from user requests), it should make the system more useful and attractive, knowing that specific use cases are catered for. It is therefore expected that the community of users of the Identifiers.org URIs will expand. This will have a feedback loop effect, as more data providers use those unique URIs, the provided benefits of the system will increase and be strengthened, making more and more datasets interoperable. All the planned developments will be available freely for both academic and commercial users. This is specially important as numerous companies nowadays provide services based on the distribution, analysis or integration of publicly available datasets. Overall, this project should strengthen the economic competitiveness of the United Kingdom, helping it to be at the forefront of linked data technologies and provision, catering for the next challenge in this domain: the semantic web.
Committee Research Committee D (Molecules, cells and industrial biotechnology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file