DataMap is an interactive knowledge platform being developed to help scientists make sense of complex biological systems by leveraging network-based analysis. Built under the UK Dementia Research Institute (UK DRI), DataMap integrates a wide array of biomedical datasets into a unified knowledge graph. This graph-centric approach enables researchers to navigate connections between genes, proteins, diseases, and other biological entities in an intuitive way. The motivation for DataMap stems from the recognition that neurodegenerative diseases like Alzheimer’s, Parkinson’s, and related disorders involve multifaceted molecular interactions and pathways. By bringing diverse data together, DataMap provides a holistic view that can reveal hidden relationships and generate new hypotheses.
A core principle of DataMap is data democratisation – making complex datasets accessible and interpretable to a broad range of stakeholders (from laboratory researchers to computational biologists) through an easy-to-use web application. In parallel, the platform emphasises data integration, harmonising heterogeneous information sources into a single network. This combination of integrative data and user-friendly tools transforms how researchers can query and explore biological knowledge, fostering collaboration and discovery.
Data Integration: Facilitate the harmonisation and utilisation of diverse datasets across the UK DRI. DataMap serves as a hub that unifies multi-modal data – genomics, proteomics, clinical phenotypes, and more – within a single framework. By clustering information around genes in a network, the platform enables novel insights that might be missed in siloed analyses [1]. This gene-centric integration model means that every gene becomes a nexus connecting its associated proteins, functions, diseases, and other attributes.
Data Democratisation: Provide secure and easy access to integrated data for the widest range of stakeholders. The DataMap web interface is being designed to present complex data in an interpretable format, lowering barriers for users who may not have specialised bioinformatics training. Through a web application (secured via institutional login), UK DRI researchers and collaborators can freely explore the knowledge graph. This supports the democratisation of data, ensuring that valuable datasets generated across UK DRI centres are not locked away but instead readily available for exploration [2].
Investigation Tools: Empower researchers with interactive tools to explore the knowledge graph and probe its various layers. DataMap is not just a static database – it’s an analytical platform. Users will be able to search for entities (genes, proteins, pathways, diseases, etc.), visualise networks of interactions, and run queries to uncover relationships. For example, one can identify a set of genes associated with two different diseases or find which signalling pathways and cell types connect a genetic risk factor to clinical phenotypes. These investigation tools turn the underlying data into a dynamic resource for hypothesis generation and data-driven discovery.
One of DataMap’s strengths is the breadth of data sources it brings together. The platform aggregates information from established public databases as well as internal UK DRI research outputs, creating a rich, interconnected knowledge network. Key data sources integrated so far include:
Genomic and Protein Knowledgebases: Ensembl (for comprehensive human gene information, including gene loci and known transcripts) and UniProt (a detailed protein knowledgebase with sequences and functional annotations). These provide the foundational gene and protein identifiers used across the graph, ensuring that all data layers connect through common gene entries.
Molecular Interaction Networks: Protein–protein interaction data from sources like IntAct (experimentally validated interactions) and the Human Reference Protein Interactome. The Human Reference Interactome (HuRI) is a curated map of binary protein interactions containing over 52,000 verified protein–protein interactions involving ~8,275 proteins [3], offering a reference network of how proteins physically or functionally associate. By incorporating these networks, DataMap allows users to trace molecular pathways and complexes relevant to neurodegenerative disease mechanisms.
Biological Ontologies: Functional and phenotypic context is added via Gene Ontology (GO) and Human Phenotype Ontology (HPO). GO provides standardized terms for molecular functions, biological processes, and cellular components of genes/proteins, enabling functional enrichment and context. HPO contributes a link between genes and clinical phenotypic traits or diseases (especially useful for capturing known gene–disease associations in neurodegeneration). Through these ontologies, a researcher can find, for instance, what biological processes a particular Alzheimer’s-associated gene is involved in, or which genes are linked to phenotypic features like memory impairment or stroke.
Drug and Compound Data: DrugBank is integrated to connect small molecules and therapeutic compounds to their protein targets. This layer introduces pharmacological information into the graph – for any given protein or gene, users can identify known drugs or compounds that interact with it. Conversely, for a given drug, one can see all its documented targets. This is crucial for drug repurposing research: by linking disease-associated genes to drugs, DataMap can help suggest existing medications that might be investigated for treating neurodegenerative conditions.
Cell–Cell Communication Data: Data from CellChat analyses (an R toolkit for inferring cell–cell communication networks from single-cell RNA sequencing data) have been incorporated. This adds a layer of intercellular signalling information, mapping how different cell types (such as neurons, microglia, astrocytes, T-cells, etc.) communicate via ligand–receptor interactions. In the context of dementia research, this helps users explore hypotheses like how certain immune cells may influence neurons through secreted factors. Incorporating CellChat results means that the knowledge graph can link a gene (say, encoding a ligand) to the cell types that produce it and the target cell types that express the corresponding receptor, painting a picture of intercellular communication pathways.
ROSMAP single-nucleus co-expression (Lopes et al., 2025): a module-trait network analysis of single-nucleus RNA-seq from the ROSMAP cohort. Lopes et al. identified modules of co-expressed genes in seven major brain cell types and linked these modules to Alzheimer’s Disease (AD) traits using a Bayesian network [9][10]. DataMap integrates these module–trait networks to represent cell-type specific gene modules and their associations with AD pathology and cognition.
CoExpROSMAP (bulk RNA-seq co-expression): co-expression networks derived from bulk RNA sequencing of ROSMAP brain tissue. CoExpROSMAP (a resource within the CoExpNets suite) provides four networks from frontal cortex representing control, mild cognitive impairment and AD groups [11]. These gene co-expression clusters and their annotations are incorporated to contextualise differential expression and cell-type inference in the graph.
UK DRI Internal Research Datasets: Importantly, DataMap also integrates bespoke datasets generated by UK DRI researchers. Notably, a comprehensive set of gene–trait association results from the Cardiff University centre has been added. These results come from MAGMA analyses of genome-wide association studies (GWAS) for various neurodegenerative and related traits. For example, gene association data for Alzheimer’s disease (AD), Parkinson’s disease (PD), stroke, white matter hyperintensities (WMH), schizophrenia (SCZ), and others have been included. These analyses not only identify which genes show significant genetic associations with each condition, but also extend to cell type-specific gene sets. In practice, this means DataMap contains layers where each “node set” represents the top genes associated with a particular trait in a particular cell type (e.g. AD risk genes in microglia, PD-associated genes in dopaminergic neurons, etc.). Each such gene set comes with an overall p-value indicating the significance of that cell type’s enrichment for disease-associated genes (from the MAGMA gene-set analysis). By integrating these data, DataMap allows researchers to compare genetic risk profiles across diseases and cell types. For instance, one could query which genes are common between AD and stroke risk (from GWAS) that are highly specific to microglial cells, thereby identifying shared cell-type-specific pathological mechanisms.
All these diverse sources are interlinked in the knowledge graph. The integration is gene-centric: genes (and their protein products) act as key nodes connecting to various data types. For example, a single gene in DataMap will link to: its protein interactions (IntAct/HuRI), its GO terms, associated phenotypes (HPO or GWAS traits), drugs targeting its protein, expression in cell types or communication pathways (CellChat), and so on. This holistic linking enables traversing from one piece of knowledge to another seamlessly. The graph model is also inherently extensible – new nodes and relationship types can be added as additional databases or experimental results become available, without disrupting the existing structure [4]. This flexibility ensures that DataMap stays up-to-date with the latest science, accommodating emerging data (such as new GWAS results or additional omics layers) as the project evolves.
DataMap’s backbone is a graph database (built with Neo4j technology) that stores the integrated knowledge graph. Neo4j allows efficient storage and querying of nodes (entities like genes, proteins, diseases, etc.) and relationships (edges) between them. By utilising a graph database, the platform can execute complex queries that traverse multiple data types with ease – a capability that traditional relational databases struggle with when data is highly interconnected. The UK DRI data science initiative specifically chose a Neo4j-based graph network to map relationships between traits, genetic factors, and other molecular data [5], reflecting the need for a highly relational data model.
Within the DataMap graph, genes are central nodes that cluster various data connections (hence the term “gene-centric” model). Surrounding a gene node, one will find linked nodes representing proteins (which often map one-to-one with genes), diseases/phenotypes, ontology terms, and other entities from the data sources listed above. The relationships are typed – for example, a gene HAS_FUNCTION relationship to a GO term, or a protein INTERACTS_WITH another protein, or a gene ASSOCIATED_WITH an Alzheimer’s disease node (from GWAS evidence). This structured yet flexible schema allows questions to be answered in a natural way. Technically, it means queries can be formulated to traverse the graph – e.g. “find all genes associated with both Trait A and Trait B that have a protein–protein interaction between them” or “find drugs that target proteins which are in pathways enriched for a given cell type’s risk genes”.
The web application layer of DataMap sits atop this graph database and provides an interactive user interface. Although still under active development, the interface is envisioned to include search functionalities, network visualisations, and analytical widgets. Users will be able to start by searching for an entity of interest (such as a gene name, a disease, or a keyword). The platform will then display a network view centered on that entity, showing connected nodes (and possibly using visual cues to indicate different types of relationships). Users can click on these connections to expand further, effectively browsing the knowledge graph in an intuitive manner. This is complemented by filters and layer controls – for instance, toggling on or off certain data layers (like choosing to display drug links or cell communication links as needed). For those interested in specific analyses, the platform will offer query building tools or preset queries. Non-programmers will benefit from a user-friendly query interface, while more advanced users (such as bioinformaticians) could directly input custom Cypher queries (Neo4j’s query language) if they desire fine-grained control.
Beyond exploration, DataMap’s platform will integrate analytical tools. Examples include enrichment analysis (e.g. taking a set of genes and finding over-represented GO terms or pathways), network analytics (identifying hub proteins or subnetworks), and comparison tools for gene sets (useful for comparing, say, the genes associated with two different conditions). By coupling these tools with the underlying data, users can perform on-the-fly evaluations. The ultimate goal is that a researcher can go from a high-level question to a specific insight all within the DataMap environment – without needing to manually cross-reference multiple databases or run separate scripts.
Crucially, since DataMap consolidates data that might otherwise require separate bioinformatics workflows, it saves researchers time and ensures consistency. All data is pre-harmonised (for example, gene IDs are consistent across sources, and up-to-date reference versions are used), so users don’t have to perform the tedious work of data cleaning and integration themselves. This aspect is part of the data democratisation goal: even those less experienced in handling big data can trust the platform to deliver coherent information.
DataMap is poised to support a variety of research inquiries in neurodegeneration and beyond. Some compelling use cases include:
Cross-Disease Gene Discovery: A dementia researcher might ask, “Which genes are implicated in both Alzheimer’s disease and stroke?” Using DataMap, they could quickly retrieve the list of genes with significant associations to both AD and stroke (from the integrated GWAS/MAGMA results). Going further, the researcher can inspect whether those overlapping genes are expressed in the same cell types or participate in the same biological pathways – information readily available via linked cell type data and GO annotations. This could highlight shared molecular mechanisms between seemingly distinct conditions, guiding further study into common therapeutic targets or risk factors.
Cell Type-Specific Pathway Exploration: Consider a scenario where an experimental finding points to microglial cells (a type of immune cell in the brain) as important in early Alzheimer’s pathology. With DataMap, one could select the MAGMA gene set for Alzheimer’s disease (AD) – Microglia, seeing all genes that were found to be enriched for AD risk within microglial-specific expression profiles. The platform can then show the protein–protein interaction network among those genes, revealing if they cluster in particular sub-networks (for example, several might be part of the complement immune pathway or lipid metabolism). By overlaying GO terms on this gene set, the researcher might discover an over-representation of inflammation-related processes, which aligns with microglial biology. This level of analysis, done interactively, helps refine hypotheses about how specific cell types contribute to disease.
Drug Repurposing Queries: DataMap’s integration of DrugBank opens the door for pharmacological explorations. For example, a clinician-scientist might query, “Find drugs that target proteins involved in Parkinson’s disease and also show associations with amyloid pathology.” Within the graph, this translates to finding proteins that are nodes connecting a PD gene set and an AD pathology phenotype, then listing any drugs connected to those proteins. The result could be a set of candidate compounds that might influence both diseases. Such insights are valuable for drug repurposing – an existing drug for one condition might be tested for efficacy in another if a common molecular target is identified.
Knowledge Retrieval and Hypothesis Generation: Even straightforward queries can be powerful. If one searches DataMap for a gene (say APOE, a well-known Alzheimer’s risk gene), the platform will return a rich profile: APOE’s protein interactions (e.g., interactions with amyloid precursor protein), its GO annotated functions (lipid transport, etc.), involvement in pathways, any known drug interactions (perhaps certain antibodies or small molecules targeting APOE), phenotypes linked to APOE (like lipid metabolism disorders or AD), and which cell types show it as a key gene in GWAS analyses. By having all this information in one place, a user can generate hypotheses – for instance, seeing APOE’s connection to cardiovascular phenotypes might spark an idea about blood-brain barrier or vascular contributions in dementia. DataMap thus acts as a knowledge synthesis tool: it doesn’t answer research questions outright but provides the connected evidence required to formulate and support questions.
The above examples illustrate how DataMap can be used in practice. Importantly, these scenarios would traditionally require consulting multiple databases and doing custom data analysis. With DataMap, the answers surface in a matter of clicks or a single query, underscoring the efficiency of integrated, graph-based data exploration.
DataMap is an ongoing project, and its development is iterative and community-informed. In its current state, the platform has established the core infrastructure (the Neo4j graph database and a functional web interface) and has ingested initial datasets (as listed above). Over the coming months, several developments are anticipated:
Expanded Data Integration: Additional datasets are slated for inclusion. These may include transcriptomic data (e.g. gene expression matrices from bulk or single-cell studies), epigenetic data, neuroimaging-derived annotations, and more detailed clinical data. As new GWAS or multi-omics studies on neurodegenerative diseases are published, their findings will be incorporated to keep the knowledge graph current. The design of DataMap makes it straightforward to plug in new data sources – the team can map new data to the existing schema and rapidly expand the graph’s content.
Enhanced User Interface: User testing is guiding refinements of the web portal. Future versions will likely introduce customisable dashboards, the ability to save query results or network views, and integration with analysis notebooks for those who want to do deeper dives (for example, exporting a sub-network for further analysis in R or Python). There is also consideration for implementing natural language query support, so that users could ask questions in plain English which the system translates into graph queries – an approach that could further lower the barrier for non-technical users.
Collaboration and Sharing: As DataMap matures, features to facilitate collaboration will be introduced. Users may be able to share specific views or findings with colleagues through the platform, or annotate nodes/relationships with their own notes (e.g. flagging a novel finding or hypothesis related to a gene). Given the secure nature of the system (with login-based access for UK DRI affiliates), it can also host unpublished or sensitive data. In the future, there might be tiered access where certain data is public and some remain internal, balancing openness with privacy as needed.
Interoperability and Standards: The team is ensuring that DataMap follows data standards and is interoperable with other tools. For instance, all entries use standard identifiers (Ensembl IDs for genes, UniProt IDs for proteins, CURIEs for ontology terms, etc.), which makes it easier to cross-link with external resources. The knowledge graph could be made queryable via APIs or exportable in formats like RDF, allowing other applications or analyses to tap into DataMap’s content. This means a researcher could programmatically fetch DataMap results for use in their own pipelines, extending the platform’s utility beyond the web interface.
DataMap draws inspiration from successful knowledge integration efforts. For example, the Clinical Knowledge Graph (CKG) by Santos et al. demonstrated the power of an open-source platform with tens of millions of nodes representing experimental data, public databases, and literature, all connected to aid clinical proteomics interpretation [6]. Likewise, the AD Atlas project integrated over 20 large Alzheimer’s studies into a network covering ~20,000 genes and various omics, providing a global view of AD biology [7]. These projects show that network-based approaches can greatly enhance data interpretation and hypothesis generation. DataMap is building on this concept within the context of dementia research, aiming to be the go-to knowledge graph for UK DRI scientists. While CKG and AD Atlas are specific exemplars (proteomics-focused and AD-focused respectively), DataMap’s scope is both broader (covering multiple diseases and data types) and more tailored to UK DRI’s collaborative research environment. By learning from those efforts, DataMap is implementing best practices for data integration and user engagement, while focusing on the specific needs of neurodegeneration researchers.
In summary, DataMap represents a significant step towards connected science in the field of neurodegenerative disease. By consolidating data and providing interactive tools, it reduces the fragmentation of knowledge and empowers researchers to approach complex questions from multiple angles. The platform’s development is a collaborative effort, and feedback from its users – researchers, students, and data scientists – is actively shaping its features. As the project progresses, DataMap is expected to evolve into an indispensable resource that accelerates discoveries, fosters new collaborations, and ultimately contributes to a deeper understanding of diseases like Alzheimer’s and Parkinson’s. Through data integration and democratisation, it embodies a modern approach to biomedical research, where knowledge is not just stored but actively connected and put to work for the benefit of science and health.
[1] [2] Core Informatics Documentation Center | DataMap Documentation
https://wiki.datamap.ukdri.ac.uk/
[3] A reference map of the human binary protein interactome - PMC
https://pmc.ncbi.nlm.nih.gov/articles/PMC7169983/
[4] [6] A knowledge graph to interpret clinical proteomics data | Nature Biotechnology
https://www.nature.com/articles/s41587-021-01145-6?error=cookies_not_supported&code=c6797cd8-4541-4b51-89f7-d43ccd1ae11f
[5] Data science | UK DRI
https://www.ukdri.ac.uk/our-science/approaches/data-science
[7] Scholars@Duke publication: An Integrated Molecular Atlas of Alzheimer’s Disease
https://scholars.duke.edu/publication/1497234