Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pathfinding approaches to ML-Guided knowledge exploration #9

Open
mbrush opened this issue Jan 6, 2018 · 2 comments
Open

Pathfinding approaches to ML-Guided knowledge exploration #9

mbrush opened this issue Jan 6, 2018 · 2 comments

Comments

@mbrush
Copy link
Contributor

mbrush commented Jan 6, 2018

Use an integrated neo4j database to explore how human and machine learning agents might collaborate to extract evidence from knowledge graphs to derive predictions and mechanistic hypotheses.

Tasks

  1. Load a neo4j instance with diverse data types from Monarch and SemMed DB databases.
  2. Define and optimize cypher 'pathfinding' query templates.
  3. Apply templates toward answering selected CQs - start with 'positive control' queries that look for paths through the graph providing evidence supporting a known fact/mechanism (e.g. ALDH2 as a known modifier of FA, or cyclodextrin as a successful re-purposing for Niemann-Pick disease).
  4. Manually explore query results by evaluating types of paths returned, defining rules/approaches to identify most meaningful evidence, and refining queries to hone in on these paths in the data.
  5. Explore machine learning approaches to automate this process, and derive evidence-based predictions from data in knowledge graphs.
  6. Explore approaches/interfaces for human intervention in this process - i.e. how to present underlying rationale for automated predictions in a way that allows human users to evaluate the evidence, refine and extend queries based on this, and inform new experiments and analyses.

Goals

  1. Understand data and modeling requirements for this type of approach
  2. Inform architectural requirements for BB and reasoner applications - particularly w.r.t automated/machine learning methods that can help weight evidence and make predictions, and interfaces for human intervention in refining and extending ML results.
  3. Provide end-to-end examples of what open-ended, ML-guided exploration and discovery in the Translator might look like in practice.

Valued Expertise

  1. Monarch/SemMedDB data
  2. Cypher query language and graph-based algorithms (e.g. for pathfinding, traversals, edge-weighting)
  3. Visualization of graph data and paths
  4. Machine learning approaches
@mbrush
Copy link
Contributor Author

mbrush commented Jan 6, 2018

TL;DR (earlier/longer notes from which the above summary was condensed)


To date most CQ notebooks have explored relatively simple retrieval and faceting type queries. But the real utility of the Translator will be supporting more open-ended, exploration of data , enabling users to populate a blackboard with knowledge that drives serendipitous discovery and novel insight.

Given the graph-based nature of much of the knowledge in the Translator system, 'pathfinding' operations are one potentially useful approach for this type of exploration and blackboard construction. Here, the system would return paths through the graph connecting entities of interest, and allow users to filter and facet these paths to hone in on those representing meaningful evidence in support of their larger question or use case.

For example, given a set of candidate FA modifier genes, explore paths linking these genes to FA in the data to provide evidence for prioritizing/ranking these candidates, and suggesting possible mechanisms of action. Here, we have positive controls that we can start with, as ALDH2, ADH5, and TGFbeta are known modifies with established mechanisms. We will write pathfinding cypher queries that return all paths through the Monarch and SemMedDB data connecting these genes to FA, and explore requirements and approaches for refining/constraining these paths to hone in on those representing the most meaningful evidence.

Example: Return all paths between Aldh2 and FA -> filter and facet and expand results to identify most meaningful paths that support this known fact, and might have led to its hypothesis before its official discovery

Using these controls for pilot experiments, we can think about the types of evidence that would support inference to these answers, the types of data that would support these inferences, how the data would have to be modeled, queried, and presented to users to support such inference, and the tooling required to support these tasks.
e.g. for arriving at Aldh2 as a FA modifier:

  • what types of information would represent evidence for this?
  • can we present this to users given our knowledge sources?
  • what would this look like (e.g.. are there paths through the data that could be presented to a user?)
  • how would the system help guide a user to the meaningful evidence amidst the noise?

Ultimately, we hope that this exercise will inform requirements for many aspects of Translator development:

  • With respect to the data, this can help us to understand where gaps are, and inform ingest of new data types and sources, and how the data should be modeled.
  • With respect to the blackboard system, it can inform requirements for query functionality, navigation and visualization interfaces, and computational methods (e.g. embedded analysis functionality, or machine learning based filtering or faceting support) that would be required to support meaningful insight and discovery. These will be useful for pilot systems like BT explorer, tk.bio, and NCATS bb prototype, as well as reasoner applications . . .
  • With respect to user engagement, it will end-to-end example of what open-ended exploration and discovery in the Translator means and looks like in practice - critical for shared understanding of data and approach.

Tasks:

  1. Build Data Graph: Load a single neo4j database (on an ncats aws server) with data from Monarch-SciGraph, which contains semantically integrated data from a variety of biological/biomedical curated knowledgebases, with a focus on genotype-phenotype related resources. Possibly load additional/complementary graph databases into this neo4j instance, normalizing as possible to facilitate traversal across databases /sources (e.g. SemMedDB). The goal here is to approximate the ability to do pathfinding across a distributed architecture currently being explored in the translator.

  2. Evaluate Interface: Create a neo4j explorer interface to allow query and visualization of results

  3. Define/Optimize Pathfinding Queries: Use cypher query language to write 'pathfinding' queries (e.g. show all paths through the graph connecting entity1 to entity2). Likely to be computationally expensive, so will require optimization of the query and/or the data. Cypher may offer specific query constructs and algorithms for pathfinding analyses.

  4. Visualization of Results: Query results will be large numbers of 'paths', which are inherently difficult to display and process in a way that supports comprehensive understanding by humans. We will need to explore output formats and approaches to visualizing, summarizing, and operating on results so as to allow efficient and actionable human understanding

  5. Query Refinement: A key task will be refining queries to hone in on the most meaningful paths through the data.

  6. Evaluation and Expansion: If successful, we will identify a small subset of paths that provide meaningful evidence for our query, that an exert can evaluate and use to seed further exploration of the data.

@mbrush mbrush changed the title Explore pathfinding approaches to Translator knowledge exploration Pathfinding approaches to Translator knowledge exploration Jan 6, 2018
@stuppie
Copy link
Contributor

stuppie commented Jan 6, 2018

Regarding Task 5 (but also probably 3 and 4), I'm thinking a machine learning approach may be useful here. How that would work could be similar to how prediction in drug repurposing works, where by using a set of known drug-disease pairs, the paths through the network connecting these known "true" connections are selected for and weighted more strongly than edge types that don't (or are less useful) for connecting these. A technique like that could be applied here to refine queries and try to select more meaningful paths. I or @veleritas could look into this more deeply...

@mbrush mbrush changed the title Pathfinding approaches to Translator knowledge exploration Pathfinding approaches to ML-Guided knowledge exploration Jan 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants