Edge: a framework for developing collective understanding [EdgeProject]

dynamic sub-graph assembly versus data warehousing

Not sure if this discussion fits here or should wait for the larger, technical part of the proposal, but we will need to explain why we think we should do this in a dynamic manner (query the web on demand) rather than building an all-encompassing data warehouse as the first step and how we are realistically going to make this happen.

Why: The Web evolves constantly, keeping a warehouse in sync is very difficult and maintaining a very large knowledge graph (e.g. a SPARQL server) is expensive. If answers could be gathered automatically via a distributed services that could go up and and come down, it would be more sustainable and more extensible.

How: The workflows would be constrained according to a set of important semantic types (genes, drugs, diseases). Based on these constraints, we would map out paths (stories) to compose the workflows. A plugin/registry system such as BioCatalogue would be used to identify functional services that would meet the requests. The outputs would be modeled in nanopublication RDF, integrated, and delivered to the client for rendering. Note this would probably not be instantaneous..

Another point that we need to address is the data cleaning. One reason for the warehouse approach is that testing procedures can be built in efficiently in order to clean the data. This is much more challenging in a dynamic approach. We can argue that we make use of sources with at least some level of data cleaning in place already and that we leave the rest to the user. By suppling the user with tools to annotate and even change links themselves we give them power.

Cite this as
Benjamin Good, Kristina Hettne (2016) dynamic sub-graph assembly versus data warehousing. Thinklab. doi:10.15363/thinklab.d172

Creative Commons License