DMTN-271

Butler management of quantum graph storage and execution#

Abstract

This technote proposes new data structures and interfaces for storing and accessing graphs of predicted processing :class:(QuantumGraph objects), with the goal of making them usable for reporting on provenance after processing has completed.

Note

This technote is a work-in-progress.

Current Status#

QuantumGraph#

The :class:class::QuantumGraph class is our main representation of processing predictions. In addition to information about task executions (“quanta”) and the datasets they consume and produce, it stores dimension records and sometimes datastore records, allowing tasks to be executed with a QuantumBackedButler that avoids database operations until the complete graph is completed. The :class:class::QuantumGraph on-disk format has been carefully designed to be space-efficient while still allowing for fast reads of individual quanta (even over object stores and http), which is important for how the graph is used to execute tasks at scale.

In terms of supporting execution, the current QuantumGraph is broadly fine, but we would like to start using the same (or closely-related) interfaces and file format for provenance, i.e. reporting on the relationships between datasets and quanta after execution has completed. On this front the current QuantumGraph has some critical limitations:

  • It requires all task classes to be imported and all task configs to be evaluated (which is a code execution, due to the way lsst.pex.config works) when it is read from disk. This makes it very hard to keep a persisted QuantumGraph readable as the software stack changes, and in some contexts it could be a security problem. The PipelineGraph class (which did not exist when QuantumGraph was introduced) has a serialization system that avoids this problem that QuantumGraph could delegate to, if we were to more closely integrate them.

  • QuantumGraph is built on a directed acyclic graph of quantum-quantum edges; datasets information is stored separately. Provenance queries often treat the datasets as primary and the quanta as secondary, and for this a “bipartite” graph structure (with quantum-dataset and dataset-quantum edges) would be more natural and allow us to better leverage third-party libraries like networkx.

  • For provenance queries, we often want a shallow but broad load of the graph, in which we read the relationships between many quanta and datasets in order to traverse the graph, but do not read the details of the vast majority of either. The current on-disk format is actually already well-suited for this, but the in-memory data structure and interface are not.

  • Provenance queries can sometimes span multiple RUN collections, and hence those queries may span multiple files on disk. QuantumGraph currently expects a one-to-one mapping between instances and files. While we could add a layer on top of QuantumGraph to facilitate queries over multiple files, it may makes more sense to have a new in-memory interface that operates directly on the stored outputs of multiple runs. This gets particularly complicated when those runs have overlapping data IDs and tasks, e.g. a “rescue” run that was intended to fix problems in a previous one.

  • While the same graph structure and much of the same metadata (data IDs in particular) is relevant for both execution and provenance queries, there is some information only needed in the graph for execution (dimension and datastore records, which are only duplicated into the graph to avoid database hits during execution) as well as status information that can only be present in post-execution provenance graphs.

Finally, while not a direct problem for provenance, the QuantumGraph serialization system is currently complicated by a lot of data ID / dimension record [de]normalization logic. This was extremely important for storage and memory efficiency, but it’s something we think we can avoid in this rework almost entirely, largely by making lsst.daf.butler.Quantum instances only when needed, rather than using them as a key part of the internal representation.

QuantumProvenanceGraph#

The QuantumProvenanceGraph class (what’s used to back the pipetask report tool) is constructed from a sequence of already-executed QuantumGraph instances and a Butler. It traverses the graphs, queries the butler for dataset existence and task metadata, and assembles this into a summary of task and dataset provenance that is persisted to JSON (as well as summary-of-the-summary tables of counts).

QuantumProvenanceGraph is not itself persistable (the summary that can be saved drops all relationship information), and is focused mostly on tracking the evolution of particular problematic quanta and dataset across multiple runs. While it is already very useful, for the purpose of this technote it is best considered a useful testbed and prototyping effort for figuring out what kinds of provenance information one important class of users (campaign pilots, and developers acting as mini-campaign pilots) needs, and especially how best to classify the many combinations of statuses that can coexist in a multi-run processing campaign. Eventually we hope to incorporate all of the lessons learned into a new more efficient provenance system in which the backing data is fully managed by the Butler.

References#