Summary: | Distributed and parallel applications not only have distributed state but are often
inherently non-deterministic, making them significantly more challenging to monitor and
debug. Additionally, a significant challenge when working with distributed and parallel
applications has to do with the fundamental requirement of determining the order in which
certain actions are performed by the application. A naive approach for ordering actions
would be to impose a single order on all actions, i.e., given any two actions or events, one
must happen before the other. A global order, however, is often misleading, e.g., two events
in two different processes may be causally independent yet one may have occurred before
the other. A partial order of events, therefore, serves as the fundamental data structure
for ordering events in distributed and parallel applications.
Traditionally, Fidge/Mattern timestamps have been used for representing event partial
orders. The size of the vector timestamp depends on the number of parallel entities (traces)
in the application, e.g., processes or threads. A major limitation of Fidge/Mattern time-
stamps is that the total size of timestamps does not scale for large systems with hundreds
or thousands of traces. Taylor proposed an efficient offset-based scheme for representing
large event partial orders by representing deltas between timestamps of successive events.
The offset-based schemes have been shown to be significantly more space efficient when
traces that communicate the most are close to each other for generating the deltas (offsets).
In Taylor’s offset-based schemes the optimal order of traces is computed offline. In this
work we adapt the offset-based schemes to dynamically reorder traces and demonstrate
that very efficient scalable representations of event partial orders can be generated in an
online setting, requiring as few as 100 bytes/event for storing partial order event data for
applications with around 1000 processes.
|