Samza, Cycles, and Streaming for People
Today, I’m delighted to announce the 0.2 release of the
coast project: a high-level streaming toolkit written in Scala.
coast is designed around Kafka’s partitioned log model, and supports complex streaming topologies with unusually strong messaging guarantees and no need for a central coordinator. The current release includes a new backend that compiles to Samza and supports exactly-once semantics for messages and state, support for cyclic dataflow graphs, and a bunch of improvements to the core library and documentation.
I’m pretty excited about this release; the core of the library has taken shape, and the new Samza backend is growing close to the guarantees and performance I’ve been aiming for from the very beginning. There’s plenty to do still – but if you’re interested, it’s a great time to get involved.
A Little Motivation
I’ve been thinking back a bit on the core ideas in
coast – where they came from, and why I think they matter. I hope to write more on this in the next few weeks… but for now, I’ll leave you with a quick example.
Connected-component labelling – where you label all the nodes in a graph, such that two nodes have the same label iff there’s a path between them – is one of the classic problems in graph theory. It’s also one of the staples of the MapReduce literature: it’s one of the simpler problems that takes repeated iterations to solve, and there are some elegant and performant approaches that make good use of MapReduce’s distributed computation style. In a recent talk, Nathan Marz uses it as an example of the kind of processing that’s best left to a batch system; streaming systems can build on the batch output, supplementing it with approximate but up-to-the-minute results.
Last night, I put together a brand-new streaming implementation of connected components adapted from MapReduce-based algorithm above. It’s not significantly more complex than any batching implementation I’ve written, and it’s completely incremental: the node-to-component mapping is updated as new edges come in, and the new labellings are published out on another stream. If another system’s interested in the results, it can just subscribe to that output stream and react to any changes instead of polling or recalculating from scratch.
Now, this implementation’s far from perfect. I’m not quite happy with the graph structure, and I’m pretty sure there are ways to send fewer messages and keep less state. On the other hand, I’m excited to be able to introduce this classic batch-processing algorithm to the streaming world – and in a way that doesn’t sacrifice the guarantees that have made MapReduce-style frameworks so useful.
I’ve long felt that most of the challenges in building streaming systems come from the immaturity of the tools, not the inherent complexity of the problems we’re solving. With
coast, I’d like to narrow that gap a bit.