Barely Functional #1
Writing a Real Program in Haskell
A lot of folks have been complaining about the gap between knowing elementary Haskell — the sort you learn in language tutorials — and real-world programming. Even when you’re comfortable with the core language, it’s not always obvious where to go from there, or how to translate those ideas into a larger system.
Me, I’m just past this point. I’ve written little bits of Haskell off and on over the last year or so, but in the last weeks, I’ve been working on what could fairly be described as ‘real code’. It’s still fairly early on, but I think there’s enough there that someone might find it interesting. I’ll talk a little bit about the project itself, but mostly I plan to write about the experience of translating ideas into working Haskell code.
I’m hoping this will be of interest to those who have read one of the many excellent introductory guides and are looking at what the next steps are like, or experienced Haskellers who are interested in what the late-beginner experience looks like these days. You might even be able to get something out of it without any Haskell experience, if you’re willing to squint past the code examples.
Ethereum
This story revolves around my ongoing Haskell reimplementation of Ethereum.
The Ethereum project describes itself as a ‘platform for decentralized applications’. In a nutshell, it takes the blockchain concept that Bitcoin developed and extends it, such that you can have the same guarantees for arbitrary computations and state as Bitcoin gives you for financial transactions. It’s pretty early on, and there are a lot of issues still to be worked out, but I’m excited to see if the ideas that made Bitcoin work can be extended to a general-purpose infrastructure.
Crypto-geekery aside, there are a few other factors that made this a nice language-learning project:
Ethereum’s internals are very well documented, and the yellow paper explains the core datastructures and algorithms in a language-independent way. Porting code from one language to another is simpler when you try and maintain the same structure in the source and target languages, but that produces unidiomatic and foreign-feeling code. Writing against a spec makes it easier to avoid these compromises.
Ethereum is already a polyglot project, with implementations in C++, Go, Python, Java, and JavaScript. A lot of design issues just aren’t obvious in a blog post code example, and a second implementation by the same author is often cleaner in any language. With the same core ideas written independantly in multiple languages, I had a chance to see how these differences work themselves out over an entire codebase.
The Etherium spec covers a lot of ground, from parsing and serialization to datastructures, crypto, networking, and virtual machines. This provides an opportunity to get a sense of how the language holds up in a variety of different situations. On a practical note, it makes it a little harder to get bored. (A factor widely undervalued in side projects.)
Learning a language is tough enough without also having to decide what to implement. Emulating the choices of an established project — even when they’re not choices I would have make — lets me worry about the how, instead of the what.
Ethereum’s a neat project, and I’ve enjoyed going through what has turned out to be a pretty careful and well-considered spec. However, the goal of these posts is not an intro to Ethereum, so I’ll be skipping over a lot of details. If you’re interested in this stuff, feel free to follow the links and references scattered throughout.
For now, we’ll start where I started: parsing and serialization.
RLP Encoding
Recursive Length Prefix, or RLP, is the standard data serialization mechanism in Ethereum. Like JSON, it defines an intermediate data representation with minimal structure and a canonical serialization. If you define a way to represent your datatypes in this format, you get a serialization and some structural transformations ‘for free’. Unline JSON, RLP is optomized for simple binary messages; it defines fewer types, and they serialize to compact binaries instead of human-readable strings.
An RLP value is called an ‘item’. An item may be either a string of bytes or a list of items, like so:
data Item = String ByteString | List [Item]
And here’s a handful of items, for flavour:
"foo"
[]
[[], "bar", ["baz", ""]]
""
[[["\NUL"],["",""],"one"],[["two"],[]],["six"],"4531", []]
I’m using the text-style notation for strings, but keep in mind that they’re just a flat sequence of bytes.
The rest of the spec goes on to define a canonical serialization for any item. In short:
- Strings are serialized on the wire without escaping.
- A string containing a single small byte takes no prefix. All other strings are prefixed by length and a type flag.
- A list of items is encoded by concatenating the serialized elements, and prefixing that with its length and type. (This is the eponymous Recursive Length Prefixing at work.)
There’s a couple other nuances, but overall things are pretty simple. Here’s my serialization function, in full:
encode :: Item -> ByteString
encode x = case x of
String bytes ->
if BS.length bytes == 1 && BS.head bytes < 0x80 then bytes
else encodeLength 0x80 bytes
List children -> encodeLength 0xc0 (mconcat $ map encode children)
where
encodeLength :: Word8 -> ByteString -> ByteString
encodeLength offset bytes
| len <= 55 = prefix len <> bytes
| otherwise =
let lenBytes = encodeInt len
-- offsets from 0x80 + 55 = 0xb7 for strings
-- and 0xc0 + 55 = 0xF7 for lists
lenLen = BS.length lenBytes + 55
in prefix lenLen <> lenBytes <> bytes
where
len = BS.length bytes
prefix n = BS.singleton $ offset + fromIntegral n
Deserialization’s a bit more involved, and uses the attoparsec library for the parsing. You can see the complete implementation on GitHub. For the rest of the spec, check the wiki or Appendix C in the yellow paper. (The implementation above mirrors the sample python implementation in the wiki quite closely.)
General Comments
Overall, this implementation came pretty smoothly. I spent a little too much time trying to pick a library for binary parsing / formatting: I’m aware of three different libraries for writing data to ByteStrings, and three for reading data back, and they’re not the same three. binary
and cereal
in particular are very similar. In the end, I went with raw ByteString concatenation and attoparsec. The code turned out nicely enough, so I doubt I’ll be revisiting that choice unless a performance issue crops up.
In retrospect, attoparsec is probably overkill for this simple format, but I was glad to get an opportunity to play around with it. I found the API very easy to pick up: there’s a few functions to define basic parsers, a couple to run the entire parser, and the rest of the work is done by the standard Haskell typeclasses. (Combining parsers together, defining parsers that take no input, etc.) One thing I’ve come to like about Haskell is that, while the experience of programming ‘inside a monad’ feels odd at first, all monads feel more or less the same. This makes it easy to walk up and down the ladder of abstraction: the code talks about parsing logic where it needs to, and everything else feels more or less like business as usual.
This turned out to be a pretty nice place to start the project: it’s a well-encapsulated problem, and the solution feels pleasant in Haskell and not too foreign. I’ve also been happy with the approach of building the program ‘inside out’, starting with a few small pieces and then composing them into larger and larger units. I’ve had to make very few changes to this code as the program grew around it, but having this code functional has helped a lot in deciding what the higher-level pieces need to look like.
Testing
The first implementation I put together had several bugs. We could chalk that up to sloppiness, but I prefer (for my personal education and self-esteem) to try and figure out why a bit of code was particularly buggy. Sometimes it turns out I just need to be more careful, but there’s often a more fundamental issue there somewhere.
Anyone with exposure to Haskell will be familiar with the “once the code typechecks, it usually works” meme. I also believe this: in a type-rich environment, most errors seem to be caught at compile time. This code, however, has very little type information floating around: it’s building up datastructures from raw bits, and all bits look more or less the same. There’s no way for the compiler to know that my particular interpretation of those bits doesn’t match the spec. In languages with a pleasant type-system, many programmers know the feeling where the types are guiding them to a particular solution. In cases like this, though, the environment is much more homogenous — and that’s where these tools are less useful, and we need to be more careful.
Or, we could just use QuickCheck.
I feel like maybe QuickCheck gets lost in the whirlwind of new terminology that young Haskellers are exposed to, which is sad, because it’s astonishingly useful. It has a much better payoff-per-unit-effort than ordinary unit testing, and it’s also a useful conceptual tool: it encourages you to take the properties your implementation ought to have and make them explicit, even when you can’t pack that information into the typesystem. In the future, I wouldn’t consider code like this ‘finished’ without a couple high-level property checks like this.
A full QuickCheck tutorial is somewhat out of scope here, but here’s my code for generating arbitrary RLP items, with some size management so we don’t recurse forever:
instance Arbitrary RLP.Item where
arbitrary = sized item
where
item n
| n <= 2 = strGen
| otherwise = oneof [strGen, resize (n `div` 2) listGen ]
strGen = RLP.String <$> arbitrary
listGen = RLP.List <$> arbitrary
And two of the 5-10 properties I’m checking:
it "should decode what it encodes" $ property $ \rlp ->
(RLP.decode . RLP.encode $ rlp) `shouldBe` Just rlp
it "should encode small bytes as themselves" $ property $ \byte ->
byte < 0x80 ==>
let oneByte = BS.pack [byte]
in RLP.encode (RLP.String oneByte) `shouldBe` oneByte
This turns out to be much more effective than unit testing. My first compiling implementation worked against all the examples from the wiki, but adding just the two properties above flushed out three or four latent issues. I’ve added around twenty-five new test inputs since then (from the shared Ethereum test suite) with no new failures. Some of the bugs QuickCheck found would only show up on very specific examples — like a list containing a single string, which consists of a single byte with a value less than 128. Testing at that level of granularity is infeasible to do by hand, so the automation is more than welcome.
Other Implementations
I’ve briefly compared my code with the official Go and Python. I chose these two because they’re two of the most actively developed, and I have some passing familiarity with the respective languages. I’ve tried not to go too far out on a limb, but please let me know if anything’s off base.
The implementation of decoding / encoding in Haskell is about half the size of the equivalent Python and Go. It looks like most of the difference is made up in the parser code. (Attoparsec gets a lot of credit for this.)
Both the Python and Go implementations of
encode
can throw a type error at runtime. For Python, this is, of course, perfectly normal. In Go, it’s because the parameter is passed as aninterface{}
, which can contain any value. (Go has no way to declare a sum or ‘union’ type.) Based on some messages in the Go mailing list, this is considered not a big deal, is unlikely to change, and switching on the type tag is the idiomatic solution.Interestingly, this switch statement also includes about a bunch of other types, like various numeric types, so the set of all legal inputs is very large. Conversely, the Python code will fail if passed anything other than a string or list. This makes the Python more strict about typing than the Go, which is a bit surprising.
Overall, the differences between the implementations were pretty minor: they all have roughly the same structure, and they all expose similar APIs. I like my own implementation, naturally, but I don’t think there are any very compelling differences here yet.
Bootstrapping a Project
Sadly, much of the time it took to get this code working was spent on project setup. There’s not a lot of great information out there about bootstrapping a new Haskell/Cabal project, and that makes much more time-consuming than it needs to be. Some tips from hard-won experience:
- Use Cabal’s sandboxes. Everything you read about cabal-dev is out of date. However, you’ll probably need a more recent version of cabal than your package manager provides. Fortunately,
cabal-install
can install itself; runcabal install cabal-install
and check it’s showing up on your path correctly. - The Cabal documentation suggests using the
detailed
test runner. Sadly, this does not actually work with any of the popular testing frameworks. This may change one day, but for now, stick withexitcode-stdio
. - Use
HSpec
for unit tests. It has a superset of the functionality in the other frameworks, including inline QuickCheck properties, and produces well-organized test suites with attractive and readable output.
This post has some other pointers for setting up a project, though you may not need all of that when you’re just getting started. I like the suggestion of using a project template, though I have not tried it.
Moving Along
Of course, this is still only half the work… we can convert from RLP to raw bytes, but we still need to convert our datatypes to RLP. This turns out to be more interesting than it sounds, and is where the languages’ implementations really start to diverge. It’s also a nice chance to mess around with DeriveGeneric
, and discuss the merits of typeclasses vs. explicit dictionary-passing.