Barely Functional #2

Structuring Data and Scrapping Boilerplate

This is my second post on a Haskell codebase I’ve been working on – a reimplementation of the Ethereum cryptocurrency / application platform in Haskell. (If you missed it, you might want to read the original post first.) Like last time, we’ll take some ideas from the original project, look at how they translate into Haskell, and compare things with one of the official implementations.

Last time we looked at the ‘RLP encoding’, a JSON-like data representation format that’s used heavily in Ethereum. As I’ve fleshed out the rest of the project, I’ve also built up a small library of support code that makes this representation much easier to deal with. In this post we’ll go through a few different iterations of the library, with a short walkthrough of GHC.Generics and some comments on working with type classes.

Round 1: Type Classes and Convertibles

The last post revolved around a simple data type called Item:

data Item = String ByteString | List [Item]

We also looked at a couple functions that translate back and forth between Items and strings of bytes:

encode :: Item -> ByteString
decode :: ByteString -> Maybe Item

Together, these make up the RLP encoding. RLP is used everywhere in Ethereum: to define the messaging protocol, to calculate hashes of data, and as part of the mining algorithm itself. Almost every structure mentioned in the specification comes with a canonical representation in RLP. This is a lot like the way JSON is often used: at the ‘edges’ of a system, it’s often useful to be able to handle many data types generically, and this sort of intermediate representation is very useful.

While JSON’s a common representation on the wire, it’s very rarely a nice structure to use internally, being simultaneously too specific (supporting only a small set of types) and too general (allowing more structures than our application expects). It’s therefore common to validate JSON at the edges of our system, converting it to more appropriate structures for internal use. In Haskell, with its culture of expressive types and static checking, this is even more important – no use in having a nice typesystem if everything has the same type!

For this to work, though, moving back and forth needs to be easy, or you risk drowning out your important logic in a sea of manual conversions. In the wildly popular aeson library, this need is handled by a couple of classes: ToJSON and FromJSON. To implement these classes for a type, you need to provide functions to convert back and forth between your type and aeson’s representation of a JSON value. In return, you get a standard conversion interface across all types; you also get to write functionality that works for everything with a JSON representation, which is occasionally very useful indeed.

I started off with a slightly simplified version of that approach. Since I didn’t need the flexibility of separate ‘to’ and ‘from’ classes, I collapsed things down to one:

class Convertible a where
  toItem :: a -> Item
  fromItem :: Item -> Maybe a

This looks a lot like the types of the encode and decode functions above: toItem converts our type a into an RLP Item; and fromItem tries to convert it back, returning Nothing if it can’t make sense of the input.

A few instances are particularly obvious:

-- We can represent an Item as itself
instance Convertible Item where
  toItem item = item
  fromItem item = Just item

-- A ByteString just needs to be wrapped / unwrapped
instance Convertible ByteString where
  toItem bs = String bs
  fromItem (String bs) = Just bs
  fromItem _ = Nothing

-- If we can convert the elements, we can convert a list
instance Convertible a => Convertible [a] where
  toItem xs = List $ map toItem xs
  fromItem (List xs) = mapM fromItem xs
  fromItem _ = Nothing

We can do the same for any type which has a well-defined representation in RLP. Here’s a quick one for a 32-byte hash:

newtype Digest = Digest ByteString

instance Convertible Digest where
  toItem (Digest bytes) = toItem bytes
  fromItem item = do
    bytes <- fromItem item
    guard $ BS.length bytes == 32
    return $ Digest bytes

Notice how we can use our existing Convertible instance for ByteString in our definition for Digest.

Many libraries – aeson, binary, cereal, and so on down the alphabet – use type classes for this kind of conversion; the common structure and interface makes them simpler and more uniform to use. I was a bit surprised, though, at how it makes them significantly easier to write. Like Digest above, most of the conversions in the project lean heavily on other conversions for simpler types. Using the type class, the compiler chooses the right conversions for those pieces automatically – and the usual methods of composition in Haskell make it easy to put the whole thing together.

Round 2: Scrapping Boilerplate with GHC.Generics

With those basic building blocks in place, I started translating more data types and their RLP representations from the specification into Haskell. Here’s one that represents a peer in the Ethereum network:

data Peer = Peer
  { peerAddress :: HostAddress
  , peerPort :: Int
  , peerUniqueId :: Digest
  }

This has a pretty direct encoding in RLP: our three-field structure translates to a three-element list, and each element translates using its own canonical representation. In a notation where RLP(x) is a function from x to its representation, this can be written a bit more formally.

RLP(peer) = [ RLP(address), RLP(port), RLP(id) ]

Assuming we’ve already defined Convertible for HostAddress, Int, and so on, we can write this out directly.

instance Convertible Peer where
  toItem (Peer address port id) = 
    toItem [toItem address, toItem port, toItem id]
  fromItem (List [address, port, id]) = 
    Peer <$> fromItem address <*> fromItem port <*> fromItem id
  fromItem _ = Nothing

In some respects this is pretty nice; it’s easy to see how the Haskell data type corresponds to the RLP version, and the conversions for the field types are easy to reuse.

After writing several such Convertible instances, though, this becomes less appealing. It turns out that most of the structures Ethereum defines use the equivalent encoding, where a n-field structure becomes an n-element list. Writing these instances requires very little thought – the translation between Peer and the corresponding Item is perfectly mechanical – but we still need to mention each field four times in a consistent order. This all adds up to quite a bit of code, with a correspondingly large error surface.

I had hoped to sit down, hammer out all the instances, and move on to more fun and interesting things… but it turned out to be just too much, and I started looking for ways to abstract this out. In other languages we might reach for macros, or code generation, or runtime reflection. Haskell, on the other hand, has its own little cottage industry of ways to scrap exactly this variety of boilerplate. Right now, the most fashionable of these appears to be GHC.Generics.

GHC.Generics is composed of a few things:

A small set of data types that, taken together, are rich enough to represent Haskell’s sum-of-products data declarations.
A type class, Generic a, that allows you to convert between a type a and its generic representation. It’s like our Convertible type class above, with forward and backward conversion functions, except the type of the generic representation may vary depending on the type of a.
A language extension, DeriveGeneric, that enables GHC to derive a correct representation automatically.

Together, those are enough to translate from a custom data type to a generic version with very little effort. It’s still up to the programmer to write logic to use that representation for their specific task – but once that’s done, you have a function that works over a whole universe of data types.

If you’re not familiar with GHC.Generics, you’ll want to peruse the fairly extensive haddocks, since I’ll address some things only glancingly. (It may help to follow along with their serialization example as we go, since the structure is similar to what we’re doing here.) If you’re not interested in the minutae of generic programming just yet, feel free to skip on ahead to the next section.

The usual way to implement a new piece of generic functionality is to create a new type class with instances for each of the primitive Generics types; the compiler can then put these instances together to match the precise shape of a full generic representation. We’ll look at a generic function that captures the pattern we noticed above, where n-field constructors are converted to n-element lists and all the fields use their own canonical conversions. Since the constructor is the product in the sum-of-products representation, I’ve unimaginatively labelled this the ‘product’ representation for a type.

Our new type class is based on the existing Convertible class, but with a few necessary changes. First, since our representation always has a list at the top level, the type class can be specialized a bit: the decoding function takes a list of items, and the encoding function returns a list. While decoding, we’ll sometimes find ourselves with both a partially-decoded value and some leftover input – this code takes the easy way out and passes the leftovers around explicitly. (Haskell’s standard reads function does the something similar.) At some points we’ll expect the input to be exhausted, though, so there’s a helper function that checks this and returns just the result.

To be able to define the instances properly for the generic-representation types, the class also needs to be higher-kinded. This complicates type signatures slightly, but it has no effect on the implementation.

class ConvertProduct f where
  -- Given a list of items, try and decode a value. We
  -- pass along any unused elements in the tail.
  partialFromItems :: [Item] -> Maybe (f a, [Item])
  -- Convert a value into a list of items.
  productToItems :: f a -> [Item]

-- Like 'partialFromItems' above, but requires that 
-- all input is used.
productFromItems :: ConvertProduct f => [Item] -> Maybe (f a)
productFromItems list = partialFromItems list >>= complete
  where
    complete (x, []) = Just x
    complete _ = Nothing

Now, we need to handle each case one-by-one. Let’s start with the simplest case: U1, which represents a constructor with zero fields. This maps pretty nicely to the empty list, and can ignore any input.

instance ConvertProduct U1 where
  partialFromItems x = Just (U1, x)
  productToItems _ = []

K1 i a represents a single field of type a. We want to use the canonical representations for the fields, so we require a Convertible instance for a and use it to handle a single element of our product.

instance Convertible a => ConvertProduct (K1 i a) where
  partialFromItems (item : rest) = do
    x <- fromItem item
    return (K1 x, rest)
  partialFromItems [] = Nothing
  productToItems (K1 x) = [toItem x]

a :*: b is used to encode products with more that one field; it’s isomorphic to the type of pairs (a, b). (We can get products of more than two elements by nesting pairs.) When encoding, we just need to encode each half and concatenate. When decoding, we do first the left, then the right, threading the leftovers through.

instance (ConvertProduct a, ConvertProduct b) => ConvertProduct (a :*: b) where
  partialFromItems list = do
    (a, rest0) <- partialFromItems list
    (b, rest1) <- partialFromItems rest0
    return (a :*: b, rest1)
  productToItems (a :*: b) = productToItems a ++ productToItems b

a :+: b is used to encode sum types (ie. types with more than one constructor) – it’s isomorphic to Either a b. As with pairs, data types with more than two constructors can be handled by nesting. This instance works by tring the first element in the sum first, falling back to the other one if that fails. Since the sum is always outermost, we expect no leftovers here.

instance (ConvertProduct a, ConvertProduct b) => ConvertProduct (a :+: b) where
  partialFromItems items = do
    let left = L1 <$> productFromItems items
        right = R1 <$> productFromItems items
    product <- left <|> right
    return (product, [])
  productToItems (L1 left) = productToItems left
  productToItems (R1 right) = productToItems right

Finally, M1 is just a carrier of metadata, with no associated structure. In our case, we can just ignore it.

instance ConvertProduct a => ConvertProduct (M1 i c a) where
  partialFromItems x = do
    (y, rest) <- partialFromItems x
    return (M1 y, rest)
  productToItems (M1 x) = productToItems x

Once all those instances have been defined, we can use this ConvertProduct class to provide a default implementation for Convertible. (Since that requires an extra couple classes, we need to add an explicit signature for the default implementation.)

{-# LANGUAGE DefaultSignatures #-} 
import GHC.Generics as G

class Convertible a where

  toItem :: a -> Item
  default toItem :: (Generic a, ConvertProduct (Rep a))
                 => a -> Item
  toItem x = List $ productToItems $ G.from x

  fromItem :: Item -> Maybe a
  default fromItem :: (Generic a, ConvertProduct (Rep a))
                   => Item -> Maybe a
  fromItem (List list) = G.to <$> productFromItems list
  fromItem _ = Nothing

Whew! That’s it for the generic implementation. Now all we need to do in our data declarations is derive Generic and instantiate the class.

{-# LANGUAGE DeriveGeneric #-} 
import GHC.Generics(Generic)

data Peer = Peer
  { peerAddress :: HostAddress
  , peerPort :: Int
  , peerUniqueId :: Digest
  } deriving Generic

instance Convertible Peer

That’s much better – the whole conversion is determined by the shape of the data and the conversions we’ve defined for the field types elsewhere, so there’s nothing extra to specify. If that default implementation doesn’t do what we want, it’s as easy as always to provide our own implementations.

I had been vaguely familiar with GHC.Generics before starting on this – and I’d used DeriveGeneric once to derive some aeson instances for a record, which gave me some confidence I could do the same for RLP. I had almost no idea how it worked internally, though – most of the complexity of the generic functions is hidden from the client – so I was learning all this pretty much from scratch.

I found (and still find) some of the types difficult to follow. All the representation types are higher-kinded, but the type argument is never used; many of the type constructors have several additional type parameters and a few type synonyms, but I needed only a small subset of them. I understand why this has to be there – that extra information is occasionally necessary – but it makes intimidating reading for a newcomer.

Thankfully, a few things made this simpler than it could have been. Perhaps most importantly, the serialization example included in the docs is precisely my use-case: it takes advantage of the structure, but ignores the field names and other metadata. It was much easier to adapt that code than to write it up from scratch.

I also got quite a bit of help from the compiler. While the variety of type parameters and aliases are daunting, they’re based on just a few concrete data types. There are only a few obvious ways to line these up – and once I got the structure right, type inference took care of the rest. It was also helpful that GHC was deriving the actual representation types: if I was missing a ConvertProduct instance, the compiler would complain and spit out exactly the bit of the representation it couldn’t handle yet. This made the correspondence between my data types and their generic representations more obvious, and made it simpler to build up the implementation piecemeal.

All this work paid off – the code simplification elsewhere was dramatic. In the manual Convertible implementation for Peer above, we have to mention each field at least three times; in the second one it’s only mentioned once, right when it’s defined. Clearing out that redundant information leaves more space on the screen – and in the mind – for the code that actually matters.

Round 3: Splitting Out a Value-Level Converter

This already covers a lot of ground, but as I got deeper into the spec, I found some important cases that it doesn’t cover. To tease these out, we’ll look at a simplified example from Ethereum’s messaging protocol. I’ve removed a handful of message types and fields, but all the relevant details should be there.

The Hello message is sent as part of the handshake, and gives the peer some information about the sending host:

data Hello = Hello
  { protocolVersion :: Int
  , networkId :: Int
  -- etc.
  }

GetPeers message asks the connected peer to info on other peers to connect to:

data GetPeers = GetPeers

The Peers message is the response, containing a list of known peers. (That Peer type in the list is the one we defined above.)

data Peers = Peers [Peer]

And finally, the Message type has a union of the three. Incoming messages are typed as Message; we can pattern-match on the different constructors to recover the appropriate type.

data Message =
    HelloMsg Hello
  | GetPeersMsg GetPeers
  | PeersMsg Peers
  -- etc.

We can’t use our existing deriving mechanism here for a couple of reasons:

To help discriminate between messages, Ethereum requires that each message is prepended with a distinct integer. (eg. Hello encodes as [ 0x00, RLP(version), RLP(network id) ], and GetPeers as [ 0x10 ]. This is almost the same as the product representation we wrote above, but that code’s locked up in the type class.
For some types with single-field constructors, we need to use just the encoding of that field. For example, HelloMsg hello should have precisely the same encoding as hello does, without any additional wrapping. (Our ‘product’ encoding would wrap the value up in a single-element list, but that doesn’t match the representation in the spec.)
In some cases, we need both these behaviours. The Peers type is encoded as:

[ 0x11, RLP(peer_0), RLP(peer_1), ... ]

This is the RLP encoding of the underlying [Peer] list, with the integer prefix jammed on to that. We’ll need to combine the above two behaviours together somehow.

While it would be possible – if tedious – to implement these cases by hand, it turns out it’s also possible to extend the Convertible machinery to make these definitions easy, and without making the common case more painful.

The Convertible class, like all type classes in Haskell, does a few things: it groups a set of values into a ‘dictionary’, it ensures only one implementation exists for each type, and it passes that dictionary behind-the-scenes to functions that require it. The grouping of related functionality is useful: the encoding and decoding methods always come in pairs, so it makes sense to pass them around and operate on them together. Having an implementation that’s implicitly passed around is also pretty handy: Ethereum defines a canonical serialization for each type, so it’s useful that a client can just say toItem whatever and have the right conversion picked out automatically.

On the other hand, the uniqueness restriction is a pain. Consider our Hello message, where the specified RLP encoding is the exactly like the ‘product’ encoding above, but with a single tag at the front of the list: it would be nice if we could just write a function that took a Convertible instance as input, returning a new one which prepends and strips the tag as necessary. But the only way to pass a type class as an argument is implicitly – and returning one is impossible.

Like many problems in computing, this tension can be resolved by adding another layer of indirection. We define a new Converter data type which holds our two transformation functions, and reduce the Convertible class to a single function that returns a Converter. In this way, we can define as many Converters as we like, and still mark one out as the canonical one. While we’re at it, we can factor out the default implementation to a top-level binding as well. Here’s the whole thing:

data Converter a = Converter 
  { convertToRLP :: a -> Item
  , convertFromRLP :: Item -> Maybe a
  }

-- We can factor out the default 'product' conversion
asProduct :: (Generic a, ConvertProduct (Rep a))
          => Converter a
asProduct = Converter to from
  where
    to input = List $ productToItems $ G.from input
    from (List list) = G.to <$> productFromItems list
    from _ = Nothing

-- ...and use that Converter as the default implementation
class Convertible a where
  converter :: Converter a
  default converter :: (Generic a, ConvertProduct (Rep a)) 
                    => Converter a
  converter = asProduct

The cost of this is a little extra work when implementing Convertible from scratch. When implementing the old Convertible class, we just had two write to functions; in the new version, we need to write a third function that returns a Converter with our two functions in it. Thankfully, we can define our old toItem and fromItem methods in terms of these new abstractions, so none of the usages need to change.

toItem :: Convertible a => a -> Item
toItem = convertToRLP converter

fromItem :: Convertible a => Item -> Maybe a
fromItem = convertFromRLP converter

With that machinery in place, we can write a simple function that takes a converter and returns a new one that handles tagged representations. When the original converter encodes a value as an RLP List, it prepends the tag to the list and passes it on. When decoding, it checks that the tag matches – if it matches, it strips the tag and delegates the rest of the decoding to the underlying converter.

tagged :: Int -> Converter a -> Converter a
tagged n conv = Converter to from
  where
    tag = toItem n
    to input = case convertToRLP conv input of
      List list -> List (tag:list)
      other -> other
    from (List (x:xs)) | x == tag = 
      convertFromRLP conv $ List xs
    from (List _) = Nothing
    from item = convertFromRLP conv item

And with that, we can now write a lightweight instance for Hello:

instance Convertible Hello where 
  converter = tagged 0x00 asProduct

There’s only one bit of functionality missing – we’d like some help defining conversions for types like our Message type, where every constructor has a single field, and we just want to use the representation for that field without wrapping. It looks quite a bit like our first generic function, and this post is already creaking under the weight of all the code, so I’ll leave it out. (If you’re interested, you can find it in Convert.hs on GitHub.) In the end, though, we get a type signature suspiciously similar to asProduct’s:

asUnderlying :: (Generic a, ConvertUnderlying (Rep a)) 
             => Converter a

…and the rest of our instance declarations fall into place:

instance Convertible GetPeers where 
  converter = tagged 0x10 asProduct

instance Convertible Peers where 
  converter = tagged 0x11 asUnderlying

instance Convertible Message where 
  converter = asUnderlying

I think that’s about as declarative as this is going to get.

My first attempt at supporting this functionality involved complicating the original ConvertProduct type class – I added an extra function for the tag, which needed a phantom type, and I special-cased the single-field constructor, and I still had to write a bunch of stuff by hand. Splitting out the old type class members into a regular data type made the code much easier to follow, and made these new features much easier to implement. Reasonable people have pointed out that some things just work better at the value level than at the type class level, and that seems to be the case here. It’s a pain we have to do this manually, since GHC is supposedly doing something quite similar behind the scenes, but I don’t see any obvious alternatives that don’t cause problems elsewhere.

General Thoughts

This turned out to be a lot more library than I was expecting to need – mostly because I grossly underestimated the number and repetitiveness of the data types that the spec defined. GHC extensions can also be a bit of a rabbit hole, so I’d meant to avoid learning more until I’d written a bit more code. In fact, pretty much everything after that first type class was stuff that I had been hoping to get away without.

Still, I had to write all those instances somehow – and while it’s hard to quantify, I feel this library has more than paid for itself. Judging by the few handwritten Convertible instances, doing everything manually would have taken significantly more code – the library is bigger, but the instances are dramatically smaller. Shifting that code out of the main codebase leaves more space and focus for the code that does interesting work, or the few data types where the representation in RLP is unusual or complex. Making patterns into actual language-level values, like we did with our ‘product’ representation, is also inherently useful: you just need to get things correct once, and the rest if the codebase can share the benefits.

Unfortunately, my first generic implementation was not correct: given a very particular shape of data type and the right RLP data, decoding failed when it should have succeeded. None of my existing types had that shape, nor did any types in my unit tests – I only caught this one with a close reading. I feel a bit of a hypocrite after singing the praises of QuickCheck last time around, since covering odd cases like this is exactly what property tests is for, and I haven’t managed to write any property tests yet. This is not for lack of trying, though; I haven’t been able to work around the fact that the generic function needs to be tested against values of many different types, but an individual QuickCheck generator always returns values of the same type. It seems like there ought to be a solution for this, but I haven’t come across it yet.

Other Implementations

In the last post, we looked quickly at the standard Go and Python implementation of the RLP encoding, comparing them to each other and to the Haskell version. I was hoping to do the same again, but it’s a bit trickier this time: the implementations differ much more than last time, and the interesting bits are spread across many files. To simplify things, I’ll focus on the implementation in Go. As always, code comparisons should be taken with several grains of salt; I encourage you to follow the links to the original code and draw your own conclusions.

The rough equivalent of our RLP Item is the Value type in value.go. While our definition of Item closely mirrors the spec, the Value class can contain a variety of numeric types as well. (I assume this is an optimization, but I haven’t found how or what it helps.) Most of the code consists of dynamic type checks and casts, as well as some methods for indexing into and taking ranges of lists. (If you make a nonsensical cast, or ask for a missing field, the Value type returns a default value instead of an error; this means that even nonsensical input will often be ‘successfully’ parsed.)

The Go implementation has no direct equivalent to the Convertible type class. Conversions to the handful of primitive types (numeric types, byte strings) live as methods on the Value type, and the RLP serialization method handles them directly. The equivalent of the Message type is in messaging.go, but that only handles the type tag and leaves the rest of the fields in their encoded form – the conversions for the message data are mostly in another file, peer.go, interleaved with some networking logic. Except for numeric conversions, which are handled within the Value type, there’s no consistent way to move back and forth between a type and it’s RLP representation. There’s also no Go equivalent to our use of GHC.Generics to capture composite structures. Sometimes, conversions between a Go class and the equivalent RLP are written manually; often there’s no explicit Go class at all, and the Value type is used raw.

In general, the eth-go code works at a low level of abstraction, converting between representations by hand and leaving the expected structure implicit. I’d expect a higher-level interface to look something like Go’s standard JSON library – which uses a mix of interfaces, reflection, and mutation to automatically translate back and forth between data types and JSON – but I haven’t worked out enough details to know how feasible / easy that would be, or to compare the convenience of that approach with the Haskell version above.

It’s difficult to make a fair comparison between the two codebases, since the two approaches diverge so much, but some general properties are fairly clear. Counting just the core type-agnostic abstractions (including the generic programming bits) the Haskell code is maybe half the size of the Go implementation; the gap widens when you include the conversions for specific types, since GHC is doing a bunch of work that the Go programmer has to do by hand. However, the Go code is less abstract, and it involves less conceptual machinery.

It’s interesting to see how these basic infrastructural things have a big impact on the rest of their respective codebases. Hopefully, we’ll get a chance to go through some of the bigger ones in future posts.

Conclusions

Now that this infrastructure’s out of the way, we can start digging into the algorithmic heart of Ethereum. Next time, I expect we’ll look at the Merkle trees used for cryptographically authenticated state – with brief digressions into free monads, initial vs. final encodings, and other fun and interesting things.