Friday, August 08, 2008

Google Protocol Buffers and Streaming

Okay, I'm not going to give some huge rundown on RPC or services or network size vs. CPU efficiency. Just a little observation which gave me a little bit of a "huh?" moment the first time I used GPB for something.

Essentially, the way that Google Protocol Buffers are encoded can be seen as a small stack-based state machine that is computed as part of a Builder. The Builder holds the essential state of a particular message representation (such as an Address or AddressBook or something), and runs through the bytes in your wire representation, modifying its current state for the desired fields. When you think you've consumed everything, you then extract your Address or AddressBook or whatever from the Builder.

The commands of this state machine are pretty simple:
  • Set the value of Field Number X to value Y (encoded with type Z)
  • Push a new context onto the stack for a sub-Message
  • Pop the context off the stack to go back to the parent message
It's quite clever, but it can lead to some interesting situations. For example, it's entirely valid for a non-repeated field to be set multiple times, so if you're interpreting the network commands, you can get "Set Name to Kirk Wylie" followed by "Set Name to Wylie, Kirk", and (at least in the implementations that I worked with) you get the final value set to "Wylie, Kirk".

That seems like a little piece of trivia, until you realize that Google Protocol Buffers, unlike every other network message representation that I've ever worked with, lack both a sizing prefix and a terminator. Remember, it's a state machine, so it's just going to keep processing.

Again, trivia.

Until you try to store a sequence of discrete messages into a file or over a socket. In which case, what will end up happening is that if you don't explicitly do your own termination or size prefixing, the Builder will just keep processing commands and you'll end up consuming the entire stream and get one message output with only the final values for each field. So if I'm trying to save two Address messages, the first having a name of "Kirk Wylie" and the second having a name of "Wylie, Kirk", I'll only get one output, with "Wylie, Kirk".

This also has the side effect of implicitly, in Java, forcing you to do an unnecessary byte copy. You have to get the prefix number of bytes of the following message (and computing that in the first place before you do the serialization costs you in CPU time), extract the next N bytes from the stream to a byte array, and then have your Builder parse the byte array.

All annoyances more than anything else. But probably useful for other people to know.
blog comments powered by Disqus