MPack 1.1.1
A C encoding/decoding library for the MessagePack serialization format.
|
The Reader API is used to parse MessagePack incrementally with no per-element memory usage. Elements can be read one-by-one, and the content of strings and binary blobs can be read in chunks.
Reading incrementally is much more difficult and verbose than using the Node API. If you are not constrained in memory or performance, you should use the Node API.
A reader is first initialized against a data source. This can be a chunk of data in memory, or it can be a file or stream. A reader that uses a file or stream will read data in chunks into an internal buffer. This allows parsing very large messages efficiently with minimal memory usage.
Once initialized, the fundamental operation of a reader is to read a tag. A tag is a value struct of type mpack_tag_t
that contains the value of a single element, or the metadata for a compound element.
Here's a minimal example that parses the first element out of a chunk of MessagePack data.
The struct mpack_tag_t
contains the element type, accessible with mpack_tag_type()
. Depending on the type, you can access the value with tag accessor functions such as mpack_tag_bool_value()
or mpack_tag_array_count()
.
Of course, parsing single values isn't terribly useful. You'll need to know how to parse strings, maps and arrays.
Compound types are either containers (map, array) or data chunks (strings, binary blobs). For any compound type, the tag only contains the compound type's size. You still need to read the contained data, and then you need to let the reader know that you're done reading the element (so it can check that you did it correctly.)
To parse a container, you must read as many additional elements as are specified by the tag's count. Note that a map tag specifies the number of key value pairs it contains, so you must actually read double the tag's count. You must then call mpack_done_array()
or mpack_done_map()
so that the reader can verify that you read the correct number of elements.
Here's an example of a function that reads a tag from a reader, then reads all of the contained elements recursively:
WARNING: It is critical that we check for errors during each iteration of our array and map loops. If we skip these checks, malicious data could claim to contain billions of elements, throwing us into a very long loop. These checks allow us to break out as soon as we run out of data. Since this function is recursive, we've also added a depth limit to prevent bad data from causing a stack overflow.
(The Node API is not vulnerable to such problems. In fact, it can break out of such bad data even sooner. The Node API will automatically flag an error as soon as it realizes that there aren't enough bytes left in the message to fill the remaining elements of all open compound types. If possible, consider using the Node API.)
Unfortunately, the above code has a problem: it does not handle strings or binary blobs. If it encounters them, it will fail because it's not reading their contents.
The data for chunks such as strings and binary blobs must be read separately. Bytes can be read across multiple read calls, as long the total number of bytes you read matches the number of bytes contained in the element. And as with containers, you must call mpack_done_str()
or mpack_done_bin()
so that the reader can verify that you read the correct total number of bytes.
We can add support for strings to the previous example with the below code. In this example, we read the string incrementally in 128 byte chunks to keep memory usage at an absolute minimum:
As above, we need a safety check to ensure that bad data cannot get us stuck in a loop.
Code for mpack_type_bin
is identical except that the type and function names contain bin
instead of str
.
The MPack reader supports accessing the data contained in compound data types directly out of the reader's buffer. This can provide zero-copy access to strings and binary blobs, provided they fit within the buffer.
To access string or bin data in-place, use mpack_read_bytes_inplace()
(or mpack_read_utf8_inplace()
to also check a string's UTF-8 encoding.) This provides a pointer directly into the reader's buffer, moving data around as necessary to make it fit. The previous code could be replaced with:
There's an important caveat here though: this will flag an error if the string does not fit in the buffer. If you are decoding a chunk of MessagePack data in memory (without a fill function), then this is not a problem, as it would simply mean that the message was truncated. But if you are decoding from a file or stream, you need to account for the fact that the string may not fit in the buffer.
To work around this, you can provide two paths for reading data depending on the size of the string. MPack provides a helper mpack_should_read_bytes_inplace()
to tell you if it's a good idea to read in-place. Here's another example that uses zero-copy strings where possible, and falls back to an allocation otherwise:
When parsing data of expected types from a dynamic stream, you will likely want to hardcode type checks before each access. For example:
MPack provides helper functions that perform these checks in the Expect API. The above code for example could be replaced by a single call to mpack_expect_bool()
. If the value is not a bool, it will flag an error and return false
. This is the incremental reader analogue to mpack_write_bool()
.
These helpers are strongly recommended because they will perform range checks and lossless conversions where needed. For example, suppose you want to read an int
. A positive integer could be stored in a tag as two different types (mpack_type_int
or mpack_type_uint
), and the value of that integer could be outside the range of an int
. The functions mpack_expect_int()
supports both types and performs the appropriate range checks. It will convert losslessly as needed and flag an error if the value doesn't fit.
Note that when using the expect functions, you still need to read the contents of compound types and call the corresponding done function. A call to an expect function only replaces a call to mpack_read_tag()
. For example:
If you want to always check that types match what you expect, head on over to the Expect API and use these type checking functions. But if you are still interested in handling the dynamic runtime type of elements, read on.
In this example, we'll implement a SAX-style event-based parser for blobs of MessagePack using the Reader API. This is a good example of non-trivial MPack usage that handles the dynamic type of incrementally parsed MessagePack data. It's easy to convert any incremental parser to an event-based parser, as we'll soon see.
First let's define a header file with a list of callbacks for our parser, and our single function to launch the parser:
Next we'll define our parse_messagepack()
function in a corresponding source file to set our our reader. This will wrap another function called parse_element()
.
We'll make parse_element()
recursive to keep things simple. This makes it extremely straightforward to implement. We just parse a tag, switch on the type, and call the appropriate callback function.
Note that we can access all strings in-place because the source data is a single chunk of contiguous memory.
As above, the error checks within loops are critical to keep the parser safe against untrusted data.
This is all that is needed to convert MPack into an event-based parser.