Does anyone care about protobufs?

There’s library(protobufs), but it seems to be unmaintained and doesn’t conform with the spec in various ways. I’ve started a bit of clean-up on the code, but before I put more work into it, I’d like to know if anyone other than me is interested in protobufs.

Protobufs have a big performance advantage over more verbose serialization methods – apparently the pure-javascript implementation is 6x faster than JSON; and I’d anticipate a similar speed difference with a Prolog implementation.

I’m thinking of implementing something with a different philosophy from the current code by making the equivalent of protoc --prolog_out= (but using a somewhat different technique), that would allow accessing the fields in the protobuf in a more natural way and would also allow the implementation to align with the spec.

Any thoughts?

2 Likes

Is this a kind of DSL - code generator thingy … supported by a library ported across languages to do the heavy lifting

But, i don’t get it … how can the message keyword appear across all languages

Is this being preprocessed somehow … do all supported language environments need to support preprocessing

Its, btw, amazing in how many ways the same problem (data description) is addressed – there is XSD for XML and i guess a schema language for json as well, and now there a Protocol Buffer DSL

Dan

A bit of explanation … protobufs are heavily used at Google both for transmitting and for storing data; and are about as lightweight as possible. For example, if you wanted to send data like this (in JSON):

{person_name: {first: "Freddy", last: "Flintstone}, number: 1000}

there would be a description of this message (in a .proto file):

message PersonAndNumber {
  message PersonName {
    string last= 3;
    string first = 4;
  }
  Person person_name = 1;
  int number = 2;
}

and the message would be serialized somewhat like this:
1(20)4(6)Freddy3(10)Flintstone2(1000).
The 1,2,3,4 are the field numbers(*) and the items in parentheses are lengths or numbers.
The actual encoding is a bit different (it’s more compact and not human-readable), but follows the same idea.
The ordering of fields is undefined; this is also a correct serialization:
2(1000)1(20)3(10)Flintstone4(6)Freddy.

Notice that there’s no metadata in the message: both sender and receiver have to agree in advance that they’re communicating via a PersonAndNumber message.

The protobuf compiler (protoc) generates code so that, for example, in Python you can write:

import PersonAndNumber_pb2  # generated by protoc from the .proto file
p_a_n = PersonAndNumber_pb2(receive_serialized_msg())
print(f"{p_a_n.person_name.first} {p_a_n.person_name.last}'s number is{p_a_n.number}")

and similarly, the message can be created something like this (the details are slightly different):

p_a_n = PersonAndNumber_pb2()
name = p_a_n.person_name()
name.first = "Freddy"
name.last = "Flintstone"
p_a_n.number = 1000
send_serialized_msg(p_a_n.Serialize())

So, in C++, Java, Python, the protobuf contents can be handled almost like a native data type, but the data can be transmitted in a very compact form; and the two ends of the communication can be written in whatever languages you want.

Also, it’s easy to add new fields in a message definition (.proto file) - older programs will just ignore the new fields. (That’s why the field numbers are included in the message; they also allow leaving out fields with default values.)

(*) A real .proto file would label the fields from 1 in each message; I’ve given every field a different number to make it easier to understand how the message is serialized.

3 Likes

Thank you Peter, for clarifying.

What is the (“cost”) context in which architecturally protobuf becomes a significant advantage over say JSON – i.e. when does it start to shine …

I imagine at Google scale, every saved byte makes a significant impact – but, for us mere mortals – when does the benefit kick in …

Edit:

I guess, there is also the flip side that no meta data is included in the message – although, i would imagine that meta data has limited use without at least (tacit) semantic agreement on the meaning of field names, on both sides of the pipe.

The cost of serializing/unserializing is apparently 6x less in JavaScript, compared to JSON.stringify() and JSON.parse(). And only data is transmitted: none of the field names are transmitted, so it’s very compact.

The other advantage is that it’s language-neutral: you can have a Python client and a Java server; and you could reimplement the Java server in C++ with no changes to the client. Admittedly, JSON and XML also have this property; but they’re much more heavyweight.

Anyway, I’m not advocating it as a replacement for reading/writing Prolog terms, nor as a replacement for JSON (unless performance is an issue); the main use would probably be with systems that already use protobufs, such as anything that uses gRPC. Here are some adopters: Square, Netflix, IBM, CoreOS, Docker, CockroachDB, Cisco, Juniper Networks, Spotify, Zalando, Dropbox (gRPC - Wikipedia)

2 Likes

For what it is worth, this relates pretty well to my work on ROS2. ROS types are described in .msg files. A .msg file basically describes a struct (object/dict/map/…) where the fields have a name and a type that is either another .msg file (nested-struct) or a primitive (various sized integers, floats and strings) or an array thereof (where we have unbounded arrays, fixed size and upperbound arrays).

Normally these types are compiled to C++ and Python classes. These are the two core languages of ROS. Additional languages have two options: (1) extend the message compiler infrastructure to generate code for the new language or (2) use the type introspection. I use the second approach for the Prolog interface. This implies

  • Given some type T, load shared objects (DLLs) that provide both helper functions and data structures. These symbols have names you can compute from the type name. This provides:
    • A function to allocate a default message for the type. All numbers are zero, strings and arrays are empty. Fixed arrays are filled with zeroed members. If some message fields have a default this is filled.
    • A function to destroy (free) a message.
    • A data structure that describes a struct, providing the field names, types and offsets in memory.

On top of that I wrote a C wrapper that walks over the type description data structure and fills the fields from the Prolog data or builds a Prolog term (dict) from the fields. The Prolog representation is a (nested) dict, where arrays are lists, numbers are mapped easily and strings are mapped to Prolog strings. When a Prolog struct needs to be translated to a message we allocate a default message and fill those fields for which there is a key in the dict. If there is no key we leave the field at the default. This is pretty efficient and easy to use from Prolog.

You can probably realise a similar design for Protobufs. Possibly merely by compiling the .proto file to a (C) data structure and then do the translation as outlined above?

1 Like

That more-or-less summarizes protobufs. The important thing is that the field names don’t appear in the wire format, only the field “tags” (numbers). If someone changes the names in a .proto file, nothing changes in the wire format but of course the code has to change to reflect the changed names.

That’s essentially how I intend to process protobufs in Prolog. (The current implementation in library(protobfus) - which is inadequate for my needs - doesn’t use the .proto files at all, but provides a “high level” interface to the wire format.) The protobuf compiler (protoc) can output a file with all the information from a .proto file (in protobuf wire format, of course), and I can turn that into Prolog facts that can be used to interpret wire format data. (The current code parses the ASCII interpretation of this file; but that’s a temporary solution while I bootstrap the implementation.)

I’ve also written code to segment the protobuf wire format into nested messages (without knowing what the “types” of the individual pieces are); the remaining work is to walk that data and map field tags to message types.

The .proto data structure already exists (I can tell protoc to output it), and I’ve got code that turns it into a bunch of Prolog facts (contrib-protobufs/descriptor_proto.pl at b46260d3ddc4aa8ec20623a92d40f8027b31d021 · kamahen/contrib-protobufs · GitHub).

I still have to work out a higher-level API, both for creating protobufs and reading their contents. This is a non-trivial exercise; the current JavaScript interface is pretty bad, for example (and the original Python interface was utterly awful). There are also some potential trade-offs between performance and convenience that I need to think about.

If the design is just for my needs, the API design shouldn’t be too difficult. But if others want to use the code, I need to think a bit deeper - and also get use cases from other people. So … does anyone currently use protobufs and what are your needs?

@peter.ludemann, how does protobufs compare with msgpack? It would be great if you had a table with a list of features/advantages listing what msgpack has and what protobufs has. I am not looking so much for “protobufs is better than X” but rather for a feature by feature comparison, because for most technologies X is better than Y in some cases and Y is better than X in other cases (for example msgpack is schemaless which is useful in some cases, whereas protobufs is schema-based which is useful in other cases, we should let the user decide), so a table summarizing features and lack of features would be good. Matrix is using CBOR so it might be good to include CBOR in the table.

My general idea is that we culd habe a library(serialization) that provides protobufs,msgpack and some of the other major serialization formats out there.

No idea. You can find out by using your favourite search engine with [msgpack vs protobuf]. And for extra fun: Comparison of data-serialization formats - Wikipedia (which is missing at least 2: RFC 3423 (CRANE) and and RFC 3954 (NetFlow).

My whole intent in improving library(protobufs) is to allow interoperability with systems that already use protobufs (I presume that some parts of GCE are included in this) and also offer a way of serializing data that’s faster than JSON, which might be important if you’re outputting/transmitting large amounts of data (in some of my code, the slowest components are inputting/outputting json_read_dict/3 and read_term/2 … the latter I’ll probably fix by using fast_read/2 and the former by switching to protobufs).

Don’t let me stop you. :grinning:
But I haven’t settled on an API for protobufs yet; and I suspect that creating an API that works across multiple serialization formats will be a challenge. :rofl: :sunglasses:

2 Likes

I also think that is not feasible. What I meant was that within the library we have different APIs for the different formats. Perhaps (library(serialization/msgpack), library(serialization/protobufs), etc.).
We already have msgpack support in "msgpack" pack for SWI-Prolog and it could be added. That was my thought.

That’s useful, thanks

EDIT: Interesting article here:

I’ve had no interaction with Protobuf, but I do care about JSON Schema, so I’m interested in your approach to relating schema to data, @peter.ludemann. I just wrote a JSON parser/generator for Scryer Prolog, and reasoning about JSON schemas is my next step.

Internally inside Prolog, will you keep the data and the schema as two completely different objects, or somehow inject the schema into the Prolog object representation?

3 Likes

We have library(http/json) but read_term/2 and write_term/2 are “builtin”. I don’t see a huge advantage in a more hierarchical library - the namespace needs to get a lot larger to require something like that. (The current library has 137 modules in clp, dialect, http and 188 modules at the top level.)

I would keep the schema and data separately. My initial thoughts on the schema are given here, where names are kept in a “fully qualified” form: contrib-protobufs/descriptor_proto.pl at eb70339580ce320cf07ca7cd6b03ced0662e571b · SWI-Prolog/contrib-protobufs · GitHub … an example of “fully qualified name” is google.protobuf.Timestamp in protobuf/addressbook.proto at 7956ad20d6a1a43d2b2f7758636b72d4427681c7 · protocolbuffers/protobuf · GitHub

The fully qualified form allows a quick lookup to resolve the type of a field … my idea for processing is to pass around a context of the current nested type and use that for looking up field types.

I haven’t decided whether to continue the current implementation’s “declarative” flavor or to have distinct input and output predicates … probably the latter. The current implementation would be left around as a lower-level (and probably faster) way of creating wire format (I don’t think its method for reading wire format is very useful, and I’m not planning on modifying it) and there would be higher level predicates that use the schema, working with something very similar to the current JSON-dict form, except the dict tags would probably be meaningful. (There are some slightly tricky bits, because the protobuf wire format allows the rough equivalent of {a:1, b:2, a:3}, but also with arrays; in fact, the array elements can be interspersed with other fields.)

2 Likes

It seems that the idea behind the current library(protobufs) is to not have separate schemas at all, but instead automatically map Prolog scalar value types to Protobuf scalar value types? For example, a Prolog integer would become a Protobuf int64 and so on?

However that approach doesn’t allow for things like enumerations and other more complex objects. I’m guessing that’s your primary motivation for extending the library.

And it looks like you will attempt to use actual schemas, defined as SWI-Prolog dicts, in your implementation? And your first step is to create a meta-schema for Protobuf itself? That’s what your descriptor_proto looks like to me.

It does have enumerations, but not well documented (and the implementation needs to be updated to take modules into account better). Similarly, it can embed more complex objects. But it doesn’t use any information from the .proto file.

Yes, I would use the actual schema - the “reflection” would be predicates such as

proto_package(Package, FileName, Options)
proto_message_type(Fqn, Package, Name)
proto_field_name(FqnName, Name)
proto_field_number(FqnName, Number)
    ...

These would be used on input to create a term similar to the one created by json_read_dict/2; for output, I’m still thinking how to do best do things – possibly a higher-level flavor of the existing design, with the schema information used to look up field number (from file name), handle enumerations, map from a Prolog primitive type (int, float, atom, etc.) to the appropriate protobuf wire form (e.g, fixed or variable length integer).

1 Like

BTW, the “schema” I gave here is not very good … I’m still thinking about a better representation.
My test case is the output from

    protoc --include_imports --descriptor_set_out=descriptor.proto.msg \
        -I$HOME/src/protobuf/src/google/protobuf descriptor.proto \
        descriptor.proto

My name is Jeffrey Rosenwald. I am the original author/contributor of the library(protobufs).

This library was never intended to provide a comprehensive emulation of the original Google idea (e.g. JAVA synchronous RPCs over byte streams). It was intended for marshaling wire-streams to/from structured data (e.g. Prolog terms), for the purpose of asynchronous Inter Process Communications (IPC) over byte stream socket connections: Telemetry, Command & Control. Protobuf Wire Types are lightweight compact, lossless, machine independent, and instantaneously recognizable.

I wasn’t dancing to Google’s tune. The implementation was intended explicitly for the SWIPL ecosystem. It supports every data type that SWIPL supports, including its infinite-precision integer. In our implementation, there is an homomorphic mapping for all numeric data types, codes are octets, and strings are UTF8 encoded characters. Packed-repeated (lists) values are not supported. Its utility and extensibility is quite remarkable. The fact that it can interwork with so many other languages is simply a “cherry on top.”

Looking at the table below, we can see that the wire types for the ‘proto2’ wire-streams have not changed in a very long time.

It was originally written in 2008, so it’s nearly 13 years old now – donkey’s years for a piece of software – and that it remains relevant today makes it still a good idea. Whether it’s good enough for the present day need is clearly for the eyes of the beholder – Caveat Emptor.

Kindest Regards.

2 Likes

Thank-you Jeffrey – I really appreciate the work you put into this “donkey’s years” ago. I doubt that I would have done as good a design.

If you don’t mind taking a look at the code changes I’ve made, I’d appreciate that: Documentation/test clean-up; add protobuf_segment_message/2 by kamahen · Pull Request #3 · SWI-Prolog/contrib-protobufs · GitHub , particularly in protobuf_segment_message/2, which follows your style of strings as UTF8 and codes as octets (and also has a fall-back in protobuf_segment_convert/2, in case a string is mis-interpreted as a message – I intend to use this as part of the automated conversion between Prolog terms and protobufs.

Like you, I don’t intend to implement everything from Google’s spec (e.g., not handling the “group” types, which are deprecated and hopefully no longer used), nor the serialization details that allow merging protobufs; but I intend to add packed-repeated (lists) and a few other things.

(BTW, I wrote code that used protobufs for nearly 10 years at Google … some of the original implementations, especially for Python, were truly awful. But because I’ve used those, I’ve had trouble switching my point-of-view to a different design, which is probably why I’ve had some problems understanding the finer points of your design. In the process of learning the different point of view, I’m trying to extend the documentation with what I’ve learned.)

3 Likes

Hi I’m working on an LSP implementation including that of its version using PROTOBUFs