Does anyone care about protobufs?

Do you need protobufs for input, output, or both? Are there any changes you’d like to the current implementation?

Blockquote Are there any changes you’d like to the current implementation?
I dunno at the moment.

protoc prolog_out should generate prolog code as it does for other languages

Do you need protobufs for input, output, or both?

I think both - the data will be send from client to server and back

That would require adding a plug-in for the compiler.
But I can do a 2-step process, which gets the same effect – protoc --descriptor_set_out=... and then process that file to generate the meta-data.
Is that an acceptable method? (TBH, I don’t know how much work it would be to go the full plug-in route)

I currently have code that creates metadata from the descriptor_set_out file (it’s actually a 2-step process: uses protoc --decode to make a human-readable form, then runs a DCG over that; but I intend to bootstrap to a 1-step process) and I have preliminary code that uses the generated metadata to read protobuf wire format and turn it into a Prolog term using dicts.
The next step is do the same thing for output and to write a bunch of unit tests; and then finish the bootstrap.
If you’re interested in my progress or have comments on my design, the work is in https://github.com/kamahen/contrib-protobufs/tree/fields2 … I’m currently putting the code in the demo directory but it’ll be moved into the protobufs.pl file when it gets in more polished form. (Right now, it’s a rather rough.)

don’t worry about too much work.
do what you can do.
I didn’t count on any help at all, so it’s just a gift for me .

Btw, at first steps it is acceptable not to mess around protoc (at least writing protoc plugin maybe delayed) and make use of “interpreted mode” of pbuf using DynamicMessage, SelfDescribingMessage

In 10 years at Google, I never saw anyone use one of those. :wink:
I suppose they’re fine if you can define the protobufs yourself, but I don’t see how to use them if someone else controls the definition.

Anyway, I’m working on packed-repeated right now (which almost every protobuf seems to use, and which was missing from the original implementation).
BTW, it appears that I messed up and didn’t update protobufs.html (which shows as section('packages/protobufs.html') … I’ll try to get that fixed soon.

From the documentation, t appears to be quite straightforward to write a plugin (in Prolog, even); and it builds on my current approach of using the --descriptor_set_out file, so a bit of extra effort will result in protoc --prolog_out=

@peter.ludemann would love to get your thoughts on this idea I just had for applying .proto schemas to raw Protobuf messages.

A .proto schema gets compiled to a Pushdown Automaton - something like below very rough draft:

Then a DCG protobuf_chars(ParsedTerm, SchemaAutomaton, PushdownStack) --> ... would consume raw Protobuf data while navigating the pushdown automaton and modifying the stack as it goes.

Though after sleeping on this, it seems that one might as well compile straight to a DCG and skip the intermediate steps…

There doesn’t seem to be any need to parse .proto files – protoc already does that and creates a protobuf with all the information (protobuf/descriptor.proto at 5e84a6169cf0f9716c9285c95c860bcb355dbdc1 · protocolbuffers/protobuf · GitHub)
This is nice, because it avoids having to deal with both proto2 and proto3 syntax and the subtle differences.

My intention is that for each .proto file, I’ll create a corresponding .pl file with a module that experts one predicate for each message type in the .proto file. The meta-information would also be available (and used by the serialization/deserialization code). It’s probably easier to show this by an example, but I won’t have that right away – at the moment, I’m updating the documentation and adding test cases (to make sure I understand the existing code), and my next piece of work will be support for packed repeated fields.

I’ll leave the existing protobuf_message/2 as is (with support for packed repeated), as a lower-level interface, at least for outputting protobufs (it’s not clear to me how useful it is for inputting - it seems to require having a “template” that has the fields in the same order as the wire format, and also needs the template to specify “repeated” for any optional field, if the wire format doesn’t provide all the fields).

1 Like

One .pl file per one .proto file, that makes sense to me. I’m starting to lean in the same direction for serializing/deserializing JSON with schemas. A JSON schema would be compiled to a .pl file that contains the definition of a DCG that can both parse and generate JSON with that schema applied.

My intention is that this .proto

syntax = "proto2";
package my.protobuf;
message SomeMessage {
  optional int32 first = 1;  // or int64, uint32, uint64
  optional string second = 2;
  repeated string third = 3;
  optional bool fourth = 4;
  message NestedMessage {
    optional sint32 value = 1;
    optional string text = 2;
  }
  optional NestedMessage fifth = 5;
  repeated NestedMessage sixth = 6;
  repeated sint32 seventh = 7;
}

would allow generating or analyzing messages to/from a term like this (the dict tags don’t need to be specified when outputting; they’d be looked up in the meta-data):

Msg = 'SomeMessage'{
    first: 100,
    second: "abcd",
    third: ["foo", "bar"],
    fourth: true,
    fifth: 'NestedMessage'{
                value: -666,
                text: "negative 666"
           },
    sixth: ['NestedMessage'{
                 value: 1234,
                 text: "onetwothreefour"
            },
            'NestedMessage'{
                  value: 2222,
                  text: "four twos"
            }
           ],
    seventh: [1, 2, 3, 4]
},
'SomeMessage':protobuf_term_wire(Msg, WireCodes),
open('some_message.wire', write, Stream, [encoding(octet),type(binary)]),
format(Stream, '~s', [WireCodes]),
close(Stream).

Edit:
For output using Prolog, you’d just create the term by normal means; the Python interface allows something similar, although there are “dunder methods” behind the scenes):

import some_message_pb2  # generated by protoc some_message.proto --output_py==.
msg = some_message_pb2.SomeMessage(
    first = 100,
    second = "abcd",
    third = ["foo", "bar"],
    fourth = True,
    fifth = some_message_pb2.SomeMessage.NestedMessage(value=-666, text = "negative 666"),
    sixth = [some_message_pb2.SomeMessage.NestedMessage(value=1234, text="onetwothreefour"),
             some_message_pb2.SomeMessage.NestedMessage(value=2222, text="four twos")],
    seventh = [1,2,3,4]
    )
with open("some_message.wire", "wb") as wire:
    wire.write(msg.SerializeToString())
2 Likes

Had to look this up.

Is this the best (fastest) way of writing codes to a file? (There’s read_file_to_codes/3 - whose core is written in C, but no write_file_from_codes/3)

This makes sense to me, the only thing I’d do differently is use native Prolog terms rather than dicts to represent the deserialized message.

Msg = 'SomeMessage'(
    first(100),
    second("abcd"),
    third(["foo", "bar"]),
    fourth(true),
    fifth('NestedMessage'(
                value(-666),
                text("negative 666")
           )
    ),
    sixth(['NestedMessage'(
                 value(1234),
                 text("onetwothreefour")
            ),
            'NestedMessage'(
                  value(2222),
                  text("four twos")
            )
           ]
    ),
    seventh([1, 2, 3, 4])
),
...

I would consider that as an option; but native Prolog terms require using something like member/2 to extract items, whereas dicts allow direct access.

Msg.fifth.value

Also, if using “native Prolog terms”, I’d go with lists and pairs, because of the possibility of missing fields:

Msg = 'SomeMessage'([
    first-100,
    second-"abcd",
    third-["foo", "bar"],
    fourth-true,
    fifth-'NestedMessage'([
                value- (-666),
                text-"negative 666"]),
... ]),
Msg='SomeMessage'Items),
member(fifth-'NestedMessage'(FifthItems), Items), 
member(value-Value, FifthItems)

EDIT: I was wrong to assume that dicts need to be fully instantiated to work properly. Dicts are actually perfectly capable of dealing with uninstantiated values (keys must be instantiated though). I now agree with @peter.ludemann that using dicts is the best approach.

Original reply:

I still prefer custom schema-specific Prolog terms. Missing fields can be explicitly set to a special null atom.

Consider the following Protobuf schema describing a binary tree of integers:

syntax = "proto2";
message Node {
  message Leaf {
    int value = 1;
  }
  Leaf left = 2;
  Leaf right = 3;
}

What’s the best way to represent the following binary tree in Prolog?

Dict:

'Node'{
  left: 'Node'{
    left: 'Node'{
      left: 'Leaf'{
        value: 1
      },
      right: 'Leaf'{
        value: 2
      }
    },
    right: 'Leaf'{
      value: 3
    }
  },
  right: 'Node'{
    left: 'Leaf'{
      value: 4
    },
    right: 'Leaf'{
      value: 5
    }
  }
}

Pairs:

'Node'([
  left-'Node'([
    left-'Node'([
      left-'Leaf'([
        value-1
      ]),
      right-'Leaf'([
        value-2
      ])
    ]),
    right-'Leaf'([
      value-3
    ])
  ]),
  right-'Node'([
    left-'Leaf'([
      value-4
    ]),
    right-'Leaf'([
      value-5
    ])
  ])
])

Schema-specific terms with explicit naming:

'Node'(
  left('Node'(
    left('Node'(
      left('Leaf'(
        value(1)
      ),
      right('Leaf'(
        value(2)
      )
    ),
    right('Leaf'(
      value(3)
    )
  ),
  right('Node'(
    left('Leaf'(
      value(4)
    ),
    right('Leaf'(
      value(5)
    )
  )
)

Schema-specific terms without explicit naming:

'Node'(
  'Node'(
    'Node'(
      'Leaf'(1),
      'Leaf'(2)
    ),
    'Leaf'(3)
  ),
  'Node'(
    'Leaf'(4),
    'Leaf'(5)
  )
)

The first thing that jumps out at me is that the pairs syntax not only looks the worst, but is also the most difficult to reason about, with infinite-list and duplicate-key minefields everywhere unless it’s completely instantiated. So pairs are out.

The last syntax is the shortest and cleanest, and also the way most Prolog programmers would represent a binary list outside the context of serialization. It has tons of nice properties - no instantiation requirements, easy pattern matching, no search space explosion minefields like pairs have, cleanliness and purity. The fact that you can’t use 'Node'.left.left.right when dealing with binary trees in Prolog hasn’t bothered anyone yet.

However, it may be difficult for humans to keep track of dozens of fields by position alone, so using explicit naming as in the syntax I originally proposed is the best compromise. We still retain the Prolog-specific advantages of allowing partial or no instantiation, we can still use pattern matching, our terms are clean and can be reasoned about in a pure manner, and we don’t have to worry about member/2 generating duplicate keys or infinite lists on backtracking.

Just to be clear – the current template_message/2 predicate must follow the structure of the .proto file. The main difference from the protobuf implementation for other languages is that it uses field numbers (called “tags” in the documentation) and it doesn’t enforce following the .proto definition – but if you don’t follow the .proto definition (that is, you don’t use the appropriate Prolog types as documented), things are likely to break in strange ways.

I’m working on a design that will automate using the .proto definition and will prevent using the wrong type.

BTW, I have also exposed the predicates for the low-level marshalling and unmarshalling (e.g., float64_codes) because they might be useful in other situations (not necessarily protobufs); in general, they shouldn’t be used but instead the “templates” should be used.

How will you handle the situation where you want to send the string “null”? If you use special values, then you need to wrap things appropriately to handle such edge cases.

I don’t know about JSON schema, but for protobufs, it’s possible to add new fields – if you send a message with new fields to a program that’s using an older version of the message specification (.proto file), the new fields are simply ignored. Also, it’s common to have a lot of empty fields – these take up zero space in a protobuf.

The typical use cases for protobufs aren’t binary trees; the use cases look more like COBOL records … dozens and dozens of fields – descriptor.proto is typical, nearly 1000 lines long (including comments), defining 54 different message types and 211 fields. You really don’t want to have to write out all of these every time; and you don’t want to update all the programs that use these whenever a field is added.

I’ve seen very few programs that “reason” about protobufs – there are some self-describing messages, but they’re not common. Protobufs are just dumb data containers. There’s a joke that a Google programmer is a machine for turning coffee into programs that transform one protocol buffer into another. (Based on the older joke by Hungarian mathematician Alfred Renyi, that “a mathematician is a machine for turning coffee into theorems.”) There’s even a T-shirt logo for the “larry & sergey protobuf moving company” – you can see it here: CppCon 2019: There Are No Zero-cost Abstractions--Chandler Carruth : Standard C++.

1 Like

How will you handle the situation where you want to send the string “ null ”?

  • Internally in Prolog terms: I use character lists and never atoms to represent strings, so no danger of conflict there
  • In Protobuf: Just omit the null values from serialization

if you send a message with new fields to a program that’s using an older version of the message specification (.proto file), the new fields are simply ignored.

That’s the worst thing I’ve read all day. :joy: I’m sure this behavior has been the cause of countless bugs. And no, that’s not normal - I’ve only seen people not validate at all or validate strictly if they actually bothered to set up a schema.

I’ve seen very few programs that “reason” about protobufs

There’s a joke that a Google programmer is a machine for turning coffee into programs that transform one protocol buffer into another.

The two issues are related. If Protobuf schemas were enforced strictly, it would be possible to transform them bidirectionally between different versions and even different message types by reasoning about them. There are emerging technologies that make dealing with schema evolution easier like Project Cambria. I’m not suggesting you break spec, but it’s just a crying shame that schemas are not enforced strictly in Protobuf because bidirectional schema evolution would be so natural to write in Prolog.

The dict format is exactly what I use for ROS. The tag is the message type name. When sending a message the used tag is ignored. ROS defines a default message for each type. We first create the default message and then walk the dict to set the specified values.

You’ve been talking to someone I guess :wink: it is a valid approach, but the “string” datatype does have its uses too. There is also the somewhat opinionated “chars vs codes” discussion. My message, being dogmatic about it is not necessarily the pragmatic approach in all practical cases.

Could you give a concrete example of some of the problems? I used to think that a pair is not fundamentally nor superficially different from any other term with arity 2.