| .. _seed-0103: |
| |
| ============================================ |
| 0103: pw_protobuf: Past, present, and future |
| ============================================ |
| .. seed:: |
| :number: 0103 |
| :name: pw_protobuf: Past, present, and future |
| :status: Accepted |
| :proposal_date: 2023-08-16 |
| :cl: 133971 |
| :authors: Alexei Frolov |
| :facilitator: Armando Montanez |
| |
| ------- |
| Summary |
| ------- |
| ``pw_protobuf`` is one of Pigweed's oldest modules and has become a foundational |
| component of Pigweed and Pigweed-based projects. At its core, ``pw_protobuf`` |
| provides a compact and efficient `protobuf <https://protobuf.dev>`_ wire format |
| encoder and decoder, but as third-party usage has grown, additional higher-level |
| APIs have sprung up, many of which were contributed by third-party developers to |
| address use cases within their own projects. |
| |
| The growth of ``pw_protobuf`` was not entirely controlled, which has resulted in |
| a lack of cohesion among its components, incomplete implementations, and |
| implicit, undocumented limitations. This has made the module difficult to |
| approach for new users and put a lasting maintenance burden on the core Pigweed |
| team. |
| |
| This document explores the state of ``pw_protobuf`` and proposes a plan to |
| resolve the issues present in the module, both in the immediate short term and |
| a longer term vision. |
| |
| --------------------------- |
| Summary of Proposed Changes |
| --------------------------- |
| The table below summarizes the states of the different ``pw_protobuf`` |
| components following acceptance of this SEED. The reasoning behind these changes |
| is explained in further detail throughout the rest of the SEED. |
| |
| .. list-table:: |
| :header-rows: 1 |
| |
| * - Component |
| - Status |
| - Details |
| * - Wire format encoder/decoder |
| - Supported |
| - * ``pw_protobuf``'s primary API. |
| * Codegen helpers for convenient use. |
| * Works with streams and direct buffers. |
| * Recommended for compact and efficient protobuf operations. |
| * - Find API |
| - Supported |
| - * Useful for extracting fields from messages without having to set up a |
| decoder. |
| * Recommended as an alternative for in-memory objects for small, simple |
| messages. |
| * Will be expanded with better support for repeated fields. |
| * - Nanopb integration (build system / RPC) |
| - Supported |
| - * Recommended for newer projects that want a complete object model for |
| their protobuf messages. |
| * Recommended by default for RPC services. |
| * Can easily be used alongside lower-level ``pw_protobuf`` APIs in cases |
| where more control is required. |
| * - Message API (``message.h``) |
| - Deprecated |
| - * Superseded by other APIs. |
| * Only used by one project. |
| * Code will be removed. |
| * - Message structures |
| - **Short-term:** Discouraged |
| |
| **Long-term:** Deprecated |
| - * Will remain supported for existing users indefinitely, though no new |
| features will be added. |
| * Docs will be updated to clearly detail its limitations. |
| * Not recommended to new users; Nanopb or the low-level APIs should be |
| preferred. |
| * Will be replaced with a newer ``pw_protobuf`` object model at an |
| unspecified future point. |
| * Code will remain until the new model is fully implemented and existing |
| users have had time to migrate (with Pigweed assistance for internal |
| customers). |
| * - ``pwpb_rpc`` |
| - **Short-term:** Discouraged |
| |
| **Long-term:** Deprecated |
| - * Will remain supported for existing users indefinitely, though no new |
| features will be added. |
| * Not recommended to new users; ``nanopb_rpc`` and/or raw methods should |
| be preferred. |
| * When the new ``pw_protobuf`` object model is added, it will come with |
| updated RPC integration. |
| * Code will remain until the new model is fully implemented and existing |
| users have had time to migrate (with Pigweed assistance for internal |
| customers). |
| * - New ``pw_protobuf`` object model |
| - **Long-term:** Planned |
| - * Intended to replace existing message structures as the premier |
| in-memory object model, with a more complete implementation of the |
| protobuf spec. |
| * Investigation and design will be examined in a future SEED. |
| |
| ---------------------------- |
| Background and Current State |
| ---------------------------- |
| |
| Protobuf Components |
| =================== |
| ``pw_protobuf`` today consists of several different layered APIs, which are |
| explored below. |
| |
| Core encoder and decoder |
| ------------------------ |
| ``pw_protobuf``'s core low-level APIs interact directly with the |
| `Protobuf wire format <https://protobuf.dev/programming-guides/encoding/>`_, |
| processing each field appearing in a message individually without any notion of |
| higher-level message semantics such as repeated or optional fields. These APIs |
| are compact and highly-capable; they are able to construct any valid protobuf |
| message, albeit by pushing much of the burden onto users to ensure that they do |
| not encode fields in violation of their messages' schemas. |
| |
| Origin |
| ^^^^^^ |
| The idea for direct wire encoding originated prior to the inception of Pigweed, |
| when the team was setting up crash reporting for a project. Crash diagnostic |
| data was transmitted from each device as a protobuf message, which was encoded |
| using `nanopb <https://jpa.kapsi.fi/nanopb/>`_, a popular lightweight, |
| embedded-friendly protobuf library for serializing and deserializing protobuf |
| data to and from C structs. |
| |
| To send crash reports, a single, statically-allocated crash message struct was |
| populated by the device's various subsystems, before being serialized to a |
| buffer and queued for transmission over the appropriate interface. The fields of |
| this struct ranged from single integers to complex nested messages. The nature |
| of nanopb in a static memory environment required each variable-length field in |
| the generated message to be reserved for its maximum allowable size, which |
| quickly blew up in the cases of large strings and repeated submessages. All in |
| all, the generated crash struct clocked in at around 12KB --- several times |
| larger than its encoded size --- a high price to pay for such a |
| memory-constrained device. |
| |
| This large overhead raised the question of whether it was necessary to store the |
| crash data in an intermediate format, or if this could be eliminated. By the |
| nature of the protobuf wire format, it is possible to build up a message in |
| parts, writing one field at a time. Due to this, it would be possible for each |
| subsystem to be passed some serializer which would allow them to write their |
| fields directly to the final output buffer, avoiding any additional in-memory |
| storage. This would be especially beneficial for variable-length fields, where |
| systems could write only as much data as they had at the moment, avoiding the |
| overhead of worst-case reservations. ``pw_protobuf`` was conceptualized as this |
| type of wire serializer, providing a convenient wrapper around direct |
| field-by-field serialization. |
| |
| While the project ended up shipping with their original ``nanopb`` setup, a |
| prototype of this serializer was written as a proof of concept, and ended up |
| being refined to support all basic protobuf operations as one of the first |
| modules offered by the newly-started Pigweed project. |
| |
| Implementation |
| ^^^^^^^^^^^^^^ |
| The core encoders have undergone several iterations over time. The |
| :cs:`original implementation <bbf164c985576a348f3bcd4c48b3e9fd8a464a66:pw_protobuf/public/pw_protobuf/encoder.h;l=25>` |
| offered a simple API to directly serialize single protobuf fields to an |
| in-memory buffer through a series of typed ``Encode`` functions. Message |
| nesting was handled manually by the user, calling a ``Push`` function to begin |
| writing fields to a submessage, followed by ``Pop`` on completion. |
| |
| The decoder was a |
| :cs:`later addition <6d9b9b447b84afb60e714ebd97523ee55b93c9a6:pw_protobuf/public/pw_protobuf/decoder.h;l=23>`, |
| initially invoking a callback on each field in the serialized message with its |
| field number, giving the users the ability to extract the field by calling the |
| appropriate typed ``Decode`` function. This was implemented via a |
| ``DecodeHandler`` virtual interface, and it persists to this day as |
| ``CallbackDecoder``. However, this proved to be too cumbersome to use, so the |
| main decoder was :cs:`rewritten <fe9723cd67796e9236022cde6ef42cda99682d77>` |
| in the style of an iterator where users manually advanced it through the |
| serialized fields, decoding those which they cared about. |
| |
| Streaming enhancement |
| ^^^^^^^^^^^^^^^^^^^^^ |
| The original encoder and decoder were designed to operate on messages which fit |
| into buffers directly in memory. However, as the ``pw_stream`` interface was |
| stabilized and adopted, there was interest in processing protobuf messages whose |
| data was not fully available (for example, reading out of flash |
| sector-by-sector). This prompted another rewrite of the core classes to make |
| ``pw::Stream`` the interface to the serialized data. This was done differently |
| for the encoder and decoder: the encoder only operates on streams, with |
| ``MemoryEncoder`` becoming a shallow wrapper instantiating a ``MemoryWriter`` on |
| top of a buffer, whereas the decoder ended up having two separate, parallel |
| ``StreamDecoder`` and ``MemoryDecoder`` implementations. |
| |
| The reason for this asymmetry has to do with the manner in which the two were |
| implemented. The encoder was |
| :cs:`rewritten first <0ed221cbb8b943205dea4ac315fe1d4b1e6b7371>`, |
| and carefully designed to function on top of the limited semantic guarantees |
| offered by ``pw_stream``. Following this redesign, it seemed obvious and natural |
| to use the existing MemoryStream to provide the previous encoding functionality |
| nearly transparently. However, when reviewing this implementation with the |
| larger team, several potential issues were noted. What was previously a simple |
| memory access to write a protobuf field became an expensive virtual call which |
| could not be elided. The common use case of serializing a message to a buffer |
| had become significantly less performant, prompting concerns about the impact of |
| the change. Additionally, it was noted that this performance impact would be far |
| worse on the decoding side, where serialized varints had to be read one byte at |
| a time. |
| |
| As a result, it was decided that a larger analysis was required. To aid this, |
| the stream-based decoder would be implemented separately to the existing memory |
| decoder so that direct comparisons could be made between the two |
| implementations. Unfortunately, the performance of the two implementations was |
| never properly analyzed as the team became entangled in higher priority |
| commitments. |
| |
| .. code-block:: c++ |
| |
| class StreamEncoder { |
| public: |
| constexpr StreamEncoder(stream::Writer& writer, ByteSpan scratch_buffer); |
| |
| Status WriteUint32(uint32_t field_number, uint32_t value); |
| Status WriteString(uint32_t field_number, std::string_view value); |
| }; |
| |
| *A subset of the StreamEncoder API, demonstrating its low-level field writing |
| operations.* |
| |
| Wire format code generation |
| --------------------------- |
| ``pw_protobuf`` provides lightweight generated code wrappers on top of its core |
| wire format encoder and decoder which eliminate the need to provide the correct |
| field number and type when writing/reading serialized fields. Each generated |
| function calls directly into the underlying encoder/decoder API, in theory |
| making them zero-overhead wrappers. |
| |
| The encoder codegen was part of the original implementation of ``pw_protobuf``. |
| It constituted a ``protoc`` plugin written in Python, and several GN build |
| templates to define protobuf libraries and invoke ``protoc`` on them to create |
| a C++ target which could be depended on by others. The build integration was |
| added separately to the main protobuf module, as ``pw_protobuf_compiler``, and |
| has since expanded to support many different protobuf code generators in various |
| languages. |
| |
| The decoder codegen was added at a much later date, alongside the struct object |
| model. Like the encoder codegen, it defines wrappers around the underlying |
| decoder functions which populate values for each of a message's fields, though |
| users are still required to manually iterate through the message and extract |
| each field. |
| |
| .. code-block:: c++ |
| |
| class FooEncoder : public ::pw::protobuf::StreamEncoder { |
| Status WriteBar(uint32_t value) { |
| return ::pw::protobuf::StreamEncoder::WriteUint32( |
| static_cast<uint32_t>(Fields::kBar), value); |
| } |
| }; |
| |
| *An example of how a generated encoder wrapper calls into the underlying |
| operation.* |
| |
| Message API |
| ----------- |
| The ``Message`` API was the first attempt at providing higher-level semantic |
| wrappers on top of ``pw_protobuf``'s direct wire serialization. It was developed |
| in conjunction with the implementation of Pigweed's software update flow for a |
| project and addressed several use cases that came up with the way the project |
| stored its update bundle metadata. |
| |
| This API works on the decoding side only, giving users easier access to fields |
| of a serialized message. It provides functions which scan a message for a field |
| using its field number (similar to the ``Find`` APIs discussed later). However, |
| instead of deserializing the field and returning its data directly, these APIs |
| give the user a typed handle to the field which can be used to read it. |
| |
| These field handles apply protobuf semantics beyond the field-by-field iteration |
| of the low level decoder. For example, a field can be accessed as a repeated |
| field, whose handle provides a C++ iterator over each instance of the field in |
| the serialized message. Additionally, ``Message`` is the only API currently in |
| ``pw_protobuf`` which allows users to work directly with protobuf ``map`` |
| fields, reading key-value pairs from a message. |
| |
| .. code-block:: c++ |
| |
| // Parse repeated field `repeated string rep_str = 5;` |
| RepeatedStrings rep_str = message.AsRepeatedString(5); |
| // Iterate through the entries. For iteration |
| for (String element : rep_str) { |
| // Process str |
| } |
| |
| // Parse map field `map<string, bytes> str_to_bytes = 7;` |
| StringToBytesMap str_to_bytes = message.AsStringToBytesMap(7); |
| // Access the entry by a given key value |
| Bytes bytes_for_key = str_to_bytes["key"]; |
| // Or iterate through map entries |
| for (StringToBytesMapEntry entry : str_to_bytes) { |
| String key = entry.Key(); |
| Bytes value = entry.Value(); |
| // Process entry |
| } |
| |
| *Examples of reading repeated and map fields from a serialized protobuf using |
| the Message API.* |
| |
| Message structures |
| ------------------ |
| ``pw_protobuf``'s message structure API is its premier high-level, in-memory |
| object model. It was contributed by an external team with some guidance from |
| Pigweed developers and was driven largely by a desire to work conveniently with |
| protobufs in RPC methods without the burden of a third-party dependency in |
| ``nanopb`` (the only officially supported protobuf library in RPC at the time). |
| |
| Message structures function similarly to more conventional protobuf libraries, |
| where every definition in a ``.proto`` file generates a corresponding C++ |
| object. In the case of ``pw_protobuf``, these objects are defined as structs |
| containing the fields of their protobuf message as members. Functions are |
| provided to encode from or decode to one of these structs, removing the manual |
| per-field processing from the lower-level APIs. |
| |
| Each field in a protobuf message becomes an inline member of its generated |
| struct. Protobuf types are mapped to C++ types where possible, with special |
| handling of protobuf specifiers and variable-length fields. Fields labeled as |
| optional are wrapped in a ``std::optional`` from the STL. Fields labeled as |
| ``oneof`` are not supported (in fact, the code generator completely ignores the |
| keyword). Variable-length fields can either be inlined or handled through |
| callbacks invoked by the encoder or decoder when processing the message. If |
| inlined, a container sized to a user-specified maximum length is generated. For |
| strings, this is a ``pw::InlineString`` while most other fields use a |
| ``pw::Vector``. |
| |
| Similar to nanopb, users can pass options to the ``pw_protobuf`` generator |
| through the protobuf compiler to configure their generated message structures. |
| These allow specifying the maximum size of variable-length fields, setting a |
| fixed size, or forcing the use of callbacks for encoding and decoding. Options |
| maybe be specified inline in the proto file or listed in a separate file |
| (conventionally named ``.options``) to avoid leaking ``pw_protobuf``-specific |
| metadata into protobuf files that may be shared across multiple languages and |
| protobuf compiler contexts. |
| |
| Unlike the lower-level generated classes which require custom per-field encoding |
| and decoding functions, message serialization is handled generically through the |
| use of a field descriptor table. The descriptor table for a message contains an |
| entry for each of its fields, storing its type, field number, and other metadata |
| alongside its offset within the generated message structure. This table is |
| generated once per message defined in a protobuf file, trading a small |
| additional memory overhead for reduced code size when serializing and |
| deserializing data. |
| |
| .. code-block:: proto |
| |
| message Customer { |
| int32 age = 1; |
| string name = 2; |
| optional fixed32 loyalty_id = 3; |
| } |
| |
| .. code-block:: c++ |
| |
| struct Customer::Message { |
| int32_t age; |
| pw::InlineString<32> name; |
| std::optional<uint32_t> loyalty_id; |
| }; |
| |
| *Example of how a protobuf message definition is converted as a C++ struct.* |
| |
| Find API |
| -------- |
| ``pw_protobuf``'s set of ``Find`` APIs constitute functions for extracting |
| single fields from serialized messages. The functions scan the message for a |
| field number and decode it as a specified protobuf type. Like the core |
| serialization APIs, there are two levels to ``Find``: direct low-level typed |
| functions, and generated code functions that invoke these for named protobuf |
| fields. |
| |
| Extracting a single field is a common protobuf use case, and was envisioned |
| early in ``pw_protobuf``'s development. An initial version of ``Find`` was |
| started shortly after the original callback-based decoder was implemented, |
| providing a ``DecodeHandler`` to scan for a specific field number in a message. |
| This version was never fully completed and did not see any production use. More |
| recently, the ``Find`` APIs were revisited and reimplemented on top of the |
| iterative decoder. |
| |
| .. code-block:: c++ |
| |
| pw::Result<uint32_t> age = Customer::FindAge(serialized_customer); |
| if (age.ok()) { |
| PW_LOG_INFO("Age is %u", age.value()); |
| } |
| |
| *An example of using a generated Find function to extract a field from a |
| serialized protobuf message.* |
| |
| RPC integration |
| --------------- |
| Pigweed RPC exchanges data in the form of protobufs and was designed to allow |
| users to implement their services using different protobuf libraries, with some |
| supported officially. Supporting the use of ``pw_protobuf`` had been a goal from |
| the beginning, but it was never implemented on top of the direct wire encoders |
| and decoders. Despite this, several RPC service implementations in Pigweed and |
| customer projects ended up using ``pw_protobuf`` on top of the raw RPC method |
| API, manually decoding and encoding messages. |
| |
| When message structures were contributed, they came with an expansion of RPC to |
| allow their usage in method implementations, becoming the second officially |
| supported protobuf library. ``pw_protobuf`` methods are structured and behave |
| similarly to RPC's nanopb-based methods, automatically deserializing requests |
| from and serializing responses to their generated message structures. |
| |
| What Works Well |
| =============== |
| Overall, ``pw_protobuf`` has been a largely successful module despite its |
| growing pains. It has become an integral part of Pigweed, used widely upstream |
| across major components of the system, including logging and crash reporting. |
| Several Pigweed customers have also shown to favor ``pw_protobuf``, choosing it |
| over other embedded protobuf libraries like nanopb. |
| |
| The list below summarizes some of ``pw_protobuf``'s successes. |
| |
| **Overall** |
| |
| * Widespread adoption across Pigweed and Pigweed-based projects. |
| |
| * Easy to integrate into a project which uses Pigweed's build system. |
| |
| * Often comes at a minimal additional cost to projects, as the core of |
| ``pw_protobuf`` is already used by popular upstream modules. |
| |
| **Core wire format encoders/decoders** |
| |
| * Simple, intuitive APIs which give users a lot of control over the structure |
| of their messages. |
| |
| * Lightweight in terms of code size and memory use. |
| |
| **Codegen general** |
| |
| * Build system integration is extensive and generally simple to use. |
| |
| * Low-level codegen wrappers are convenient to use without sacrificing the |
| power of the underlying APIs. |
| |
| **Message API** |
| |
| * Though only used by a single project, it works well for their needs and |
| gives them extensive semantic processing of serialized messages without the |
| overhead of decoding to a full in-memory object. |
| |
| * More capable processing than the Find APIs: for example, allowing iteration |
| over elements of a repeated field. |
| |
| * As the entire API is stream-based, it permits useful operations such as |
| giving the user a bounded stream over a bytes field of the message, |
| eliminating the need for an additional copy of data. |
| |
| * Support for protobuf maps, something which is absent from any other |
| ``pw_protobuf`` API. |
| |
| **Message Structures** |
| |
| * Message structures work incredibly well for the majority of simple use cases, |
| making protobufs easy to use without having to understand the details of the |
| wire format. |
| |
| * Adoption of ``pw_protobuf`` increased following the addition of this API and |
| corresponding RPC support, indicating that it is more valuable to a typical |
| user who is not concerned with the minor efficiencies offered by the |
| lower-level APIs. |
| |
| * Encoding and decoding messages is efficient due to the struct model's generic |
| table-based implementation. Users do not have to write custom code to process |
| each message as they would with the lower-level APIs, resulting in reduced |
| overall code size in some cases. |
| |
| * Nested messages are far easier to handle than in any other API, which require |
| additional setup creating sub-encoders/decoders. |
| |
| * The use of containers such as ``pw::Vector`` for repeated fields simplifies |
| their use and avoids the issues of similar libraries such as nanopb, where |
| users have to remember to manually set their length. |
| |
| **Find API** |
| |
| * Eliminates a lot of boilerplate in the common use case where only a single |
| field from a message needs to be read. |
| |
| **RPC integration** |
| |
| * Has seen a high rate of adoption as it provides a convenient API to read and |
| write requests and responses without requiring the management of a third-party |
| library dependency. |
| |
| * ``pw_protobuf``-based RPC services can still fall back on the raw RPC API in |
| instances where more flexible handling is required. |
| |
| The Issues |
| ========== |
| |
| Overview |
| -------- |
| This section shows a summary of the known issues present at each layer of the |
| current ``pw_protobuf`` module. Several of these issues will be explored in |
| further detail later. |
| |
| **Overall** |
| |
| * Lack of an overall vision and cohesive story: What is ``pw_protobuf`` trying |
| to be and what kinds of users does it target? Where does it fit into the |
| larger protobuf ecosystem? |
| |
| * Documentation structure doesn't clearly guide users. Should be addressed in |
| conjunction with the larger :ref:`SEED-0102 <seed-0102>` effort. |
| |
| * Too many overlapping implementations. We should focus on one model with a |
| clear delineation between its layers. |
| |
| * Despite describing itself as a lightweight and efficient protobuf library, |
| little size reporting and performance statistics are provided to substantiate |
| these claims. |
| |
| **Core wire format encoders/decoders** |
| |
| * Parallel memory and stream decoder implementations which don't share any code. |
| They also have different APIs, e.g. using ``Result`` (stream decoder) vs. a |
| ``Status`` and output pointer (memory decoder). |
| |
| * Effectively-deprecated APIs still exist (e.g. ``CallbackDecoder``). |
| |
| * Inefficiencies when working with varints and streams. When reading a varint |
| from a message, the ``StreamDecoder`` consumes its stream one byte at a time, |
| each going through a potentially costly virtual call to the underlying |
| implementation. |
| |
| **Codegen general** |
| |
| * The headers generated by ``pw_protobuf`` are poorly structured. Some users |
| have observed large compiler memory usage parsing them, which may be related. |
| |
| * Each message in a ``.proto`` file generates a namespace in C++, in which its |
| generated classes appear. This is unintuitive and difficult to use, with most |
| users resorting to a mess of using statements at the top of each file that |
| works with protobufs. |
| |
| * Due to the way ``pw_protobuf`` appends its own namespace to users' proto |
| packages, it is not always possible to deduce where this namespace will exist |
| in external compilation units. To work around this, a somewhat hacky approach |
| is used where every generated ``pw_protobuf`` namespace is aliased within a |
| root-level namespace scope. |
| |
| * While basic codegen works in all build systems, only the GN build supports |
| the full capabilities of ``pw_protobuf``. Several essential features, such as |
| options files, are missing from other builds. |
| |
| * There appear to be issues with how the codegen steps are exposed to the CMake |
| build graph, preventing protobuf files from being regenerated as a result of |
| some codegen script modifications. |
| |
| * Protobuf editions, the modern replacement for the proto2 and proto3 syntax |
| options, are not supported by the code generator. Files using them fail to |
| compile. |
| |
| **Message API** |
| |
| * The message API as a whole has been superseded by the structure API, and there |
| is no reason for it to be used. |
| |
| **Message structures** |
| |
| * Certain types of valid proto messages are impossible to represent due to |
| language limitations. For example, as message structs directly embed |
| submessages, a circular dependency between nested messages cannot exist. |
| |
| * Optional proto fields are represented in C++ by ``std::optional``. This has |
| several issues: |
| |
| * Memory overhead as a result of storing each field's presence flag |
| individually. |
| |
| * Inconsistent with how other protobuf libraries function. Typically, field |
| presence is exposed through a separate API, with accessors always |
| returning a value (the default if absent). |
| |
| * Not all types of fields are supported. Optional strings and optional |
| submessages do not work (the generator effectively ignores the ``optional`` |
| specifier). ``oneof`` fields do not work. |
| |
| * Not all options work for all fields. Fixed/max size specifiers to inline |
| repeated fields generally only work for simple field types --- callbacks must |
| be used otherwise. |
| |
| * In cases where the generator does not support something, it often does not |
| indicate this to the user, silently outputting incorrect code instead. |
| |
| * Options files share both a filename and some option names with other protobuf |
| libraries, namely Nanopb. This can cause issues when trying to use the same |
| protobuf definition in different contexts, as the options do not always work |
| the same way in both. |
| |
| **Find API** |
| |
| * Lack of support for repeated fields. Only the first element will be found. |
| |
| * Similarly, does not support recurring non-repeated fields. The protobuf |
| specification requires that scalar fields are overridden if they reappear, |
| while string, bytes, or submessage fields are merged. |
| |
| * Only one layer of searching is supported; it is not possible to look up a |
| nested field. |
| |
| * The stream version of the Find API does not allow scanning for submessages due |
| to limitations with the ownership and lifetime of its decoder. |
| |
| **RPC integration** |
| |
| * RPC creates and runs message encoders and decoders for the user. Therefore, it |
| is not possible to use any messages with callback-based fields in RPC method |
| implementations. |
| |
| Deep dive on selected issues |
| ---------------------------- |
| |
| Generated namespaces |
| ^^^^^^^^^^^^^^^^^^^^ |
| ``pw_protobuf``'s generator was written to output a namespace for each message |
| in a file from its first implementation, on top of which all subsequent |
| generated code was added. |
| |
| The reason for this unusual design choice was to work around C++'s |
| declaration-before-definition rule to allow circularly-referential protobuf |
| messages. Each message's associated generated classes are first forward-declared |
| at the start of the generated header, and later defined as necessary. |
| |
| For example, given a message ``Foo``, the following code is generated: |
| |
| .. code-block:: c++ |
| |
| namespace Foo { |
| |
| // Message field numbers. |
| enum Fields; |
| |
| // Generated struct. |
| struct Message; |
| |
| class StreamEncoder; |
| class StreamDecoder; |
| |
| // Some other metadata omitted. |
| |
| } // namespace Foo |
| |
| The more intuitive approach of generating a struct/class directly for each |
| message is difficult, if not impossible, to cleanly implement under the current |
| ``pw_protobuf`` object model. There are several reasons why this is, with the |
| primary being that cross-message dependencies cannot easily be generated due to |
| the aforementioned declaration issues. C++ does not allow forward-declaring a |
| subclass, so certain types of nested message relationships are not directly |
| representable. Some potential workarounds have been suggested for this, such as |
| defining struct members as aliases to internally-generated types, but we have |
| been unable to get this correctly working following a timeboxed prototyping |
| session. |
| |
| Message structures |
| ^^^^^^^^^^^^^^^^^^ |
| Many of the issues with message structs stem from the same language limitations |
| as those described above with namespacing. As the generated structures' members |
| are embedded directly within them and publicly exposed, it is not possible to |
| represent certain types of valid protobuf messages. Additionally, the way |
| certain types of fields are generated is problematic, as described below. |
| |
| **Optional fields** |
| |
| A field labeled as ``optional`` in a proto file generates a struct member |
| wrapped in a ``std::optional`` from the C++ STL. This choice is semantically |
| inconsistent with how the official protobuf libraries in other languages are |
| designed. Typically, accessing a field will always return a valid value. In the |
| case of absence, the field is populated with its default value (the zero value |
| unless otherwise specified). Presence checking is implemented as a parallel API |
| for users who require it. |
| |
| This choice also results in additional memory overhead, as each field's presence |
| flag is stored within its optional wrapper, padding what could otherwise be a |
| single bit to a much larger aligned size. In the conventional disconnected model |
| of field presence, the generated object could instead store a bitfield with an |
| entry for each of its members, compacting its overall size. |
| |
| Optional fields are not supported for all types. The compiler ignores the |
| ``optional`` specifier when it is set on string fields, as well as on nested |
| messages, generating the member as a regular field and serializing it per |
| standard ``proto3`` rules, omitting a zero-valued entry. |
| |
| Implementing ``optional`` the typical way would require hiding the members of |
| each generated message, instead providing accessor functions to modify them, |
| checking for presence and inserting default values where appropriate. |
| |
| **Oneof fields** |
| |
| The ``pw_protobuf`` code generator completely ignores the ``oneof`` specifier |
| when processing a message. When multiple fields are listed within a ``oneof`` |
| block in a ``.proto`` file, the generated struct will contain all of them as |
| direct members without any notion of exclusivity. This permits ``pw_protobuf`` |
| to encode semantically invalid protobuf messages: if multiple members of a |
| ``oneof`` are set, the encoder will serialize all of them, creating a message |
| that is unprocessable by other protobuf libraries. |
| |
| For example, given the following protobuf definition: |
| |
| .. code-block:: proto |
| |
| message Foo { |
| oneof variant { |
| uint32 a = 1; |
| uint32 b = 2; |
| } |
| } |
| |
| The generator will output the following struct, allowing invalid messages to be |
| written. |
| |
| .. code-block:: c++ |
| |
| struct Foo::Message { |
| uint32_t a; |
| uint32_t b; |
| }; |
| |
| // This will work and create a semantically invalid message. |
| Foo::StreamEncoder encoder; |
| encoder.Write({.a = 32, .b = 100}); |
| |
| Similarly to ``optional``, the best approach to support ``oneof`` would be to |
| hide the members of each message and provide accessors. This would avoid the |
| risk of incorrectly reading memory (such as a wrong ``union`` member) and not |
| require manual bookkeeping as in nanopb. |
| |
| -------- |
| Proposal |
| -------- |
| |
| Short-term Plan |
| =============== |
| A full rework of ``pw_protobuf`` does not seem feasible at this point in time |
| due to limited resourcing. As a result, the most reasonable course of action is |
| to tie up the loose ends of the existing code, and leave the module in a state |
| where it functions properly in every supported use case, with unsupported use |
| cases made explicit. |
| |
| The important steps to making this happen are listed below. |
| |
| * Restructure the module documentation to help users select which protobuf API |
| is best suited for them, and add a section explicitly detailing the |
| limitations of each. |
| |
| * Deprecate and hide the ``Message`` API, as it has been superseded by the |
| ``Find`` APIs. |
| |
| * Discourage usage of message structures in new code, while providing a |
| comprehensive upfront explanation of their limitations and unsupported use |
| cases, including: |
| |
| * ``oneof`` cannot be used. |
| |
| * Inlining some types of repeated fields such as submessages is not possible. |
| Callbacks must be used to encode and decode them. |
| |
| * The use of ``optional`` only generates optional struct members for simple |
| scalar fields. More complex optional fields must be processed through |
| callbacks. |
| |
| * Update the code generator to loudly fail when it encounters an unsupported |
| message or field structure. |
| |
| * Discourage the use of the automatic ``pw_protobuf`` RPC generator due to the |
| limitations with message structures. ``nanopb`` or manually processed ``raw`` |
| methods should be used instead. |
| |
| Similarly, clearly document the limitations around callback-based messages in |
| RPCs methods, and provide examples of how to fall back to raw RPC encoding and |
| decoding. |
| |
| * Move all upstream usage of ``pw_protobuf`` away from message structures and |
| ``pwpb_rpc`` to the lower-level direct wire APIs, or rarely Nanopb. |
| |
| * Rename the options files used by ``pw_protobuf``'s message structs to |
| distinguish them from Nanopb options. |
| |
| * Make the ``pw_protobuf`` code generator aware of the protobuf edition option |
| so that message definitions using it can be compiled. |
| |
| * Extend full protobuf code generation support to the Bazel and CMake builds, as |
| well as the Android build integration. |
| |
| * Minimize the amount of duplication in the code generator to clean up generated |
| header files and attempt to reduce compiler memory usage. |
| |
| * Extend the ``Find`` APIs with support for repeated fields to bring them closer |
| to the Message API's utility. |
| |
| Long-term Plan |
| ============== |
| This section lays out a long term design vision for ``pw_protobuf``. There is no |
| estimated timeframe on when this work will happen, but the ideas are collected |
| here for future reference. |
| |
| Replace message structures |
| -------------------------- |
| As discussed above, most issues with message structures stem from having members |
| exposed directly. Obscuring the internal details of messages and providing |
| public accessor APIs gives the flexibility to fix the existing problems without |
| running against language limitations or exposing additional complexity to users. |
| |
| By doing so, the internal representation of a message is no longer directly tied |
| to C++'s type system. Instead of defining typed members for each field in a |
| message, the entire message structure could consist of an intermediate binary |
| representation, with fields located at known offsets alongside necessary |
| metadata. This avoids the declaration and aliasing issues, as types are now only |
| required to be defined at access rather than storage. |
| |
| This would require a complete rewrite which would be incompatible with the |
| current APIs. The least invasive way to handle it would be to create an entirely |
| new code generator, port over the core lower level generator functionality, and |
| build the new messages on top of it. The old API would then be fully deprecated, |
| and users could migrate over one message at a time, with assistance from the |
| Pigweed team for internal customers. |
| |
| Investigate standardization of wire format operations |
| ----------------------------------------------------- |
| ``pw_protobuf`` is one of many libraries, both within Google and externally, |
| that re-implements protobuf wire format processing. At the time it was written, |
| this made sense, as there was no convenient option that fit the niche that |
| ``pw_protobuf`` targeted. However, since then, the core protobuf team has |
| heavily invested in the development of `upb <https://github.com/protocolbuffers/upb>`_: |
| a compact, low-level protobuf backend intended to be wrapped by higher-level |
| libraries in various languages. Many of upb's core design goals align with the |
| initial vision for ``pw_protobuf``, making it worthwhile to coordinate with its |
| developers to see it may be suitable for use in Pigweed. |
| |
| Preliminary investigations into upb have shown that, while small in size, it is |
| still larger than the core of ``pw_protobuf`` as it is a complete protobuf |
| library supporting the entire protobuf specification. Not all of that is |
| required for Pigweed or its customers, so any potential reuse would likely be |
| contingent on the ability to selectively remove unnecessary parts. |
| |
| At the time of writing, upb does not have a stable API or ABI. While this is |
| okay for first-party consumers, shipping it as part of Pigweed may present |
| additional maintenance issues. |
| |
| Nonetheless, synchronizing with upb to share learnings and potentially reduce |
| duplicated effort should be an essential step in any future ``pw_protobuf`` |
| work. |