Original Author:
Ben Olmstead (aka reventlov, aka Dmitri Prime), original designer and author of Emboss
There are many, many books, articles, talks, classes, and exercises on good software design, and most general design principles apply to Emboss. In this section, I will only cover the “most important” principles and those that I do not see highlighted in many other places.
In order to avoid “second system effect,” designs that do not work in practice, and wasted effort, it is best to design to a specific problem — preferably a few instances of that problem, so that your design is more likely to solve a wide range of real world problems.
For example, in Emboss if you wait until you have a specific data structure that is awkward or impossible to express, then try to find examples of other structures that are awkward in the same way, and then design a feature to handle those data structures, you are much more likely to come up with a solution that a) will actually be used, and b) will be used in more than one place.
Often, users will have a problem, think “I could solve this if I could do X,” and then ask for a feature for X without mentioning their original problem. As a software designer, one of the first things you should do is try to figure out the original problem — usually by asking the user some probing questions — so that you can design to the problem, not to the user's solution.
(Note that this is sometimes true even if you are the user: it is easy to get tunnel vision about a solution you came up with. Sometimes you need to step back and try to find a different solution.)
Avoid the temptation to cover every possible use case, even if some of those would generally fit within the domain of your project. A project like Emboss will attract extremely specific requests — requests whose solutions do not generalize.
Instead of trying to cover every use case for every user, leave “escape hatches” in your design, so that users can use Emboss for the cases it covers, and integrate their own solutions in the places that Emboss does not cover.
There will always be formats that Emboss cannot handle without becoming an actual programming language — even something as “basic” as compression is generally beyond what Emboss is meant to be capable of.
Emboss has strong backwards-compatibility guarantees: in particular, once a feature is “released,” support for that feature is guaranteed more or less forever. Because of this, new features should be narrow, even if there are “obvious” expansions, and even if narrowing the feature actually takes more code in the compiler. You can always expand a feature later, but narrowing it or cutting it out would break Emboss's support guarantees.
Although this principle is very standard for professional, publicly-released software, it may be a culture shock to developers who are used to “monorepo”[^mono] environments such as Google — it is not possible to just update all users in the real world! Note that even many of Google's open source projects, such as Abseil, require their users to periodically update their code to the latest conventions, which imposes a cost on users of those projects. Emboss is intended for smaller developers and embedded systems, which often do not have the resources for such migrations.
[^mono]: In the several years that Emboss spent inside Google's monorepo it underwent many large, backwards-incompatible changes that made the current language significantly better. Early incubation in a controlled environment can be valuable for a new language!
Emboss uses $
in many keyword names, but does not allow $
to be used in user identifiers — this lets Emboss add $
keywords without worrying about colliding with identifiers in existing code. (This is in direct contrast to most programming languages, where introducing new keywords often breaks existing code.)
As another example, Emboss disallows identifiers that collide with keywords in many programming languages — this gives room for Emboss to add back ends for those programming languages later, without having to figure out a convention for mangling identifiers that collide. As a real-world counterexample, Protocol Buffers had to figure out a convention for handling field names that collide with C++ identifiers such as class
— and protoc
still generates broken C++ code if you have two fields named class
and class_
in the same message
.
An “extension point” is a place where someone should be able to hook into the system without changing the system. This can be an API, a “hook,” a defined data format, or something else entirely, but the defining factor is that it is a way to add new features or alter behavior without changing the existing software.
In practice, many extension points won't “just work” until there are at least a few things using them, due to bugs or unexpected coupling, but in principle they should not require any modification.
One extension point in the Emboss compiler is the full separation between front and back ends, so that future back ends (such as Rust, Protocol Buffers, PDF documentation, etc.) can be added without changing the overall design or (theoretically) any of the existing compiler.[^ext]
[^ext]: This is not unique or original to Emboss: separate front and back ends are totally standard in modern compiler design.
In the physical world, an electrical outlet or a network port is an extension point — there is nothing there right now, but there is a defined place for something to be added later.
A “line of cleavage” is similar to an extension point, except that instead of being a ready-to-go place to add something new, it's a place where the major work was done, but there are still some pieces that need to be fixed up.
A line of cleavage in the Emboss compiler is the use of a special .emb
file (prelude.emb
) to define “built-in” types, with the aim of eventually allowing end users to define their own types at the same level. This feature still has open design decisions, such as:
But these are relatively minor compared to the larger question of “how can Emboss allow end users to define their own basic types?”
In software, lines of cleavage are usually invisible to end users, and can be difficult to see even for developers working on the code.
In the physical world, an example of this is putting empty conduit into walls or ceilings: that way, new electrical or communication wires or pneumatic tubes can be pulled through the conduit and attached to new outlets, without having to open up all the walls.
Every complex software system has a cloud of potential features around it: features which, for one reason or another, have not been implemented yet, but which some stakeholder(s) want. These features usually exist at every stage from “idle thought in a developer's mind” to “partially implemented, but not finished,” and the likelihoods of each one to become a finished feature cover an equally wide range.
When designing a new feature there are very good reasons to think about these potential features:
First, you should ensure that your new feature does not make another highly-desirable feature impossible. In Emboss, for example, if your new feature made it impossible to support a string type, that would be a very good reason to redesign your feature (or abandon it, if it is fundamentally incompatible).
Second, sometimes you can tweak your design so that a potential feature becomes obsolete: fundamentally, every feature request exists to solve a problem, and often it is not the only way to solve that problem. If you can solve it in a different way, you can make users happy and avoid some future work. (Though be careful: it can be difficult to infer the full scope of a user's problem(s) from a feature request.)
Third, thinking about specific potential features can help narrow the amount of “future design space” that you need to consider, which makes it easier to put extension points and lines of cleavage in your design in places where they will actually be used.
In contrast to general software design principles, there are far fewer sources on good language design. I speculate that this is because there are far fewer language designers than software designers. (There are tens of millions of software developers, but only tens of thousands of programming, markup, and data definition languages — and of those, maybe two thousand or so are “serious” languages with significant real-world use.)
Luckily, there are many publicly available and documented languages to learn from directly.
Language design can be very roughly divided into syntactic and semantic concerns: syntax is how the language looks (what symbols and keywords are used, and in what order), while semantics cover how the language works (what actually happens). It might seem like semantics are more important, but syntax has a huge effect on how easy it is to understand existing code and to write correct code, which are both incredibly important in real-world use.
In this section, I will try to outline language design principles that I have found or developed, particularly when they are useful for Emboss.
Turing-complete languages cannot be fully analyzed. This is one of the reasons that languages like HTML and CSS are not programming languages: the more expressive a language is, the more difficult it is to analyze.
The .emb
format is intended to be more on the declarative side, so that definitions can be analyzed and transformed as necessary.
Although Emboss is a data definition language (DDL), not a programming language, many lessons and principles from programming language design can be applied, as well as lessons from other DDLs, and sometimes even interface definition languages (IDLs), as well as markup and query languages.
In particular, for Emboss it is often worth looking at:
Popular programming languages: C, C++, Rust, JavaScript, TypeScript, C#, Java, Go, Python 3, Swift, Objective C, Lua. “Systems” programming languages such as C, C++, and Rust are usually the most relevant of these, but it is useful to survey all the popular languages because many Emboss users will be familiar with them. Note that Lua is used for Wireshark packet definitions.
Selected “interesting” programming languages: Wuffs, Haskell, Ocaml, Agda, Coq. These have some lessons for Emboss, especially its expression system — in particular, they're all much more principled than “standard” programming languages about how they handle types and values. There are many other programming languages that have interesting ideas (FORTH, Prolog, D, Perl, Logo, Scratch, APL, so-called “esoteric” programming languages), but they usually are not relevant to Emboss.
DDLs: Kaitai Struct, Protocol Buffers, Cap'n Proto, SQL-DDL. Kaitai Struct is the closest of these to solving the same problem as Emboss (though it has some fundamentally different design decisions which make it far worse for embedded systems), but all have some lessons. Some higher-level schema languages like DTD, XML Schema, or JSON Schema tend to be less relevant to Emboss. Note that there are a number of DDLs that are also IDLs: in actual use, some of them (Protocol Buffers) are used more often for their DDL features, while others (XPIDL, COM) are used more for their IDL features.
Many (most?) languages are designed by people who have minimal knowledge of the academic theories of how programming languages work — for Emboss, Category Theory is particularly useful, and the computer science of parsers (especially LR(1) parsers) is useful for tweaking the parser generator or adding new syntax.
This is a case where a little bit of learning goes a long way: you do not need to learn a lot about parsers or Category Theory to benefit from them.
Many of the academic topics related to programming language design have corresponding industrial knowledge, and there are practical concerns that have very little to do with academic theory.
The Emboss compiler is (loosely) based on the design of LLVM, with a series of transformation passes that operate somewhat independently, and independent back end code generators.[^designoops]
[^designoops]: After many years of experience with this, I think that this is not quite the right design for Emboss, and I would make two major changes: first (and simplest), I would divide the current “front end” into a true front end that only handled syntax and some types of syntax sugar, and a “middle end” that handled all of the symbol resolution, bounds analysis, constraint checking, etc. Second, I would use a “compute-on-demand” (lazy evaluation) approach in the middle end, which would allow certain operations to be decoupled. The LLVM design is more suited for independent optimization passes, not for the kind of gradual annotation process in the Emboss middle end.
As another example, understanding how (and how well) Clang, GCC, and MSVC can optimize C++ code is crucial to generating high-performance code from Emboss (and Emboss leans very heavily on the C++ compiler to optimize its output).
Some bits of practical knowledge are tiny little bits of almost-trivia. For example, if you have C or C++ code in a (text) template, and you use $
to indicate substitution variables (as in $var
or $var$
), then most editors and code formatters will treat your substitution variables as normal identifiers. This is because almost every C and C++ compiler allows you to use $
in identifiers, even though there has never been a C or C++ standard that allows those names, and it is rarely noted in any compiler, editor, or formatter's documentation.
Emboss pulls many conventions from programming, data definition, and markup languages. In general, if there is a feature in Emboss that works in a way that is the same as in other languages, it is best to pull syntax from elsewhere — ideally, pull in the most common syntax. Many examples of this in Emboss are so common you might not even think about them:
+
, -
, *
)*
binds more tightly than +
and -
, but also: see the next section)Other examples are most specific, with no universal convention:
: Type
syntax for type annotation (TypeScript, Python, Ocaml, Rust, ...)This is especially important for Emboss, because most people reading or writing Emboss code will not want to spend much time becoming an “Emboss expert” — where someone might be willing to spend days or weeks to learn how to write Rust code, they are more likely to spend hours or minutes learning to write Emboss.
However, there are three main reasons to avoid using existing syntax:
The “standard” syntax is error prone. One example of this is operator precedence in most programming languages: errors related to not knowing the relative precedence of &&
and ||
are so common that most compilers have an option to warn if they are mixed without parentheses. Emboss handles this — and a few other error-prone constructs — by having a partial ordering for precedence instead of the standard total ordering, and making it a syntax error to mix operators such as &&
and ||
that have incomparable (neither equal, less than, nor greater than) precedence. As far as I can tell, this is a totally new innovation in Emboss: there is no precedent (no pun intended) whatsoever for partial precedence order.
When avoiding syntax in this way, it is ideal to make the standard syntax into a syntax error (so that no one can use it accidentally) and to add an error message to the compiler that suggests the correct syntax.
The existing syntax is not used consistently: if multiple programming languages use the same syntax for slightly different semantics, it is usually worth avoiding the syntax. For example, /
has quite a few different semantics — in many languages, it is a type-parameterized division, where the numeric result depends on the (static or dynamic) types of its operands, and across languages, the “integer division” flavor is not consistent — in most programming languages it is truncating division (-7 / 3 == -2
), but in some programming languages it is flooring division (-7 / 3 == -3
).
The semantics do not match: if an Emboss feature is almost, but not quite equivalent to a feature in other languages, it is best to avoid making the Emboss feature look like the other feature.
When designing a new feature, try to come up with several alternatives and poll Emboss users (or sometimes non-Emboss-using programmers) as to which one they prefer.
For syntax, one especially powerful technique is to show an example of the proposed syntax to people who have never seen it, and ask “what do you think this means?” without any hinting or prompting. This is the “gold standard” way of finding out whether your syntax is clear or not.
Computing now has roughly seventy years of experience with artificial languages (in programming, markup, data definition, query, etc. flavors), and we have learned a lot about what kinds of constructs are error-prone for humans to use. Avoid these, where possible! Some examples include:
Large semantic differences should not have small, easily-overlooked syntactic differences. For example, allowing single- and double-character operators (=
and ==
, |
and ||
, etc.) in the same contexts: a classic C-family programming error is to use =
in a condition instead of ==
. Many modern languages either force =
to be used only in “statement context” (and some, like C#, also ban side-effectless statements such as x == y;
) or use a different operator like :=
for assignment. (Or both, as in Python, which allows :=
but not =
for “expression assignment.”)
Syntax should have consistent semantic meaning. For example, in JavaScript these two snippets mean the same thing:
return f() + 10;
return f() + 10;
but this one is different (it returns undefined
, thanks to JavaScript's automatic ;
insertion):
return f() + 10;
A small difference in the placement of the line break leads to totally different semantics!
C++ has a number of places where identical syntax can have wildly different semantics, especially (ab)use of operator overloads and the most vexing parse.
Hoare calls “null” his “billion-dollar mistake,” and the way that null pointers are handled in most programming languages, especially C and C++, is particularly error-prone. (But note that it isn‘t really “null” itself that is problematic — it’s that there is no way to mark a pointer as “not null,” and that doing anything with a null pointer leads to undefined behavior. However, some popular language features, such as the ?.
operator found in several programming languages and the std::optional<>
type in C++, show that there is some utility to nullable types, as long as there is language support for enforcing null checks and/or allowing null to propagate in the same way that NaN can.)
Edge cases, such as integer overflow, are difficult for humans to reason about. In systems programming languages like C and C++, this leads to a significant percentage of security flaws. (C and C++ compilers use the “integer overflow is undefined” rule extensively in optimization, so there are pragmatic trade-offs in general. Emboss is used in smaller contexts with tighter safety guarantees.)
Emboss sits in a section of design space that has very few alternatives, and as a result there are things to think about when designing Emboss features that do not apply to many other languages.
Also, because Emboss already exists, there are a number of systems within Emboss-the-language that may interact with new features.
And finally, if you want your feature to become implemented, it is necessary to consider how difficult it would be to implement new features in embossc
.
Maybe the least fun (at least for me[^unfun]) part of designing Emboss features is reading through data sheets, programming manuals, RFCs, and user guides to understand the data formats used in the real world, so that any new feature can handle a reasonable subset of those formats. Some sources to consider:
[^unfun]: One of my original motivations for creating Emboss is that I find reading data sheets and implementing code to read/write the data formats therein to be extremely tedious.
The “heart” of Emboss is what may be called the “structure layout system:” the engine that determines which bits to read and write in order to produce or commit the values of fields. When designing, consider:
Does this feature require reaching “outside” of a scope? For example, referencing a sibling field from within a field's scope is currently impossible, because each field has its own scope. Allowing [requires: this == sibling]
means expanding that scope.
Does this feature require information that is not (currently) available to the layout engine, or not available at the right place or time? For example, if you are designing a feature to allow field sizes to be $auto
, how does that interact with structures that are variable size?
Does this feature require information that is potentially circular, or would it interact with another potential feature to require circular information, and is there a way to resolve that? For example: if you are designing a feature to allow field sizes to be $auto
, inferring their size from their type, how will that interact with the potential feature to allow struct
s that grow to the size they are given?
Although most expressions in Emboss definitions are simple (such as x*4
or even just 0
), the expression system in Emboss tracks a lot of information, such as:
When designing a feature, consider:
opaque
to the expression system, or will it be possible to perform operations on them? If they are opaque
for now, will they stay that way, or will it be possible to manipulate them in the future? For example, adding a string type in Emboss might start as opaque
, but allow operations like “value at index” or “substring” in the future.Note that the entire point of Emboss is to provide a bridge between physical data layout (as defined in the structure layout system) and abstract values with no specific representation (as exposed through the expression system).
Any new syntax has to be added to the parser. Aside from the language design considerations for new syntax (see the “General Language Design Principles” section), there are a few levels of concern for the actual implementation:
The first consideration is more of a general language design consideration: if your language design says “users will be able to specify their program in English,” that is not really feasible (or unambiguous). (Not that it hasn't been tried, many times.)
The second consideration — can you add this syntax to embossc
? — is the most practical and important consideration for Emboss. LR(1) grammars are pretty restrictive (though shift-reduce parsers have advantages — there are reasons Emboss is using one), and even when it is possible to express a particular syntactic construct in LR(1)[^zimm], it may be difficult for most programmers to actually do so. As a practical matter, I recommend trying to actually add your syntax to module_ir.py
.
[^zimm]: I (Ben Olmstead) think it would be awesome to implement [Zimmerman, 2022] plus a few extensions of my own devising in Emboss's shift-reduce engine, which would make the grammar design space significantly larger. I would also separate the parser generator engine into its own project.
The third consideration is more future-focused and abstract: does this syntax lock Emboss into using a shift-reduce parser in the future? Ideally, no. Luckily(?), LR(1) grammars are one of the more restrictive types of grammars in common use, so it is likely that anything that can be handled by the current parser can be handled by many other types of parsers.
Right now, there is only the generated C++ code, but there should be other back ends in the future. Some new features are pure syntax sugar (e.g., $next
or a < b < c
) that are replaced in the IR long before it reaches the back end (e.g., with the offset+length of the syntactically-previous field, or the IR equivalent of a < b && b < c
), while others require extensive changes to how code is generated.
.h
, handle this feature in its current design?The runtime library will be included with every program that touches Emboss, so it is important to make it efficient. When adding features, consider:
Can the feature be added in such a way that it does not cost anything for programs that do not use the feature? A standalone C++ template will not be included in a program unless the program instantiates the template, but if the new code is used from somewhere in an existing function, it may be included in programs that do not use it directly.
Can the feature be added without allocating any heap memory? Can it be added with O(1) stack memory use? Both of these are important for some embedded systems, such as OS-less microcontroller and hard-real-time environments. Some features may intrinsically require memory allocation, in which case it is best if they can be separated: for example, Emboss structure-to-string conversion requires allocation, and even #include
'ing the appropriate headers can be too much for some environments, even if the serialization code is never included in the final binary.
How much can you rely on the C++ compiler to optimize things? If you have to implement your own optimizations, that will cost more development time and add more complexity to the standard library.
The Emboss compiler is already quite complex, and has many subsystems that interact. It is already quite difficult to reason about some interactions.
Can the feature be added at an “edge” of the compiler? For example, if you can implement your feature as syntax sugar that converts the new feature to existing IR early in the compilation process, it is much easier to verify that it will not cause problematic interactions. Similarly, if you can implement your feature entirely in the back end or in the runtime library, you do not need to worry about interactions inside the front end.
If a feature cannot be added at an edge, how can you design it to minimize the complexity? (Ideally, you could even unify existing systems in such a way that the overall complexity of the compiler is lower at the end.)
It is important to have some idea of how any feature would be implemented against future back ends.
Some features may be difficult to implement in other languages. For example, Python does not have a native switch
statement, so any switch
-like feature in Emboss may be awkward to implement — but this does not necessarily mean that Emboss should not have a switch
.
As a rule of thumb, languages can be grouped into tiers:
(It may be difficult to classify some languages, such as FORTRAN, which is still hanging around in 2024.)
Remember that other back ends may have different requirements and guarantees than the C++ back end: for example, it would be unreasonable for a Java back end to promise “no dynamic memory allocation.”
These back ends would translate binary structures into alternate representations that are easier for some tools to use: for example, Google has many, many tools for processing Protocol Buffers, and JSON is popular in the open-source world.
Most other formats have limitations that may make some kinds of Emboss constructs difficult or impossible to correctly reproduce: for example, Emboss already supports “infinitely nested” struct
types, like:
struct Foo: 0 [+10] Foo child_foo
Formats like Protobuf or JSON, which do not have any way of representing loops in their data graph, cannot handle this.
Until the most recent versions of Protobuf, mismatches between Protobuf enum
and Emboss enum
made it functionally impossible to map any Emboss enum
types onto Protobuf enum
types: Emboss enum
types are open (allow any value, even ones that are not listed in the enum
), where all Protobuf enum
types were closed (only allowed known values). (The most recent Protobuf versions, Proto3 and Editions, allow you to have open enum
types.)
Generally, it is not worth blocking an Emboss feature because of these kinds of mismatches, but it is worth thinking about how to avoid them, if possible.
These back ends would translate .emb
files to a form of human-readable documentation, intended for publication on a web site, in an RFC, or as part of a PDF datasheet. This type of back end is the motivation for having both --
documentation blocks and #
comments in Emboss.
Since the output from these back ends would be intended for human consumption, for the most part you would only need to ensure that your feature can be understood by humans.