Currently, Emboss does not support division, which limits what it can express in its expression language. In particular, it is awkward to handle things like arrays of 16-bit values whose size is specified in bytes.
The reason Emboss does not yet have support for division is that division is tricky to handle correctly. This document attempts to explain the pitfalls and the proper solutions.
Note that Emboss does not support non-integer arithmetic, so this document only talks about integer division.
Many programming languages use /
for all division operations, whether they are “integer” division, floating point division, rational arithmetic division, matrix division, or anything else.
Very, very few languages (Haskell) use an operator other than %
for modulus/remainder, and no popular language overloads %
with a second meaning.
(Note that Haskell defines %
as rational number division. mod
and rem
are used for modulus and remainder, respectively, and div
and quot
are used for truncating and flooring division.)
%
is technically modulus in some languages, and remainder in others, but in all cases the equality x == (x / n * n) + x % n
holds, and does not seem to confuse programmers.
All of these languages use /
for multiple kinds of division, including integer division.
Python 2 deprecated /
for integer division; in a Python 2 program with from __future__ import division
, /
is floating-point division.
C++ and Python both allow /
to be overloaded for user-defined types.
These languages use /
for floating-point division, and do not directly support integer division; both operands and result must be cast to int:
((l | 0) / (r | 0)) | 0 // JS, TS int(int($l) / int($r)) # Perl
These languages use separate operators for integer and non-integer division.
Integer division:
l `div` r -- Haskell l / r (* OCaml *) l // r # Python 3
Non-integer division:
l / r -- Haskell l /. r (* OCaml *) l / r # Python 3
Emboss should use //
for division, following Python 3's example.
This is the only unambiguous integer division operator used by a TIOBE top-20 language, and Python 3's //
has the same division semantics (flooring division) as Emboss -- most other programming languages use truncating (round-toward-0) division.
If Emboss ever gets support for floating-point arithmetic, it should probably use OCaml's /.
operator for division on floats.
Because /
does different things in different popular languages (flooring integer division, truncating integer division, floating-point division, or multiple types, depending on operand types), Emboss should not use it.
Emboss should use %
for modulus, following the example of almost every other language.
In most programming languages, division is a left-associative operator with equal precedence to multiplication; i.e. this:
a * b / c * d
is equivalent to this:
((a * b) / c) * d
However, a nontrivial percentage of programmers tend to read it as:
(a * b) / (c * d)
This may be because, after grade school, most division is expressed in fraction form, like:
a * b ----- c * d
Interestingly, fewer programmers seem to mis-read extra division:
a * b / c / d
is usually (correctly) parsed as:
((a * b) / c) / d
Given that confusion, in Emboss, the construct:
a * b / c * d
should be a syntax error.
The constructs:
a * b / c / d a * b / c + d
could be syntax errors, although the second one (... + d
), in particular, seems to get parsed correctly by all programmers in my polling.
This is in keeping with two general rules of Emboss design:
.emb
files slightly wordier. (See, for example, the “separate-but-equal” precedence of the &&
and ||
operators in Emboss.)Roughly speaking, there are two common forms of “integer division:” flooring (sometimes called Euclidean) and truncating (sometimes called round-toward-0) division.
Operation | Flooring Result | Truncating Result |
---|---|---|
+8 / +3 | 2 | 2 |
+8 / -3 | -3 | -2 |
-8 / +3 | -3 | -2 |
-8 / -3 | -3 | 2 |
+8 % +3 | 2 | 2 |
+8 % -3 | -1 | 2 |
-8 % +3 | 1 | -2 |
-8 % -3 | -2 | -2 |
Most programming languages and most CPUs implement truncating division, however, truncating division is irregular around 0, which prevents some kinds of expression rewrites (e.g., (x + 6) / 3
is not the same as x / 3 + 2
when x == -4
) and would force Emboss's bounds tracking to have wider bounds in some common cases, such as modulus by a known constant, where the dividend could be either positive or negative.
Emboss should use flooring division.
Division is the first operation in Emboss's expression language which can have an undefined result: all other operations are total.
Emboss does have some notion of “invalid value,” which it uses at runtime to handle fields whose backing store does not exist or whose existence condition is false, however, “invalid value” is not fully propagated in the expression type system, and it has a somewhat different meaning than “undefined.”
Currently, integer types in the Emboss expression language are sets of the form:
{x ∈ ℤ | (min ≤ x ∨ min = -infinity) ∧ (x ≤ max ∨ max = infinity) ∧ x MOD m = n ∧ 0 ≤ n < m ∧ min ∈ ℤ ∪ {-infinity} ∧ max ∈ ℤ ∪ {infinity} ∧ m ∈ ℤ ∪ {infinity} ∧ n ∈ ℤ}
where min
, max
, m
, and n
are parameters. The special values infinity
and -infinity
for max
and min
indicate that there is no known upper or lower bound, and the special value infinity
for m
indicates that x
has a constant value equal to n
.
This will have to change to something like:
{x ∈ ℤ ∪ {undefined} | ((can_be_undefined ∧ x = undefined) ∨ (min ≤ x ≤ max ∧ x MOD m = n)) ∧ 0 ≤ n < m ∧ can_be_undefined ∈ 𝔹 ∧ min ∈ ℤ ∪ {-infinity, undefined} ∧ max ∈ ℤ ∪ {infinity, undefined} ∧ m ∈ ℤ ∪ {infinity} ∧ n ∈ ℤ ∪ {undefined}}
where can_be_undefined
is a new boolean parameter.
This also leaks out into boolean types, which are currently either {true, false}, {true}, or {false}: they can now be {true, false}, {true}, {false}, {true, false, undefined}, {true, undefined}, {false, undefined}, or {undefined}.
For an expression x of the form l // r
, Emboss must figure out the five parameters of the type of x (x_min
, x_max
, x_m
, x_n
, x_can_be_undefined
) from the five parameters of each of the types of l
and r
.
x_can_be_undefined = l_can_be_undefined ∨ r_can_be_undefined ∨ r_min ≤ 0 ≤ r_max
That is, x
can be undefined if either input is undefined, or if there could be division by zero.
If r_min = 0 = r_max
, then:
x_max = undefined x_min = undefined x_m = infinity x_n = undefined
Otherwise, the remaining parameters can be computed by parts, although for x_max
and x_min
it is simpler to simply check all pairs of {l_min
, -1
, 1
, l_max
} // {r_min
, -1
, 1
, r_max
} (removing zeroes from the right-hand side, and removing -1
and 1
if they do not fall between the min
and max
).
x_m
is:
infinity if l_m = r_m = infinity else GCD(l_m, r_n) if r_m = infinity else 1
x_n
is
l_n // r_n if l_m = r_m = infinity else (l_n // r_n) MOD GCD(l_m, r_n) if r_m = infinity else 0
Modulus is somewhat simpler.
x_can_be_undefined = l_can_be_undefined ∨ r_can_be_undefined ∨ r_min ≤ 0 ≤ r_max
If l_m = r_m = infinity
, x_m = infinity
and x_min = x_max = x_n = l_n % r_n
.
Otherwise:
x_max = max(0, r_max - 1) x_min = min(0, r_min + 1) x_m = 1 x_n = 0
Note that, as with some other bounds calculations in Emboss, there are some special cases where l
or r
is a very small set where it is technically possible to find a narrower bound: for example, if 6 ≤ l ≤ 7, l % 4 could have the bounds x_max = 3
and x_min = 2
. However, Emboss does not need to find the absolute tightest bound; it only needs to find a reasonable bound.
IsComplete()
The intention of IsComplete()
is that it should return true
iff adding more bytes cannot change the result of Ok()
-- that is, iff the structure is “complete” in the sense that there are enough bytes to hold the structure, even if the structure is broken in some other way.
If the size of the structure is a dynamic value which involves division by zero, the second definition of IsComplete()
becomes ill-defined, in that it becomes impossible to know if there are enough bytes for the structure:
struct Foo: 0 [+2] UInt divisor 2 [+512 // divisor] UInt:8[] payload
If divisor == 0
, the size of payload
is undefined, so the structure's completeness is undefined.
On the other hand, if divisor == 0
, then the structure can never be Ok()
, so in that sense it is ‘complete’: there is no way to add more bytes and get a structure that is functional. There may be another way for the client software to recover, such as scanning an input stream for a new start-of-message marker.
I (bolms@) think that the best option is:
Maybe<bool> IsUndefined()
to virtual field views (and maybe to UIntView
, IntView
, and FlagView
).IntrinsicSizeInBytes().IsUndefined()
, IsComplete()
should return true
. (Note that IntrinsicSizeInBytes()
is just the way the $size_in_bytes
implicit virtual field is spelled in C++.)Maybe<T>
to have a ‘why not’ reason when there is no T
.IsComplete()
logic.IntrinsicSizeInBytes()
and IsComplete()
to note this caveat.In practice, it is rare to have a dynamic value that could be zero as a divisor in a field position or length: such divisions are almost always either division by a constant or division by one of a small set of known divisors, often powers of 2 (and in many of those cases, a shift operation would fit better).