Design Sketch: Integer Division and Modulus Operators

Overview

Currently, Emboss does not support division, which limits what it can express in its expression language. In particular, it is awkward to handle things like arrays of 16-bit values whose size is specified in bytes.

The reason Emboss does not yet have support for division is that division is tricky to handle correctly. This document attempts to explain the pitfalls and the proper solutions.

Note that Emboss does not support non-integer arithmetic, so this document only talks about integer division.

Syntax

Symbol

Many programming languages use / for all division operations, whether they are “integer” division, floating point division, rational arithmetic division, matrix division, or anything else.

Very, very few languages (Haskell) use an operator other than % for modulus/remainder, and no popular language overloads % with a second meaning.

(Note that Haskell defines % as rational number division. mod and rem are used for modulus and remainder, respectively, and div and quot are used for truncating and flooring division.)

% is technically modulus in some languages, and remainder in others, but in all cases the equality x == (x / n * n) + x % n holds, and does not seem to confuse programmers.

C, C++, C#, Java, Go, Python 2, Rust, SQL, ...

All of these languages use / for multiple kinds of division, including integer division.

Python 2 deprecated / for integer division; in a Python 2 program with from __future__ import division, / is floating-point division.

C++ and Python both allow / to be overloaded for user-defined types.

JavaScript, TypeScript, Perl 5

These languages use / for floating-point division, and do not directly support integer division; both operands and result must be cast to int:

((l | 0) / (r | 0)) | 0  // JS, TS

int(int($l) / int($r))    # Perl

OCaml, Haskell, Python 3

These languages use separate operators for integer and non-integer division.

Integer division:

l `div` r    -- Haskell
l / r        (* OCaml *)
l // r        # Python 3

Non-integer division:

l / r        -- Haskell
l /. r       (* OCaml *)
l / r         # Python 3

Emboss

Emboss should use // for division, following Python 3's example.

This is the only unambiguous integer division operator used by a TIOBE top-20 language, and Python 3's // has the same division semantics (flooring division) as Emboss -- most other programming languages use truncating (round-toward-0) division.

If Emboss ever gets support for floating-point arithmetic, it should probably use OCaml's /. operator for division on floats.

Because / does different things in different popular languages (flooring integer division, truncating integer division, floating-point division, or multiple types, depending on operand types), Emboss should not use it.

Emboss should use % for modulus, following the example of almost every other language.

Precedence

In most programming languages, division is a left-associative operator with equal precedence to multiplication; i.e. this:

a * b / c * d

is equivalent to this:

((a * b) / c) * d

However, a nontrivial percentage of programmers tend to read it as:

(a * b) / (c * d)

This may be because, after grade school, most division is expressed in fraction form, like:

a * b
-----
c * d

Interestingly, fewer programmers seem to mis-read extra division:

a * b / c / d

is usually (correctly) parsed as:

((a * b) / c) / d

Given that confusion, in Emboss, the construct:

a * b / c * d

should be a syntax error.

The constructs:

a * b / c / d
a * b / c + d

could be syntax errors, although the second one (... + d), in particular, seems to get parsed correctly by all programmers in my polling.

This is in keeping with two general rules of Emboss design:

  1. It is better to lock things down at first and gradually ease restrictions than the other way around.
  2. Emboss syntax should be as unambiguous as possible, even when it makes implementation trickier or .emb files slightly wordier. (See, for example, the “separate-but-equal” precedence of the && and || operators in Emboss.)

Semantics

Flooring Division vs Truncating Division

Roughly speaking, there are two common forms of “integer division:” flooring (sometimes called Euclidean) and truncating (sometimes called round-toward-0) division.

OperationFlooring ResultTruncating Result
+8 / +322
+8 / -3-3-2
-8 / +3-3-2
-8 / -3-32
+8 % +322
+8 % -3-12
-8 % +31-2
-8 % -3-2-2

Most programming languages and most CPUs implement truncating division, however, truncating division is irregular around 0, which prevents some kinds of expression rewrites (e.g., (x + 6) / 3 is not the same as x / 3 + 2 when x == -4) and would force Emboss's bounds tracking to have wider bounds in some common cases, such as modulus by a known constant, where the dividend could be either positive or negative.

Emboss should use flooring division.

Undefined Results

Division is the first operation in Emboss's expression language which can have an undefined result: all other operations are total.

Emboss does have some notion of “invalid value,” which it uses at runtime to handle fields whose backing store does not exist or whose existence condition is false, however, “invalid value” is not fully propagated in the expression type system, and it has a somewhat different meaning than “undefined.”

Currently, integer types in the Emboss expression language are sets of the form:

{x ∈ ℤ | (min ≤ x ∨ min = -infinity) ∧
         (x ≤ max ∨ max = infinity) ∧
         x MOD m = n ∧
         0 ≤ n < m ∧
         min ∈ ℤ ∪ {-infinity} ∧
         max ∈ ℤ ∪ {infinity} ∧
         m ∈ ℤ ∪ {infinity} ∧
         n ∈ ℤ}

where min, max, m, and n are parameters. The special values infinity and -infinity for max and min indicate that there is no known upper or lower bound, and the special value infinity for m indicates that x has a constant value equal to n.

This will have to change to something like:

{x ∈ ℤ ∪ {undefined} | ((can_be_undefined ∧ x = undefined) ∨
                        (min ≤ x ≤ max ∧ x MOD m = n)) ∧
                       0 ≤ n < m ∧
                       can_be_undefined ∈ 𝔹 ∧
                       min ∈ ℤ ∪ {-infinity, undefined} ∧
                       max ∈ ℤ ∪ {infinity, undefined} ∧
                       m ∈ ℤ ∪ {infinity} ∧
                       n ∈ ℤ ∪ {undefined}}

where can_be_undefined is a new boolean parameter.

This also leaks out into boolean types, which are currently either {true, false}, {true}, or {false}: they can now be {true, false}, {true}, {false}, {true, false, undefined}, {true, undefined}, {false, undefined}, or {undefined}.

Expression Bound Calculations

Division

For an expression x of the form l // r, Emboss must figure out the five parameters of the type of x (x_min, x_max, x_m, x_n, x_can_be_undefined) from the five parameters of each of the types of l and r.

x_can_be_undefined = l_can_be_undefined ∨
                     r_can_be_undefined ∨
                     r_min ≤ 0 ≤ r_max

That is, x can be undefined if either input is undefined, or if there could be division by zero.

If r_min = 0 = r_max, then:

x_max = undefined
x_min = undefined
x_m = infinity
x_n = undefined

Otherwise, the remaining parameters can be computed by parts, although for x_max and x_min it is simpler to simply check all pairs of {l_min, -1, 1, l_max} // {r_min, -1, 1, r_max} (removing zeroes from the right-hand side, and removing -1 and 1 if they do not fall between the min and max).

x_m is:

infinity      if l_m = r_m = infinity else
GCD(l_m, r_n) if r_m = infinity       else
1

x_n is

l_n // r_n                     if l_m = r_m = infinity else
(l_n // r_n) MOD GCD(l_m, r_n) if r_m = infinity       else
0

Modulus

Modulus is somewhat simpler.

x_can_be_undefined = l_can_be_undefined ∨
                     r_can_be_undefined ∨
                     r_min ≤ 0 ≤ r_max

If l_m = r_m = infinity, x_m = infinity and x_min = x_max = x_n = l_n % r_n.

Otherwise:

x_max = max(0, r_max - 1)
x_min = min(0, r_min + 1)
x_m = 1
x_n = 0

Note that, as with some other bounds calculations in Emboss, there are some special cases where l or r is a very small set where it is technically possible to find a narrower bound: for example, if 6 ≤ l ≤ 7, l % 4 could have the bounds x_max = 3 and x_min = 2. However, Emboss does not need to find the absolute tightest bound; it only needs to find a reasonable bound.

Other Notes

IsComplete()

The intention of IsComplete() is that it should return true iff adding more bytes cannot change the result of Ok() -- that is, iff the structure is “complete” in the sense that there are enough bytes to hold the structure, even if the structure is broken in some other way.

If the size of the structure is a dynamic value which involves division by zero, the second definition of IsComplete() becomes ill-defined, in that it becomes impossible to know if there are enough bytes for the structure:

struct Foo:
  0 [+2]               UInt      divisor
  2 [+512 // divisor]  UInt:8[]  payload

If divisor == 0, the size of payload is undefined, so the structure's completeness is undefined.

On the other hand, if divisor == 0, then the structure can never be Ok(), so in that sense it is ‘complete’: there is no way to add more bytes and get a structure that is functional. There may be another way for the client software to recover, such as scanning an input stream for a new start-of-message marker.

I (bolms@) think that the best option is:

  1. Add a new method Maybe<bool> IsUndefined() to virtual field views (and maybe to UIntView, IntView, and FlagView).
  2. If IntrinsicSizeInBytes().IsUndefined(), IsComplete() should return true. (Note that IntrinsicSizeInBytes() is just the way the $size_in_bytes implicit virtual field is spelled in C++.)
    • This will require a bunch of plumbing into the whole expression generation logic -- probably changing Maybe<T> to have a ‘why not’ reason when there is no T.
  3. Otherwise, proceed with the current IsComplete() logic.
  4. Update the documentation for IntrinsicSizeInBytes() and IsComplete() to note this caveat.

In practice, it is rare to have a dynamic value that could be zero as a divisor in a field position or length: such divisions are almost always either division by a constant or division by one of a small set of known divisors, often powers of 2 (and in many of those cases, a shift operation would fit better).