Entropy collection TODOs

I‘m writing this at the end of my internship to record some of the things I didn’t get to.

Proper use of RdRand

On x86, RdRand reads from a deterministic CPRNG (which is seeded from a hardware entropy source). The newer RdSeed instruction reads from the underlying entropy source directly (well, with some post-processing). Currently, we prefer to use RdSeed but if that isn't available we fall back on RdRand. However, we just draw random bits directly from RdRand, in contravention of the Intel HWRNG guide (online here; see section 4.2.5 “Guaranteeing DBRG Reseeding”). We should fix that.

Reseeding the CPRNG during runtime

My hacky virtio driver will reseed the CPRNG on qemu (on a five minute recurring timer). I think that's the only entropy source that is currently used to reseed after system startup.

As a start, we should be able to use the entropy sources built into the kernel (RdRand and jitterentropy). Just running these on a periodic timer would improve our reseeding story. Note that once every 5 minutes is probably more often than we need.

We've talked about reseeding more often if large amounts of data have been drawn from the CPRNG (on the order of 2^48 bits, I think).

Monitoring entropy sources

Entropy sources can potentially fail, either totally or partially.

Total failures like “the device was unplugged” or “the device is not responding to I/O” will hopefully be reported by the hardware layer.

Partial failures, where the device returns data but with less entropy than expected, are scarier. We should run simple health tests to try to detect partial failures. See for example the continuous health tests in NIST SP800-90B, section 4.4. The health tests there are pretty simple and require minimal resources. They do require storing some statistics about recent entropy source outputs, which presents some security risk.

The NIST SP also suggests (well, requires, but I'm not aware of any immediate plans for certification) running startup tests. The NIST startup tests involve running the continuous tests over at least 4096 samples (see section 4.3 #12), after which these samples may be reused to seed the CPRNG.

Once monitoring is in place, we need to decide how to respond to entropy source failures. If one of six different entropy sources fails, we might treat that as a minor hardware failure that gets logged. If the system has only one entropy source and it fails, we need to take more drastic action (on the order of shutting off the CPRNG or halting the system).

Userspace RNG drivers

Once DDK settles down, we should add to and improve our RNG drivers. Currently, there are two RNG-related drivers: TPM and virtio-rng.

An important requirement is to restrict access to the zx_cprng_add_entropy syscall, via a Resource or similar mechanism. We should also use this to differentiate between the devices providing entropy, for monitoring purposes. It would also be nice if the kernel can send start/stop signals to the drivers through this Resource.

Here are some currently unused entropy sources to consider:

There's an existing TPM driver, which calls cprng_add_entropy in its bind() callback. We should add support for TPM 2.0, for better coverage.
There are plenty of commercially available hardware RNGs, often connecting over USB. We could add drivers for those, but it probably makes sense to expect third party drivers instead.
There‘s also apparently a hardware RNG built into the SoC in Raspberry Pis, according to the Raspberry Pi forums. In general we could check other specific targets (i.e. not “pc-x86-64”) for hardware RNGs and wire those up. If we’re lucky, many of these will be accessible from the kernel for use during or immediately after boot.
Finally, we could record entropy from hardware IRQs, especially for hard disks, network cards, input devices, and other classic entropy sources. This won‘t be anywhere near as fast as a dedicated hardware RNG, but it’s attractive since a few lines of code added in the right places in our driver stack should enable entropy collection from a wide variety of very common devices.

Jitterentropy

Replace the noise-generating functions by assembly, and remove ‘-O0’

Right now, jitterentropy is compiled at optimization level -O0 (as per the author's documentation). The reason is the two noise-generating functions: jent_lfsr_time and jent_memaccess. We should replace these C functions by assembly code (probably by compiling with flags -S -O0), then compile the rest of jitterentropy with optimizations enabled. After this, we should re-test to make sure our entropy estimates remain accurate.

Test jitterentropy more thoroughly

I've been testing on the same handful of physical devices. We should test jitterentropy on a few other PCs, RPis, etc.

Test jitterentropy at runtime

Right now, jitterentropy only runs (and was only tested) during the single-core part of the boot sequence. We should test jitterentropy during SMP runtime, and consider whether we need to (say) disable interrupts or pin ourselves to a CPU inside jitterentropy.

More tuning

See the tuning doc. The current universally hard-coded parameters seem to be decent, so this probably isn‘t incredibly urgent. Still, since jitterentropy is on the critical path for every single boot and since it will run during runtime as well (hopefully soon!), it’s probably worth optimizing at some point.

We should probably at least tune jitterentropy on a per-architecture basis, and ideally per-target. Note that right now, the entropy_per_1000_bytes statistic in kernel/lib/crypto/entropy/jitterentropy_collector.cpp is hard-coded and not arch/target dependent. That should probably also be configurable.

Cloning the NIST test suite

We may want to clone the NIST test suite into Fuchsia third_party. This would help us to automate the testing and analysis of our entropy sources (Jitterentropy in particular).