Updated comments

commit: 0a2fe80fab18ae1a438ac800c3c49b5ae481d9ea [log] [tgz]
author: Matthew Riley <mattdr@google.com> Wed May 06 17:45:31 2020 -0700
committer: Matthew Riley <mattdr@google.com> Wed May 06 17:45:31 2020 -0700
tree: 532bd1e96f28318c272678c4ff51b3f277995a8f
parent: 2d0239dd1629e498e044c10139ab5e32e4d4e7bd [diff]
diff --git a/demos/timing_array.cc b/demos/timing_array.cc
index 38cd784..683264e 100644
--- a/demos/timing_array.cc
+++ b/demos/timing_array.cc

@@ -78,60 +78,36 @@
 //   - Reads _at or below_ the threshold _almost certainly_ came from cache
 //   - Reads _above_ the threshold _probably_ came from memory
 //
-// There are a lot of ways to try find such a threshold value. Our current
+// There are a lot of ways to arrive at such a threshold value. Our current
 // approach is, roughly:
 //   1. Flush all array elements from cache.
 //   2. Read all elements into cache, noting the fastest read.
 //   3. Read all elements again, now from cache, noting the slowest read.
-//   4. Repeat (1)-(3) many times to get a lot of "fastest uncached" and
-//      "slowest cached" datapoints.
-//   5. Take a low-percentile value from each distribution and return their
-//      midpoint.
+//      Take the midpoint of the "fastest uncached" and "slowest cached" and
+//      call it a "candidate threshold".
+//   4. Repeat (1)-(3) many times to get a distribution of candidate thresholds.
+//   5. Choose a threshold from the low side of that distribution.
 //
 // In a previous attempt, we just collected the "slowest cached" distribution
 // and sampled the threshold from there. That worked in many cases, but failed
 // when the observed latency of cached reads could change over time -- for
 // example, on laptops with aggressive dynamic frequency scaling.
 //
-// Now we also track "fastest uncached" and choose a threshold that falls
-// between the two distributions, adding some margin without sacrificing
-// precision.
-//
-// We have to measure reads across all elements of the array, since that's what
-// the code that later *uses* our computed threshold will be doing. Certain
-// measurement artifacts only appear when reading a lot of values:
-//   - TimingArray forces values onto different pages, which introduces TLB
-//     pressure. After re-reading ~64 elements of a 256-element array on an
-//     Intel Xeon processor, we see latency increase as we start hitting the L2
-//     TLB. The 7-Zip LZMA benchmarks have some useful measurements on this:
-//     https://www.7-cpu.com/cpu/Skylake.html
-//   - Our first version of TimingArray didn't implement cache coloring and all
-//     elements were contending for the same cache set. As a result, after
-//     reading 256 values, we had evicted all but the first ~8 from the L1
-//     cache. Our threshold computation took this into account. If we had just
-//     looped over reading one memory value, the computed threshold would have
-//     been too low to classify reads from L2 or L3 cache.
-//
-// Repeating the experiment lets us 
-// Repeating the experiment 
-//   - A context switch might happen right before a measurement, evicting array
-//     elements from the cache; or one could happen *during* a measurement,
-//     adding arbitrary extra time to the observed latency.
-//   - A coscheduled hyperthread might introduce cache contention, forcing some
-//     reads to go to memory.
-//   - We might not always succeed in defeating the prefetcher, and a read we
-//     expected to go to memory might instead come from cache.
-//
-//
-//
-//
-// The idea of taking low-percentile values to reduce noise is inspired in part
-// by observations from "Opportunities and Limits of Remote Timing Attacks"[1].
-//
-// [1] https://www.cs.rice.edu/~dwallach/pub/crosby-timing2009.pdf
+// Now we also measure "fastest uncached" and use it to bias the threshold up.
+// The hope is we add margin to more reliably recognize true positives without
+// meaningfully increasing the likelihood of false postives.
 uint64_t TimingArray::FindCachedReadLatencyThreshold() {
   const int iterations = 10000;
 
+  // A few considerations for choosing the percentile:
+  //   - Lower means fewer false positives
+  //   - Too low and we see outliers from uncached reads that were "too fast",
+  //     e.g. places where we failed to defeat the prefetcher
+  //   - Too high and we see outliers from slow cached *or* uncached reads,
+  //     e.g. measurements that were interrupted by context switches, or
+  //     where a coscheduled hyperthread introduced cache contention.
+  const int percentile = 10;
+
   // For testing, allow the threshold to be specified as an environment
   // variable.
   const char *threshold_from_env = getenv("CACHED_THRESHOLD");
@@ -143,6 +119,21 @@
   for (int n = 0; n < iterations; ++n) {
     FlushFromCache();
 
+    // We have to measure reads across all elements of the array, since that's
+    // what the code that later *uses* our computed threshold will be doing.
+    // Certain measurement artifacts only appear when reading many values:
+    //   - TimingArray forces values onto different pages, which introduces TLB
+    //     pressure. After re-reading ~64 elements of a 256-element array on an
+    //     Intel Xeon processor, we see latency increase as we start hitting the
+    //     L2 TLB. The 7-Zip LZMA benchmarks have some useful measurements on
+    //     this: https://www.7-cpu.com/cpu/Skylake.html
+    //   - Our first version of TimingArray didn't implement cache coloring and
+    //     all elements were contending for the same cache set. As a result,
+    //     after reading 256 values, we had evicted all but the first ~8 from
+    //     the L1 cache. Our threshold computation took this into account. If we
+    //     had just looped over reading one memory value, the computed threshold
+    //     would have been too low to classify reads from L2 or L3 cache.
+
     uint64_t fastest_uncached = std::numeric_limits<uint64_t>::max();
     for (int i = 0; i < size(); ++i) {
       fastest_uncached =
@@ -155,10 +146,12 @@
           std::max(slowest_cached, MeasureReadLatency(&ElementAt(i)));
     }
 
+    // The candidate threshold is the midpoint.
     candidate_thresholds.push_back((fastest_uncached + slowest_cached) / 2);
   }
 
-  // Return a candidate threshold on the lower side.
+  // Return the Nth% candidate threshold.
   std::sort(candidate_thresholds.begin(), candidate_thresholds.end());
-  return candidate_thresholds[iterations * 0.2];
+  int index = (percentile / 100.0) * (iterations - 1);
+  return candidate_thresholds[index];
 }
commit	0a2fe80fab18ae1a438ac800c3c49b5ae481d9ea	[log] [tgz]
author	Matthew Riley <mattdr@google.com>	Wed May 06 17:45:31 2020 -0700
committer	Matthew Riley <mattdr@google.com>	Wed May 06 17:45:31 2020 -0700
tree	532bd1e96f28318c272678c4ff51b3f277995a8f
parent	2d0239dd1629e498e044c10139ab5e32e4d4e7bd [diff]