[perfcompare] Do multiple launches of each test process instead of just one

We've found that the mean running times of performance tests tend to
vary between launches of the same process and between reboots.

We want to calculate the mean across process launches and reboots, so
we need to sample accordingly.  This change implements sampling across
process relaunches.  Sampling across reboots will be done later.

 * Before: We launch the perftest process once, and (for each test
   case) do 1000 test runs in that process.  We treat this as a sample
   of size 1000 when calculating confidence intervals.

 * After: We launch the perftest process 30 times, and (for each test
   case) do 100 test runs in each process.  We treat this as a sample
   of size 30 when calculating confidence intervals.

Bug: IN-646
Test: "python perfcompare_test.py" + perfcompare trybot
Change-Id: Ic0176dcc394e87015302bbe3d8588a68408fca05
diff --git a/garnet/bin/perfcompare/perfcompare.py b/garnet/bin/perfcompare/perfcompare.py
index 18bfee7..ae4da1b 100644
--- a/garnet/bin/perfcompare/perfcompare.py
+++ b/garnet/bin/perfcompare/perfcompare.py
@@ -18,20 +18,37 @@
 # confidence intervals are non-overlapping, we conclude that the
 # performance has improved or regressed for this test.
 #
-# We make the simplifying assumption that the running time for each
-# run of a performance test is normally distributed.  (In practice,
-# this assumption is not true.  We will need to check how much tests
-# deviate from this assumption, and how much that affects the
-# comparison we are doing here.)
+# Data is gathered from a 3-level sampling process:
 #
-# With that assumption, ideally we should use Student's t-distribution
-# for calculating the confidence intervals for the means.  That is
-# easy if the SciPy library is available.  However, this code runs
-# using infra's copy of Python, which doesn't make SciPy available.
-# For now, we instead use the normal distribution for calculating the
-# confidence intervals (giving a Z-test instead of a t-test).
-# Fortunately, for large samples (i.e. large numbers of test runs),
-# the difference between the two is small.
+#  1) Boot Fuchsia one or more times.  (Currently we only boot once.)
+#  2) For each boot, launch the perf test process one or more times.
+#  3) For each process launch, instantiate the performance test and
+#     run the body of the test some number of times.
+#
+# This is intended to account for variation across boots and across process
+# launches.
+#
+# Currently we use z-test confidence intervals.  In future we should either
+# use t-test confidence intervals, or (preferably) use bootstrap confidence
+# intervals.
+#
+#  * We apply the z-test confidence interval to the mean running times from
+#    each process instance (from level #3).  This means we treat the sample
+#    size as being the number of processes launches.  This is rather
+#    ad-hoc: it assumes that there is a lot of between-process variation
+#    and that we need to widen the confidence intervals to reflect that.
+#    Using bootstrapping with resampling across the 3 levels above should
+#    account for that variation without making ad-hoc assumptions.
+#
+#  * This assumes that the values we apply the z-test to are normally
+#    distributed, or approximately normally distributed.  Using
+#    bootstrapping instead would avoid this assumption.
+#
+#  * t-test confidence intervals would be better than z-test confidence
+#    intervals, especially for smaller sample sizes.  The former is easier
+#    to do if the SciPy library is available.  However, this code runs
+#    using infra's copy of Python, which doesn't make SciPy available at
+#    the moment.
 
 
 # ALPHA is a parameter for calculating confidence intervals.  It is
@@ -48,11 +65,16 @@
 # This is the value of scipy.stats.norm.ppf(ALPHA / 2).
 Z_TEST_OFFSET = -2.5758293035489008
 
+
+def Mean(values):
+    return float(sum(values)) / len(values)
+
+
 # Returns the mean and standard deviation of a sample.  This does the
 # same as scipy.stats.norm.fit().  This does not apply Bessel's
 # correction to the calculation of the standard deviation.
 def MeanAndStddev(values):
-    mean_val = float(sum(values)) / len(values)
+    mean_val = Mean(values)
     sum_of_squares = 0.0
     for val in values:
         diff = val - mean_val
@@ -83,17 +105,17 @@
 
 def ResultsFromDir(dir_path):
     results_map = {}
-    # Sorting the result of os.listdir() is not essential.  Currently
-    # it just makes error handling of duplicates more deterministic.
+    # Sorting the result of os.listdir() is not essential, but it makes any
+    # later behaviour more deterministic.
     for filename in sorted(os.listdir(dir_path)):
         if filename == 'summary.json':
             continue
         if filename.endswith('.json'):
             file_path = os.path.join(dir_path, filename)
             for data in ReadJsonFile(file_path):
-                assert data['label'] not in results_map
-                results_map[data['label']] = Stats(data['values'])
-    return results_map
+                new_value = Mean(data['values'])
+                results_map.setdefault(data['label'], []).append(new_value)
+    return {name: Stats(values) for name, values in results_map.iteritems()}
 
 
 def FormatTable(rows, out_fh):
diff --git a/garnet/bin/perfcompare/perfcompare_test.py b/garnet/bin/perfcompare/perfcompare_test.py
index 281f06d..61d9c50 100644
--- a/garnet/bin/perfcompare/perfcompare_test.py
+++ b/garnet/bin/perfcompare/perfcompare_test.py
@@ -34,6 +34,11 @@
             func()
 
 
+def WriteJsonFile(filename, json_data):
+    with open(filename, 'w') as fh:
+        json.dump(json_data, fh)
+
+
 def ReadGoldenFile(filename):
     data = open(filename, 'r').read()
     matches = list(re.finditer('\n\n### (.*)\n', data, re.M))
@@ -116,17 +121,18 @@
 
     def ExampleDataDir(self, mean=1000, stddev=100, drop_one=False):
         dir_path = self.MakeTempDir()
-        results = [{'label': 'ClockGetTimeExample',
-                    'test_suite': 'fuchsia.example.perf_test',
-                    'unit': 'nanoseconds',
-                    'values': GenerateTestData(mean, stddev)}]
+        results = [('ClockGetTimeExample', GenerateTestData(mean, stddev))]
         if not drop_one:
-            results.append({'label': 'SecondExample',
-                            'test_suite': 'fuchsia.example.perf_test',
-                            'unit': 'nanoseconds',
-                            'values': GenerateTestData(2000, 300)})
-        with open(os.path.join(dir_path, 'example.perf_test.json'), 'w') as fh:
-            json.dump(results, fh)
+            results.append(('SecondExample', GenerateTestData(2000, 300)))
+
+        for test_name, values in results:
+            for idx, value in enumerate(values):
+                WriteJsonFile(
+                    os.path.join(dir_path, '%s_%d.json' % (test_name, idx)),
+                    [{'label': test_name,
+                      'test_suite': 'fuchsia.example.perf_test',
+                      'unit': 'nanoseconds',
+                      'values': [value]}])
 
         # Include a summary.json file to check that we skip reading it.
         with open(os.path.join(dir_path, 'summary.json'), 'w') as fh:
diff --git a/garnet/tests/benchmarks/benchmarks_perfcompare.cc b/garnet/tests/benchmarks/benchmarks_perfcompare.cc
index 00ca7dd..d1da0ec 100644
--- a/garnet/tests/benchmarks/benchmarks_perfcompare.cc
+++ b/garnet/tests/benchmarks/benchmarks_perfcompare.cc
@@ -11,6 +11,7 @@
 // timeout.
 
 #include "garnet/testing/benchmarking/benchmarking.h"
+#include "src/lib/fxl/strings/string_printf.h"
 
 int main(int argc, const char** argv) {
   auto maybe_benchmarks_runner =
@@ -21,16 +22,27 @@
 
   auto& benchmarks_runner = *maybe_benchmarks_runner;
 
-  // Performance tests implemented in the Zircon repo.
-  benchmarks_runner.AddLibPerfTestBenchmark(
-      "zircon.perf_test",
-      "/pkgfs/packages/garnet_benchmarks/0/test/sys/perf-test");
+  // Reduce the number of iterations of each perf test within each process
+  // given that we are launching each process multiple times.
+  std::vector<std::string> extra_args = {"--runs", "100"};
 
-  // Performance tests implemented in the Garnet repo (the name
-  // "zircon_benchmarks" is now misleading).
-  benchmarks_runner.AddLibPerfTestBenchmark(
-      "zircon_benchmarks",
-      "/pkgfs/packages/zircon_benchmarks/0/test/zircon_benchmarks");
+  // Run these processes multiple times in order to account for
+  // between-process variation in results (e.g. due to memory layout chosen
+  // when a process starts).
+  for (int process = 0; process < 30; ++process) {
+    // Performance tests implemented in the Zircon repo.
+    benchmarks_runner.AddLibPerfTestBenchmark(
+        fxl::StringPrintf("zircon.perf_test_process%06d", process),
+        "/pkgfs/packages/garnet_benchmarks/0/test/sys/perf-test",
+        extra_args);
+
+    // Performance tests implemented in the Garnet repo (the name
+    // "zircon_benchmarks" is now misleading).
+    benchmarks_runner.AddLibPerfTestBenchmark(
+        fxl::StringPrintf("zircon_benchmarks_process%06d", process),
+        "/pkgfs/packages/zircon_benchmarks/0/test/zircon_benchmarks",
+        extra_args);
+  }
 
   benchmarks_runner.Finish();
 }