Add a loop alignment directive to work around a performance regression.

We found LLVM upstream change at rL310792 degraded zippy benchmark by
~3%. Performance analysis showed the regression was caused by some
side-effect. The incidental loop alignment change (from 32 bytes to 16
bytes) led to increase of branch miss prediction and caused the
regression. The regression was reproducible on several intel
micro-architectures, like sandybridge, haswell and skylake. Sadly we
still don't have good understanding about the internal of intel branch
predictor and cannot explain how the branch miss prediction increases
when the loop alignment changes, so we cannot make a real fix here. The
workaround solution in the patch is to add a directive, align the hot
loop to 32 bytes, which can restore the performance. This is in order to
unblock the flip of default compiler to LLVM.
diff --git a/snappy.cc b/snappy.cc
index 23f948f..fd519e5 100644
--- a/snappy.cc
+++ b/snappy.cc
@@ -685,6 +685,13 @@
         }
 
     MAYBE_REFILL();
+    // Add loop alignment directive. Without this directive, we observed
+    // significant performance degradation on several intel architectures
+    // in snappy benchmark built with LLVM. The degradation was caused by
+    // increased branch miss prediction.
+#if defined(__clang__) && defined(__x86_64__)
+    asm volatile (".p2align 5");
+#endif
     for ( ;; ) {
       const unsigned char c = *(reinterpret_cast<const unsigned char*>(ip++));