Enable stitched x86-64 AES-NI AES-GCM implementation.

Measured on a SkyLake processor:

Before:

Did 11373750 AES-128-GCM (16 bytes) seal operations in 1016000us (11194635.8 ops/sec): 179.1 MB/s
Did 2253000 AES-128-GCM (1350 bytes) seal operations in 1016000us (2217519.7 ops/sec): 2993.7 MB/s
Did 453750 AES-128-GCM (8192 bytes) seal operations in 1015000us (447044.3 ops/sec): 3662.2 MB/s
Did 10753500 AES-256-GCM (16 bytes) seal operations in 1016000us (10584153.5 ops/sec): 169.3 MB/s
Did 1898750 AES-256-GCM (1350 bytes) seal operations in 1015000us (1870689.7 ops/sec): 2525.4 MB/s
Did 374000 AES-256-GCM (8192 bytes) seal operations in 1016000us (368110.2 ops/sec): 3015.6 MB/s

After:

Did 11074000 AES-128-GCM (16 bytes) seal operations in 1015000us (10910344.8 ops/sec): 174.6 MB/s
Did 3178250 AES-128-GCM (1350 bytes) seal operations in 1016000us (3128198.8 ops/sec): 4223.1 MB/s
Did 734500 AES-128-GCM (8192 bytes) seal operations in 1016000us (722933.1 ops/sec): 5922.3 MB/s
Did 10394750 AES-256-GCM (16 bytes) seal operations in 1015000us (10241133.0 ops/sec): 163.9 MB/s
Did 2502250 AES-256-GCM (1350 bytes) seal operations in 1016000us (2462844.5 ops/sec): 3324.8 MB/s
Did 544500 AES-256-GCM (8192 bytes) seal operations in 1015000us (536453.2 ops/sec): 4394.6 MB/s

Change-Id: If058935796441ed4e577b9a72d3aa43422edba58
Reviewed-on: https://boringssl-review.googlesource.com/7273
Reviewed-by: Adam Langley <alangley@gmail.com>
diff --git a/BUILDING.md b/BUILDING.md
index 8e226c0..c87e8b5 100644
--- a/BUILDING.md
+++ b/BUILDING.md
@@ -33,7 +33,7 @@
     executable may be configured explicitly by setting `GO_EXECUTABLE`.
 
   * To build the x86 and x86\_64 assembly, your assembler must support AVX2
-    instructions. If using GNU binutils, you must have 2.22 or later.
+    instructions and MOVBE. If using GNU binutils, you must have 2.22 or later.
 
 ## Building
 
diff --git a/crypto/modes/asm/aesni-gcm-x86_64.pl b/crypto/modes/asm/aesni-gcm-x86_64.pl
index f777a6e..e329741 100644
--- a/crypto/modes/asm/aesni-gcm-x86_64.pl
+++ b/crypto/modes/asm/aesni-gcm-x86_64.pl
@@ -41,17 +41,24 @@
 ( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or
 die "can't locate x86_64-xlate.pl";
 
-# This must be kept in sync with |$avx| in ghash-x86_64.pl; otherwise tags will
+# |$avx| in ghash-x86_64.pl must be set to at least 1; otherwise tags will
 # be computed incorrectly.
 #
 # In upstream, this is controlled by shelling out to the compiler to check
 # versions, but BoringSSL is intended to be used with pre-generated perlasm
 # output, so this isn't useful anyway.
-$avx = 0;
+#
+# The upstream code uses the condition |$avx>1| even though no AVX2
+# instructions are used, because it assumes MOVBE is supported by the assembler
+# if and only if AVX2 is also supported by the assembler; see
+# https://marc.info/?l=openssl-dev&m=146567589526984&w=2.
+$avx = 2;
 
 open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
 *STDOUT=*OUT;
 
+# See the comment above regarding why the condition is ($avx>1) when there are
+# no AVX2 instructions being used.
 if ($avx>1) {{{
 
 ($inp,$out,$len,$key,$ivp,$Xip)=("%rdi","%rsi","%rdx","%rcx","%r8","%r9");
diff --git a/crypto/modes/asm/ghash-x86_64.pl b/crypto/modes/asm/ghash-x86_64.pl
index df8546c..d7471e2 100644
--- a/crypto/modes/asm/ghash-x86_64.pl
+++ b/crypto/modes/asm/ghash-x86_64.pl
@@ -90,13 +90,13 @@
 ( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or
 die "can't locate x86_64-xlate.pl";
 
-# This must be kept in sync with |$avx| in aesni-gcm-x86_64.pl; otherwise tags
-# will be computed incorrectly.
+# See the notes about |$avx| in aesni-gcm-x86_64.pl; otherwise tags will be
+# computed incorrectly.
 #
 # In upstream, this is controlled by shelling out to the compiler to check
 # versions, but BoringSSL is intended to be used with pre-generated perlasm
 # output, so this isn't useful anyway.
-$avx = 0;
+$avx = 1;
 
 open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
 *STDOUT=*OUT;