blake2b: fix AVX performance problems on amd64

On some amd64 CPUs (Xeon E5-2680v4 / E5-2620v3) using SSE and AVX instructions
leads to very low performance.
On a i7-6500U the SSE-AVX code performs following:

AVX2:
name        time/op
Write128-4    165ns ± 0%
Write1K-4    1.20µs ± 0%
Sum128-4      189ns ± 1%
Sum1K-4      1.22µs ± 0%

name        speed
Write128-4  773MB/s ± 1%
Write1K-4   855MB/s ± 0%
Sum128-4    675MB/s ± 1%
Sum1K-4     838MB/s ± 0%

while the same code achieves values < 65MB/s on a Xeon E5-2620v3.

Replacing the `MOVQ` and `PINSRQ` with the AVX instructions `VMOVQ` and `VPINSRQ`
increases the performance of the AVX/AVX2 code to some expected values:

name         old time/op    new time/op     delta
Write128-12    2.20µs ±10%     0.22µs ± 9%    -90.00%  (p=0.029 n=4+4)
Write1K-12     16.2µs ± 0%      1.1µs ± 0%    -93.07%  (p=0.029 n=4+4)
Sum128-12      2.10µs ± 0%     0.22µs ± 0%    -89.47%  (p=0.029 n=4+4)
Sum1K-12       16.3µs ± 0%      1.2µs ± 0%    -92.65%  (p=0.029 n=4+4)

name         old speed      new speed       delta
Write128-12  58.5MB/s ±10%  582.8MB/s ±10%   +897.08%  (p=0.029 n=4+4)
Write1K-12   63.1MB/s ± 0%  909.8MB/s ± 0%  +1341.40%  (p=0.029 n=4+4)
Sum128-12    60.8MB/s ± 0%  576.3MB/s ± 0%   +847.84%  (p=0.029 n=4+4)
Sum1K-12     62.8MB/s ± 0%  855.2MB/s ± 0%  +1260.78%  (p=0.029 n=4+4)

The AVX/AVX2 code now uses only AVX (no SSE) instructions.

Fixes golang/go#18563.

Change-Id: I1961dd8fa02014642587523b7f099816a263c9f5
Reviewed-on: https://go-review.googlesource.com/34993
Reviewed-by: Adam Langley <agl@golang.org>
1 file changed