Optimize asm for decoding copy fragments some more.

Relative to the previous commit:

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     518.80       508.74       0.98x
BenchmarkWordsDecode1e2-8     871.43       962.52       1.10x
BenchmarkWordsDecode1e3-8     1411.32      1435.51      1.02x
BenchmarkWordsDecode1e4-8     1469.60      1514.02      1.03x
BenchmarkWordsDecode1e5-8     771.07       807.73       1.05x
BenchmarkWordsDecode1e6-8     872.19       892.24       1.02x
Benchmark_UFlat0-8            1129.79      2200.22      1.95x
Benchmark_UFlat1-8            1075.37      1446.09      1.34x
Benchmark_UFlat2-8            15617.45     14706.88     0.94x
Benchmark_UFlat3-8            1438.15      1787.82      1.24x
Benchmark_UFlat4-8            4838.37      10683.24     2.21x
Benchmark_UFlat5-8            1075.46      1965.33      1.83x
Benchmark_UFlat6-8            811.70       833.52       1.03x
Benchmark_UFlat7-8            781.87       792.85       1.01x
Benchmark_UFlat8-8            819.38       854.75       1.04x
Benchmark_UFlat9-8            724.43       730.21       1.01x
Benchmark_UFlat10-8           1193.70      2775.98      2.33x
Benchmark_UFlat11-8           879.15       1037.94      1.18x

As with previous recent commits, the new asm code is covered by existing
tests: TestDecode, TestDecodeLengthOffset and TestDecodeGoldenInput.
There is also a new test for the slowForwardCopy algorithm.

As a data point, the "new MB/s" numbers are now in the same ballpark as
the benchmark numbers that I get from the C++ snappy implementation on
the same machine:

BM_UFlat/0   2.4GB/s    html
BM_UFlat/1   1.4GB/s    urls
BM_UFlat/2   21.1GB/s   jpg
BM_UFlat/3   1.5GB/s    jpg_200
BM_UFlat/4   10.2GB/s   pdf
BM_UFlat/5   2.1GB/s    html4
BM_UFlat/6   990.6MB/s  txt1
BM_UFlat/7   930.1MB/s  txt2
BM_UFlat/8   1.0GB/s    txt3
BM_UFlat/9   849.7MB/s  txt4
BM_UFlat/10  2.9GB/s    pb
BM_UFlat/11  1.2GB/s    gaviota

As another data point, here is the amd64 asm code as of this commit
compared to the most recent pure Go implementation, revision 03ee571c:

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     498.83       508.74       1.02x
BenchmarkWordsDecode1e2-8     445.12       962.52       2.16x
BenchmarkWordsDecode1e3-8     530.29       1435.51      2.71x
BenchmarkWordsDecode1e4-8     361.08       1514.02      4.19x
BenchmarkWordsDecode1e5-8     270.69       807.73       2.98x
BenchmarkWordsDecode1e6-8     290.91       892.24       3.07x
Benchmark_UFlat0-8            543.87       2200.22      4.05x
Benchmark_UFlat1-8            449.84       1446.09      3.21x
Benchmark_UFlat2-8            15511.96     14706.88     0.95x
Benchmark_UFlat3-8            873.92       1787.82      2.05x
Benchmark_UFlat4-8            2978.58      10683.24     3.59x
Benchmark_UFlat5-8            536.04       1965.33      3.67x
Benchmark_UFlat6-8            278.33       833.52       2.99x
Benchmark_UFlat7-8            271.63       792.85       2.92x
Benchmark_UFlat8-8            288.86       854.75       2.96x
Benchmark_UFlat9-8            262.13       730.21       2.79x
Benchmark_UFlat10-8           640.03       2775.98      4.34x
Benchmark_UFlat11-8           356.37       1037.94      2.91x

The UFlat2 case is decoding a compressed JPEG file, a binary format, and
so Snappy is not actually getting much extra compression. Decompression
collapses to not much more than repeatedly invoking runtime.memmove, so
optimizing the snappy code per se doesn't have a huge impact on that
particular benchmark number.
2 files changed