Optimize asm for decoding copy fragments.

Relative to the previous commit:

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     518.05       518.80       1.00x
BenchmarkWordsDecode1e2-8     776.28       871.43       1.12x
BenchmarkWordsDecode1e3-8     995.41       1411.32      1.42x
BenchmarkWordsDecode1e4-8     615.92       1469.60      2.39x
BenchmarkWordsDecode1e5-8     453.95       771.07       1.70x
BenchmarkWordsDecode1e6-8     453.74       872.19       1.92x
Benchmark_UFlat0-8            863.12       1129.79      1.31x
Benchmark_UFlat1-8            766.01       1075.37      1.40x
Benchmark_UFlat2-8            15463.36     15617.45     1.01x
Benchmark_UFlat3-8            1388.63      1438.15      1.04x
Benchmark_UFlat4-8            4367.79      4838.37      1.11x
Benchmark_UFlat5-8            844.84       1075.46      1.27x
Benchmark_UFlat6-8            442.42       811.70       1.83x
Benchmark_UFlat7-8            437.68       781.87       1.79x
Benchmark_UFlat8-8            458.19       819.38       1.79x
Benchmark_UFlat9-8            423.36       724.43       1.71x
Benchmark_UFlat10-8           1023.05      1193.70      1.17x
Benchmark_UFlat11-8           507.18       879.15       1.73x
1 file changed