Rewrite the core of the decoder in asm.

This is an experiment. A future commit may roll back this commit if it
turns out that the complexity and inherent unsafety of asm code
outweights the performance benefits.

The new asm code is covered by existing tests: TestDecode,
TestDecodeLengthOffset and TestDecodeGoldenInput. These tests were
checked in by previous commits, to make it clear that they pass both
before and after this new implementation. This commit is purely an
optimization; there should be no other change in behavior.

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     498.83       519.36       1.04x
BenchmarkWordsDecode1e2-8     445.12       691.63       1.55x
BenchmarkWordsDecode1e3-8     530.29       858.97       1.62x
BenchmarkWordsDecode1e4-8     361.08       581.86       1.61x
BenchmarkWordsDecode1e5-8     270.69       380.78       1.41x
BenchmarkWordsDecode1e6-8     290.91       403.12       1.39x
Benchmark_UFlat0-8            543.87       784.21       1.44x
Benchmark_UFlat1-8            449.84       625.49       1.39x
Benchmark_UFlat2-8            15511.96     15366.67     0.99x
Benchmark_UFlat3-8            873.92       1321.47      1.51x
Benchmark_UFlat4-8            2978.58      4338.83      1.46x
Benchmark_UFlat5-8            536.04       770.24       1.44x
Benchmark_UFlat6-8            278.33       386.10       1.39x
Benchmark_UFlat7-8            271.63       376.79       1.39x
Benchmark_UFlat8-8            288.86       400.47       1.39x
Benchmark_UFlat9-8            262.13       362.89       1.38x
Benchmark_UFlat10-8           640.03       943.89       1.47x
Benchmark_UFlat11-8           356.37       493.98       1.39x

The numbers above are pure Go vs the new asm; about a 1.4x improvement.
As a data point, the numbers below are pure Go vs pure Go with bounds
checking disabled:

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     498.83       516.68       1.04x
BenchmarkWordsDecode1e2-8     445.12       495.22       1.11x
BenchmarkWordsDecode1e3-8     530.29       612.44       1.15x
BenchmarkWordsDecode1e4-8     361.08       374.12       1.04x
BenchmarkWordsDecode1e5-8     270.69       300.66       1.11x
BenchmarkWordsDecode1e6-8     290.91       325.22       1.12x
Benchmark_UFlat0-8            543.87       655.85       1.21x
Benchmark_UFlat1-8            449.84       516.04       1.15x
Benchmark_UFlat2-8            15511.96     15291.13     0.99x
Benchmark_UFlat3-8            873.92       1063.07      1.22x
Benchmark_UFlat4-8            2978.58      3615.30      1.21x
Benchmark_UFlat5-8            536.04       639.51       1.19x
Benchmark_UFlat6-8            278.33       309.44       1.11x
Benchmark_UFlat7-8            271.63       301.89       1.11x
Benchmark_UFlat8-8            288.86       322.38       1.12x
Benchmark_UFlat9-8            262.13       289.92       1.11x
Benchmark_UFlat10-8           640.03       787.34       1.23x
Benchmark_UFlat11-8           356.37       403.35       1.13x

In other words, eliminating bounds checking gets you about a 1.15x
improvement. All the other benefits of hand-written asm gets you another
1.2x over and above that.

For reference, I've copy/pasted the "go tool compile -S -B -o /dev/null
main.go" output at http://play.golang.org/p/vOs4Z7Qf1X
4 files changed