Write the encoder's emitCopy in asm.

name              old speed      new speed      delta
WordsEncode1e1-8   690MB/s ± 0%   665MB/s ± 0%  -3.64%  (p=0.008 n=5+5)
WordsEncode1e2-8  83.7MB/s ± 1%  83.8MB/s ± 1%    ~     (p=0.421 n=5+5)
WordsEncode1e3-8   230MB/s ± 1%   231MB/s ± 1%    ~     (p=0.421 n=5+5)
WordsEncode1e4-8   233MB/s ± 1%   232MB/s ± 1%    ~     (p=0.151 n=5+5)
WordsEncode1e5-8   212MB/s ± 0%   212MB/s ± 1%    ~     (p=1.000 n=5+5)
WordsEncode1e6-8   255MB/s ± 0%   257MB/s ± 0%  +0.57%  (p=0.008 n=5+5)
RandomEncode-8    13.2GB/s ± 1%  13.2GB/s ± 1%    ~     (p=0.151 n=5+5)
_ZFlat0-8          623MB/s ± 0%   629MB/s ± 0%  +0.93%  (p=0.008 n=5+5)
_ZFlat1-8          319MB/s ± 1%   324MB/s ± 0%  +1.65%  (p=0.008 n=5+5)
_ZFlat2-8         13.9GB/s ± 1%  13.9GB/s ± 1%    ~     (p=0.548 n=5+5)
_ZFlat3-8          176MB/s ± 0%   176MB/s ± 1%    ~     (p=0.690 n=5+5)
_ZFlat4-8         6.05GB/s ± 0%  6.12GB/s ± 0%  +1.20%  (p=0.008 n=5+5)
_ZFlat5-8          603MB/s ± 0%   614MB/s ± 0%  +1.71%  (p=0.008 n=5+5)
_ZFlat6-8          228MB/s ± 0%   230MB/s ± 0%  +0.83%  (p=0.008 n=5+5)
_ZFlat7-8          212MB/s ± 0%   214MB/s ± 0%  +0.74%  (p=0.008 n=5+5)
_ZFlat8-8          242MB/s ± 0%   244MB/s ± 0%  +0.99%  (p=0.008 n=5+5)
_ZFlat9-8          199MB/s ± 1%   200MB/s ± 0%  +0.57%  (p=0.008 n=5+5)
_ZFlat10-8         796MB/s ± 1%   797MB/s ± 0%    ~     (p=1.000 n=5+5)
_ZFlat11-8         348MB/s ± 0%   351MB/s ± 1%    ~     (p=0.056 n=5+5)

I'm not overly worried about the WordsEncode1e1-8 change: the time/op is
around 15 nanoseconds, which is tiny. In comparison, _ZFlat0-8 takes
around 163 microseconds (note µs not ns).
4 files changed