Zero out only that part of the hash table in use.

Obviously, the biggest gains are for the smallest inputs.

The WordsEncode1e1-8 benchmark should be unaffected (except possibly by
noise) because it never calls into encodeBlock.

name              old speed      new speed       delta
WordsEncode1e1-8   678MB/s ± 0%    667MB/s ± 0%    -1.74%  (p=0.008 n=5+5)
WordsEncode1e2-8  90.1MB/s ± 0%  352.9MB/s ± 1%  +291.85%  (p=0.008 n=5+5)
WordsEncode1e3-8   295MB/s ± 0%    383MB/s ± 1%   +29.67%  (p=0.008 n=5+5)
WordsEncode1e4-8   276MB/s ± 0%    277MB/s ± 1%      ~     (p=0.310 n=5+5)
WordsEncode1e5-8   248MB/s ± 0%    248MB/s ± 0%      ~     (p=0.841 n=5+5)
WordsEncode1e6-8   295MB/s ± 0%    296MB/s ± 0%      ~     (p=0.548 n=5+5)
RandomEncode-8    14.4GB/s ± 1%   14.4GB/s ± 2%      ~     (p=1.000 n=5+5)
_ZFlat0-8          749MB/s ± 0%    748MB/s ± 0%      ~     (p=0.151 n=5+5)
_ZFlat1-8          405MB/s ± 0%    406MB/s ± 0%    +0.18%  (p=0.032 n=4+5)
_ZFlat2-8         16.2GB/s ± 1%   16.1GB/s ± 1%      ~     (p=0.421 n=5+5)
_ZFlat3-8          202MB/s ± 1%    604MB/s ± 0%  +198.86%  (p=0.008 n=5+5)
_ZFlat4-8         7.59GB/s ± 1%   7.62GB/s ± 1%      ~     (p=0.548 n=5+5)
_ZFlat5-8          728MB/s ± 1%    729MB/s ± 0%      ~     (p=0.548 n=5+5)
_ZFlat6-8          266MB/s ± 1%    267MB/s ± 0%      ~     (p=0.548 n=5+5)
_ZFlat7-8          248MB/s ± 0%    248MB/s ± 0%      ~     (p=0.881 n=5+5)
_ZFlat8-8          282MB/s ± 0%    282MB/s ± 0%      ~     (p=0.556 n=4+5)
_ZFlat9-8          231MB/s ± 0%    231MB/s ± 0%    +0.23%  (p=0.032 n=5+5)
_ZFlat10-8         970MB/s ± 0%    972MB/s ± 0%      ~     (p=0.421 n=5+5)
_ZFlat11-8         402MB/s ± 0%    401MB/s ± 0%      ~     (p=0.730 n=5+5)
diff --git a/encode_amd64.s b/encode_amd64.s
index 56631f3..92f0a39 100644
--- a/encode_amd64.s
+++ b/encode_amd64.s
@@ -262,8 +262,11 @@
 varTable:
 	// var table [maxTableSize]uint16
 	//
-	// sizeof(table) is 32768 bytes, which is 2048 16-byte writes.
-	MOVQ $2048, DX
+	// In the asm code, unlike the Go code, we can zero-initialize only the
+	// first tableSize elements. Each uint16 element is 2 bytes and each MOVOU
+	// writes 16 bytes, so we can do only tableSize/8 writes instead of the
+	// 2048 writes that would zero-initialize all of table's 32768 bytes.
+	SHRQ $3, DX
 	LEAQ table-32768(SP), BX
 	PXOR X0, X0