poly1305: simplify reference implementation

Reduce code complexity by replacing the floating-point implementation
with a 32-bit implementation.

Moreover this improves the performance on 386:

name 		old time/op 	new time/op 	delta
64-2 		972ns ± 2% 	350ns ± 1% 	-64.04% (p=0.029 n=4+4)
1K-2 		10.9µs ± 3% 	4.2µs ± 1% 	-61.11% (p=0.029 n=4+4)
64Unaligned-2	969ns ± 2% 	354ns ± 2% 	-63.44% (p=0.029 n=4+4)
1KUnaligned-2 	10.8µs ± 3% 	4.2µs ± 1% 	-61.15% (p=0.029 n=4+4)

name 		old speed 	new speed 	delta
64-2 		65.8MB/s ± 2% 	182.9MB/s ± 1% 	+177.93% (p=0.029 n=4+4)
1K-2 		94.3MB/s ± 3% 	242.3MB/s ± 1% 	+157.08% (p=0.029 n=4+4)
64Unaligned-2 	66.0MB/s ± 2% 	180.4MB/s ± 2% 	+173.32% (p=0.029 n=4+4)
1KUnaligned-2  	94.4MB/s ± 3%  	243.0MB/s ± 1% 	+157.36% (p=0.029 n=4+4)

There are already optimized versions for amd64 and arm,
and a optimized version for s390x seems to be planned.
	See: https://go-review.googlesource.com/#/c/32812/

Change-Id: I7a5ac62ae33727b0e6060cb966de73a468317e30
Reviewed-on: https://go-review.googlesource.com/35294
Reviewed-by: Michael Munday <munday@ca.ibm.com>
Reviewed-by: Adam Langley <agl@golang.org>
1 file changed