refs/heads/upstream/firmware-rex-15709.B - third_party/vboot_reference

commit	ae3ca0f05a87a80afc298d45743bb3a34dd1da6f	[log] [tgz]
author	Muhammad Monir Hossain <muhammad.monir.hossain@intel.com>	Fri Oct 13 15:45:27 2023 -0700
committer	YH Lin <yueherngl@chromium.org>	Thu Dec 28 16:49:00 2023 +0000
tree	7496885d6433d43af8fdc416c8a01412df5f77b1
parent	7f98141e69161ebd646f3c193b8a8ca4c9816dae [diff]

firmware/2lib: Use SSE2 to speed-up Montgomery multiplication

This commit implements the Algorithm 2 described in Montgomery
Multiplication Using Vector Instructions from August 20, 2013
(cf. https://eprint.iacr.org/2013/519.pdf). This variation of the
Montgomery multiplication algorithm decouples some arithmetic steps
which can then be performed in parallel.

This implementation leverages the SSE2 instruction set to perform
arithmetic operations in parallel. It can be selected by setting the
`X86_SSE2_INSTR' compilation flag which can deliver a significant
performance boost of the RSA signature verification process.

For instance, on a Meteor Lake rex0 board we measured a 56% reduction
of the modular exponentiation operation execution time.

| modpow() function call during boot | original | SSE2 Algorithm 2 |
|------------------------------------+----------+------------------|
| coreboot/verstage 1                |    6.531 |            2.965 |
| coreboot/verstage 2                |    1.854 |            0.750 |
| coreboot/verstage 3                |    1.841 |            0.751 |
| depthcharge 1                      |    0.547 |            0.273 |
| depthcharge 2                      |    0.152 |            0.079 |
| depthcharge 3                      |    0.164 |            0.077 |
|------------------------------------+----------+------------------|
| Total (ms)                         |   11.089 |            4.895 |
| Ratio compared to original         |    100 % |             44 % |

However, on some SoC, SSE2 may not deliver good performances before
the silicon is fully initialized. Typically, on a Raptor Lake board we
observed a performance degradation before FSP-S and an improvement
after FSP-S.

On a Raptor Lake brya0 board:

| modpow() function call during boot | original | SSE2 Algorithm 2 |
|------------------------------------+----------+------------------|
| coreboot/verstage 1                |    5.642 |           10.145 |
| coreboot/verstage 2                |    1.560 |            2.412 |
| coreboot/verstage 3                |    1.548 |            2.413 |
| depthcharge 1                      |    0.693 |            0.248 |
| depthcharge 2                      |    0.172 |            0.065 |
| depthcharge 3                      |    0.223 |            0.067 |
|------------------------------------+----------+------------------|
| Total (ms)                         |    9.838 |           15.350 |
| Ratio compared to original         |    100 % |            156 % |

Hence the optimized code should only be enabled when appropriate.

BUG=b:312709384
BRANCH=main
TEST=modpow time duration is divided by two on rex board

Cq-Depend: chromium:5146107
Change-Id: I8739f94c0f6423f6e848bcae10b56e8be8e352ce
Signed-off-by: Jeremy Compostella <jeremy.compostella@intel.com>
Signed-off-by: Muhammad Monir Hossain <muhammad.monir.hossain@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/platform/vboot_reference/+/5055254
Commit-Queue: Julius Werner <jwerner@chromium.org>
Tested-by: Julius Werner <jwerner@chromium.org>
Reviewed-by: Julius Werner <jwerner@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/platform/vboot_reference/+/5149259
Commit-Queue: YH Lin <yueherngl@chromium.org>
Tested-by: Subrata Banik <subratabanik@chromium.org>
Reviewed-by: YH Lin <yueherngl@chromium.org>

9 files changed

tree: 7496885d6433d43af8fdc416c8a01412df5f77b1