Skip to content

Commit 4b2356b

Browse files
committed
poly1305: modify s390x assembly to implement MAC interface
The vector (vx) implementation has been updated to read in the state and update it - as opposed to being a single shot function. This has allowed the new MAC interface can be implemented. For performance reasons s390x uses a larger buffer than the generic implementation. There is a relatively high fixed cost to read the state, calculate the key coefficients and serialize the state, so it makes sense to buffer more blocks before calling it. For now I've had to remove the faster VMSL implementation. It is too complex for me to update in time for Go 1.15. At some point I'd like to revisit it but for now it looks like using the MAC interface is more of a win than using VMSL. The benchmarks show considerable improvements when using the MAC interface. The Sum benchmarks show slowdown due to a combination of the removal of the VMSL implementation and also the added overhead from splitting the summation function into multiple parts. poly1305: name old speed new speed delta 64 1.33GB/s ± 0% 0.80GB/s ± 1% -39.51% (p=0.000 n=16+20) 1K 4.04GB/s ± 0% 2.97GB/s ± 0% -26.46% (p=0.000 n=19+19) 2M 5.32GB/s ± 1% 3.63GB/s ± 0% -31.76% (p=0.000 n=20+19) 64Unaligned 1.33GB/s ± 0% 0.80GB/s ± 0% -39.80% (p=0.000 n=19+18) 1KUnaligned 4.09GB/s ± 1% 2.94GB/s ± 0% -28.23% (p=0.000 n=19+18) 2MUnaligned 5.33GB/s ± 1% 3.52GB/s ± 0% -34.04% (p=0.000 n=20+19) Write64 1.03GB/s ± 1% 1.49GB/s ± 1% +44.34% (p=0.000 n=20+20) Write1K 1.21GB/s ± 0% 3.24GB/s ± 0% +169.02% (p=0.000 n=20+17) Write2M 1.24GB/s ± 1% 3.63GB/s ± 0% +192.36% (p=0.000 n=20+19) Write64Unaligned 1.04GB/s ± 1% 1.50GB/s ± 0% +44.16% (p=0.000 n=19+14) Write1KUnaligned 1.21GB/s ± 0% 3.20GB/s ± 0% +164.55% (p=0.000 n=20+16) Write2MUnaligned 1.24GB/s ± 1% 3.51GB/s ± 0% +183.96% (p=0.000 n=20+19) chacha20poly1305 (this vs. using generic MAC interface - post CL 206977): name old speed new speed delta Open-64 147MB/s ± 2% 156MB/s ± 1% +6.15% (p=0.000 n=20+19) Seal-64 151MB/s ± 0% 164MB/s ± 1% +8.86% (p=0.000 n=19+16) Open-64-X 104MB/s ± 2% 111MB/s ± 1% +6.24% (p=0.000 n=20+20) Seal-64-X 109MB/s ± 2% 111MB/s ± 1% +2.11% (p=0.000 n=20+19) Open-1350 555MB/s ± 0% 751MB/s ± 1% +35.19% (p=0.000 n=20+20) Seal-1350 557MB/s ± 0% 759MB/s ± 0% +36.23% (p=0.000 n=20+20) Open-1350-X 517MB/s ± 1% 683MB/s ± 1% +31.97% (p=0.000 n=20+20) Seal-1350-X 511MB/s ± 0% 683MB/s ± 0% +33.77% (p=0.000 n=18+19) Open-8192 672MB/s ± 0% 1013MB/s ± 0% +50.65% (p=0.000 n=19+19) Seal-8192 674MB/s ± 0% 1018MB/s ± 0% +50.98% (p=0.000 n=18+20) Open-8192-X 663MB/s ± 0% 979MB/s ± 0% +47.57% (p=0.000 n=20+20) Seal-8192-X 658MB/s ± 0% 985MB/s ± 0% +49.62% (p=0.000 n=18+20) name old allocs/op new allocs/op delta Open-64 0.00 0.00 ~ (all equal) Seal-64 0.00 0.00 ~ (all equal) Open-64-X 0.00 0.00 ~ (all equal) Seal-64-X 0.00 0.00 ~ (all equal) Open-1350 0.00 0.00 ~ (all equal) Seal-1350 0.00 0.00 ~ (all equal) Open-1350-X 0.00 0.00 ~ (all equal) Seal-1350-X 0.00 0.00 ~ (all equal) Open-8192 0.00 0.00 ~ (all equal) Seal-8192 0.00 0.00 ~ (all equal) Open-8192-X 0.00 0.00 ~ (all equal) Seal-8192-X 0.00 0.00 ~ (all equal) chacha20poly1305 (this vs. using asm Sum interface - pre CL 206977): name old speed new speed delta Open-64 144MB/s ± 0% 156MB/s ± 1% +8.16% (p=0.000 n=20+19) Seal-64 150MB/s ± 0% 164MB/s ± 1% +9.35% (p=0.000 n=20+16) Open-64-X 104MB/s ± 1% 111MB/s ± 1% +6.15% (p=0.000 n=19+20) Seal-64-X 109MB/s ± 1% 111MB/s ± 1% +1.43% (p=0.000 n=19+19) Open-1350 702MB/s ± 1% 751MB/s ± 1% +6.98% (p=0.000 n=20+20) Seal-1350 715MB/s ± 0% 759MB/s ± 0% +6.09% (p=0.000 n=19+20) Open-1350-X 642MB/s ± 0% 683MB/s ± 1% +6.37% (p=0.000 n=19+20) Seal-1350-X 639MB/s ± 0% 683MB/s ± 0% +6.98% (p=0.000 n=20+19) Open-8192 994MB/s ± 0% 1013MB/s ± 0% +1.85% (p=0.000 n=20+19) Seal-8192 1.00GB/s ± 0% 1.02GB/s ± 0% +1.90% (p=0.000 n=20+20) Open-8192-X 965MB/s ± 0% 979MB/s ± 0% +1.43% (p=0.000 n=19+20) Seal-8192-X 962MB/s ± 0% 985MB/s ± 0% +2.39% (p=0.000 n=20+20) name old allocs/op new allocs/op delta Open-64 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Seal-64 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Open-64-X 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Seal-64-X 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Open-1350 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Seal-1350 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Open-1350-X 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Seal-1350-X 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Open-8192 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Seal-8192 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Open-8192-X 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Seal-8192-X 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) Updates golang/go#25219. Change-Id: Ib491e3a47b6b3ec8bbbe1f41f7bf42ad82f5c249 Reviewed-on: https://go-review.googlesource.com/c/crypto/+/219057 Run-TryBot: Michael Munday <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Filippo Valsorda <[email protected]>
1 parent 729f1e8 commit 4b2356b

File tree

9 files changed

+548
-1222
lines changed

9 files changed

+548
-1222
lines changed

poly1305/mac_noasm.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
// Use of this source code is governed by a BSD-style
33
// license that can be found in the LICENSE file.
44

5-
// +build !amd64,!ppc64le gccgo purego
5+
// +build !amd64,!ppc64le,!s390x gccgo purego
66

77
package poly1305
88

poly1305/poly1305.go

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,9 @@ const TagSize = 16
2626
// 16-byte result into out. Authenticating two different messages with the same
2727
// key allows an attacker to forge messages at will.
2828
func Sum(out *[16]byte, m []byte, key *[32]byte) {
29-
sum(out, m, key)
29+
h := New(key)
30+
h.Write(m)
31+
h.Sum(out[:0])
3032
}
3133

3234
// Verify returns true if mac is a valid authenticator for m with the given key.

poly1305/poly1305_test.go

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ package poly1305
66

77
import (
88
"crypto/rand"
9+
"encoding/binary"
910
"encoding/hex"
1011
"flag"
1112
"testing"
@@ -15,9 +16,10 @@ import (
1516
var stressFlag = flag.Bool("stress", false, "run slow stress tests")
1617

1718
type test struct {
18-
in string
19-
key string
20-
tag string
19+
in string
20+
key string
21+
tag string
22+
state string
2123
}
2224

2325
func (t *test) Input() []byte {
@@ -48,9 +50,33 @@ func (t *test) Tag() [16]byte {
4850
return tag
4951
}
5052

53+
func (t *test) InitialState() [3]uint64 {
54+
// state is hex encoded in big-endian byte order
55+
if t.state == "" {
56+
return [3]uint64{0, 0, 0}
57+
}
58+
buf, err := hex.DecodeString(t.state)
59+
if err != nil {
60+
panic(err)
61+
}
62+
if len(buf) != 3*8 {
63+
panic("incorrect state length")
64+
}
65+
return [3]uint64{
66+
binary.BigEndian.Uint64(buf[16:24]),
67+
binary.BigEndian.Uint64(buf[8:16]),
68+
binary.BigEndian.Uint64(buf[0:8]),
69+
}
70+
}
71+
5172
func testSum(t *testing.T, unaligned bool, sumImpl func(tag *[TagSize]byte, msg []byte, key *[32]byte)) {
5273
var tag [16]byte
5374
for i, v := range testData {
75+
// cannot set initial state before calling sum, so skip those tests
76+
if v.InitialState() != [3]uint64{0, 0, 0} {
77+
continue
78+
}
79+
5480
in := v.Input()
5581
if unaligned {
5682
in = unalignBytes(in)
@@ -140,6 +166,9 @@ func testWriteGeneric(t *testing.T, unaligned bool) {
140166
input = unalignBytes(input)
141167
}
142168
h := newMACGeneric(&key)
169+
if s := v.InitialState(); s != [3]uint64{0, 0, 0} {
170+
h.macState.h = s
171+
}
143172
n, err := h.Write(input[:len(input)/3])
144173
if err != nil || n != len(input[:len(input)/3]) {
145174
t.Errorf("#%d: unexpected Write results: n = %d, err = %v", i, n, err)
@@ -165,6 +194,9 @@ func testWrite(t *testing.T, unaligned bool) {
165194
input = unalignBytes(input)
166195
}
167196
h := New(&key)
197+
if s := v.InitialState(); s != [3]uint64{0, 0, 0} {
198+
h.macState.h = s
199+
}
168200
n, err := h.Write(input[:len(input)/3])
169201
if err != nil || n != len(input[:len(input)/3]) {
170202
t.Errorf("#%d: unexpected Write results: n = %d, err = %v", i, n, err)

poly1305/sum_generic.go

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,8 @@ func newMACGeneric(key *[32]byte) macGeneric {
4141
// the value of [x0, x1, x2] is x[0] + x[1] * 2⁶⁴ + x[2] * 2¹²⁸.
4242
type macState struct {
4343
// h is the main accumulator. It is to be interpreted modulo 2¹³⁰ - 5, but
44-
// can grow larger during and after rounds.
44+
// can grow larger during and after rounds. It must, however, remain below
45+
// 2 * (2¹³⁰ - 5).
4546
h [3]uint64
4647
// r and s are the private key components.
4748
r [2]uint64

poly1305/sum_noasm.go

Lines changed: 0 additions & 18 deletions
This file was deleted.

poly1305/sum_s390x.go

Lines changed: 54 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,38 +2,74 @@
22
// Use of this source code is governed by a BSD-style
33
// license that can be found in the LICENSE file.
44

5-
// +build go1.11,!gccgo,!purego
5+
// +build !gccgo,!purego
66

77
package poly1305
88

99
import (
1010
"golang.org/x/sys/cpu"
1111
)
1212

13-
// poly1305vx is an assembly implementation of Poly1305 that uses vector
13+
// updateVX is an assembly implementation of Poly1305 that uses vector
1414
// instructions. It must only be called if the vector facility (vx) is
1515
// available.
1616
//go:noescape
17-
func poly1305vx(out *[16]byte, m *byte, mlen uint64, key *[32]byte)
17+
func updateVX(state *macState, msg []byte)
1818

19-
// poly1305vmsl is an assembly implementation of Poly1305 that uses vector
20-
// instructions, including VMSL. It must only be called if the vector facility (vx) is
21-
// available and if VMSL is supported.
22-
//go:noescape
23-
func poly1305vmsl(out *[16]byte, m *byte, mlen uint64, key *[32]byte)
19+
// mac is a replacement for macGeneric that uses a larger buffer and redirects
20+
// calls that would have gone to updateGeneric to updateVX if the vector
21+
// facility is installed.
22+
//
23+
// A larger buffer is required for good performance because the vector
24+
// implementation has a higher fixed cost per call than the generic
25+
// implementation.
26+
type mac struct {
27+
macState
28+
29+
buffer [16 * TagSize]byte // size must be a multiple of block size (16)
30+
offset int
31+
}
2432

25-
func sum(out *[16]byte, m []byte, key *[32]byte) {
26-
if cpu.S390X.HasVX {
27-
var mPtr *byte
28-
if len(m) > 0 {
29-
mPtr = &m[0]
33+
func (h *mac) Write(p []byte) (int, error) {
34+
nn := len(p)
35+
if h.offset > 0 {
36+
n := copy(h.buffer[h.offset:], p)
37+
if h.offset+n < len(h.buffer) {
38+
h.offset += n
39+
return nn, nil
3040
}
31-
if cpu.S390X.HasVXE && len(m) > 256 {
32-
poly1305vmsl(out, mPtr, uint64(len(m)), key)
41+
p = p[n:]
42+
h.offset = 0
43+
if cpu.S390X.HasVX {
44+
updateVX(&h.macState, h.buffer[:])
3345
} else {
34-
poly1305vx(out, mPtr, uint64(len(m)), key)
46+
updateGeneric(&h.macState, h.buffer[:])
3547
}
36-
} else {
37-
sumGeneric(out, m, key)
3848
}
49+
50+
tail := len(p) % len(h.buffer) // number of bytes to copy into buffer
51+
body := len(p) - tail // number of bytes to process now
52+
if body > 0 {
53+
if cpu.S390X.HasVX {
54+
updateVX(&h.macState, p[:body])
55+
} else {
56+
updateGeneric(&h.macState, p[:body])
57+
}
58+
}
59+
h.offset = copy(h.buffer[:], p[body:]) // copy tail bytes - can be 0
60+
return nn, nil
61+
}
62+
63+
func (h *mac) Sum(out *[TagSize]byte) {
64+
state := h.macState
65+
remainder := h.buffer[:h.offset]
66+
67+
// Use the generic implementation if we have 2 or fewer blocks left
68+
// to sum. The vector implementation has a higher startup time.
69+
if cpu.S390X.HasVX && len(remainder) > 2*TagSize {
70+
updateVX(&state, remainder)
71+
} else if len(remainder) > 0 {
72+
updateGeneric(&state, remainder)
73+
}
74+
finalize(out, &state.h, &state.s)
3975
}

0 commit comments

Comments
 (0)