Description
What version of Go are you using (go version
)?
$ go version go version go1.20.1 windows/amd64
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env PS C:\Users\jack\dev\golang-windows-dupreads> go env set GO111MODULE= set GOARCH=amd64 set GOBIN= set GOCACHE=C:\Users\jack\AppData\Local\go-build set GOENV=C:\Users\jack\AppData\Roaming\go\env set GOEXE=.exe set GOEXPERIMENT= set GOFLAGS= set GOHOSTARCH=amd64 set GOHOSTOS=windows set GOINSECURE= set GOMODCACHE=C:\Users\jack\go\pkg\mod set GONOPROXY= set GONOSUMDB= set GOOS=windows set GOPATH=C:\Users\jack\go set GOPRIVATE= set GOPROXY=https://proxy.golang.org,direct set GOROOT=C:\Program Files\Go set GOSUMDB=sum.golang.org set GOTMPDIR= set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64 set GOVCS= set GOVERSION=go1.20.1 set GCCGO=gccgo set GOAMD64=v1 set AR=ar set CC=gcc set CXX=g++ set CGO_ENABLED=0 set GOMOD=C:\Users\jack\dev\golang-windows-dupreads\go.mod set GOWORK= set CGO_CFLAGS=-O2 -g set CGO_CPPFLAGS= set CGO_CXXFLAGS=-O2 -g set CGO_FFLAGS=-O2 -g set CGO_LDFLAGS=-O2 -g set PKG_CONFIG=pkg-config set GOGCCFLAGS=-m64 -fno-caret-diagnostics -Qunused-arguments -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=C:\Users\jack\AppData\Local\Temp\go-build129281977=/tmp/go-build -gno-record-gcc-switches PS C:\Users\jack\dev\golang-windows-dupreads>
What did you do?
- Create a TCP server on localhost.
- Connect to that server.
- Wrap the client conn in a wrapper that records all
Write
data and the server conn in a wrapper that records allRead
data. - Read and write a bunch of data in a very particular way (details below).
- Compare what was written to what was read.
What did you expect to see?
There should be no different in the bytes that the client sent and the bytes the server received.
What did you see instead?
At some point after ~2.7 MB a chunk of data is received twice.
Details
This problem was originally encountered in the tests for https://github.com/jackc/pgx. pgx has need of non-blocking IO (see #36973 and #15735). It also needs this supported when using TLS. Since the desired functionality doesn't presently exist in Go it has an internal package nbconn
that wraps up a lot of complicated code in a way that presents a net.Conn
that supports non-blocking IO even over TLS.
One of the tests consistently encounters a tls: bad record MAC
on Windows but not on Linux or Mac. Given the complicated code it would be obvious to assume that the problem is in that nbconn
package. However, after many hours of debugging I believe I have ruled that out. I have a reproducible test case that creates a net.Conn
for a client and server and records everything written and read for the client to server stream. The TLS connection and the nbconn connection are on top of / use those recording connections. The recorded connections observe the repeated data (which obviously will corrupt the TLS connection).
Given that the repeated data is observed below the TLS connection I don't believe it is plausible that the bug exists in either pgx or the tls package. It seems likelier that it is a problem in the net package, the Go runtime, or Windows itself.
I have a test case at https://github.com/jackc/golang-windows-dupreads. It always fails on Windows but succeeds on other platforms. I've reduced the test as much as possible, but unfortunately I am unable to reproduce it without the particular pattern of reads and writes produced by nbconn
and tls
.
Here is the core of the test. It creates a server on localhost and establishes a connection to it. It wraps both connections in recorders. Then it performs some activity. Lastly, it checks to see if the bytes written and the bytes read match:
func TestRepeatedMessage(t *testing.T) {
// Make a connected pair of TCP connections.
tcpClientConn, tcpServerConn, err := makeTCPConns()
if err != nil {
t.Errorf("makeTCPConns failed: %v", err)
}
// Wrap the connections in recording connections.
recordedClientConn, recordedServerConn := makeRecordedConns(tcpClientConn, tcpServerConn)
// Establish a TLS connection from the client to the server.
clientConn, serverConn, err := makeTSLConnection(recordedClientConn, recordedServerConn)
if err != nil {
t.Errorf("makeTLSConnection failed: %v", err)
}
defer clientConn.Close()
defer serverConn.Close()
// This message size should be large enough to block a Write until the other side Reads some of it. Both the server
// and the client try to Write a large message without reading.
const messageSize = 4 * 1024 * 1024
// The server writes a large message before reading.
errChan := make(chan error, 1)
go func() {
remoteWriteBuf := make([]byte, messageSize)
_, err := serverConn.Write(remoteWriteBuf)
if err != nil {
errChan <- err
return
}
readBuf := make([]byte, messageSize)
_, err = io.ReadFull(serverConn, readBuf)
errChan <- err
}()
// This Write will always succeed due to the nbconn wrapper buffering all writes. This is necessary to prevent the
// TLS connection from being broken by a temporarily failing write.
writeBuf := make([]byte, messageSize)
n, err := clientConn.Write(writeBuf)
if err != nil {
t.Errorf("clientConn.Write failed: %v", err)
}
if messageSize != n {
t.Errorf("clientConn.Write wrote wrong number of bytes: %d instead of %d", n, messageSize)
}
readBuf := make([]byte, messageSize)
// This Read is what will actually flush the previous Write. Since the server goroutine above also writes a large
// message before reading this means that both sides are writing enough bytes to block until the other side reads.
// This will deadlock unless the internal non-blocking and buffering logic is working.
_, err = io.ReadFull(clientConn, readBuf)
if err != nil {
t.Errorf("io.ReadFull(clientConn, readBuf) failed: %v", err)
}
err = <-errChan
if err != nil {
t.Errorf("serverConn goroutine encountered an error: %v", err)
}
// Now compare what was written by the client connection and what was read by the server connection.
recordedServerConn.readLock.Lock()
defer recordedServerConn.readLock.Unlock()
readLog := recordedServerConn.readLog
recordedClientConn.writeLock.Lock()
defer recordedClientConn.writeLock.Unlock()
writeLog := recordedClientConn.writeLog
if readLog.Len() != writeLog.Len() {
t.Errorf("readLog.Len(): %d, writeLog.Len(): %d\n", readLog.Len(), writeLog.Len())
}
for i := 0; i < readLog.Len(); i++ {
if readLog.Bytes()[i] != writeLog.Bytes()[i] {
t.Errorf("mismatch at: %d", i)
t.Logf("readLog.Bytes()[i-20 : i+20] %v", readLog.Bytes()[i-20:i+20])
t.Logf("writeLog.Bytes()[i-20 : i+20] %v", writeLog.Bytes()[i-20:i+20])
firstOccuranceOfMismatchDataIndex := bytes.Index(writeLog.Bytes(), readLog.Bytes()[i:i+100])
if firstOccuranceOfMismatchDataIndex < i {
t.Logf("mismatch data appears earlier in stream at %d which is %d bytes earlier", firstOccuranceOfMismatchDataIndex, i-firstOccuranceOfMismatchDataIndex)
}
break
}
}
}
Run with go test
PS C:\Users\jack\dev\golang-windows-dupreads> go test
--- FAIL: TestRepeatedMessage (0.14s)
nbconn_test.go:229: serverConn goroutine encountered an error: local error: tls: bad record MAC
nbconn_test.go:242: readLog.Len(): 2799033, writeLog.Len(): 4200393
nbconn_test.go:247: mismatch at: 2766292
nbconn_test.go:248: readLog.Bytes()[i-20 : i+20] [33 53 47 198 118 85 105 211 192 151 29 220 243 125 77 23 3 3 64 17 75 110 182 199 210 112 201 152 118 93 91 4 251 122 27 61 92 41 94 214]
nbconn_test.go:249: writeLog.Bytes()[i-20 : i+20] [33 53 47 198 118 85 105 211 192 151 29 220 243 125 77 23 3 3 64 17 94 72 252 16 226 122 111 94 77 211 208 12 105 22 3 27 13 109 105 190]
nbconn_test.go:252: mismatch data appears earlier in stream at 2749886 which is 16406 bytes earlier
FAIL
exit status 1
FAIL dupreads 0.350s
Observations:
- The mismatch / repeated message is always 16406 bytes earlier. This appears to be the size of the Writes emitted by the tls package.
- The error never occurs before 2.7MB.
- While going back and forth between Windows and Mac while testing this issue I observed that Windows seems to have a larger TCP buffer or window. i.e. You can write more bytes to a connection without the other side reading before the write blocks. This leads this test to generate a different pattern of reads and writes on the different platforms.