Skip to content

net: TCP connection erroneously duplicates message on Windows  #58764

Open
@jackc

Description

@jackc

What version of Go are you using (go version)?

$ go version
go version go1.20.1 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
PS C:\Users\jack\dev\golang-windows-dupreads> go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\jack\AppData\Local\go-build
set GOENV=C:\Users\jack\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\jack\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\jack\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.20.1
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=0
set GOMOD=C:\Users\jack\dev\golang-windows-dupreads\go.mod
set GOWORK=
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -fno-caret-diagnostics -Qunused-arguments -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=C:\Users\jack\AppData\Local\Temp\go-build129281977=/tmp/go-build -gno-record-gcc-switches
PS C:\Users\jack\dev\golang-windows-dupreads>

What did you do?

  1. Create a TCP server on localhost.
  2. Connect to that server.
  3. Wrap the client conn in a wrapper that records all Write data and the server conn in a wrapper that records all Read data.
  4. Read and write a bunch of data in a very particular way (details below).
  5. Compare what was written to what was read.

What did you expect to see?

There should be no different in the bytes that the client sent and the bytes the server received.

What did you see instead?

At some point after ~2.7 MB a chunk of data is received twice.

Details

This problem was originally encountered in the tests for https://github.com/jackc/pgx. pgx has need of non-blocking IO (see #36973 and #15735). It also needs this supported when using TLS. Since the desired functionality doesn't presently exist in Go it has an internal package nbconn that wraps up a lot of complicated code in a way that presents a net.Conn that supports non-blocking IO even over TLS.

One of the tests consistently encounters a tls: bad record MAC on Windows but not on Linux or Mac. Given the complicated code it would be obvious to assume that the problem is in that nbconn package. However, after many hours of debugging I believe I have ruled that out. I have a reproducible test case that creates a net.Conn for a client and server and records everything written and read for the client to server stream. The TLS connection and the nbconn connection are on top of / use those recording connections. The recorded connections observe the repeated data (which obviously will corrupt the TLS connection).

Given that the repeated data is observed below the TLS connection I don't believe it is plausible that the bug exists in either pgx or the tls package. It seems likelier that it is a problem in the net package, the Go runtime, or Windows itself.

I have a test case at https://github.com/jackc/golang-windows-dupreads. It always fails on Windows but succeeds on other platforms. I've reduced the test as much as possible, but unfortunately I am unable to reproduce it without the particular pattern of reads and writes produced by nbconn and tls.

Here is the core of the test. It creates a server on localhost and establishes a connection to it. It wraps both connections in recorders. Then it performs some activity. Lastly, it checks to see if the bytes written and the bytes read match:

func TestRepeatedMessage(t *testing.T) {
	// Make a connected pair of TCP connections.
	tcpClientConn, tcpServerConn, err := makeTCPConns()
	if err != nil {
		t.Errorf("makeTCPConns failed: %v", err)
	}

	// Wrap the connections in recording connections.
	recordedClientConn, recordedServerConn := makeRecordedConns(tcpClientConn, tcpServerConn)

	// Establish a TLS connection from the client to the server.
	clientConn, serverConn, err := makeTSLConnection(recordedClientConn, recordedServerConn)
	if err != nil {
		t.Errorf("makeTLSConnection failed: %v", err)
	}
	defer clientConn.Close()
	defer serverConn.Close()

	// This message size should be large enough to block a Write until the other side Reads some of it. Both the server
	// and the client try to Write a large message without reading.
	const messageSize = 4 * 1024 * 1024

	// The server writes a large message before reading.
	errChan := make(chan error, 1)
	go func() {
		remoteWriteBuf := make([]byte, messageSize)
		_, err := serverConn.Write(remoteWriteBuf)
		if err != nil {
			errChan <- err
			return
		}

		readBuf := make([]byte, messageSize)
		_, err = io.ReadFull(serverConn, readBuf)
		errChan <- err
	}()

	// This Write will always succeed due to the nbconn wrapper buffering all writes. This is necessary to prevent the
	// TLS connection from being broken by a temporarily failing write.
	writeBuf := make([]byte, messageSize)
	n, err := clientConn.Write(writeBuf)
	if err != nil {
		t.Errorf("clientConn.Write failed: %v", err)
	}
	if messageSize != n {
		t.Errorf("clientConn.Write wrote wrong number of bytes: %d instead of %d", n, messageSize)
	}

	readBuf := make([]byte, messageSize)
	// This Read is what will actually flush the previous Write. Since the server goroutine above also writes a large
	// message before reading this means that both sides are writing enough bytes to block until the other side reads.
	// This will deadlock unless the internal non-blocking and buffering logic is working.
	_, err = io.ReadFull(clientConn, readBuf)
	if err != nil {
		t.Errorf("io.ReadFull(clientConn, readBuf) failed: %v", err)
	}

	err = <-errChan
	if err != nil {
		t.Errorf("serverConn goroutine encountered an error: %v", err)
	}

	// Now compare what was written by the client connection and what was read by the server connection.
	recordedServerConn.readLock.Lock()
	defer recordedServerConn.readLock.Unlock()
	readLog := recordedServerConn.readLog

	recordedClientConn.writeLock.Lock()
	defer recordedClientConn.writeLock.Unlock()
	writeLog := recordedClientConn.writeLog

	if readLog.Len() != writeLog.Len() {
		t.Errorf("readLog.Len(): %d, writeLog.Len(): %d\n", readLog.Len(), writeLog.Len())
	}

	for i := 0; i < readLog.Len(); i++ {
		if readLog.Bytes()[i] != writeLog.Bytes()[i] {
			t.Errorf("mismatch at: %d", i)
			t.Logf("readLog.Bytes()[i-20 : i+20]  %v", readLog.Bytes()[i-20:i+20])
			t.Logf("writeLog.Bytes()[i-20 : i+20] %v", writeLog.Bytes()[i-20:i+20])
			firstOccuranceOfMismatchDataIndex := bytes.Index(writeLog.Bytes(), readLog.Bytes()[i:i+100])
			if firstOccuranceOfMismatchDataIndex < i {
				t.Logf("mismatch data appears earlier in stream at %d which is %d bytes earlier", firstOccuranceOfMismatchDataIndex, i-firstOccuranceOfMismatchDataIndex)
			}
			break
		}
	}
}

Run with go test

PS C:\Users\jack\dev\golang-windows-dupreads> go test
--- FAIL: TestRepeatedMessage (0.14s)
    nbconn_test.go:229: serverConn goroutine encountered an error: local error: tls: bad record MAC
    nbconn_test.go:242: readLog.Len(): 2799033, writeLog.Len(): 4200393
    nbconn_test.go:247: mismatch at: 2766292
    nbconn_test.go:248: readLog.Bytes()[i-20 : i+20]  [33 53 47 198 118 85 105 211 192 151 29 220 243 125 77 23 3 3 64 17 75 110 182 199 210 112 201 152 118 93 91 4 251 122 27 61 92 41 94 214]
    nbconn_test.go:249: writeLog.Bytes()[i-20 : i+20] [33 53 47 198 118 85 105 211 192 151 29 220 243 125 77 23 3 3 64 17 94 72 252 16 226 122 111 94 77 211 208 12 105 22 3 27 13 109 105 190]
    nbconn_test.go:252: mismatch data appears earlier in stream at 2749886 which is 16406 bytes earlier
FAIL
exit status 1
FAIL    dupreads        0.350s

Observations:

  1. The mismatch / repeated message is always 16406 bytes earlier. This appears to be the size of the Writes emitted by the tls package.
  2. The error never occurs before 2.7MB.
  3. While going back and forth between Windows and Mac while testing this issue I observed that Windows seems to have a larger TCP buffer or window. i.e. You can write more bytes to a connection without the other side reading before the write blocks. This leads this test to generate a different pattern of reads and writes on the different platforms.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.OS-Windows

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions