commit	3a73c5ed04199f3a013c53030290a54a414dc6e5	[log] [tgz]
author	Oliver Newman <15459200+orn688@users.noreply.github.com>	Fri Mar 03 09:55:37 2023 -0500
committer	GitHub <noreply@github.com>	Fri Mar 03 09:55:37 2023 -0500
tree	5d64344c2bfa3bdfc18b2fb31964114aaf87ee78
parent	66c350bf1ae62ed6b232e00987bd0e6b83999845 [diff]

[reader] Fix compressed reader retries (#432)

Note: this introduces a behavior change in addition to the bug fix described below. After `compressedSeeker.Read()` returns an error, subsequent calls continue to drain the internal buffer until it announces EOF, whereas before the method short-circuited and reported the last error on subsequent calls. Since this is an internal package, it's unlikely that there have been any external dependencies on this behavior.

There was a subtle bug in the `compressedSeeker` class that caused retried upload requests to occasionally fail under a very specific set of conditions:

1. The file being uploaded must be greater than the client's `MaxBatchSize` so it's not batched together with other files (compression is always disabled for files that are batched together).
2. Upload compression must be enabled for the file, e.g. by setting `client.CompressedBytestreamThreshold(x)` where x is less than the file size. This will cause the upload to enter the `compressedSeeker` code path.

Here's the process by which the bug happens (the relevant code is mostly in `go/pkg/client/bytestream.go` in the `writeChunked()` function):
1. All chunks of the file upload normally up until the final chunk.
2. `ch.Next()` is called and retrieves the final chunk of the file.
3. The `reader.compressedSeeker` underlying the `chunker.Chunker` that was passed into `writeChunked()` gets an `io.EOF` error when reading from its underlying reader. This is completely normal and expected since the end of the file has been reached.
4. `reader.compressedSeeker.Read()` correctly propagates the `io.EOF` error but also caches it to its `cfs.err` field, so all subsequent calls to `Read()` will return `io.EOF`.
5. The `WriteRequest` to upload the final chunk of the file fails transiently, e.g. due to an internal server error.
6. The entire upload gets retried to do the `c.Retrier.Do()` call within `writeChunked()`.
7. `ch.Reset()` gets called to reset the chunker to the beginning of the file, calling `compressedSeeker.SeekOffset(0)` under the hood. This correctly seeks to the beginning of the file but *does not* unset the cached `err`.
7. The first call to `ch.Next()` tries to call `reader.compressedSeeker.Read()`, expecting to read from the beginning of the file again, but it immediately returns the cached `io.EOF` and a length of zero bytes, suggesting that the file is now empty.
8. `writeChunked()` dutifully tries to upload the apparently empty file, but under the digest of the file's actual contents.
9. This results in a server-side error when the server tries to validate that the uploaded file matches the digest under which it is being stored, something like: `Uploaded blob had calculated digest 19f3/0 but was being stored to 34ba/<size>`, where `<size>` is the actual size of the file in bytes. This is the same error that happens if the file is changed by another process between the first and the second upload attempt, but in this case it happens even though the file is not modified on disk. This error is not considered transient, so is not retried, and bubbles up to an error.

I could find no indication as to why errors needed to be cached on the `compressedSeeker`, and since the error caching was the underlying cause of the bug (I reproduce this in a test case), it was simplest to just remove the error caching logic.

2 files changed

tree: 5d64344c2bfa3bdfc18b2fb31964114aaf87ee78

README.md

Remote Execution API SDKs

CI status:

This repository contains SDKs for the Remote Execution API.

See each language subdirectory's README.md for more specific instructions on using the SDK for that language.