blob: b92d45c6abc046c129ec01ea7c2b0acff635a9c7 [file] [log] [blame] [view] [edit]
# Trace Processor Parse Cache
**Authors:** @lalitm
**Status:** Draft
## Problem
Loading large traces (multi-GB) into Trace Processor is slow. Parsing
protobuf trace data, sorting events, and populating tables can take minutes
for large traces. This cost is paid on every load, even when the trace
hasn't changed.
This is painful in several workflows:
1. **Interactive analysis:** Reopening the same trace in the UI after a page
refresh or browser restart.
2. **AI agents:** Repeated open/query/close cycles on the same trace, where
each cycle re-parses from scratch.
3. **Iterative scripting:** Python notebooks or scripts that create a new
`TraceProcessor` instance on the same trace during development.
The parsed table representation is deterministic for a given trace + TP
version + flags. Re-parsing is pure waste when the inputs haven't changed.
## Decision
Add a parse cache that serializes parsed tables into an opaque binary
format and transparently loads from cache on subsequent opens. The caching
logic lives entirely in the shell layer — the TP core remains IO-free.
## Design
### Cache format
The parse cache format is an **internal implementation detail** of Trace
Processor. It is not stable, not versioned for external consumers, and
carries **no forwards or backwards compatibility guarantees**. The format
may change between any two TP releases without notice.
Currently, the cache is a TAR archive of Arrow IPC files (one per intrinsic
table), reusing the existing `ExportToArrow()` / Arrow TAR import
infrastructure. This is a convenient implementation choice, not a public
contract.
A separate, stable Arrow export feature (with explicit versioning and
compatibility guarantees) will be provided independently for users who need
a durable, interoperable representation. The parse cache is not that — it
is purely an acceleration mechanism tied to a specific TP build.
### Cache key
The cache is identified by:
```
SHA256(tp_version + sorted_relevant_global_flags + trace_identity)
```
Where `trace_identity` is:
| Client | Identity |
| ------------- | ------------------------------------------------- |
| CLI shell | file path + size + mtime (from `stat()`) |
| Browser/UI | `File.name` + `File.size` + `File.lastModified` |
| Python | file path + size + mtime (from `os.stat()`) |
This is fast to compute (no file content hashing) and sufficient for
practical invalidation. The chance of a false cache hit (different trace
content with identical name + size + mtime) is negligible.
"Relevant global flags" are those that affect parsing behavior:
`--full-sort`, `--no-ftrace-raw`, `--crop-track-events`, etc.
The inclusion of `tp_version` in the cache key means that upgrading TP
automatically invalidates all existing caches. This is intentional — schema
changes between versions make old caches unsafe to load.
### Cache location
`~/.cache/perfetto/parse-cache/<hash>` by default, overridable via
`--parse-cache-dir`. The cache directory is created on first write.
### CLI: `--parse-cache` global flag
```bash
trace_processor_shell --parse-cache query -c "SELECT ..." trace.pb
```
Behavior:
1. Compute cache key from the trace file's stat metadata + TP version +
flags.
2. If a valid cache exists, read it and feed it to `Parse()` instead of
the original trace. Skip to step 5.
3. Otherwise, read the original trace and feed it to `Parse()` as normal.
4. After parsing completes, start writing the parse cache in a background
thread via `ExportToArrow()`.
5. Execute the user's command (query, repl, serve, etc.).
6. On exit, wait for the background cache write to complete before
terminating.
The background write means the first load has no added latency for queries.
The wait-on-exit means the cache is guaranteed to exist after the first
invocation completes.
### CLI: `parse-cache` subcommand
For explicit cache management by human users:
```bash
# Create a parse cache for a trace
trace_processor_shell parse-cache create trace.pb
# Show cache status for a trace (exists, size, staleness)
trace_processor_shell parse-cache info trace.pb
# Delete cache for a specific trace
trace_processor_shell parse-cache clear trace.pb
# Delete all parse caches
trace_processor_shell parse-cache clear --all
```
### RPC: `/parse` with trace identity
To support caching for RPC clients (UI, Python), the first `/parse` call
gains an optional `trace_identity` field:
```protobuf
message ParseRequest {
optional bytes data = 1;
// Optional. Sent only on the first /parse call. If the shell has a
// valid parse cache for this identity, it loads from cache and ignores
// subsequent /parse data.
optional string trace_identity = 2;
}
message ParseResult {
optional string error = 1;
// If true, the shell loaded from cache. The client may skip sending
// further /parse chunks. Old clients that ignore this field and
// continue sending data are fine — the shell discards the bytes.
optional bool skip_further_parse = 2;
}
```
Flow:
1. Client sends first `/parse` chunk with `trace_identity` set.
2. Shell checks cache:
- **Cache hit:** Load cached data, respond with
`skip_further_parse = true`. Discard data from this and all subsequent
`/parse` calls.
- **Cache miss:** Parse the data normally, respond with
`skip_further_parse = false`. Continue accepting `/parse` chunks.
3. After `notify_eof`, if cache was missed, write cache in background.
4. Old clients that don't send `trace_identity`: no caching, existing
behavior unchanged.
5. New clients that ignore `skip_further_parse` and keep sending: shell
discards bytes, everything still works.
This is fully backwards compatible in both directions.
### Python API
Caching is configured via `TraceProcessorConfig`:
```python
from perfetto.trace_processor import TraceProcessor, TraceProcessorConfig
config = TraceProcessorConfig(parse_cache=True)
tp = TraceProcessor(trace="large.pb", config=config)
```
Implementation: `parse_cache=True` causes the Python wrapper to pass
`--parse-cache` to the shell subprocess. The Python client also sends
`trace_identity` (from `os.stat()`) on the first `/parse` call so the
shell can check its cache. If the response has `skip_further_parse = true`,
Python skips sending remaining chunks.
No explicit cache management API in Python. Users who want manual control
can use the CLI `parse-cache` subcommand.
### What is NOT cached
- **Derived state:** Views, PerfettoSQL module outputs, and custom SQL are
recomputed on top of the cached tables. This is expected and fast.
- **Session-specific tables:** `__intrinsic_trace_file`,
`__intrinsic_trace_import_logs`, and similar metadata tables are excluded
from the cache.
- **Empty tables:** Tables with zero rows are not written to the cache.
- **Stdin/pipe traces:** When there is no file identity (data piped from
stdin or generated programmatically), caching is silently skipped.
### Disk space
The cache size is roughly comparable to the parsed table data, which can be
a significant fraction of the original trace size. Mitigations:
- Caching is always opt-in (`--parse-cache` flag or `parse-cache create`).
Users make a conscious choice to spend disk space.
- `parse-cache info` shows cache sizes.
- `parse-cache clear --all` provides easy cleanup.
- After writing, the shell prints the cache size as a courtesy:
`Parse cache written: 8.2 GB at ~/.cache/perfetto/parse-cache/a1b2c3`
### Concurrency
- Cache files are written atomically: write to a temporary file, then
rename. This prevents partial reads from concurrent TP instances.
- Multiple TP instances loading the same trace simultaneously may each
write a cache. The last rename wins, which is fine since the content is
identical.
### Architecture summary
```
+------------------+ +-------------------+ +---------------+
| Clients | | Shell layer | | TP core |
| | | (IO lives here) | | (IO-free) |
| CLI user |---->| | | |
| Browser/UI |---->| Cache check/write |---->| Parse() |
| Python API |---->| File I/O | | ExportToArrow |
| | | Background thread | | |
+------------------+ +-------------------+ +---------------+
```
TP core provides the primitives (`ExportToArrow`, import via `Parse`). The
shell layer orchestrates caching. RPC clients provide trace identity
metadata. No caching logic in TP core or the RPC layer itself.
## Alternatives considered
### 1. Cache in TP core
Add cache awareness to `TraceProcessor` directly (e.g., in
`NotifyEndOfFile()` or `Config`).
Pro:
* Single implementation covers all callers.
Con:
* Violates TP's IO-free design. Implicit file writes in a library are
surprising and hard to control.
* TP doesn't know about file paths, mtime, or disk layout.
### 2. SQLite as cache format
Use the existing `ExportTraceToDatabase()` as the cache format instead.
Pro:
* Already implemented.
* Self-describing, queryable directly.
Con:
* Slower to load (SQL overhead, row-oriented).
* Larger on disk for numeric-heavy data.
* Not designed for fast bulk table restoration.
### 3. Automatic caching by default
Enable caching automatically without user opt-in.
Pro:
* Zero-friction fast reloads.
Con:
* Disk bloat can be tremendous and surprising. A 10 GB trace produces a
comparably sized cache silently.
* Dangerous on servers, shared filesystems, CI environments.
* Users should make a conscious choice to spend disk space.
### 4. Content hashing for cache key
Use SHA256 of the full trace content as the cache key instead of
path + size + mtime.
Pro:
* Correct even if a file is modified without changing mtime.
Con:
* Hashing a multi-GB file is slow and defeats the purpose of caching.
* For RPC clients (browser), reading the full file just to compute a
hash is wasteful.
* The mtime + size heuristic is sufficient in practice.
## Open questions
* Should there be a maximum cache directory size with automatic eviction
(LRU)? Or is manual `parse-cache clear` sufficient?
* Should `parse-cache create` support creating caches for multiple traces
at once (e.g., `parse-cache create *.pb`)?
* For the UI: should the cache directory be configurable via the httpd
endpoint, or is `~/.cache/perfetto/parse-cache/` always correct?