Data Organization

This document describes recommended practices for structuring data that is stored in Ledger. We assume that the reader is already familiar with Ledger API surface.

Pages

Data stored in Ledger is grouped into separate independent key-value stores called pages.

Deciding how to split data into pages has the following implications:

  • atomic changes across multiple values are possible only within one page, so if there's data that has to be modified together, it must belong to the same page
  • the current state of each page is either entirely present or entirely absent on disk (save for lazy values), hence splitting data into multiple pages allows granular optimization of which data to keep on disk on each device

There are no restrictions on how many pages a client application can use.

Entries

Each page stores an ordered list of key-value pairs, also called entries.

Both keys and values are arbitrary byte arrays - the client application is free to construct the keys and values as needed based on their needs, but see some guidance below.

Key ordering

Entries are sorted by key in the lexicographic order of the byte array. Entry retrieval methods always return them in order, it is also possible to use a range query to retrieve only a part of the ordered list.

The lexicographic byte array ordering of the keys implies the following caveats:

  • number ordering: a key of [1, 0, 0] is ordered before the key of [2]. If ordering between the items is important, it is essential to add padding to any numbers that are part of the key, so that their width is fixed
  • string ordering: ordering of unicode strings in a way that matches human expectations is non-trivial and byte-array ordering of utf8/16/32 strings is unlikely to yield good results.

Range queries

The entries stored in a page are sorted by keys, and can be retrieved using either exact matching or range queries.

Because of the range query support, it might be desirable to construct the keys in a way that matches the querying needs of the application. For example, a messaging application which needs to retrieve only the messages not older than a week might wish to structure the data as follows:

  • key: {timestamp}-{message UUID}
  • value: {message content}
Caveat: as the range queries can currently only yield results in the increasing order, in order to retrieve the newest messages, the timestamp needs to be constructed as (BIG NUMBER minus the current time).

Then, a range query can be used to retrieve the messages after the given timestamp.

Ledger does not support custom indexes and queries by value. Please do let us know about your use cases if this limitation prevents you from adopting Ledger.

Avoiding collisions

UUIDs should be used as part of the entry key in order to avoid unintended collisions, where needed. For example, a messaging application might want to include an UUID of a message stored in a Ledger entry as part of the entry key, in order to avoid collisions between two messages of the same timestamp:

  • key: {timestamp}-{message UUID}
  • value: {message content}

Granularity

Deciding how to split page data into entries has implications on querying ability and conflict resolution.

If two pieces of data can change independently (for example: a name and an email of a contact in a contacts app), it is beneficial to put these pieces of data in separate entries. Because conflicts are resolved entry-by-entry, concurrent modifications of different entries can be automatically merged by the default conflict resolution policy.

On the other hand, if two pieces of data are related (when one changes, the other piece of data needs to change accordingly), we need to either:

  • put them in the same entry (and serialize them in the application)
  • put them in separate entries, but use a custom conflict resolution policy to preserve the invariants
  • put them in separate entries, but consistently always modify both of them together within a transaction