docs/Serialization.rst - third_party/swift - Git at Google

 :orphan:

 =================================
 Swift Binary Serialization Format
 =================================

 The fundamental unit of distribution for Swift code is a *module.* A module
 contains declarations as an interface for clients to write code against. It may
 also contain implementation information for any of these declarations that can
 be used to optimize client code. Conceptually, the file containing the
 interface for a module serves much the same purpose as the collection of C
 header files for a particular library.

 Swift's binary serialization format is currently used for several purposes:

 - The public interface for a module ("swiftmodule files").

 - A representation of captured compiler state after semantic analysis and SIL
   generation, but before LLVM IR generation ("SIB", for "Swift Intermediate
   Binary").

 - Debug information about types, for proper high-level introspection without
   running code.

 - Debug information about non-public APIs, for interactive debugging.

 The first two uses require a module to serve as a container of both AST nodes
 and SIL entities. As a unit of distribution, it should also be
 forward-compatible: module files installed on a developer's system in 201X
 should be usable without updates for years to come, even as the Swift compiler
 continues to be improved and enhanced. However, they are currently too closely
 tied to the compiler internals to be useful for this purpose, and it is likely
 we'll invent a new format instead.


 Why LLVM bitcode?
 =================

 The `LLVM bitstream <http://llvm.org/docs/BitCodeFormat.html>`_ format was
 invented as a container format for LLVM IR. It is a binary format supporting
 two basic structures: *blocks,* which define regions of the file, and
 *records,* which contain data fields that can be up to 64 bits. It has a few
 nice properties that make it a useful container format for Swift modules as
 well:

 - It is easy to skip over an entire block, because the block's length is
   recorded at its start.

 - It is possible to jump to specific offsets *within* a block without having to
   reparse from the start of the block.

 - A format change doesn't immediately invalidate existing bitstream files,
   because the stream includes layout information for each record.

 - It's a binary format, so it's at least *somewhat* compact. [I haven't done a
   size comparison against other formats.]

 If we were to switch to another container format, we would likely want it to
 have most of these properties as well. But we're already linking against
 LLVM...might as well use it!


 Versioning
 ==========

 .. warning::

   This section is relevant to any forward-compatible format used for a
   library's public interface. However, as mentioned above this may not be
   the current binary serialization format.

   Today's Swift uses a "major" version number of 0 and an always-incrementing
   "minor" version number. Every change is treated as compatibility-breaking;
   the minor version must match exactly for the compiler to load the module.

 Persistent serialized Swift files use the following versioning scheme:

 - Serialized modules are given a major and minor version number.

 - When making a backwards-compatible change, the major and the minor version
   number both MUST NOT be incremented.

 - When making a change such that new modules cannot be safely loaded by older
   compilers, the minor version number MUST be incremented.

 - When making a change such that *old* modules cannot be safely loaded by
   *newer* compilers, the major version number MUST be incremented. The minor
   version number MUST then be reset to zero.

 - Ideally, the major version number is never incremented.

 A serialized file's version number is checked against the client's supported
 version before it is loaded. If it is too old or too new, the file cannot be
 loaded.

 Note that the version number describes the contents of the file. Thus, if a
 compiler supports features introduced in file version 1.9, but a particular
 module only uses features introduced before and in version 1.7, the compiler
 MAY serialize that module with the version number 1.7. However, doing so
 requires extra work on the compiler's part to detect which features are in use;
 a simpler implementation would just use the latest version number supported:
 1.9.

 *This versioning scheme was inspired by* `Semantic Versioning
 <http://semver.org>`_. *However, it is not compatible with Semantic Versioning
 because it promises* forward-compatibility *rather than* backward-compatibility.


 A High-Level Tour of the Current Module Format
 ==============================================

 Every serialized module is represented as a single block called the "module
 block". The module block is made up of several other block kinds, largely for
 organizational purposes.

 - The **block info block** is a standard LLVM bitcode block that contains
   metadata about the bitcode stream. It is the only block that appears outside
   the module block; we always put it at the very start of the file. Though it
   can contain actual semantic information, our use of it is only for debugging
   purposes.

 - The **control block** is always the first block in the module block. It can
   be processed without loading the rest of the module, and indeed is intended
   to allow clients to decide whether not the module is compatible with the
   current AST context. The major and minor version numbers of the format are
   stored here.

 - The **input block** contains information about how to import the module once
   the client has decided to load it. This includes the list of other modules
   that this module depends on.

 - The **SIL block** contains SIL-level implementations that can be imported
   into a client's SILModule context. In most cases this is just a performance
   concern, but sometimes it affects language semantics as well, as in the case
   of ``@_transparent``. The SIL block precedes the AST block because it affects
   which AST nodes get serialized.

 - The **SIL index black** contains tables for accessing various SIL entities by
   their names, along with a mapping of unique IDs for these to the appropriate
   bit offsets into the SIL block.

 - The **AST block** contains the serialized forms of Decl, DeclContext, and
   Type AST nodes. Decl nodes may be cross-references to other modules, while
   types are always serialized with enough info to regenerate them at load time.
   Nodes are accessed by a file-unique "DeclIDs" (also covering DeclContexts)
   and "TypeIDs"; the two sets of IDs use separate numbering schemes.

 .. note::

   The AST block is currently referred to as the "decls block" in the source.

 - The **identifier block** contains a single blob of strings. This is intended
   for Identifiers---strings uniqued by the ASTContext---but can in theory
   support any string data. The strings are accessed by a file-unique
   "IdentifierID".

 - The **index block** contains mappings from the AST node and identifier IDs to
   their offsets in the AST block or identifier block (as appropriate). It also
   contains various top-level AST information about the module, such as its
   top-level declarations.


 SIL
 ===

 [to be written]


 Cross-reference resilience
 ==========================

 [to be written]
	:orphan:

	=================================
	Swift Binary Serialization Format
	=================================

	The fundamental unit of distribution for Swift code is a module. A module
	contains declarations as an interface for clients to write code against. It may
	also contain implementation information for any of these declarations that can
	be used to optimize client code. Conceptually, the file containing the
	interface for a module serves much the same purpose as the collection of C
	header files for a particular library.

	Swift's binary serialization format is currently used for several purposes:

	- The public interface for a module ("swiftmodule files").

	- A representation of captured compiler state after semantic analysis and SIL
	generation, but before LLVM IR generation ("SIB", for "Swift Intermediate
	Binary").

	- Debug information about types, for proper high-level introspection without
	running code.

	- Debug information about non-public APIs, for interactive debugging.

	The first two uses require a module to serve as a container of both AST nodes
	and SIL entities. As a unit of distribution, it should also be
	forward-compatible: module files installed on a developer's system in 201X
	should be usable without updates for years to come, even as the Swift compiler
	continues to be improved and enhanced. However, they are currently too closely
	tied to the compiler internals to be useful for this purpose, and it is likely
	we'll invent a new format instead.


	Why LLVM bitcode?
	=================

	The `LLVM bitstream <http://llvm.org/docs/BitCodeFormat.html>`_ format was
	invented as a container format for LLVM IR. It is a binary format supporting
	two basic structures: blocks, which define regions of the file, and
	records, which contain data fields that can be up to 64 bits. It has a few
	nice properties that make it a useful container format for Swift modules as
	well:

	- It is easy to skip over an entire block, because the block's length is
	recorded at its start.

	- It is possible to jump to specific offsets within a block without having to
	reparse from the start of the block.

	- A format change doesn't immediately invalidate existing bitstream files,
	because the stream includes layout information for each record.

	- It's a binary format, so it's at least somewhat compact. [I haven't done a
	size comparison against other formats.]

	If we were to switch to another container format, we would likely want it to
	have most of these properties as well. But we're already linking against
	LLVM...might as well use it!


	Versioning
	==========

	.. warning::

	This section is relevant to any forward-compatible format used for a
	library's public interface. However, as mentioned above this may not be
	the current binary serialization format.

	Today's Swift uses a "major" version number of 0 and an always-incrementing
	"minor" version number. Every change is treated as compatibility-breaking;
	the minor version must match exactly for the compiler to load the module.

	Persistent serialized Swift files use the following versioning scheme:

	- Serialized modules are given a major and minor version number.

	- When making a backwards-compatible change, the major and the minor version
	number both MUST NOT be incremented.

	- When making a change such that new modules cannot be safely loaded by older
	compilers, the minor version number MUST be incremented.

	- When making a change such that old modules cannot be safely loaded by
	newer compilers, the major version number MUST be incremented. The minor
	version number MUST then be reset to zero.

	- Ideally, the major version number is never incremented.

	A serialized file's version number is checked against the client's supported
	version before it is loaded. If it is too old or too new, the file cannot be
	loaded.

	Note that the version number describes the contents of the file. Thus, if a
	compiler supports features introduced in file version 1.9, but a particular
	module only uses features introduced before and in version 1.7, the compiler
	MAY serialize that module with the version number 1.7. However, doing so
	requires extra work on the compiler's part to detect which features are in use;
	a simpler implementation would just use the latest version number supported:
	1.9.

	This versioning scheme was inspired by `Semantic Versioning
	<http://semver.org>`_. *However, it is not compatible with Semantic Versioning
	because it promises* forward-compatibility rather than backward-compatibility.


	A High-Level Tour of the Current Module Format
	==============================================

	Every serialized module is represented as a single block called the "module
	block". The module block is made up of several other block kinds, largely for
	organizational purposes.

	- The block info block is a standard LLVM bitcode block that contains
	metadata about the bitcode stream. It is the only block that appears outside
	the module block; we always put it at the very start of the file. Though it
	can contain actual semantic information, our use of it is only for debugging
	purposes.

	- The control block is always the first block in the module block. It can
	be processed without loading the rest of the module, and indeed is intended
	to allow clients to decide whether not the module is compatible with the
	current AST context. The major and minor version numbers of the format are
	stored here.

	- The input block contains information about how to import the module once
	the client has decided to load it. This includes the list of other modules
	that this module depends on.

	- The SIL block contains SIL-level implementations that can be imported
	into a client's SILModule context. In most cases this is just a performance
	concern, but sometimes it affects language semantics as well, as in the case
	of ``@_transparent``. The SIL block precedes the AST block because it affects
	which AST nodes get serialized.

	- The SIL index black contains tables for accessing various SIL entities by
	their names, along with a mapping of unique IDs for these to the appropriate
	bit offsets into the SIL block.

	- The AST block contains the serialized forms of Decl, DeclContext, and
	Type AST nodes. Decl nodes may be cross-references to other modules, while
	types are always serialized with enough info to regenerate them at load time.
	Nodes are accessed by a file-unique "DeclIDs" (also covering DeclContexts)
	and "TypeIDs"; the two sets of IDs use separate numbering schemes.

	.. note::

	The AST block is currently referred to as the "decls block" in the source.

	- The identifier block contains a single blob of strings. This is intended
	for Identifiers---strings uniqued by the ASTContext---but can in theory
	support any string data. The strings are accessed by a file-unique
	"IdentifierID".

	- The index block contains mappings from the AST node and identifier IDs to
	their offsets in the AST block or identifier block (as appropriate). It also
	contains various top-level AST information about the module, such as its
	top-level declarations.


	SIL
	===

	[to be written]


	Cross-reference resilience
	==========================

	[to be written]