| .\" Copyright (c) 2003-2007 Tim Kientzle |
| .\" All rights reserved. |
| .\" |
| .\" Redistribution and use in source and binary forms, with or without |
| .\" modification, are permitted provided that the following conditions |
| .\" are met: |
| .\" 1. Redistributions of source code must retain the above copyright |
| .\" notice, this list of conditions and the following disclaimer. |
| .\" 2. Redistributions in binary form must reproduce the above copyright |
| .\" notice, this list of conditions and the following disclaimer in the |
| .\" documentation and/or other materials provided with the distribution. |
| .\" |
| .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND |
| .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE |
| .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE |
| .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE |
| .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL |
| .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS |
| .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) |
| .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT |
| .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY |
| .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF |
| .\" SUCH DAMAGE. |
| .\" |
| .\" $FreeBSD: src/lib/libarchive/libarchive_internals.3,v 1.2 2007/12/30 04:58:22 kientzle Exp $ |
| .\" |
| .Dd April 16, 2007 |
| .Dt LIBARCHIVE 3 |
| .Os |
| .Sh NAME |
| .Nm libarchive_internals |
| .Nd description of libarchive internal interfaces |
| .Sh OVERVIEW |
| The |
| .Nm libarchive |
| library provides a flexible interface for reading and writing |
| streaming archive files such as tar and cpio. |
| Internally, it follows a modular layered design that should |
| make it easy to add new archive and compression formats. |
| .Sh GENERAL ARCHITECTURE |
| Externally, libarchive exposes most operations through an |
| opaque, object-style interface. |
| The |
| .Xr archive_entry 1 |
| objects store information about a single filesystem object. |
| The rest of the library provides facilities to write |
| .Xr archive_entry 1 |
| objects to archive files, |
| read them from archive files, |
| and write them to disk. |
| (There are plans to add a facility to read |
| .Xr archive_entry 1 |
| objects from disk as well.) |
| .Pp |
| The read and write APIs each have four layers: a public API |
| layer, a format layer that understands the archive file format, |
| a compression layer, and an I/O layer. |
| The I/O layer is completely exposed to clients who can replace |
| it entirely with their own functions. |
| .Pp |
| In order to provide as much consistency as possible for clients, |
| some public functions are virtualized. |
| Eventually, it should be possible for clients to open |
| an archive or disk writer, and then use a single set of |
| code to select and write entries, regardless of the target. |
| .Sh READ ARCHITECTURE |
| From the outside, clients use the |
| .Xr archive_read 3 |
| API to manipulate an |
| .Nm archive |
| object to read entries and bodies from an archive stream. |
| Internally, the |
| .Nm archive |
| object is cast to an |
| .Nm archive_read |
| object, which holds all read-specific data. |
| The API has four layers: |
| The lowest layer is the I/O layer. |
| This layer can be overridden by clients, but most clients use |
| the packaged I/O callbacks provided, for example, by |
| .Xr archive_read_open_memory 3 , |
| and |
| .Xr archive_read_open_fd 3 . |
| The compression layer calls the I/O layer to |
| read bytes and decompresses them for the format layer. |
| The format layer unpacks a stream of uncompressed bytes and |
| creates |
| .Nm archive_entry |
| objects from the incoming data. |
| The API layer tracks overall state |
| (for example, it prevents clients from reading data before reading a header) |
| and invokes the format and compression layer operations |
| through registered function pointers. |
| In particular, the API layer drives the format-detection process: |
| When opening the archive, it reads an initial block of data |
| and offers it to each registered compression handler. |
| The one with the highest bid is initialized with the first block. |
| Similarly, the format handlers are polled to see which handler |
| is the best for each archive. |
| (Prior to 2.4.0, the format bidders were invoked for each |
| entry, but this design hindered error recovery.) |
| .Ss I/O Layer and Client Callbacks |
| The read API goes to some lengths to be nice to clients. |
| As a result, there are few restrictions on the behavior of |
| the client callbacks. |
| .Pp |
| The client read callback is expected to provide a block |
| of data on each call. |
| A zero-length return does indicate end of file, but otherwise |
| blocks may be as small as one byte or as large as the entire file. |
| In particular, blocks may be of different sizes. |
| .Pp |
| The client skip callback returns the number of bytes actually |
| skipped, which may be much smaller than the skip requested. |
| The only requirement is that the skip not be larger. |
| In particular, clients are allowed to return zero for any |
| skip that they don't want to handle. |
| The skip callback must never be invoked with a negative value. |
| .Pp |
| Keep in mind that not all clients are reading from disk: |
| clients reading from networks may provide different-sized |
| blocks on every request and cannot skip at all; |
| advanced clients may use |
| .Xr mmap 2 |
| to read the entire file into memory at once and return the |
| entire file to libarchive as a single block; |
| other clients may begin asynchronous I/O operations for the |
| next block on each request. |
| .Ss Decompresssion Layer |
| The decompression layer not only handles decompression, |
| it also buffers data so that the format handlers see a |
| much nicer I/O model. |
| The decompression API is a two stage peek/consume model. |
| A read_ahead request specifies a minimum read amount; |
| the decompression layer must provide a pointer to at least |
| that much data. |
| If more data is immediately available, it should return more: |
| the format layer handles bulk data reads by asking for a minimum |
| of one byte and then copying as much data as is available. |
| .Pp |
| A subsequent call to the |
| .Fn consume |
| function advances the read pointer. |
| Note that data returned from a |
| .Fn read_ahead |
| call is guaranteed to remain in place until |
| the next call to |
| .Fn read_ahead . |
| Intervening calls to |
| .Fn consume |
| should not cause the data to move. |
| .Pp |
| Skip requests must always be handled exactly. |
| Decompression handlers that cannot seek forward should |
| not register a skip handler; |
| the API layer fills in a generic skip handler that reads and discards data. |
| .Pp |
| A decompression handler has a specific lifecycle: |
| .Bl -tag -compact -width indent |
| .It Registration/Configuration |
| When the client invokes the public support function, |
| the decompression handler invokes the internal |
| .Fn __archive_read_register_compression |
| function to provide bid and initialization functions. |
| This function returns |
| .Cm NULL |
| on error or else a pointer to a |
| .Cm struct decompressor_t . |
| This structure contains a |
| .Va void * config |
| slot that can be used for storing any customization information. |
| .It Bid |
| The bid function is invoked with a pointer and size of a block of data. |
| The decompressor can access its config data |
| through the |
| .Va decompressor |
| element of the |
| .Cm archive_read |
| object. |
| The bid function is otherwise stateless. |
| In particular, it must not perform any I/O operations. |
| .Pp |
| The value returned by the bid function indicates its suitability |
| for handling this data stream. |
| A bid of zero will ensure that this decompressor is never invoked. |
| Return zero if magic number checks fail. |
| Otherwise, your initial implementation should return the number of bits |
| actually checked. |
| For example, if you verify two full bytes and three bits of another |
| byte, bid 19. |
| Note that the initial block may be very short; |
| be careful to only inspect the data you are given. |
| (The current decompressors require two bytes for correct bidding.) |
| .It Initialize |
| The winning bidder will have its init function called. |
| This function should initialize the remaining slots of the |
| .Va struct decompressor_t |
| object pointed to by the |
| .Va decompressor |
| element of the |
| .Va archive_read |
| object. |
| In particular, it should allocate any working data it needs |
| in the |
| .Va data |
| slot of that structure. |
| The init function is called with the block of data that |
| was used for tasting. |
| At this point, the decompressor is responsible for all I/O |
| requests to the client callbacks. |
| The decompressor is free to read more data as and when |
| necessary. |
| .It Satisfy I/O requests |
| The format handler will invoke the |
| .Va read_ahead , |
| .Va consume , |
| and |
| .Va skip |
| functions as needed. |
| .It Finish |
| The finish method is called only once when the archive is closed. |
| It should release anything stored in the |
| .Va data |
| and |
| .Va config |
| slots of the |
| .Va decompressor |
| object. |
| It should not invoke the client close callback. |
| .El |
| .Ss Format Layer |
| The read formats have a similar lifecycle to the decompression handlers: |
| .Bl -tag -compact -width indent |
| .It Registration |
| Allocate your private data and initialize your pointers. |
| .It Bid |
| Formats bid by invoking the |
| .Fn read_ahead |
| decompression method but not calling the |
| .Fn consume |
| method. |
| This allows each bidder to look ahead in the input stream. |
| Bidders should not look further ahead than necessary, as long |
| look aheads put pressure on the decompression layer to buffer |
| lots of data. |
| Most formats only require a few hundred bytes of look ahead; |
| look aheads of a few kilobytes are reasonable. |
| (The ISO9660 reader sometimes looks ahead by 48k, which |
| should be considered an upper limit.) |
| .It Read header |
| The header read is usually the most complex part of any format. |
| There are a few strategies worth mentioning: |
| For formats such as tar or cpio, reading and parsing the header is |
| straightforward since headers alternate with data. |
| For formats that store all header data at the beginning of the file, |
| the first header read request may have to read all headers into |
| memory and store that data, sorted by the location of the file |
| data. |
| Subsequent header read requests will skip forward to the |
| beginning of the file data and return the corresponding header. |
| .It Read Data |
| The read data interface supports sparse files; this requires that |
| each call return a block of data specifying the file offset and |
| size. |
| This may require you to carefully track the location so that you |
| can return accurate file offsets for each read. |
| Remember that the decompressor will return as much data as it has. |
| Generally, you will want to request one byte, |
| examine the return value to see how much data is available, and |
| possibly trim that to the amount you can use. |
| You should invoke consume for each block just before you return it. |
| .It Skip All Data |
| The skip data call should skip over all file data and trailing padding. |
| This is called automatically by the API layer just before each |
| header read. |
| It is also called in response to the client calling the public |
| .Fn data_skip |
| function. |
| .It Cleanup |
| On cleanup, the format should release all of its allocated memory. |
| .El |
| .Ss API Layer |
| XXX to do XXX |
| .Sh WRITE ARCHITECTURE |
| The write API has a similar set of four layers: |
| an API layer, a format layer, a compression layer, and an I/O layer. |
| The registration here is much simpler because only |
| one format and one compression can be registered at a time. |
| .Ss I/O Layer and Client Callbacks |
| XXX To be written XXX |
| .Ss Compression Layer |
| XXX To be written XXX |
| .Ss Format Layer |
| XXX To be written XXX |
| .Ss API Layer |
| XXX To be written XXX |
| .Sh WRITE_DISK ARCHITECTURE |
| The write_disk API is intended to look just like the write API |
| to clients. |
| Since it does not handle multiple formats or compression, it |
| is not layered internally. |
| .Sh GENERAL SERVICES |
| The |
| .Nm archive_read , |
| .Nm archive_write , |
| and |
| .Nm archive_write_disk |
| objects all contain an initial |
| .Nm archive |
| object which provides common support for a set of standard services. |
| (Recall that ANSI/ISO C90 guarantees that you can cast freely between |
| a pointer to a structure and a pointer to the first element of that |
| structure.) |
| The |
| .Nm archive |
| object has a magic value that indicates which API this object |
| is associated with, |
| slots for storing error information, |
| and function pointers for virtualized API functions. |
| .Sh MISCELLANEOUS NOTES |
| Connecting existing archiving libraries into libarchive is generally |
| quite difficult. |
| In particular, many existing libraries strongly assume that you |
| are reading from a file; they seek forwards and backwards as necessary |
| to locate various pieces of information. |
| In contrast, libarchive never seeks backwards in its input, which |
| sometimes requires very different approaches. |
| .Pp |
| For example, libarchive's ISO9660 support operates very differently |
| from most ISO9660 readers. |
| The libarchive support utilizes a work-queue design that |
| keeps a list of known entries sorted by their location in the input. |
| Whenever libarchive's ISO9660 implementation is asked for the next |
| header, checks this list to find the next item on the disk. |
| Directories are parsed when they are encountered and new |
| items are added to the list. |
| This design relies heavily on the ISO9660 image being optimized so that |
| directories always occur earlier on the disk than the files they |
| describe. |
| .Pp |
| Depending on the specific format, such approaches may not be possible. |
| The ZIP format specification, for example, allows archivers to store |
| key information only at the end of the file. |
| In theory, it is possible to create ZIP archives that cannot |
| be read without seeking. |
| Fortunately, such archives are very rare, and libarchive can read |
| most ZIP archives, though it cannot always extract as much information |
| as a dedicated ZIP program. |
| .Sh SEE ALSO |
| .Xr archive 3 , |
| .Xr archive_entry 3 , |
| .Xr archive_read 3 , |
| .Xr archive_write 3 , |
| .Xr archive_write_disk 3 |
| .Sh HISTORY |
| The |
| .Nm libarchive |
| library first appeared in |
| .Fx 5.3 . |
| .Sh AUTHORS |
| .An -nosplit |
| The |
| .Nm libarchive |
| library was written by |
| .An Tim Kientzle Aq kientzle@acm.org . |
| .Sh BUGS |