| \input texinfo @c -*- Texinfo -*- |
| @setfilename bzip2.info |
| |
| @ignore |
| This file documents bzip2 version 0.9.5, and associated library |
| libbzip2, written by Julian Seward (jseward@acm.org). |
| |
| Copyright (C) 1996-1999 Julian R Seward |
| |
| Permission is granted to make and distribute verbatim copies of |
| this manual provided the copyright notice and this permission notice |
| are preserved on all copies. |
| |
| Permission is granted to copy and distribute translations of this manual |
| into another language, under the above conditions for verbatim copies. |
| @end ignore |
| |
| @ifinfo |
| @format |
| START-INFO-DIR-ENTRY |
| * Bzip2: (bzip2). A program and library for data compression. |
| END-INFO-DIR-ENTRY |
| @end format |
| |
| @end ifinfo |
| |
| @iftex |
| @c @finalout |
| @settitle bzip2 and libbzip2 |
| @titlepage |
| @title bzip2 and libbzip2 |
| @subtitle a program and library for data compression |
| @subtitle copyright (C) 1996-1999 Julian Seward |
| @subtitle version 0.9.5d of 4 September 1999 |
| @author Julian Seward |
| |
| @end titlepage |
| |
| @parindent 0mm |
| @parskip 2mm |
| |
| @end iftex |
| @node Top, Overview, (dir), (dir) |
| |
| This program, @code{bzip2}, |
| and associated library @code{libbzip2}, are |
| Copyright (C) 1996-1999 Julian R Seward. All rights reserved. |
| |
| Redistribution and use in source and binary forms, with or without |
| modification, are permitted provided that the following conditions |
| are met: |
| @itemize @bullet |
| @item |
| Redistributions of source code must retain the above copyright |
| notice, this list of conditions and the following disclaimer. |
| @item |
| The origin of this software must not be misrepresented; you must |
| not claim that you wrote the original software. If you use this |
| software in a product, an acknowledgment in the product |
| documentation would be appreciated but is not required. |
| @item |
| Altered source versions must be plainly marked as such, and must |
| not be misrepresented as being the original software. |
| @item |
| The name of the author may not be used to endorse or promote |
| products derived from this software without specific prior written |
| permission. |
| @end itemize |
| THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS |
| OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED |
| WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE |
| ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY |
| DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL |
| DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE |
| GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS |
| INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, |
| WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING |
| NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS |
| SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
| |
| Julian Seward, Cambridge, UK. |
| |
| @code{jseward@@acm.org} |
| |
| @code{http://www.muraroa.demon.co.uk} |
| |
| @code{bzip2}/@code{libbzip2} version 0.9.5 of 24 May 1999. |
| |
| PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented |
| algorithms. However, I do not have the resources available to carry out |
| a full patent search. Therefore I cannot give any guarantee of the |
| above statement. |
| |
| |
| |
| |
| |
| |
| |
| @node Overview, Implementation, Top, Top |
| @chapter Introduction |
| |
| @code{bzip2} compresses files using the Burrows-Wheeler |
| block-sorting text compression algorithm, and Huffman coding. |
| Compression is generally considerably better than that |
| achieved by more conventional LZ77/LZ78-based compressors, |
| and approaches the performance of the PPM family of statistical compressors. |
| |
| @code{bzip2} is built on top of @code{libbzip2}, a flexible library |
| for handling compressed data in the @code{bzip2} format. This manual |
| describes both how to use the program and |
| how to work with the library interface. Most of the |
| manual is devoted to this library, not the program, |
| which is good news if your interest is only in the program. |
| |
| Chapter 2 describes how to use @code{bzip2}; this is the only part |
| you need to read if you just want to know how to operate the program. |
| Chapter 3 describes the programming interfaces in detail, and |
| Chapter 4 records some miscellaneous notes which I thought |
| ought to be recorded somewhere. |
| |
| |
| @chapter How to use @code{bzip2} |
| |
| This chapter contains a copy of the @code{bzip2} man page, |
| and nothing else. |
| |
| @quotation |
| |
| @unnumberedsubsubsec NAME |
| @itemize |
| @item @code{bzip2}, @code{bunzip2} |
| - a block-sorting file compressor, v0.9.5 |
| @item @code{bzcat} |
| - decompresses files to stdout |
| @item @code{bzip2recover} |
| - recovers data from damaged bzip2 files |
| @end itemize |
| |
| @unnumberedsubsubsec SYNOPSIS |
| @itemize |
| @item @code{bzip2} [ -cdfkqstvzVL123456789 ] [ filenames ... ] |
| @item @code{bunzip2} [ -fkvsVL ] [ filenames ... ] |
| @item @code{bzcat} [ -s ] [ filenames ... ] |
| @item @code{bzip2recover} filename |
| @end itemize |
| |
| @unnumberedsubsubsec DESCRIPTION |
| |
| @code{bzip2} compresses files using the Burrows-Wheeler block sorting |
| text compression algorithm, and Huffman coding. Compression is |
| generally considerably better than that achieved by more conventional |
| LZ77/LZ78-based compressors, and approaches the performance of the PPM |
| family of statistical compressors. |
| |
| The command-line options are deliberately very similar to those of GNU |
| @code{gzip}, but they are not identical. |
| |
| @code{bzip2} expects a list of file names to accompany the command-line |
| flags. Each file is replaced by a compressed version of itself, with |
| the name @code{original_name.bz2}. Each compressed file has the same |
| modification date, permissions, and, when possible, ownership as the |
| corresponding original, so that these properties can be correctly |
| restored at decompression time. File name handling is naive in the |
| sense that there is no mechanism for preserving original file names, |
| permissions, ownerships or dates in filesystems which lack these |
| concepts, or have serious file name length restrictions, such as MS-DOS. |
| |
| @code{bzip2} and @code{bunzip2} will by default not overwrite existing |
| files. If you want this to happen, specify the @code{-f} flag. |
| |
| If no file names are specified, @code{bzip2} compresses from standard |
| input to standard output. In this case, @code{bzip2} will decline to |
| write compressed output to a terminal, as this would be entirely |
| incomprehensible and therefore pointless. |
| |
| @code{bunzip2} (or @code{bzip2 -d}) decompresses all |
| specified files. Files which were not created by @code{bzip2} |
| will be detected and ignored, and a warning issued. |
| @code{bzip2} attempts to guess the filename for the decompressed file |
| from that of the compressed file as follows: |
| @itemize |
| @item @code{filename.bz2 } becomes @code{filename} |
| @item @code{filename.bz } becomes @code{filename} |
| @item @code{filename.tbz2} becomes @code{filename.tar} |
| @item @code{filename.tbz } becomes @code{filename.tar} |
| @item @code{anyothername } becomes @code{anyothername.out} |
| @end itemize |
| If the file does not end in one of the recognised endings, |
| @code{.bz2}, @code{.bz}, |
| @code{.tbz2} or @code{.tbz}, @code{bzip2} complains that it cannot |
| guess the name of the original file, and uses the original name |
| with @code{.out} appended. |
| |
| As with compression, supplying no |
| filenames causes decompression from standard input to standard output. |
| |
| @code{bunzip2} will correctly decompress a file which is the |
| concatenation of two or more compressed files. The result is the |
| concatenation of the corresponding uncompressed files. Integrity |
| testing (@code{-t}) of concatenated compressed files is also supported. |
| |
| You can also compress or decompress files to the standard output by |
| giving the @code{-c} flag. Multiple files may be compressed and |
| decompressed like this. The resulting outputs are fed sequentially to |
| stdout. Compression of multiple files in this manner generates a stream |
| containing multiple compressed file representations. Such a stream |
| can be decompressed correctly only by @code{bzip2} version 0.9.0 or |
| later. Earlier versions of @code{bzip2} will stop after decompressing |
| the first file in the stream. |
| |
| @code{bzcat} (or @code{bzip2 -dc}) decompresses all specified files to |
| the standard output. |
| |
| @code{bzip2} will read arguments from the environment variables |
| @code{BZIP2} and @code{BZIP}, in that order, and will process them |
| before any arguments read from the command line. This gives a |
| convenient way to supply default arguments. |
| |
| Compression is always performed, even if the compressed file is slightly |
| larger than the original. Files of less than about one hundred bytes |
| tend to get larger, since the compression mechanism has a constant |
| overhead in the region of 50 bytes. Random data (including the output |
| of most file compressors) is coded at about 8.05 bits per byte, giving |
| an expansion of around 0.5%. |
| |
| As a self-check for your protection, @code{bzip2} uses 32-bit CRCs to |
| make sure that the decompressed version of a file is identical to the |
| original. This guards against corruption of the compressed data, and |
| against undetected bugs in @code{bzip2} (hopefully very unlikely). The |
| chances of data corruption going undetected is microscopic, about one |
| chance in four billion for each file processed. Be aware, though, that |
| the check occurs upon decompression, so it can only tell you that |
| something is wrong. It can't help you recover the original uncompressed |
| data. You can use @code{bzip2recover} to try to recover data from |
| damaged files. |
| |
| Return values: 0 for a normal exit, 1 for environmental problems (file |
| not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt |
| compressed file, 3 for an internal consistency error (eg, bug) which |
| caused @code{bzip2} to panic. |
| |
| |
| @unnumberedsubsubsec OPTIONS |
| @table @code |
| @item -c --stdout |
| Compress or decompress to standard output. |
| @item -d --decompress |
| Force decompression. @code{bzip2}, @code{bunzip2} and @code{bzcat} are |
| really the same program, and the decision about what actions to take is |
| done on the basis of which name is used. This flag overrides that |
| mechanism, and forces bzip2 to decompress. |
| @item -z --compress |
| The complement to @code{-d}: forces compression, regardless of the |
| invokation name. |
| @item -t --test |
| Check integrity of the specified file(s), but don't decompress them. |
| This really performs a trial decompression and throws away the result. |
| @item -f --force |
| Force overwrite of output files. Normally, @code{bzip2} will not overwrite |
| existing output files. Also forces @code{bzip2} to break hard links |
| to files, which it otherwise wouldn't do. |
| @item -k --keep |
| Keep (don't delete) input files during compression |
| or decompression. |
| @item -s --small |
| Reduce memory usage, for compression, decompression and testing. Files |
| are decompressed and tested using a modified algorithm which only |
| requires 2.5 bytes per block byte. This means any file can be |
| decompressed in 2300k of memory, albeit at about half the normal speed. |
| |
| During compression, @code{-s} selects a block size of 200k, which limits |
| memory use to around the same figure, at the expense of your compression |
| ratio. In short, if your machine is low on memory (8 megabytes or |
| less), use -s for everything. See MEMORY MANAGEMENT below. |
| @item -q --quiet |
| Suppress non-essential warning messages. Messages pertaining to |
| I/O errors and other critical events will not be suppressed. |
| @item -v --verbose |
| Verbose mode -- show the compression ratio for each file processed. |
| Further @code{-v}'s increase the verbosity level, spewing out lots of |
| information which is primarily of interest for diagnostic purposes. |
| @item -L --license -V --version |
| Display the software version, license terms and conditions. |
| @item -1 to -9 |
| Set the block size to 100 k, 200 k .. 900 k when compressing. Has no |
| effect when decompressing. See MEMORY MANAGEMENT below. |
| @item -- |
| Treats all subsequent arguments as file names, even if they start |
| with a dash. This is so you can handle files with names beginning |
| with a dash, for example: @code{bzip2 -- -myfilename}. |
| @item --repetitive-fast |
| @item --repetitive-best |
| These flags are redundant in versions 0.9.5 and above. They provided |
| some coarse control over the behaviour of the sorting algorithm in |
| earlier versions, which was sometimes useful. 0.9.5 and above have an |
| improved algorithm which renders these flags irrelevant. |
| @end table |
| |
| |
| @unnumberedsubsubsec MEMORY MANAGEMENT |
| |
| @code{bzip2} compresses large files in blocks. The block size affects |
| both the compression ratio achieved, and the amount of memory needed for |
| compression and decompression. The flags @code{-1} through @code{-9} |
| specify the block size to be 100,000 bytes through 900,000 bytes (the |
| default) respectively. At decompression time, the block size used for |
| compression is read from the header of the compressed file, and |
| @code{bunzip2} then allocates itself just enough memory to decompress |
| the file. Since block sizes are stored in compressed files, it follows |
| that the flags @code{-1} to @code{-9} are irrelevant to and so ignored |
| during decompression. |
| |
| Compression and decompression requirements, in bytes, can be estimated |
| as: |
| @example |
| Compression: 400k + ( 8 x block size ) |
| |
| Decompression: 100k + ( 4 x block size ), or |
| 100k + ( 2.5 x block size ) |
| @end example |
| Larger block sizes give rapidly diminishing marginal returns. Most of |
| the compression comes from the first two or three hundred k of block |
| size, a fact worth bearing in mind when using @code{bzip2} on small machines. |
| It is also important to appreciate that the decompression memory |
| requirement is set at compression time by the choice of block size. |
| |
| For files compressed with the default 900k block size, @code{bunzip2} |
| will require about 3700 kbytes to decompress. To support decompression |
| of any file on a 4 megabyte machine, @code{bunzip2} has an option to |
| decompress using approximately half this amount of memory, about 2300 |
| kbytes. Decompression speed is also halved, so you should use this |
| option only where necessary. The relevant flag is @code{-s}. |
| |
| In general, try and use the largest block size memory constraints allow, |
| since that maximises the compression achieved. Compression and |
| decompression speed are virtually unaffected by block size. |
| |
| Another significant point applies to files which fit in a single block |
| -- that means most files you'd encounter using a large block size. The |
| amount of real memory touched is proportional to the size of the file, |
| since the file is smaller than a block. For example, compressing a file |
| 20,000 bytes long with the flag @code{-9} will cause the compressor to |
| allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 |
| kbytes of it. Similarly, the decompressor will allocate 3700k but only |
| touch 100k + 20000 * 4 = 180 kbytes. |
| |
| Here is a table which summarises the maximum memory usage for different |
| block sizes. Also recorded is the total compressed size for 14 files of |
| the Calgary Text Compression Corpus totalling 3,141,622 bytes. This |
| column gives some feel for how compression varies with block size. |
| These figures tend to understate the advantage of larger block sizes for |
| larger files, since the Corpus is dominated by smaller files. |
| @example |
| Compress Decompress Decompress Corpus |
| Flag usage usage -s usage Size |
| |
| -1 1200k 500k 350k 914704 |
| -2 2000k 900k 600k 877703 |
| -3 2800k 1300k 850k 860338 |
| -4 3600k 1700k 1100k 846899 |
| -5 4400k 2100k 1350k 845160 |
| -6 5200k 2500k 1600k 838626 |
| -7 6100k 2900k 1850k 834096 |
| -8 6800k 3300k 2100k 828642 |
| -9 7600k 3700k 2350k 828642 |
| @end example |
| |
| @unnumberedsubsubsec RECOVERING DATA FROM DAMAGED FILES |
| |
| @code{bzip2} compresses files in blocks, usually 900kbytes long. Each |
| block is handled independently. If a media or transmission error causes |
| a multi-block @code{.bz2} file to become damaged, it may be possible to |
| recover data from the undamaged blocks in the file. |
| |
| The compressed representation of each block is delimited by a 48-bit |
| pattern, which makes it possible to find the block boundaries with |
| reasonable certainty. Each block also carries its own 32-bit CRC, so |
| damaged blocks can be distinguished from undamaged ones. |
| |
| @code{bzip2recover} is a simple program whose purpose is to search for |
| blocks in @code{.bz2} files, and write each block out into its own |
| @code{.bz2} file. You can then use @code{bzip2 -t} to test the |
| integrity of the resulting files, and decompress those which are |
| undamaged. |
| |
| @code{bzip2recover} |
| takes a single argument, the name of the damaged file, |
| and writes a number of files @code{rec0001file.bz2}, |
| @code{rec0002file.bz2}, etc, containing the extracted blocks. |
| The output filenames are designed so that the use of |
| wildcards in subsequent processing -- for example, |
| @code{bzip2 -dc rec*file.bz2 > recovered_data} -- lists the files in |
| the correct order. |
| |
| @code{bzip2recover} should be of most use dealing with large @code{.bz2} |
| files, as these will contain many blocks. It is clearly |
| futile to use it on damaged single-block files, since a |
| damaged block cannot be recovered. If you wish to minimise |
| any potential data loss through media or transmission errors, |
| you might consider compressing with a smaller |
| block size. |
| |
| |
| @unnumberedsubsubsec PERFORMANCE NOTES |
| |
| The sorting phase of compression gathers together similar strings in the |
| file. Because of this, files containing very long runs of repeated |
| symbols, like "aabaabaabaab ..." (repeated several hundred times) may |
| compress more slowly than normal. Versions 0.9.5 and above fare much |
| better than previous versions in this respect. The ratio between |
| worst-case and average-case compression time is in the region of 10:1. |
| For previous versions, this figure was more like 100:1. You can use the |
| @code{-vvvv} option to monitor progress in great detail, if you want. |
| |
| Decompression speed is unaffected by these phenomena. |
| |
| @code{bzip2} usually allocates several megabytes of memory to operate |
| in, and then charges all over it in a fairly random fashion. This means |
| that performance, both for compressing and decompressing, is largely |
| determined by the speed at which your machine can service cache misses. |
| Because of this, small changes to the code to reduce the miss rate have |
| been observed to give disproportionately large performance improvements. |
| I imagine @code{bzip2} will perform best on machines with very large |
| caches. |
| |
| |
| @unnumberedsubsubsec CAVEATS |
| |
| I/O error messages are not as helpful as they could be. @code{bzip2} |
| tries hard to detect I/O errors and exit cleanly, but the details of |
| what the problem is sometimes seem rather misleading. |
| |
| This manual page pertains to version 0.9.5 of @code{bzip2}. Compressed |
| data created by this version is entirely forwards and backwards |
| compatible with the previous public releases, versions 0.1pl2 and 0.9.0, |
| but with the following exception: 0.9.0 and above can correctly |
| decompress multiple concatenated compressed files. 0.1pl2 cannot do |
| this; it will stop after decompressing just the first file in the |
| stream. |
| |
| @code{bzip2recover} uses 32-bit integers to represent bit positions in |
| compressed files, so it cannot handle compressed files more than 512 |
| megabytes long. This could easily be fixed. |
| |
| |
| @unnumberedsubsubsec AUTHOR |
| Julian Seward, @code{jseward@@acm.org}. |
| |
| The ideas embodied in @code{bzip2} are due to (at least) the following |
| people: Michael Burrows and David Wheeler (for the block sorting |
| transformation), David Wheeler (again, for the Huffman coder), Peter |
| Fenwick (for the structured coding model in the original @code{bzip}, |
| and many refinements), and Alistair Moffat, Radford Neal and Ian Witten |
| (for the arithmetic coder in the original @code{bzip}). I am much |
| indebted for their help, support and advice. See the manual in the |
| source distribution for pointers to sources of documentation. Christian |
| von Roques encouraged me to look for faster sorting algorithms, so as to |
| speed up compression. Bela Lubkin encouraged me to improve the |
| worst-case compression performance. Many people sent patches, helped |
| with portability problems, lent machines, gave advice and were generally |
| helpful. |
| |
| @end quotation |
| |
| |
| |
| |
| @chapter Programming with @code{libbzip2} |
| |
| This chapter describes the programming interface to @code{libbzip2}. |
| |
| For general background information, particularly about memory |
| use and performance aspects, you'd be well advised to read Chapter 2 |
| as well. |
| |
| @section Top-level structure |
| |
| @code{libbzip2} is a flexible library for compressing and decompressing |
| data in the @code{bzip2} data format. Although packaged as a single |
| entity, it helps to regard the library as three separate parts: the low |
| level interface, and the high level interface, and some utility |
| functions. |
| |
| The structure of @code{libbzip2}'s interfaces is similar to |
| that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib} |
| library. |
| |
| @subsection Low-level summary |
| |
| This interface provides services for compressing and decompressing |
| data in memory. There's no provision for dealing with files, streams |
| or any other I/O mechanisms, just straight memory-to-memory work. |
| In fact, this part of the library can be compiled without inclusion |
| of @code{stdio.h}, which may be helpful for embedded applications. |
| |
| The low-level part of the library has no global variables and |
| is therefore thread-safe. |
| |
| Six routines make up the low level interface: |
| @code{bzCompressInit}, @code{bzCompress}, and @* @code{bzCompressEnd} |
| for compression, |
| and a corresponding trio @code{bzDecompressInit}, @* @code{bzDecompress} |
| and @code{bzDecompressEnd} for decompression. |
| The @code{*Init} functions allocate |
| memory for compression/decompression and do other |
| initialisations, whilst the @code{*End} functions close down operations |
| and release memory. |
| |
| The real work is done by @code{bzCompress} and @code{bzDecompress}. |
| These compress/decompress data from a user-supplied input buffer |
| to a user-supplied output buffer. These buffers can be any size; |
| arbitrary quantities of data are handled by making repeated calls |
| to these functions. This is a flexible mechanism allowing a |
| consumer-pull style of activity, or producer-push, or a mixture of |
| both. |
| |
| |
| |
| @subsection High-level summary |
| |
| This interface provides some handy wrappers around the low-level |
| interface to facilitate reading and writing @code{bzip2} format |
| files (@code{.bz2} files). The routines provide hooks to facilitate |
| reading files in which the @code{bzip2} data stream is embedded |
| within some larger-scale file structure, or where there are |
| multiple @code{bzip2} data streams concatenated end-to-end. |
| |
| For reading files, @code{bzReadOpen}, @code{bzRead}, @code{bzReadClose} |
| and @code{bzReadGetUnused} are supplied. For writing files, |
| @code{bzWriteOpen}, @code{bzWrite} and @code{bzWriteFinish} are |
| available. |
| |
| As with the low-level library, no global variables are used |
| so the library is per se thread-safe. However, if I/O errors |
| occur whilst reading or writing the underlying compressed files, |
| you may have to consult @code{errno} to determine the cause of |
| the error. In that case, you'd need a C library which correctly |
| supports @code{errno} in a multithreaded environment. |
| |
| To make the library a little simpler and more portable, |
| @code{bzReadOpen} and @code{bzWriteOpen} require you to pass them file |
| handles (@code{FILE*}s) which have previously been opened for reading or |
| writing respectively. That avoids portability problems associated with |
| file operations and file attributes, whilst not being much of an |
| imposition on the programmer. |
| |
| |
| |
| @subsection Utility functions summary |
| For very simple needs, @code{bzBuffToBuffCompress} and |
| @code{bzBuffToBuffDecompress} are provided. These compress |
| data in memory from one buffer to another buffer in a single |
| function call. You should assess whether these functions |
| fulfill your memory-to-memory compression/decompression |
| requirements before investing effort in understanding the more |
| general but more complex low-level interface. |
| |
| Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} / |
| @code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to |
| give better @code{zlib} compatibility. These functions are |
| @code{bzopen}, @code{bzread}, @code{bzwrite}, @code{bzflush}, |
| @code{bzclose}, |
| @code{bzerror} and @code{bzlibVersion}. You may find these functions |
| more convenient for simple file reading and writing, than those in the |
| high-level interface. These functions are not (yet) officially part of |
| the library, and are minimally documented here. If they break, you |
| get to keep all the pieces. I hope to document them properly when time |
| permits. |
| |
| Yoshioka also contributed modifications to allow the library to be |
| built as a Windows DLL. |
| |
| |
| @section Error handling |
| |
| The library is designed to recover cleanly in all situations, including |
| the worst-case situation of decompressing random data. I'm not |
| 100% sure that it can always do this, so you might want to add |
| a signal handler to catch segmentation violations during decompression |
| if you are feeling especially paranoid. I would be interested in |
| hearing more about the robustness of the library to corrupted |
| compressed data. |
| |
| The file @code{bzlib.h} contains all definitions needed to use |
| the library. In particular, you should definitely not include |
| @code{bzlib_private.h}. |
| |
| In @code{bzlib.h}, the various return values are defined. The following |
| list is not intended as an exhaustive description of the circumstances |
| in which a given value may be returned -- those descriptions are given |
| later. Rather, it is intended to convey the rough meaning of each |
| return value. The first five actions are normal and not intended to |
| denote an error situation. |
| @table @code |
| @item BZ_OK |
| The requested action was completed successfully. |
| @item BZ_RUN_OK |
| @itemx BZ_FLUSH_OK |
| @itemx BZ_FINISH_OK |
| In @code{bzCompress}, the requested flush/finish/nothing-special action |
| was completed successfully. |
| @item BZ_STREAM_END |
| Compression of data was completed, or the logical stream end was |
| detected during decompression. |
| @end table |
| |
| The following return values indicate an error of some kind. |
| @table @code |
| @item BZ_SEQUENCE_ERROR |
| When using the library, it is important to call the functions in the |
| correct sequence and with data structures (buffers etc) in the correct |
| states. @code{libbzip2} checks as much as it can to ensure this is |
| happening, and returns @code{BZ_SEQUENCE_ERROR} if not. Code which |
| complies precisely with the function semantics, as detailed below, |
| should never receive this value; such an event denotes buggy code |
| which you should investigate. |
| @item BZ_PARAM_ERROR |
| Returned when a parameter to a function call is out of range |
| or otherwise manifestly incorrect. As with @code{BZ_SEQUENCE_ERROR}, |
| this denotes a bug in the client code. The distinction between |
| @code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth |
| making. |
| @item BZ_MEM_ERROR |
| Returned when a request to allocate memory failed. Note that the |
| quantity of memory needed to decompress a stream cannot be determined |
| until the stream's header has been read. So @code{bzDecompress} and |
| @code{bzRead} may return @code{BZ_MEM_ERROR} even though some of |
| the compressed data has been read. The same is not true for |
| compression; once @code{bzCompressInit} or @code{bzWriteOpen} have |
| successfully completed, @code{BZ_MEM_ERROR} cannot occur. |
| @item BZ_DATA_ERROR |
| Returned when a data integrity error is detected during decompression. |
| Most importantly, this means when stored and computed CRCs for the |
| data do not match. This value is also returned upon detection of any |
| other anomaly in the compressed data. |
| @item BZ_DATA_ERROR_MAGIC |
| As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to |
| know when the compressed stream does not start with the correct |
| magic bytes (@code{'B' 'Z' 'h'}). |
| @item BZ_IO_ERROR |
| Returned by @code{bzRead} and @code{bzRead} when there is an error |
| reading or writing in the compressed file, and by @code{bzReadOpen} |
| and @code{bzWriteOpen} for attempts to use a file for which the |
| error indicator (viz, @code{ferror(f)}) is set. |
| On receipt of @code{BZ_IO_ERROR}, the caller should consult |
| @code{errno} and/or @code{perror} to acquire operating-system |
| specific information about the problem. |
| @item BZ_UNEXPECTED_EOF |
| Returned by @code{bzRead} when the compressed file finishes |
| before the logical end of stream is detected. |
| @item BZ_OUTBUFF_FULL |
| Returned by @code{bzBuffToBuffCompress} and |
| @code{bzBuffToBuffDecompress} to indicate that the output data |
| will not fit into the output buffer provided. |
| @end table |
| |
| |
| |
| @section Low-level interface |
| |
| @subsection @code{bzCompressInit} |
| @example |
| typedef |
| struct @{ |
| char *next_in; |
| unsigned int avail_in; |
| unsigned int total_in; |
| |
| char *next_out; |
| unsigned int avail_out; |
| unsigned int total_out; |
| |
| void *state; |
| |
| void *(*bzalloc)(void *,int,int); |
| void (*bzfree)(void *,void *); |
| void *opaque; |
| @} |
| bz_stream; |
| |
| int bzCompressInit ( bz_stream *strm, |
| int blockSize100k, |
| int verbosity, |
| int workFactor ); |
| |
| @end example |
| |
| Prepares for compression. The @code{bz_stream} structure |
| holds all data pertaining to the compression activity. |
| A @code{bz_stream} structure should be allocated and initialised |
| prior to the call. |
| The fields of @code{bz_stream} |
| comprise the entirety of the user-visible data. @code{state} |
| is a pointer to the private data structures required for compression. |
| |
| Custom memory allocators are supported, via fields @code{bzalloc}, |
| @code{bzfree}, |
| and @code{opaque}. The value |
| @code{opaque} is passed to as the first argument to |
| all calls to @code{bzalloc} and @code{bzfree}, but is |
| otherwise ignored by the library. |
| The call @code{bzalloc ( opaque, n, m )} is expected to return a |
| pointer @code{p} to |
| @code{n * m} bytes of memory, and @code{bzfree ( opaque, p )} |
| should free |
| that memory. |
| |
| If you don't want to use a custom memory allocator, set @code{bzalloc}, |
| @code{bzfree} and |
| @code{opaque} to @code{NULL}, |
| and the library will then use the standard @code{malloc}/@code{free} |
| routines. |
| |
| Before calling @code{bzCompressInit}, fields @code{bzalloc}, |
| @code{bzfree} and @code{opaque} should |
| be filled appropriately, as just described. Upon return, the internal |
| state will have been allocated and initialised, and @code{total_in} and |
| @code{total_out} will have been set to zero. |
| These last two fields are used by the library |
| to inform the caller of the total amount of data passed into and out of |
| the library, respectively. You should not try to change them. |
| |
| Parameter @code{blockSize100k} specifies the block size to be used for |
| compression. It should be a value between 1 and 9 inclusive, and the |
| actual block size used is 100000 x this figure. 9 gives the best |
| compression but takes most memory. |
| |
| Parameter @code{verbosity} should be set to a number between 0 and 4 |
| inclusive. 0 is silent, and greater numbers give increasingly verbose |
| monitoring/debugging output. If the library has been compiled with |
| @code{-DBZ_NO_STDIO}, no such output will appear for any verbosity |
| setting. |
| |
| Parameter @code{workFactor} controls how the compression phase behaves |
| when presented with worst case, highly repetitive, input data. If |
| compression runs into difficulties caused by repetitive data, the |
| library switches from the standard sorting algorithm to a fallback |
| algorithm. The fallback is slower than the standard algorithm by |
| perhaps a factor of three, but always behaves reasonably, no matter how |
| bad the input. |
| |
| Lower values of @code{workFactor} reduce the amount of effort the |
| standard algorithm will expend before resorting to the fallback. You |
| should set this parameter carefully; too low, and many inputs will be |
| handled by the fallback algorithm and so compress rather slowly, too |
| high, and your average-to-worst case compression times can become very |
| large. The default value of 30 gives reasonable behaviour over a wide |
| range of circumstances. |
| |
| Allowable values range from 0 to 250 inclusive. 0 is a special case, |
| equivalent to using the default value of 30. |
| |
| Note that the compressed output generated is the same regardless of |
| whether or not the fallback algorithm is used. |
| |
| Be aware also that this parameter may disappear entirely in future |
| versions of the library. In principle it should be possible to devise a |
| good way to automatically choose which algorithm to use. Such a |
| mechanism would render the parameter obsolete. |
| |
| Possible return values: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{strm} is @code{NULL} |
| or @code{blockSize} < 1 or @code{blockSize} > 9 |
| or @code{verbosity} < 0 or @code{verbosity} > 4 |
| or @code{workFactor} < 0 or @code{workFactor} > 250 |
| @code{BZ_MEM_ERROR} |
| if not enough memory is available |
| @code{BZ_OK} |
| otherwise |
| @end display |
| Allowable next actions: |
| @display |
| @code{bzCompress} |
| if @code{BZ_OK} is returned |
| no specific action needed in case of error |
| @end display |
| |
| @subsection @code{bzCompress} |
| @example |
| int bzCompress ( bz_stream *strm, int action ); |
| @end example |
| Provides more input and/or output buffer space for the library. The |
| caller maintains input and output buffers, and calls @code{bzCompress} to |
| transfer data between them. |
| |
| Before each call to @code{bzCompress}, @code{next_in} should point at |
| the data to be compressed, and @code{avail_in} should indicate how many |
| bytes the library may read. @code{bzCompress} updates @code{next_in}, |
| @code{avail_in} and @code{total_in} to reflect the number of bytes it |
| has read. |
| |
| Similarly, @code{next_out} should point to a buffer in which the |
| compressed data is to be placed, with @code{avail_out} indicating how |
| much output space is available. @code{bzCompress} updates |
| @code{next_out}, @code{avail_out} and @code{total_out} to reflect the |
| number of bytes output. |
| |
| You may provide and remove as little or as much data as you like on each |
| call of @code{bzCompress}. In the limit, it is acceptable to supply and |
| remove data one byte at a time, although this would be terribly |
| inefficient. You should always ensure that at least one byte of output |
| space is available at each call. |
| |
| A second purpose of @code{bzCompress} is to request a change of mode of the |
| compressed stream. |
| |
| Conceptually, a compressed stream can be in one of four states: IDLE, |
| RUNNING, FLUSHING and FINISHING. Before initialisation |
| (@code{bzCompressInit}) and after termination (@code{bzCompressEnd}), a |
| stream is regarded as IDLE. |
| |
| Upon initialisation (@code{bzCompressInit}), the stream is placed in the |
| RUNNING state. Subsequent calls to @code{bzCompress} should pass |
| @code{BZ_RUN} as the requested action; other actions are illegal and |
| will result in @code{BZ_SEQUENCE_ERROR}. |
| |
| At some point, the calling program will have provided all the input data |
| it wants to. It will then want to finish up -- in effect, asking the |
| library to process any data it might have buffered internally. In this |
| state, @code{bzCompress} will no longer attempt to read data from |
| @code{next_in}, but it will want to write data to @code{next_out}. |
| Because the output buffer supplied by the user can be arbitrarily small, |
| the finishing-up operation cannot necessarily be done with a single call |
| of @code{bzCompress}. |
| |
| Instead, the calling program passes @code{BZ_FINISH} as an action to |
| @code{bzCompress}. This changes the stream's state to FINISHING. Any |
| remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and |
| transferred to the output buffer. To do this, @code{bzCompress} must be |
| called repeatedly until all the output has been consumed. At that |
| point, @code{bzCompress} returns @code{BZ_STREAM_END}, and the stream's |
| state is set back to IDLE. @code{bzCompressEnd} should then be |
| called. |
| |
| Just to make sure the calling program does not cheat, the library makes |
| a note of @code{avail_in} at the time of the first call to |
| @code{bzCompress} which has @code{BZ_FINISH} as an action (ie, at the |
| time the program has announced its intention to not supply any more |
| input). By comparing this value with that of @code{avail_in} over |
| subsequent calls to @code{bzCompress}, the library can detect any |
| attempts to slip in more data to compress. Any calls for which this is |
| detected will return @code{BZ_SEQUENCE_ERROR}. This indicates a |
| programming mistake which should be corrected. |
| |
| Instead of asking to finish, the calling program may ask |
| @code{bzCompress} to take all the remaining input, compress it and |
| terminate the current (Burrows-Wheeler) compression block. This could |
| be useful for error control purposes. The mechanism is analogous to |
| that for finishing: call @code{bzCompress} with an action of |
| @code{BZ_FLUSH}, remove output data, and persist with the |
| @code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned. As |
| with finishing, @code{bzCompress} detects any attempt to provide more |
| input data once the flush has begun. |
| |
| Once the flush is complete, the stream returns to the normal RUNNING |
| state. |
| |
| This all sounds pretty complex, but isn't really. Here's a table |
| which shows which actions are allowable in each state, what action |
| will be taken, what the next state is, and what the non-error return |
| values are. Note that you can't explicitly ask what state the |
| stream is in, but nor do you need to -- it can be inferred from the |
| values returned by @code{bzCompress}. |
| @display |
| IDLE/@code{any} |
| Illegal. IDLE state only exists after @code{bzCompressEnd} or |
| before @code{bzCompressInit}. |
| Return value = @code{BZ_SEQUENCE_ERROR} |
| |
| RUNNING/@code{BZ_RUN} |
| Compress from @code{next_in} to @code{next_out} as much as possible. |
| Next state = RUNNING |
| Return value = @code{BZ_RUN_OK} |
| |
| RUNNING/@code{BZ_FLUSH} |
| Remember current value of @code{next_in}. Compress from @code{next_in} |
| to @code{next_out} as much as possible, but do not accept any more input. |
| Next state = FLUSHING |
| Return value = @code{BZ_FLUSH_OK} |
| |
| RUNNING/@code{BZ_FINISH} |
| Remember current value of @code{next_in}. Compress from @code{next_in} |
| to @code{next_out} as much as possible, but do not accept any more input. |
| Next state = FINISHING |
| Return value = @code{BZ_FINISH_OK} |
| |
| FLUSHING/@code{BZ_FLUSH} |
| Compress from @code{next_in} to @code{next_out} as much as possible, |
| but do not accept any more input. |
| If all the existing input has been used up and all compressed |
| output has been removed |
| Next state = RUNNING; Return value = @code{BZ_RUN_OK} |
| else |
| Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK} |
| |
| FLUSHING/other |
| Illegal. |
| Return value = @code{BZ_SEQUENCE_ERROR} |
| |
| FINISHING/@code{BZ_FINISH} |
| Compress from @code{next_in} to @code{next_out} as much as possible, |
| but to not accept any more input. |
| If all the existing input has been used up and all compressed |
| output has been removed |
| Next state = IDLE; Return value = @code{BZ_STREAM_END} |
| else |
| Next state = FINISHING; Return value = @code{BZ_FINISHING} |
| |
| FINISHING/other |
| Illegal. |
| Return value = @code{BZ_SEQUENCE_ERROR} |
| @end display |
| |
| That still looks complicated? Well, fair enough. The usual sequence |
| of calls for compressing a load of data is: |
| @itemize @bullet |
| @item Get started with @code{bzCompressInit}. |
| @item Shovel data in and shlurp out its compressed form using zero or more |
| calls of @code{bzCompress} with action = @code{BZ_RUN}. |
| @item Finish up. |
| Repeatedly call @code{bzCompress} with action = @code{BZ_FINISH}, |
| copying out the compressed output, until @code{BZ_STREAM_END} is returned. |
| @item Close up and go home. Call @code{bzCompressEnd}. |
| @end itemize |
| If the data you want to compress fits into your input buffer all |
| at once, you can skip the calls of @code{bzCompress ( ..., BZ_RUN )} and |
| just do the @code{bzCompress ( ..., BZ_FINISH )} calls. |
| |
| All required memory is allocated by @code{bzCompressInit}. The |
| compression library can accept any data at all (obviously). So you |
| shouldn't get any error return values from the @code{bzCompress} calls. |
| If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in |
| your programming. |
| |
| Trivial other possible return values: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL} |
| @end display |
| |
| @subsection @code{bzCompressEnd} |
| @example |
| int bzCompressEnd ( bz_stream *strm ); |
| @end example |
| Releases all memory associated with a compression stream. |
| |
| Possible return values: |
| @display |
| @code{BZ_PARAM_ERROR} if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} |
| @code{BZ_OK} otherwise |
| @end display |
| |
| |
| @subsection @code{bzDecompressInit} |
| @example |
| int bzDecompressInit ( bz_stream *strm, int verbosity, int small ); |
| @end example |
| Prepares for decompression. As with @code{bzCompressInit}, a |
| @code{bz_stream} record should be allocated and initialised before the |
| call. Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be |
| set if a custom memory allocator is required, or made @code{NULL} for |
| the normal @code{malloc}/@code{free} routines. Upon return, the internal |
| state will have been initialised, and @code{total_in} and |
| @code{total_out} will be zero. |
| |
| For the meaning of parameter @code{verbosity}, see @code{bzCompressInit}. |
| |
| If @code{small} is nonzero, the library will use an alternative |
| decompression algorithm which uses less memory but at the cost of |
| decompressing more slowly (roughly speaking, half the speed, but the |
| maximum memory requirement drops to around 2300k). See Chapter 2 for |
| more information on memory management. |
| |
| Note that the amount of memory needed to decompress |
| a stream cannot be determined until the stream's header has been read, |
| so even if @code{bzDecompressInit} succeeds, a subsequent |
| @code{bzDecompress} could fail with @code{BZ_MEM_ERROR}. |
| |
| Possible return values: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{(small != 0 && small != 1)} |
| or @code{(verbosity < 0 || verbosity > 4)} |
| @code{BZ_MEM_ERROR} |
| if insufficient memory is available |
| @end display |
| |
| Allowable next actions: |
| @display |
| @code{bzDecompress} |
| if @code{BZ_OK} was returned |
| no specific action required in case of error |
| @end display |
| |
| |
| |
| @subsection @code{bzDecompress} |
| @example |
| int bzDecompress ( bz_stream *strm ); |
| @end example |
| Provides more input and/out output buffer space for the library. The |
| caller maintains input and output buffers, and uses @code{bzDecompress} |
| to transfer data between them. |
| |
| Before each call to @code{bzDecompress}, @code{next_in} |
| should point at the compressed data, |
| and @code{avail_in} should indicate how many bytes the library |
| may read. @code{bzDecompress} updates @code{next_in}, @code{avail_in} |
| and @code{total_in} |
| to reflect the number of bytes it has read. |
| |
| Similarly, @code{next_out} should point to a buffer in which the uncompressed |
| output is to be placed, with @code{avail_out} indicating how much output space |
| is available. @code{bzCompress} updates @code{next_out}, |
| @code{avail_out} and @code{total_out} to reflect |
| the number of bytes output. |
| |
| You may provide and remove as little or as much data as you like on |
| each call of @code{bzDecompress}. |
| In the limit, it is acceptable to |
| supply and remove data one byte at a time, although this would be |
| terribly inefficient. You should always ensure that at least one |
| byte of output space is available at each call. |
| |
| Use of @code{bzDecompress} is simpler than @code{bzCompress}. |
| |
| You should provide input and remove output as described above, and |
| repeatedly call @code{bzDecompress} until @code{BZ_STREAM_END} is |
| returned. Appearance of @code{BZ_STREAM_END} denotes that |
| @code{bzDecompress} has detected the logical end of the compressed |
| stream. @code{bzDecompress} will not produce @code{BZ_STREAM_END} until |
| all output data has been placed into the output buffer, so once |
| @code{BZ_STREAM_END} appears, you are guaranteed to have available all |
| the decompressed output, and @code{bzDecompressEnd} can safely be |
| called. |
| |
| If case of an error return value, you should call @code{bzDecompressEnd} |
| to clean up and release memory. |
| |
| Possible return values: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} |
| or @code{strm->avail_out < 1} |
| @code{BZ_DATA_ERROR} |
| if a data integrity error is detected in the compressed stream |
| @code{BZ_DATA_ERROR_MAGIC} |
| if the compressed stream doesn't begin with the right magic bytes |
| @code{BZ_MEM_ERROR} |
| if there wasn't enough memory available |
| @code{BZ_STREAM_END} |
| if the logical end of the data stream was detected and all |
| output in has been consumed, eg @code{s->avail_out > 0} |
| @code{BZ_OK} |
| otherwise |
| @end display |
| Allowable next actions: |
| @display |
| @code{bzDecompress} |
| if @code{BZ_OK} was returned |
| @code{bzDecompressEnd} |
| otherwise |
| @end display |
| |
| |
| @subsection @code{bzDecompressEnd} |
| @example |
| int bzDecompressEnd ( bz_stream *strm ); |
| @end example |
| Releases all memory associated with a decompression stream. |
| |
| Possible return values: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} |
| @code{BZ_OK} |
| otherwise |
| @end display |
| |
| Allowable next actions: |
| @display |
| None. |
| @end display |
| |
| |
| @section High-level interface |
| |
| This interface provides functions for reading and writing |
| @code{bzip2} format files. First, some general points. |
| |
| @itemize @bullet |
| @item All of the functions take an @code{int*} first argument, |
| @code{bzerror}. |
| After each call, @code{bzerror} should be consulted first to determine |
| the outcome of the call. If @code{bzerror} is @code{BZ_OK}, |
| the call completed |
| successfully, and only then should the return value of the function |
| (if any) be consulted. If @code{bzerror} is @code{BZ_IO_ERROR}, |
| there was an error |
| reading/writing the underlying compressed file, and you should |
| then consult @code{errno}/@code{perror} to determine the |
| cause of the difficulty. |
| @code{bzerror} may also be set to various other values; precise details are |
| given on a per-function basis below. |
| @item If @code{bzerror} indicates an error |
| (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}), |
| you should immediately call @code{bzReadClose} (or @code{bzWriteClose}, |
| depending on whether you are attempting to read or to write) |
| to free up all resources associated |
| with the stream. Once an error has been indicated, behaviour of all calls |
| except @code{bzReadClose} (@code{bzWriteClose}) is undefined. |
| The implication is that (1) @code{bzerror} should |
| be checked after each call, and (2) if @code{bzerror} indicates an error, |
| @code{bzReadClose} (@code{bzWriteClose}) should then be called to clean up. |
| @item The @code{FILE*} arguments passed to |
| @code{bzReadOpen}/@code{bzWriteOpen} |
| should be set to binary mode. |
| Most Unix systems will do this by default, but other platforms, |
| including Windows and Mac, will not. If you omit this, you may |
| encounter problems when moving code to new platforms. |
| @item Memory allocation requests are handled by |
| @code{malloc}/@code{free}. |
| At present |
| there is no facility for user-defined memory allocators in the file I/O |
| functions (could easily be added, though). |
| @end itemize |
| |
| |
| |
| @subsection @code{bzReadOpen} |
| @example |
| typedef void BZFILE; |
| |
| BZFILE *bzReadOpen ( int *bzerror, FILE *f, |
| int small, int verbosity, |
| void *unused, int nUnused ); |
| @end example |
| Prepare to read compressed data from file handle @code{f}. @code{f} |
| should refer to a file which has been opened for reading, and for which |
| the error indicator (@code{ferror(f)})is not set. If @code{small} is 1, |
| the library will try to decompress using less memory, at the expense of |
| speed. |
| |
| For reasons explained below, @code{bzRead} will decompress the |
| @code{nUnused} bytes starting at @code{unused}, before starting to read |
| from the file @code{f}. At most @code{BZ_MAX_UNUSED} bytes may be |
| supplied like this. If this facility is not required, you should pass |
| @code{NULL} and @code{0} for @code{unused} and n@code{Unused} |
| respectively. |
| |
| For the meaning of parameters @code{small} and @code{verbosity}, |
| see @code{bzDecompressInit}. |
| |
| The amount of memory needed to decompress a file cannot be determined |
| until the file's header has been read. So it is possible that |
| @code{bzReadOpen} returns @code{BZ_OK} but a subsequent call of |
| @code{bzRead} will return @code{BZ_MEM_ERROR}. |
| |
| Possible assignments to @code{bzerror}: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{f} is @code{NULL} |
| or @code{small} is neither @code{0} nor @code{1} |
| or @code{(unused == NULL && nUnused != 0)} |
| or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))} |
| @code{BZ_IO_ERROR} |
| if @code{ferror(f)} is nonzero |
| @code{BZ_MEM_ERROR} |
| if insufficient memory is available |
| @code{BZ_OK} |
| otherwise. |
| @end display |
| |
| Possible return values: |
| @display |
| Pointer to an abstract @code{BZFILE} |
| if @code{bzerror} is @code{BZ_OK} |
| @code{NULL} |
| otherwise |
| @end display |
| |
| Allowable next actions: |
| @display |
| @code{bzRead} |
| if @code{bzerror} is @code{BZ_OK} |
| @code{bzClose} |
| otherwise |
| @end display |
| |
| |
| @subsection @code{bzRead} |
| @example |
| int bzRead ( int *bzerror, BZFILE *b, void *buf, int len ); |
| @end example |
| Reads up to @code{len} (uncompressed) bytes from the compressed file |
| @code{b} into |
| the buffer @code{buf}. If the read was successful, |
| @code{bzerror} is set to @code{BZ_OK} |
| and the number of bytes read is returned. If the logical end-of-stream |
| was detected, @code{bzerror} will be set to @code{BZ_STREAM_END}, |
| and the number |
| of bytes read is returned. All other @code{bzerror} values denote an error. |
| |
| @code{bzRead} will supply @code{len} bytes, |
| unless the logical stream end is detected |
| or an error occurs. Because of this, it is possible to detect the |
| stream end by observing when the number of bytes returned is |
| less than the number |
| requested. Nevertheless, this is regarded as inadvisable; you should |
| instead check @code{bzerror} after every call and watch out for |
| @code{BZ_STREAM_END}. |
| |
| Internally, @code{bzRead} copies data from the compressed file in chunks |
| of size @code{BZ_MAX_UNUSED} bytes |
| before decompressing it. If the file contains more bytes than strictly |
| needed to reach the logical end-of-stream, @code{bzRead} will almost certainly |
| read some of the trailing data before signalling @code{BZ_SEQUENCE_END}. |
| To collect the read but unused data once @code{BZ_SEQUENCE_END} has |
| appeared, call @code{bzReadGetUnused} immediately before @code{bzReadClose}. |
| |
| Possible assignments to @code{bzerror}: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0} |
| @code{BZ_SEQUENCE_ERROR} |
| if @code{b} was opened with @code{bzWriteOpen} |
| @code{BZ_IO_ERROR} |
| if there is an error reading from the compressed file |
| @code{BZ_UNEXPECTED_EOF} |
| if the compressed file ended before the logical end-of-stream was detected |
| @code{BZ_DATA_ERROR} |
| if a data integrity error was detected in the compressed stream |
| @code{BZ_DATA_ERROR_MAGIC} |
| if the stream does not begin with the requisite header bytes (ie, is not |
| a @code{bzip2} data file). This is really a special case of @code{BZ_DATA_ERROR}. |
| @code{BZ_MEM_ERROR} |
| if insufficient memory was available |
| @code{BZ_STREAM_END} |
| if the logical end of stream was detected. |
| @code{BZ_OK} |
| otherwise. |
| @end display |
| |
| Possible return values: |
| @display |
| number of bytes read |
| if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END} |
| undefined |
| otherwise |
| @end display |
| |
| Allowable next actions: |
| @display |
| collect data from @code{buf}, then @code{bzRead} or @code{bzReadClose} |
| if @code{bzerror} is @code{BZ_OK} |
| collect data from @code{buf}, then @code{bzReadClose} or @code{bzReadGetUnused} |
| if @code{bzerror} is @code{BZ_SEQUENCE_END} |
| @code{bzReadClose} |
| otherwise |
| @end display |
| |
| |
| |
| @subsection @code{bzReadGetUnused} |
| @example |
| void bzReadGetUnused ( int* bzerror, BZFILE *b, |
| void** unused, int* nUnused ); |
| @end example |
| Returns data which was read from the compressed file but was not needed |
| to get to the logical end-of-stream. @code{*unused} is set to the address |
| of the data, and @code{*nUnused} to the number of bytes. @code{*nUnused} will |
| be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive. |
| |
| This function may only be called once @code{bzRead} has signalled |
| @code{BZ_STREAM_END} but before @code{bzReadClose}. |
| |
| Possible assignments to @code{bzerror}: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{b} is @code{NULL} |
| or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL} |
| @code{BZ_SEQUENCE_ERROR} |
| if @code{BZ_STREAM_END} has not been signalled |
| or if @code{b} was opened with @code{bzWriteOpen} |
| @code{BZ_OK} |
| otherwise |
| @end display |
| |
| Allowable next actions: |
| @display |
| @code{bzReadClose} |
| @end display |
| |
| |
| @subsection @code{bzReadClose} |
| @example |
| void bzReadClose ( int *bzerror, BZFILE *b ); |
| @end example |
| Releases all memory pertaining to the compressed file @code{b}. |
| @code{bzReadClose} does not call @code{fclose} on the underlying file |
| handle, so you should do that yourself if appropriate. |
| @code{bzReadClose} should be called to clean up after all error |
| situations. |
| |
| Possible assignments to @code{bzerror}: |
| @display |
| @code{BZ_SEQUENCE_ERROR} |
| if @code{b} was opened with @code{bzOpenWrite} |
| @code{BZ_OK} |
| otherwise |
| @end display |
| |
| Allowable next actions: |
| @display |
| none |
| @end display |
| |
| |
| |
| @subsection @code{bzWriteOpen} |
| @example |
| BZFILE *bzWriteOpen ( int *bzerror, FILE *f, |
| int blockSize100k, int verbosity, |
| int workFactor ); |
| @end example |
| Prepare to write compressed data to file handle @code{f}. |
| @code{f} should refer to |
| a file which has been opened for writing, and for which the error |
| indicator (@code{ferror(f)})is not set. |
| |
| For the meaning of parameters @code{blockSize100k}, |
| @code{verbosity} and @code{workFactor}, see |
| @* @code{bzCompressInit}. |
| |
| All required memory is allocated at this stage, so if the call |
| completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a |
| subsequent call to @code{bzWrite}. |
| |
| Possible assignments to @code{bzerror}: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{f} is @code{NULL} |
| or @code{blockSize100k < 1} or @code{blockSize100k > 9} |
| @code{BZ_IO_ERROR} |
| if @code{ferror(f)} is nonzero |
| @code{BZ_MEM_ERROR} |
| if insufficient memory is available |
| @code{BZ_OK} |
| otherwise |
| @end display |
| |
| Possible return values: |
| @display |
| Pointer to an abstract @code{BZFILE} |
| if @code{bzerror} is @code{BZ_OK} |
| @code{NULL} |
| otherwise |
| @end display |
| |
| Allowable next actions: |
| @display |
| @code{bzWrite} |
| if @code{bzerror} is @code{BZ_OK} |
| (you could go directly to @code{bzWriteClose}, but this would be pretty pointless) |
| @code{bzWriteClose} |
| otherwise |
| @end display |
| |
| |
| |
| @subsection @code{bzWrite} |
| @example |
| void bzWrite ( int *bzerror, BZFILE *b, void *buf, int len ); |
| @end example |
| Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be |
| compressed and written to the file. |
| |
| Possible assignments to @code{bzerror}: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0} |
| @code{BZ_SEQUENCE_ERROR} |
| if b was opened with @code{bzReadOpen} |
| @code{BZ_IO_ERROR} |
| if there is an error writing the compressed file. |
| @code{BZ_OK} |
| otherwise |
| @end display |
| |
| |
| |
| |
| @subsection @code{bzWriteClose} |
| @example |
| int bzWriteClose ( int *bzerror, BZFILE* f, |
| int abandon, |
| unsigned int* nbytes_in, |
| unsigned int* nbytes_out ); |
| @end example |
| |
| Compresses and flushes to the compressed file all data so far supplied |
| by @code{bzWrite}. The logical end-of-stream markers are also written, so |
| subsequent calls to @code{bzWrite} are illegal. All memory associated |
| with the compressed file @code{b} is released. |
| @code{fflush} is called on the |
| compressed file, but it is not @code{fclose}'d. |
| |
| If @code{bzWriteClose} is called to clean up after an error, the only |
| action is to release the memory. The library records the error codes |
| issued by previous calls, so this situation will be detected |
| automatically. There is no attempt to complete the compression |
| operation, nor to @code{fflush} the compressed file. You can force this |
| behaviour to happen even in the case of no error, by passing a nonzero |
| value to @code{abandon}. |
| |
| If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the |
| total volume of uncompressed data handled. Similarly, @code{nbytes_out} |
| will be set to the total volume of compressed data written. |
| |
| Possible assignments to @code{bzerror}: |
| @display |
| @code{BZ_SEQUENCE_ERROR} |
| if @code{b} was opened with @code{bzReadOpen} |
| @code{BZ_IO_ERROR} |
| if there is an error writing the compressed file |
| @code{BZ_OK} |
| otherwise |
| @end display |
| |
| @subsection Handling embedded compressed data streams |
| |
| The high-level library facilitates use of |
| @code{bzip2} data streams which form some part of a surrounding, larger |
| data stream. |
| @itemize @bullet |
| @item For writing, the library takes an open file handle, writes |
| compressed data to it, @code{fflush}es it but does not @code{fclose} it. |
| The calling application can write its own data before and after the |
| compressed data stream, using that same file handle. |
| @item Reading is more complex, and the facilities are not as general |
| as they could be since generality is hard to reconcile with efficiency. |
| @code{bzRead} reads from the compressed file in blocks of size |
| @code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot |
| the logical end of compressed stream. |
| To recover this data once decompression has |
| ended, call @code{bzReadGetUnused} after the last call of @code{bzRead} |
| (the one returning @code{BZ_STREAM_END}) but before calling |
| @code{bzReadClose}. |
| @end itemize |
| |
| This mechanism makes it easy to decompress multiple @code{bzip2} |
| streams placed end-to-end. As the end of one stream, when @code{bzRead} |
| returns @code{BZ_STREAM_END}, call @code{bzReadGetUnused} to collect the |
| unused data (copy it into your own buffer somewhere). |
| That data forms the start of the next compressed stream. |
| To start uncompressing that next stream, call @code{bzReadOpen} again, |
| feeding in the unused data via the @code{unused}/@code{nUnused} |
| parameters. |
| Keep doing this until @code{BZ_STREAM_END} return coincides with the |
| physical end of file (@code{feof(f)}). In this situation |
| @code{bzReadGetUnused} |
| will of course return no data. |
| |
| This should give some feel for how the high-level interface can be used. |
| If you require extra flexibility, you'll have to bite the bullet and get |
| to grips with the low-level interface. |
| |
| @subsection Standard file-reading/writing code |
| Here's how you'd write data to a compressed file: |
| @example @code |
| FILE* f; |
| BZFILE* b; |
| int nBuf; |
| char buf[ /* whatever size you like */ ]; |
| int bzerror; |
| int nWritten; |
| |
| f = fopen ( "myfile.bz2", "w" ); |
| if (!f) @{ |
| /* handle error */ |
| @} |
| b = bzWriteOpen ( &bzerror, f, 9 ); |
| if (bzerror != BZ_OK) @{ |
| bzWriteClose ( b ); |
| /* handle error */ |
| @} |
| |
| while ( /* condition */ ) @{ |
| /* get data to write into buf, and set nBuf appropriately */ |
| nWritten = bzWrite ( &bzerror, b, buf, nBuf ); |
| if (bzerror == BZ_IO_ERROR) @{ |
| bzWriteClose ( &bzerror, b ); |
| /* handle error */ |
| @} |
| @} |
| |
| bzWriteClose ( &bzerror, b ); |
| if (bzerror == BZ_IO_ERROR) @{ |
| /* handle error */ |
| @} |
| @end example |
| And to read from a compressed file: |
| @example |
| FILE* f; |
| BZFILE* b; |
| int nBuf; |
| char buf[ /* whatever size you like */ ]; |
| int bzerror; |
| int nWritten; |
| |
| f = fopen ( "myfile.bz2", "r" ); |
| if (!f) @{ |
| /* handle error */ |
| @} |
| b = bzReadOpen ( &bzerror, f, 0, NULL, 0 ); |
| if (bzerror != BZ_OK) @{ |
| bzReadClose ( &bzerror, b ); |
| /* handle error */ |
| @} |
| |
| bzerror = BZ_OK; |
| while (bzerror == BZ_OK && /* arbitrary other conditions */) @{ |
| nBuf = bzRead ( &bzerror, b, buf, /* size of buf */ ); |
| if (bzerror == BZ_OK) @{ |
| /* do something with buf[0 .. nBuf-1] */ |
| @} |
| @} |
| if (bzerror != BZ_STREAM_END) @{ |
| bzReadClose ( &bzerror, b ); |
| /* handle error */ |
| @} else @{ |
| bzReadClose ( &bzerror ); |
| @} |
| @end example |
| |
| |
| |
| @section Utility functions |
| @subsection @code{bzBuffToBuffCompress} |
| @example |
| int bzBuffToBuffCompress( char* dest, |
| unsigned int* destLen, |
| char* source, |
| unsigned int sourceLen, |
| int blockSize100k, |
| int verbosity, |
| int workFactor ); |
| @end example |
| Attempts to compress the data in @code{source[0 .. sourceLen-1]} |
| into the destination buffer, @code{dest[0 .. *destLen-1]}. |
| If the destination buffer is big enough, @code{*destLen} is |
| set to the size of the compressed data, and @code{BZ_OK} is |
| returned. If the compressed data won't fit, @code{*destLen} |
| is unchanged, and @code{BZ_OUTBUFF_FULL} is returned. |
| |
| Compression in this manner is a one-shot event, done with a single call |
| to this function. The resulting compressed data is a complete |
| @code{bzip2} format data stream. There is no mechanism for making |
| additional calls to provide extra input data. If you want that kind of |
| mechanism, use the low-level interface. |
| |
| For the meaning of parameters @code{blockSize100k}, @code{verbosity} |
| and @code{workFactor}, @* see @code{bzCompressInit}. |
| |
| To guarantee that the compressed data will fit in its buffer, allocate |
| an output buffer of size 1% larger than the uncompressed data, plus |
| six hundred extra bytes. |
| |
| @code{bzBuffToBuffDecompress} will not write data at or |
| beyond @code{dest[*destLen]}, even in case of buffer overflow. |
| |
| Possible return values: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL} |
| or @code{blockSize100k < 1} or @code{blockSize100k > 9} |
| or @code{verbosity < 0} or @code{verbosity > 4} |
| or @code{workFactor < 0} or @code{workFactor > 250} |
| @code{BZ_MEM_ERROR} |
| if insufficient memory is available |
| @code{BZ_OUTBUFF_FULL} |
| if the size of the compressed data exceeds @code{*destLen} |
| @code{BZ_OK} |
| otherwise |
| @end display |
| |
| |
| |
| @subsection @code{bzBuffToBuffDecompress} |
| @example |
| int bzBuffToBuffDecompress ( char* dest, |
| unsigned int* destLen, |
| char* source, |
| unsigned int sourceLen, |
| int small, |
| int verbosity ); |
| @end example |
| Attempts to decompress the data in @code{source[0 .. sourceLen-1]} |
| into the destination buffer, @code{dest[0 .. *destLen-1]}. |
| If the destination buffer is big enough, @code{*destLen} is |
| set to the size of the uncompressed data, and @code{BZ_OK} is |
| returned. If the compressed data won't fit, @code{*destLen} |
| is unchanged, and @code{BZ_OUTBUFF_FULL} is returned. |
| |
| @code{source} is assumed to hold a complete @code{bzip2} format |
| data stream. @code{bzBuffToBuffDecompress} tries to decompress |
| the entirety of the stream into the output buffer. |
| |
| For the meaning of parameters @code{small} and @code{verbosity}, |
| see @code{bzDecompressInit}. |
| |
| Because the compression ratio of the compressed data cannot be known in |
| advance, there is no easy way to guarantee that the output buffer will |
| be big enough. You may of course make arrangements in your code to |
| record the size of the uncompressed data, but such a mechanism is beyond |
| the scope of this library. |
| |
| @code{bzBuffToBuffDecompress} will not write data at or |
| beyond @code{dest[*destLen]}, even in case of buffer overflow. |
| |
| Possible return values: |
| @display |
| @code{BZ_PARAM_ERROR} |
| if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL} |
| or @code{small != 0 && small != 1} |
| or @code{verbosity < 0} or @code{verbosity > 4} |
| @code{BZ_MEM_ERROR} |
| if insufficient memory is available |
| @code{BZ_OUTBUFF_FULL} |
| if the size of the compressed data exceeds @code{*destLen} |
| @code{BZ_DATA_ERROR} |
| if a data integrity error was detected in the compressed data |
| @code{BZ_DATA_ERROR_MAGIC} |
| if the compressed data doesn't begin with the right magic bytes |
| @code{BZ_UNEXPECTED_EOF} |
| if the compressed data ends unexpectedly |
| @code{BZ_OK} |
| otherwise |
| @end display |
| |
| |
| |
| @section @code{zlib} compatibility functions |
| Yoshioka Tsuneo has contributed some functions to |
| give better @code{zlib} compatibility. These functions are |
| @code{bzopen}, @code{bzread}, @code{bzwrite}, @code{bzflush}, |
| @code{bzclose}, |
| @code{bzerror} and @code{bzlibVersion}. |
| These functions are not (yet) officially part of |
| the library. If they break, you get to keep all the pieces. |
| Nevertheless, I think they work ok. |
| @example |
| typedef void BZFILE; |
| |
| const char * bzlibVersion ( void ); |
| @end example |
| Returns a string indicating the library version. |
| @example |
| BZFILE * bzopen ( const char *path, const char *mode ); |
| BZFILE * bzdopen ( int fd, const char *mode ); |
| @end example |
| Opens a @code{.bz2} file for reading or writing, using either its name |
| or a pre-existing file descriptor. |
| Analogous to @code{fopen} and @code{fdopen}. |
| @example |
| int bzread ( BZFILE* b, void* buf, int len ); |
| int bzwrite ( BZFILE* b, void* buf, int len ); |
| @end example |
| Reads/writes data from/to a previously opened @code{BZFILE}. |
| Analogous to @code{fread} and @code{fwrite}. |
| @example |
| int bzflush ( BZFILE* b ); |
| void bzclose ( BZFILE* b ); |
| @end example |
| Flushes/closes a @code{BZFILE}. @code{bzflush} doesn't actually do |
| anything. Analogous to @code{fflush} and @code{fclose}. |
| |
| @example |
| const char * bzerror ( BZFILE *b, int *errnum ) |
| @end example |
| Returns a string describing the more recent error status of |
| @code{b}, and also sets @code{*errnum} to its numerical value. |
| |
| |
| @section Using the library in a @code{stdio}-free environment |
| |
| @subsection Getting rid of @code{stdio} |
| |
| In a deeply embedded application, you might want to use just |
| the memory-to-memory functions. You can do this conveniently |
| by compiling the library with preprocessor symbol @code{BZ_NO_STDIO} |
| defined. Doing this gives you a library containing only the following |
| eight functions: |
| |
| @code{bzCompressInit}, @code{bzCompress}, @code{bzCompressEnd} @* |
| @code{bzDecompressInit}, @code{bzDecompress}, @code{bzDecompressEnd} @* |
| @code{bzBuffToBuffCompress}, @code{bzBuffToBuffDecompress} |
| |
| When compiled like this, all functions will ignore @code{verbosity} |
| settings. |
| |
| @subsection Critical error handling |
| @code{libbzip2} contains a number of internal assertion checks which |
| should, needless to say, never be activated. Nevertheless, if an |
| assertion should fail, behaviour depends on whether or not the library |
| was compiled with @code{BZ_NO_STDIO} set. |
| |
| For a normal compile, an assertion failure yields the message |
| @example |
| bzip2/libbzip2, v0.9.5: internal error number N. |
| This is a bug in bzip2/libbzip2, v0.9.5. Please report |
| it to me at: jseward@@acm.org. If this happened when |
| you were using some program which uses libbzip2 as a |
| component, you should also report this bug to the author(s) |
| of that program. Please make an effort to report this bug; |
| timely and accurate bug reports eventually lead to higher |
| quality software. Thanks. Julian Seward, 24 May 1999. |
| @end example |
| where @code{N} is some error code number. @code{exit(3)} |
| is then called. |
| |
| For a @code{stdio}-free library, assertion failures result |
| in a call to a function declared as: |
| @example |
| extern void bz_internal_error ( int errcode ); |
| @end example |
| The relevant code is passed as a parameter. You should supply |
| such a function. |
| |
| In either case, once an assertion failure has occurred, any |
| @code{bz_stream} records involved can be regarded as invalid. |
| You should not attempt to resume normal operation with them. |
| |
| You may, of course, change critical error handling to suit |
| your needs. As I said above, critical errors indicate bugs |
| in the library and should not occur. All "normal" error |
| situations are indicated via error return codes from functions, |
| and can be recovered from. |
| |
| |
| @section Making a Windows DLL |
| Everything related to Windows has been contributed by Yoshioka Tsuneo |
| @* (@code{QWF00133@@niftyserve.or.jp} / |
| @code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to |
| him (but perhaps Cc: me, @code{jseward@@acm.org}). |
| |
| My vague understanding of what to do is: using Visual C++ 5.0, |
| open the project file @code{libbz2.dsp}, and build. That's all. |
| |
| If you can't |
| open the project file for some reason, make a new one, naming these files: |
| @code{blocksort.c}, @code{bzlib.c}, @code{compress.c}, |
| @code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @* |
| @code{randtable.c} and @code{libbz2.def}. You will also need |
| to name the header files @code{bzlib.h} and @code{bzlib_private.h}. |
| |
| If you don't use VC++, you may need to define the proprocessor symbol |
| @code{_WIN32}. |
| |
| Finally, @code{dlltest.c} is a sample program using the DLL. It has a |
| project file, @code{dlltest.dsp}. |
| |
| If you just want a makefile for Visual C, have a look at |
| @code{makefile.msc}. |
| |
| Be aware that if you compile @code{bzip2} itself on Win32, you must set |
| @code{BZ_UNIX} to 0 and @code{BZ_LCCWIN32} to 1, in the file |
| @code{bzip2.c}, before compiling. Otherwise the resulting binary won't |
| work correctly. |
| |
| I haven't tried any of this stuff myself, but it all looks plausible. |
| |
| |
| |
| @chapter Miscellanea |
| |
| These are just some random thoughts of mine. Your mileage may |
| vary. |
| |
| @section Limitations of the compressed file format |
| @code{bzip2-0.9.5} and @code{0.9.0} |
| use exactly the same file format as the previous |
| version, @code{bzip2-0.1}. This decision was made in the interests of |
| stability. Creating yet another incompatible compressed file format |
| would create further confusion and disruption for users. |
| |
| Nevertheless, this is not a painless decision. Development |
| work since the release of @code{bzip2-0.1} in August 1997 |
| has shown complexities in the file format which slow down |
| decompression and, in retrospect, are unnecessary. These are: |
| @itemize @bullet |
| @item The run-length encoder, which is the first of the |
| compression transformations, is entirely irrelevant. |
| The original purpose was to protect the sorting algorithm |
| from the very worst case input: a string of repeated |
| symbols. But algorithm steps Q6a and Q6b in the original |
| Burrows-Wheeler technical report (SRC-124) show how |
| repeats can be handled without difficulty in block |
| sorting. |
| @item The randomisation mechanism doesn't really need to be |
| there. Udi Manber and Gene Myers published a suffix |
| array construction algorithm a few years back, which |
| can be employed to sort any block, no matter how |
| repetitive, in O(N log N) time. Subsequent work by |
| Kunihiko Sadakane has produced a derivative O(N (log N)^2) |
| algorithm which usually outperforms the Manber-Myers |
| algorithm. |
| |
| I could have changed to Sadakane's algorithm, but I find |
| it to be slower than @code{bzip2}'s existing algorithm for |
| most inputs, and the randomisation mechanism protects |
| adequately against bad cases. I didn't think it was |
| a good tradeoff to make. Partly this is due to the fact |
| that I was not flooded with email complaints about |
| @code{bzip2-0.1}'s performance on repetitive data, so |
| perhaps it isn't a problem for real inputs. |
| |
| Probably the best long-term solution, |
| and the one I have incorporated into 0.9.5 and above, |
| is to use the existing sorting |
| algorithm initially, and fall back to a O(N (log N)^2) |
| algorithm if the standard algorithm gets into difficulties. |
| @item The compressed file format was never designed to be |
| handled by a library, and I have had to jump though |
| some hoops to produce an efficient implementation of |
| decompression. It's a bit hairy. Try passing |
| @code{decompress.c} through the C preprocessor |
| and you'll see what I mean. Much of this complexity |
| could have been avoided if the compressed size of |
| each block of data was recorded in the data stream. |
| @item An Adler-32 checksum, rather than a CRC32 checksum, |
| would be faster to compute. |
| @end itemize |
| It would be fair to say that the @code{bzip2} format was frozen |
| before I properly and fully understood the performance |
| consequences of doing so. |
| |
| Improvements which I was able to incorporate into |
| 0.9.0, despite using the same file format, are: |
| @itemize @bullet |
| @item Single array implementation of the inverse BWT. This |
| significantly speeds up decompression, presumably |
| because it reduces the number of cache misses. |
| @item Faster inverse MTF transform for large MTF values. The |
| new implementation is based on the notion of sliding blocks |
| of values. |
| @item @code{bzip2-0.9.0} now reads and writes files with @code{fread} |
| and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}. |
| Duh! Well, you live and learn. |
| |
| @end itemize |
| Further ahead, it would be nice |
| to be able to do random access into files. This will |
| require some careful design of compressed file formats. |
| |
| |
| |
| @section Portability issues |
| After some consideration, I have decided not to use |
| GNU @code{autoconf} to configure 0.9.5. |
| |
| @code{autoconf}, admirable and wonderful though it is, |
| mainly assists with portability problems between Unix-like |
| platforms. But @code{bzip2} doesn't have much in the way |
| of portability problems on Unix; most of the difficulties appear |
| when porting to the Mac, or to Microsoft's operating systems. |
| @code{autoconf} doesn't help in those cases, and brings in a |
| whole load of new complexity. |
| |
| Most people should be able to compile the library and program |
| under Unix straight out-of-the-box, so to speak, especially |
| if you have a version of GNU C available. |
| |
| There are a couple of @code{__inline__} directives in the code. GNU C |
| (@code{gcc}) should be able to handle them. If you're not using |
| GNU C, your C compiler shouldn't see them at all. |
| If your compiler does, for some reason, see them and doesn't |
| like them, just @code{#define} @code{__inline__} to be @code{/* */}. One |
| easy way to do this is to compile with the flag @code{-D__inline__=}, |
| which should be understood by most Unix compilers. |
| |
| If you still have difficulties, try compiling with the macro |
| @code{BZ_STRICT_ANSI} defined. This should enable you to build the |
| library in a strictly ANSI compliant environment. Building the program |
| itself like this is dangerous and not supported, since you remove |
| @code{bzip2}'s checks against compressing directories, symbolic links, |
| devices, and other not-really-a-file entities. This could cause |
| filesystem corruption! |
| |
| One other thing: if you create a @code{bzip2} binary for public |
| distribution, please try and link it statically (@code{gcc -s}). This |
| avoids all sorts of library-version issues that others may encounter |
| later on. |
| |
| If you build @code{bzip2} on Win32, you must set @code{BZ_UNIX} to 0 and |
| @code{BZ_LCCWIN32} to 1, in the file @code{bzip2.c}, before compiling. |
| Otherwise the resulting binary won't work correctly. |
| |
| |
| |
| @section Reporting bugs |
| I tried pretty hard to make sure @code{bzip2} is |
| bug free, both by design and by testing. Hopefully |
| you'll never need to read this section for real. |
| |
| Nevertheless, if @code{bzip2} dies with a segmentation |
| fault, a bus error or an internal assertion failure, it |
| will ask you to email me a bug report. Experience with |
| version 0.1 shows that almost all these problems can |
| be traced to either compiler bugs or hardware problems. |
| @itemize @bullet |
| @item |
| Recompile the program with no optimisation, and see if it |
| works. And/or try a different compiler. |
| I heard all sorts of stories about various flavours |
| of GNU C (and other compilers) generating bad code for |
| @code{bzip2}, and I've run across two such examples myself. |
| |
| 2.7.X versions of GNU C are known to generate bad code from |
| time to time, at high optimisation levels. |
| If you get problems, try using the flags |
| @code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}. |
| You should specifically @emph{not} use @code{-funroll-loops}. |
| |
| You may notice that the Makefile runs four tests as part of |
| the build process. If the program passes all of these, it's |
| a pretty good (but not 100%) indication that the compiler has |
| done its job correctly. |
| @item |
| If @code{bzip2} crashes randomly, and the crashes are not |
| repeatable, you may have a flaky memory subsystem. @code{bzip2} |
| really hammers your memory hierarchy, and if it's a bit marginal, |
| you may get these problems. Ditto if your disk or I/O subsystem |
| is slowly failing. Yup, this really does happen. |
| |
| Try using a different machine of the same type, and see if |
| you can repeat the problem. |
| @item This isn't really a bug, but ... If @code{bzip2} tells |
| you your file is corrupted on decompression, and you |
| obtained the file via FTP, there is a possibility that you |
| forgot to tell FTP to do a binary mode transfer. That absolutely |
| will cause the file to be non-decompressible. You'll have to transfer |
| it again. |
| @end itemize |
| |
| If you've incorporated @code{libbzip2} into your own program |
| and are getting problems, please, please, please, check that the |
| parameters you are passing in calls to the library, are |
| correct, and in accordance with what the documentation says |
| is allowable. I have tried to make the library robust against |
| such problems, but I'm sure I haven't succeeded. |
| |
| Finally, if the above comments don't help, you'll have to send |
| me a bug report. Now, it's just amazing how many people will |
| send me a bug report saying something like |
| @display |
| bzip2 crashed with segmentation fault on my machine |
| @end display |
| and absolutely nothing else. Needless to say, a such a report |
| is @emph{totally, utterly, completely and comprehensively 100% useless; |
| a waste of your time, my time, and net bandwidth}. |
| With no details at all, there's no way I can possibly begin |
| to figure out what the problem is. |
| |
| The rules of the game are: facts, facts, facts. Don't omit |
| them because "oh, they won't be relevant". At the bare |
| minimum: |
| @display |
| Machine type. Operating system version. |
| Exact version of @code{bzip2} (do @code{bzip2 -V}). |
| Exact version of the compiler used. |
| Flags passed to the compiler. |
| @end display |
| However, the most important single thing that will help me is |
| the file that you were trying to compress or decompress at the |
| time the problem happened. Without that, my ability to do anything |
| more than speculate about the cause, is limited. |
| |
| Please remember that I connect to the Internet with a modem, so |
| you should contact me before mailing me huge files. |
| |
| |
| @section Did you get the right package? |
| |
| @code{bzip2} is a resource hog. It soaks up large amounts of CPU cycles |
| and memory. Also, it gives very large latencies. In the worst case, you |
| can feed many megabytes of uncompressed data into the library before |
| getting any compressed output, so this probably rules out applications |
| requiring interactive behaviour. |
| |
| These aren't faults of my implementation, I hope, but more |
| an intrinsic property of the Burrows-Wheeler transform (unfortunately). |
| Maybe this isn't what you want. |
| |
| If you want a compressor and/or library which is faster, uses less |
| memory but gets pretty good compression, and has minimal latency, |
| consider Jean-loup |
| Gailly's and Mark Adler's work, @code{zlib-1.1.2} and |
| @code{gzip-1.2.4}. Look for them at |
| @code{http://www.cdrom.com/pub/infozip/zlib} and |
| @code{http://www.gzip.org} respectively. |
| |
| For something faster and lighter still, you might try Markus F X J |
| Oberhumer's @code{LZO} real-time compression/decompression library, at |
| @* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}. |
| |
| If you want to use the @code{bzip2} algorithms to compress small blocks |
| of data, 64k bytes or smaller, for example on an on-the-fly disk |
| compressor, you'd be well advised not to use this library. Instead, |
| I've made a special library tuned for that kind of use. It's part of |
| @code{e2compr-0.40}, an on-the-fly disk compressor for the Linux |
| @code{ext2} filesystem. Look at |
| @code{http://www.netspace.net.au/~reiter/e2compr}. |
| |
| |
| |
| @section Testing |
| |
| A record of the tests I've done. |
| |
| First, some data sets: |
| @itemize @bullet |
| @item B: a directory containing 6001 files, one for every length in the |
| range 0 to 6000 bytes. The files contain random lowercase |
| letters. 18.7 megabytes. |
| @item H: my home directory tree. Documents, source code, mail files, |
| compressed data. H contains B, and also a directory of |
| files designed as boundary cases for the sorting; mostly very |
| repetitive, nasty files. 565 megabytes. |
| @item A: directory tree holding various applications built from source: |
| @code{egcs}, @code{gcc-2.8.1}, KDE, GTK, Octave, etc. |
| 2200 megabytes. |
| @end itemize |
| The tests conducted are as follows. Each test means compressing |
| (a copy of) each file in the data set, decompressing it and |
| comparing it against the original. |
| |
| First, a bunch of tests with block sizes and internal buffer |
| sizes set very small, |
| to detect any problems with the |
| blocking and buffering mechanisms. |
| This required modifying the source code so as to try to |
| break it. |
| @enumerate |
| @item Data set H, with |
| buffer size of 1 byte, and block size of 23 bytes. |
| @item Data set B, buffer sizes 1 byte, block size 1 byte. |
| @item As (2) but small-mode decompression. |
| @item As (2) with block size 2 bytes. |
| @item As (2) with block size 3 bytes. |
| @item As (2) with block size 4 bytes. |
| @item As (2) with block size 5 bytes. |
| @item As (2) with block size 6 bytes and small-mode decompression. |
| @item H with buffer size of 1 byte, but normal block |
| size (up to 900000 bytes). |
| @end enumerate |
| Then some tests with unmodified source code. |
| @enumerate |
| @item H, all settings normal. |
| @item As (1), with small-mode decompress. |
| @item H, compress with flag @code{-1}. |
| @item H, compress with flag @code{-s}, decompress with flag @code{-s}. |
| @item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing, |
| @code{bzip2-0.9.5} decompressing, all settings normal. |
| @item Backwards compatibility: H, @code{bzip2-0.9.5} compressing, |
| @code{bzip2-0.1pl2} decompressing, all settings normal. |
| @item Bigger tests: A, all settings normal. |
| @item As (7), using the fallback (Sadakane-like) sorting algorithm. |
| @item As (8), compress with flag @code{-1}, decompress with flag |
| @code{-s}. |
| @item H, using the fallback sorting algorithm. |
| @item Forwards compatibility: A, @code{bzip2-0.1pl2} compressing, |
| @code{bzip2-0.9.5} decompressing, all settings normal. |
| @item Backwards compatibility: A, @code{bzip2-0.9.5} compressing, |
| @code{bzip2-0.1pl2} decompressing, all settings normal. |
| @item Misc test: about 400 megabytes of @code{.tar} files with |
| @code{bzip2} compiled with Checker (a memory access error |
| detector, like Purify). |
| @item Misc tests to make sure it builds and runs ok on non-Linux/x86 |
| platforms. |
| @end enumerate |
| These tests were conducted on a 225 MHz IDT WinChip machine, running |
| Linux 2.0.36. They represent nearly a week of continuous computation. |
| All tests completed successfully. |
| |
| |
| @section Further reading |
| @code{bzip2} is not research work, in the sense that it doesn't present |
| any new ideas. Rather, it's an engineering exercise based on existing |
| ideas. |
| |
| Four documents describe essentially all the ideas behind @code{bzip2}: |
| @example |
| Michael Burrows and D. J. Wheeler: |
| "A block-sorting lossless data compression algorithm" |
| 10th May 1994. |
| Digital SRC Research Report 124. |
| ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz |
| If you have trouble finding it, try searching at the |
| New Zealand Digital Library, http://www.nzdl.org. |
| |
| Daniel S. Hirschberg and Debra A. LeLewer |
| "Efficient Decoding of Prefix Codes" |
| Communications of the ACM, April 1990, Vol 33, Number 4. |
| You might be able to get an electronic copy of this |
| from the ACM Digital Library. |
| |
| David J. Wheeler |
| Program bred3.c and accompanying document bred3.ps. |
| This contains the idea behind the multi-table Huffman |
| coding scheme. |
| ftp://ftp.cl.cam.ac.uk/users/djw3/ |
| |
| Jon L. Bentley and Robert Sedgewick |
| "Fast Algorithms for Sorting and Searching Strings" |
| Available from Sedgewick's web page, |
| www.cs.princeton.edu/~rs |
| @end example |
| The following paper gives valuable additional insights into the |
| algorithm, but is not immediately the basis of any code |
| used in bzip2. |
| @example |
| Peter Fenwick: |
| Block Sorting Text Compression |
| Proceedings of the 19th Australasian Computer Science Conference, |
| Melbourne, Australia. Jan 31 - Feb 2, 1996. |
| ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps |
| @end example |
| Kunihiko Sadakane's sorting algorithm, mentioned above, |
| is available from: |
| @example |
| http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz |
| @end example |
| The Manber-Myers suffix array construction |
| algorithm is described in a paper |
| available from: |
| @example |
| http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps |
| @end example |
| |
| |
| |
| @contents |
| |
| @bye |
| |