| .PU |
| .TH bzip2 1 |
| .SH NAME |
| bzip2, bunzip2 \- a block-sorting file compressor, v0.9.0 |
| .br |
| bzcat \- decompresses files to stdout |
| .br |
| bzip2recover \- recovers data from damaged bzip2 files |
| |
| .SH SYNOPSIS |
| .ll +8 |
| .B bzip2 |
| .RB [ " \-cdfkstvzVL123456789 " ] |
| [ |
| .I "filenames \&..." |
| ] |
| .ll -8 |
| .br |
| .B bunzip2 |
| .RB [ " \-fkvsVL " ] |
| [ |
| .I "filenames \&..." |
| ] |
| .br |
| .B bzcat |
| .RB [ " \-s " ] |
| [ |
| .I "filenames \&..." |
| ] |
| .br |
| .B bzip2recover |
| .I "filename" |
| |
| .SH DESCRIPTION |
| .I bzip2 |
| compresses files using the Burrows-Wheeler block-sorting |
| text compression algorithm, and Huffman coding. |
| Compression is generally considerably |
| better than that |
| achieved by more conventional LZ77/LZ78-based compressors, |
| and approaches the performance of the PPM family of statistical |
| compressors. |
| |
| The command-line options are deliberately very similar to |
| those of |
| .I GNU Gzip, |
| but they are not identical. |
| |
| .I bzip2 |
| expects a list of file names to accompany the command-line flags. |
| Each file is replaced by a compressed version of itself, |
| with the name "original_name.bz2". |
| Each compressed file has the same modification date and permissions |
| as the corresponding original, so that these properties can be |
| correctly restored at decompression time. File name handling is |
| naive in the sense that there is no mechanism for preserving |
| original file names, permissions and dates in filesystems |
| which lack these concepts, or have serious file name length |
| restrictions, such as MS-DOS. |
| |
| .I bzip2 |
| and |
| .I bunzip2 |
| will by default not overwrite existing files; |
| if you want this to happen, specify the \-f flag. |
| |
| If no file names are specified, |
| .I bzip2 |
| compresses from standard input to standard output. |
| In this case, |
| .I bzip2 |
| will decline to write compressed output to a terminal, as |
| this would be entirely incomprehensible and therefore pointless. |
| |
| .I bunzip2 |
| (or |
| .I bzip2 \-d |
| ) decompresses and restores all specified files whose names |
| end in ".bz2". |
| Files without this suffix are ignored. |
| Again, supplying no filenames |
| causes decompression from standard input to standard output. |
| |
| .I bunzip2 |
| will correctly decompress a file which is the concatenation |
| of two or more compressed files. The result is the concatenation |
| of the corresponding uncompressed files. Integrity testing |
| (\-t) of concatenated compressed files is also supported. |
| |
| You can also compress or decompress files to |
| the standard output by giving the \-c flag. |
| Multiple files may be compressed and decompressed like this. |
| The resulting outputs are fed sequentially to stdout. |
| Compression of multiple files in this manner generates |
| a stream containing multiple compressed file representations. |
| Such a stream can be decompressed correctly only by |
| .I bzip2 |
| version 0.9.0 or later. Earlier versions of |
| .I bzip2 |
| will stop after decompressing the first file in the stream. |
| |
| .I bzcat |
| (or |
| .I bzip2 \-dc |
| ) decompresses all specified files to the standard output. |
| |
| Compression is always performed, even if the compressed file is |
| slightly larger than the original. Files of less than about |
| one hundred bytes tend to get larger, since the compression |
| mechanism has a constant overhead in the region of 50 bytes. |
| Random data (including the output of most file compressors) |
| is coded at about 8.05 bits per byte, giving an expansion of |
| around 0.5%. |
| |
| As a self-check for your protection, |
| .I bzip2 |
| uses 32-bit CRCs to make sure that the decompressed |
| version of a file is identical to the original. |
| This guards against corruption of the compressed data, |
| and against undetected bugs in |
| .I bzip2 |
| (hopefully very unlikely). |
| The chances of data corruption going undetected is |
| microscopic, about one chance in four billion |
| for each file processed. Be aware, though, that the check |
| occurs upon decompression, so it can only tell you that |
| that something is wrong. It can't help you recover the |
| original uncompressed data. |
| You can use |
| .I bzip2recover |
| to try to recover data from damaged files. |
| |
| Return values: |
| 0 for a normal exit, |
| 1 for environmental |
| problems (file not found, invalid flags, I/O errors, &c), |
| 2 to indicate a corrupt compressed file, |
| 3 for an internal consistency error (eg, bug) which caused |
| .I bzip2 |
| to panic. |
| |
| .SH MEMORY MANAGEMENT |
| .I Bzip2 |
| compresses large files in blocks. The block size affects both the |
| compression ratio achieved, and the amount of memory needed both for |
| compression and decompression. The flags \-1 through \-9 |
| specify the block size to be 100,000 bytes through 900,000 bytes |
| (the default) respectively. At decompression-time, the block size used for |
| compression is read from the header of the compressed file, and |
| .I bunzip2 |
| then allocates itself just enough memory to decompress the file. |
| Since block sizes are stored in compressed files, it follows that the flags |
| \-1 to \-9 |
| are irrelevant to and so ignored during decompression. |
| Compression and decompression requirements, in bytes, can be estimated as: |
| |
| Compression: 400k + ( 7 x block size ) |
| |
| Decompression: 100k + ( 4 x block size ), or |
| .br |
| 100k + ( 2.5 x block size ) |
| |
| Larger block sizes give rapidly diminishing marginal returns; most |
| of the |
| compression comes from the first two or three hundred k of block size, |
| a fact worth bearing in mind when using |
| .I bzip2 |
| on small machines. It is also important to appreciate that the |
| decompression memory requirement is set at compression-time by the |
| choice of block size. |
| |
| For files compressed with the default 900k block size, |
| .I bunzip2 |
| will require about 3700 kbytes to decompress. |
| To support decompression of any file on a 4 megabyte machine, |
| .I bunzip2 |
| has an option to decompress using approximately half this |
| amount of memory, about 2300 kbytes. Decompression speed is |
| also halved, so you should use this option only where necessary. |
| The relevant flag is \-s. |
| |
| In general, try and use the largest block size |
| memory constraints allow, since that maximises the compression |
| achieved. Compression and decompression |
| speed are virtually unaffected by block size. |
| |
| Another significant point applies to files which fit in a single |
| block -- that means most files you'd encounter using a large |
| block size. The amount of real memory touched is proportional |
| to the size of the file, since the file is smaller than a block. |
| For example, compressing a file 20,000 bytes long with the flag |
| \-9 |
| will cause the compressor to allocate around |
| 6700k of memory, but only touch 400k + 20000 * 7 = 540 |
| kbytes of it. Similarly, the decompressor will allocate 3700k but |
| only touch 100k + 20000 * 4 = 180 kbytes. |
| |
| Here is a table which summarises the maximum memory usage for |
| different block sizes. Also recorded is the total compressed |
| size for 14 files of the Calgary Text Compression Corpus |
| totalling 3,141,622 bytes. This column gives some feel for how |
| compression varies with block size. These figures tend to understate |
| the advantage of larger block sizes for larger files, since the |
| Corpus is dominated by smaller files. |
| |
| Compress Decompress Decompress Corpus |
| Flag usage usage -s usage Size |
| |
| -1 1100k 500k 350k 914704 |
| -2 1800k 900k 600k 877703 |
| -3 2500k 1300k 850k 860338 |
| -4 3200k 1700k 1100k 846899 |
| -5 3900k 2100k 1350k 845160 |
| -6 4600k 2500k 1600k 838626 |
| -7 5400k 2900k 1850k 834096 |
| -8 6000k 3300k 2100k 828642 |
| -9 6700k 3700k 2350k 828642 |
| |
| .SH OPTIONS |
| .TP |
| .B \-c --stdout |
| Compress or decompress to standard output. \-c will decompress |
| multiple files to stdout, but will only compress a single file to |
| stdout. |
| .TP |
| .B \-d --decompress |
| Force decompression. |
| .I bzip2, |
| .I bunzip2 |
| and |
| .I bzcat |
| are really the same program, and the decision about what actions |
| to take is done on the basis of which name is |
| used. This flag overrides that mechanism, and forces |
| .I bzip2 |
| to decompress. |
| .TP |
| .B \-z --compress |
| The complement to \-d: forces compression, regardless of the invokation |
| name. |
| .TP |
| .B \-t --test |
| Check integrity of the specified file(s), but don't decompress them. |
| This really performs a trial decompression and throws away the result. |
| .TP |
| .B \-f --force |
| Force overwrite of output files. Normally, |
| .I bzip2 |
| will not overwrite existing output files. |
| .TP |
| .B \-k --keep |
| Keep (don't delete) input files during compression or decompression. |
| .TP |
| .B \-s --small |
| Reduce memory usage, for compression, decompression and |
| testing. |
| Files are decompressed and tested using a modified algorithm which only |
| requires 2.5 bytes per block byte. This means any file can be |
| decompressed in 2300k of memory, albeit at about half the normal |
| speed. |
| |
| During compression, -s selects a block size of 200k, which limits |
| memory use to around the same figure, at the expense of your |
| compression ratio. In short, if your machine is low on memory |
| (8 megabytes or less), use -s for everything. See |
| MEMORY MANAGEMENT above. |
| .TP |
| .B \-v --verbose |
| Verbose mode -- show the compression ratio for each file processed. |
| Further \-v's increase the verbosity level, spewing out lots of |
| information which is primarily of interest for diagnostic purposes. |
| .TP |
| .B \-L --license -V --version |
| Display the software version, license terms and conditions. |
| .TP |
| .B \-1 to \-9 |
| Set the block size to 100 k, 200 k .. 900 k when |
| compressing. Has no effect when decompressing. |
| See MEMORY MANAGEMENT above. |
| .TP |
| .B \--repetitive-fast |
| .I bzip2 |
| injects some small pseudo-random variations |
| into very repetitive blocks to limit |
| worst-case performance during compression. |
| If sorting runs into difficulties, the block |
| is randomised, and sorting is restarted. |
| Very roughly, |
| .I bzip2 |
| persists for three times as long as a well-behaved input |
| would take before resorting to randomisation. |
| This flag makes it give up much sooner. |
| |
| .TP |
| .B \--repetitive-best |
| Opposite of \--repetitive-fast; try a lot harder before |
| resorting to randomisation. |
| |
| .SH RECOVERING DATA FROM DAMAGED FILES |
| .I bzip2 |
| compresses files in blocks, usually 900kbytes long. |
| Each block is handled independently. If a media or |
| transmission error causes a multi-block .bz2 |
| file to become damaged, |
| it may be possible to recover data from the undamaged blocks |
| in the file. |
| |
| The compressed representation of each block is delimited by |
| a 48-bit pattern, which makes it possible to find the block |
| boundaries with reasonable certainty. Each block also carries |
| its own 32-bit CRC, so damaged blocks can be |
| distinguished from undamaged ones. |
| |
| .I bzip2recover |
| is a simple program whose purpose is to search for |
| blocks in .bz2 files, and write each block out into |
| its own .bz2 file. You can then use |
| .I bzip2 -t |
| to test the integrity of the resulting files, |
| and decompress those which are undamaged. |
| |
| .I bzip2recover |
| takes a single argument, the name of the damaged file, |
| and writes a number of files "rec0001file.bz2", "rec0002file.bz2", |
| etc, containing the extracted blocks. The output filenames |
| are designed so that the use of wildcards in subsequent processing |
| -- for example, "bzip2 -dc rec*file.bz2 > recovered_data" -- |
| lists the files in the "right" order. |
| |
| .I bzip2recover |
| should be of most use dealing with large .bz2 files, as |
| these will contain many blocks. It is clearly futile to |
| use it on damaged single-block files, since a damaged |
| block cannot be recovered. If you wish to minimise |
| any potential data loss through media or transmission |
| errors, you might consider compressing with a smaller |
| block size. |
| |
| .SH PERFORMANCE NOTES |
| The sorting phase of compression gathers together similar strings |
| in the file. Because of this, files containing very long |
| runs of repeated symbols, like "aabaabaabaab ..." (repeated |
| several hundred times) may compress extraordinarily slowly. |
| You can use the |
| \-vvvvv |
| option to monitor progress in great detail, if you want. |
| Decompression speed is unaffected. |
| |
| Such pathological cases |
| seem rare in practice, appearing mostly in artificially-constructed |
| test files, and in low-level disk images. It may be inadvisable to |
| use |
| .I bzip2 |
| to compress the latter. |
| If you do get a file which causes severe slowness in compression, |
| try making the block size as small as possible, with flag \-1. |
| |
| .I bzip2 |
| usually allocates several megabytes of memory to operate in, |
| and then charges all over it in a fairly random fashion. This |
| means that performance, both for compressing and decompressing, |
| is largely determined by the speed |
| at which your machine can service cache misses. |
| Because of this, small changes |
| to the code to reduce the miss rate have been observed to give |
| disproportionately large performance improvements. |
| I imagine |
| .I bzip2 |
| will perform best on machines with very large caches. |
| |
| .SH CAVEATS |
| I/O error messages are not as helpful as they could be. |
| .I Bzip2 |
| tries hard to detect I/O errors and exit cleanly, but the |
| details of what the problem is sometimes seem rather misleading. |
| |
| This manual page pertains to version 0.9.0 of |
| .I bzip2. |
| Compressed data created by this version is entirely forwards and |
| backwards compatible with the previous public release, version 0.1pl2, |
| but with the following exception: 0.9.0 can correctly decompress |
| multiple concatenated compressed files. 0.1pl2 cannot do this; it |
| will stop after decompressing just the first file in the stream. |
| |
| Wildcard expansion for Windows 95 and NT |
| is flaky. |
| |
| .I bzip2recover |
| uses 32-bit integers to represent bit positions in |
| compressed files, so it cannot handle compressed files |
| more than 512 megabytes long. This could easily be fixed. |
| |
| .SH AUTHOR |
| Julian Seward, jseward@acm.org. |
| |
| http://www.muraroa.demon.co.uk |
| |
| The ideas embodied in |
| .I bzip2 |
| are due to (at least) the following people: |
| Michael Burrows and David Wheeler (for the block sorting |
| transformation), David Wheeler (again, for the Huffman coder), |
| Peter Fenwick (for the structured coding model in the original |
| .I bzip, |
| and many refinements), |
| and |
| Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic |
| coder in the original |
| .I bzip). |
| I am much indebted for their help, support and advice. |
| See the manual in the source distribution for pointers to |
| sources of documentation. |
| Christian von Roques encouraged me to look for faster |
| sorting algorithms, so as to speed up compression. |
| Bela Lubkin encouraged me to improve the worst-case |
| compression performance. |
| Many people sent patches, helped with portability problems, |
| lent machines, gave advice and were generally helpful. |
| |