blob: aee8e2b186f250caf15f73cddaef0c346cea39fe [file] [log] [blame]
bzip2(1) bzip2(1)
NAME
bzip2, bunzip2 - a block-sorting file compressor, v0.1
bzip2recover - recovers data from damaged bzip2 files
SYNOPSIS
bzip2 [ -cdfkstvVL123456789 ] [ filenames ... ]
bunzip2 [ -kvsVL ] [ filenames ... ]
bzip2recover filename
DESCRIPTION
Bzip2 compresses files using the Burrows-Wheeler block-
sorting text compression algorithm, and Huffman coding.
Compression is generally considerably better than that
achieved by more conventional LZ77/LZ78-based compressors,
and approaches the performance of the PPM family of sta-
tistical compressors.
The command-line options are deliberately very similar to
those of GNU Gzip, but they are not identical.
Bzip2 expects a list of file names to accompany the com-
mand-line flags. Each file is replaced by a compressed
version of itself, with the name "original_name.bz2".
Each compressed file has the same modification date and
permissions as the corresponding original, so that these
properties can be correctly restored at decompression
time. File name handling is naive in the sense that there
is no mechanism for preserving original file names, per-
missions and dates in filesystems which lack these con-
cepts, or have serious file name length restrictions, such
as MS-DOS.
Bzip2 and bunzip2 will not overwrite existing files; if
you want this to happen, you should delete them first.
If no file names are specified, bzip2 compresses from
standard input to standard output. In this case, bzip2
will decline to write compressed output to a terminal, as
this would be entirely incomprehensible and therefore
pointless.
Bunzip2 (or bzip2 -d ) decompresses and restores all spec-
ified files whose names end in ".bz2". Files without this
suffix are ignored. Again, supplying no filenames causes
decompression from standard input to standard output.
You can also compress or decompress files to the standard
output by giving the -c flag. You can decompress multiple
files like this, but you may only compress a single file
this way, since it would otherwise be difficult to sepa-
rate out the compressed representations of the original
files.
1
bzip2(1) bzip2(1)
Compression is always performed, even if the compressed
file is slightly larger than the original. Files of less
than about one hundred bytes tend to get larger, since the
compression mechanism has a constant overhead in the
region of 50 bytes. Random data (including the output of
most file compressors) is coded at about 8.05 bits per
byte, giving an expansion of around 0.5%.
As a self-check for your protection, bzip2 uses 32-bit
CRCs to make sure that the decompressed version of a file
is identical to the original. This guards against corrup-
tion of the compressed data, and against undetected bugs
in bzip2 (hopefully very unlikely). The chances of data
corruption going undetected is microscopic, about one
chance in four billion for each file processed. Be aware,
though, that the check occurs upon decompression, so it
can only tell you that that something is wrong. It can't
help you recover the original uncompressed data. You can
use bzip2recover to try to recover data from damaged
files.
Return values: 0 for a normal exit, 1 for environmental
problems (file not found, invalid flags, I/O errors, &c),
2 to indicate a corrupt compressed file, 3 for an internal
consistency error (eg, bug) which caused bzip2 to panic.
MEMORY MANAGEMENT
Bzip2 compresses large files in blocks. The block size
affects both the compression ratio achieved, and the
amount of memory needed both for compression and decom-
pression. The flags -1 through -9 specify the block size
to be 100,000 bytes through 900,000 bytes (the default)
respectively. At decompression-time, the block size used
for compression is read from the header of the compressed
file, and bunzip2 then allocates itself just enough memory
to decompress the file. Since block sizes are stored in
compressed files, it follows that the flags -1 to -9 are
irrelevant to and so ignored during decompression. Com-
pression and decompression requirements, in bytes, can be
estimated as:
Compression: 400k + ( 7 x block size )
Decompression: 100k + ( 5 x block size ), or
100k + ( 2.5 x block size )
Larger block sizes give rapidly diminishing marginal
returns; most of the compression comes from the first two
or three hundred k of block size, a fact worth bearing in
mind when using bzip2 on small machines. It is also
important to appreciate that the decompression memory
requirement is set at compression-time by the choice of
block size.
2
bzip2(1) bzip2(1)
For files compressed with the default 900k block size,
bunzip2 will require about 4600 kbytes to decompress. To
support decompression of any file on a 4 megabyte machine,
bunzip2 has an option to decompress using approximately
half this amount of memory, about 2300 kbytes. Decompres-
sion speed is also halved, so you should use this option
only where necessary. The relevant flag is -s.
In general, try and use the largest block size memory con-
straints allow, since that maximises the compression
achieved. Compression and decompression speed are virtu-
ally unaffected by block size.
Another significant point applies to files which fit in a
single block -- that means most files you'd encounter
using a large block size. The amount of real memory
touched is proportional to the size of the file, since the
file is smaller than a block. For example, compressing a
file 20,000 bytes long with the flag -9 will cause the
compressor to allocate around 6700k of memory, but only
touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the
decompressor will allocate 4600k but only touch 100k +
20000 * 5 = 200 kbytes.
Here is a table which summarises the maximum memory usage
for different block sizes. Also recorded is the total
compressed size for 14 files of the Calgary Text Compres-
sion Corpus totalling 3,141,622 bytes. This column gives
some feel for how compression varies with block size.
These figures tend to understate the advantage of larger
block sizes for larger files, since the Corpus is domi-
nated by smaller files.
Compress Decompress Decompress Corpus
Flag usage usage -s usage Size
-1 1100k 600k 350k 914704
-2 1800k 1100k 600k 877703
-3 2500k 1600k 850k 860338
-4 3200k 2100k 1100k 846899
-5 3900k 2600k 1350k 845160
-6 4600k 3100k 1600k 838626
-7 5400k 3600k 1850k 834096
-8 6000k 4100k 2100k 828642
-9 6700k 4600k 2350k 828642
OPTIONS
-c --stdout
Compress or decompress to standard output. -c will
decompress multiple files to stdout, but will only
compress a single file to stdout.
3
bzip2(1) bzip2(1)
-d --decompress
Force decompression. Bzip2 and bunzip2 are really
the same program, and the decision about whether to
compress or decompress is done on the basis of
which name is used. This flag overrides that mech-
anism, and forces bzip2 to decompress.
-f --compress
The complement to -d: forces compression, regard-
less of the invokation name.
-t --test
Check integrity of the specified file(s), but don't
decompress them. This really performs a trial
decompression and throws away the result, using the
low-memory decompression algorithm (see -s).
-k --keep
Keep (don't delete) input files during compression
or decompression.
-s --small
Reduce memory usage, both for compression and
decompression. Files are decompressed using a mod-
ified algorithm which only requires 2.5 bytes per
block byte. This means any file can be decom-
pressed in 2300k of memory, albeit somewhat more
slowly than usual.
During compression, -s selects a block size of
200k, which limits memory use to around the same
figure, at the expense of your compression ratio.
In short, if your machine is low on memory (8
megabytes or less), use -s for everything. See
MEMORY MANAGEMENT above.
-v --verbose
Verbose mode -- show the compression ratio for each
file processed. Further -v's increase the ver-
bosity level, spewing out lots of information which
is primarily of interest for diagnostic purposes.
-L --license
Display the software version, license terms and
conditions.
-V --version
Same as -L.
-1 to -9
Set the block size to 100 k, 200 k .. 900 k when
compressing. Has no effect when decompressing.
See MEMORY MANAGEMENT above.
4
bzip2(1) bzip2(1)
--repetitive-fast
bzip2 injects some small pseudo-random variations
into very repetitive blocks to limit worst-case
performance during compression. If sorting runs
into difficulties, the block is randomised, and
sorting is restarted. Very roughly, bzip2 persists
for three times as long as a well-behaved input
would take before resorting to randomisation. This
flag makes it give up much sooner.
--repetitive-best
Opposite of --repetitive-fast; try a lot harder
before resorting to randomisation.
RECOVERING DATA FROM DAMAGED FILES
bzip2 compresses files in blocks, usually 900kbytes long.
Each block is handled independently. If a media or trans-
mission error causes a multi-block .bz2 file to become
damaged, it may be possible to recover data from the
undamaged blocks in the file.
The compressed representation of each block is delimited
by a 48-bit pattern, which makes it possible to find the
block boundaries with reasonable certainty. Each block
also carries its own 32-bit CRC, so damaged blocks can be
distinguished from undamaged ones.
bzip2recover is a simple program whose purpose is to
search for blocks in .bz2 files, and write each block out
into its own .bz2 file. You can then use bzip2 -t to test
the integrity of the resulting files, and decompress those
which are undamaged.
bzip2recover takes a single argument, the name of the dam-
aged file, and writes a number of files "rec0001file.bz2",
"rec0002file.bz2", etc, containing the extracted blocks.
The output filenames are designed so that the use of wild-
cards in subsequent processing -- for example, "bzip2 -dc
rec*file.bz2 > recovered_data" -- lists the files in the
"right" order.
bzip2recover should be of most use dealing with large .bz2
files, as these will contain many blocks. It is clearly
futile to use it on damaged single-block files, since a
damaged block cannot be recovered. If you wish to min-
imise any potential data loss through media or transmis-
sion errors, you might consider compressing with a smaller
block size.
PERFORMANCE NOTES
The sorting phase of compression gathers together similar
5
bzip2(1) bzip2(1)
strings in the file. Because of this, files containing
very long runs of repeated symbols, like "aabaabaabaab
..." (repeated several hundred times) may compress
extraordinarily slowly. You can use the -vvvvv option to
monitor progress in great detail, if you want. Decompres-
sion speed is unaffected.
Such pathological cases seem rare in practice, appearing
mostly in artificially-constructed test files, and in low-
level disk images. It may be inadvisable to use bzip2 to
compress the latter. If you do get a file which causes
severe slowness in compression, try making the block size
as small as possible, with flag -1.
Incompressible or virtually-incompressible data may decom-
press rather more slowly than one would hope. This is due
to a naive implementation of the move-to-front coder.
bzip2 usually allocates several megabytes of memory to
operate in, and then charges all over it in a fairly ran-
dom fashion. This means that performance, both for com-
pressing and decompressing, is largely determined by the
speed at which your machine can service cache misses.
Because of this, small changes to the code to reduce the
miss rate have been observed to give disproportionately
large performance improvements. I imagine bzip2 will per-
form best on machines with very large caches.
Test mode (-t) uses the low-memory decompression algorithm
(-s). This means test mode does not run as fast as it
could; it could run as fast as the normal decompression
machinery. This could easily be fixed at the cost of some
code bloat.
CAVEATS
I/O error messages are not as helpful as they could be.
Bzip2 tries hard to detect I/O errors and exit cleanly,
but the details of what the problem is sometimes seem
rather misleading.
This manual page pertains to version 0.1 of bzip2. It may
well happen that some future version will use a different
compressed file format. If you try to decompress, using
0.1, a .bz2 file created with some future version which
uses a different compressed file format, 0.1 will complain
that your file "is not a bzip2 file". If that happens,
you should obtain a more recent version of bzip2 and use
that to decompress the file.
Wildcard expansion for Windows 95 and NT is flaky.
bzip2recover uses 32-bit integers to represent bit posi-
tions in compressed files, so it cannot handle compressed
6
bzip2(1) bzip2(1)
files more than 512 megabytes long. This could easily be
fixed.
bzip2recover sometimes reports a very small, incomplete
final block. This is spurious and can be safely ignored.
RELATIONSHIP TO bzip-0.21
This program is a descendant of the bzip program, version
0.21, which I released in August 1996. The primary dif-
ference of bzip2 is its avoidance of the possibly patented
algorithms which were used in 0.21. bzip2 also brings
various useful refinements (-s, -t), uses less memory,
decompresses significantly faster, and has support for
recovering data from damaged files.
Because bzip2 uses Huffman coding to construct the com-
pressed bitstream, rather than the arithmetic coding used
in 0.21, the compressed representations generated by the
two programs are incompatible, and they will not interop-
erate. The change in suffix from .bz to .bz2 reflects
this. It would have been helpful to at least allow bzip2
to decompress files created by 0.21, but this would defeat
the primary aim of having a patent-free compressor.
For a more precise statement about patent issues in bzip2,
please see the README file in the distribution.
Huffman coding necessarily involves some coding ineffi-
ciency compared to arithmetic coding. This means that
bzip2 compresses about 1% worse than 0.21, an unfortunate
but unavoidable fact-of-life. On the other hand, decom-
pression is approximately 50% faster for the same reason,
and the change in file format gave an opportunity to add
data-recovery features. So it is not all bad.
AUTHOR
Julian Seward, jseward@acm.org.
The ideas embodied in bzip and bzip2 are due to (at least)
the following people: Michael Burrows and David Wheeler
(for the block sorting transformation), David Wheeler
(again, for the Huffman coder), Peter Fenwick (for the
structured coding model in 0.21, and many refinements),
and Alistair Moffat, Radford Neal and Ian Witten (for the
arithmetic coder in 0.21). I am much indebted for their
help, support and advice. See the file ALGORITHMS in the
source distribution for pointers to sources of documenta-
tion. Christian von Roques encouraged me to look for
faster sorting algorithms, so as to speed up compression.
Bela Lubkin encouraged me to improve the worst-case com-
pression performance. Many people sent patches, helped
with portability problems, lent machines, gave advice and
were generally helpful.
7