bzip2.1 - third_party/bzip2 - Git at Google

 .PU
 .TH bzip2 1
 .SH NAME
 bzip2, bunzip2 \- a block-sorting file compressor, v0.1
 .br
 bzip2recover \- recovers data from damaged bzip2 files

 .SH SYNOPSIS
 .ll +8
 .B bzip2
 .RB [ " \-cdfkstvVL123456789 " ]
 [
 .I "filenames \&..."
 ]
 .ll -8
 .br
 .B bunzip2
 .RB [ " \-kvsVL " ]
 [
 .I "filenames \&..."
 ]
 .br
 .B bzip2recover
 .I "filename"

 .SH DESCRIPTION
 .I Bzip2
 compresses files using the Burrows-Wheeler block-sorting
 text compression algorithm, and Huffman coding.
 Compression is generally considerably
 better than that
 achieved by more conventional LZ77/LZ78-based compressors,
 and approaches the performance of the PPM family of statistical
 compressors.

 The command-line options are deliberately very similar to
 those of
 .I GNU Gzip,
 but they are not identical.

 .I Bzip2
 expects a list of file names to accompany the command-line flags.
 Each file is replaced by a compressed version of itself,
 with the name "original_name.bz2".
 Each compressed file has the same modification date and permissions
 as the corresponding original, so that these properties can be
 correctly restored at decompression time.  File name handling is
 naive in the sense that there is no mechanism for preserving
 original file names, permissions and dates in filesystems
 which lack these concepts, or have serious file name length
 restrictions, such as MS-DOS.

 .I Bzip2
 and
 .I bunzip2
 will not overwrite existing files; if you want this to happen,
 you should delete them first.

 If no file names are specified,
 .I bzip2
 compresses from standard input to standard output.
 In this case,
 .I bzip2
 will decline to write compressed output to a terminal, as
 this would be entirely incomprehensible and therefore pointless.

 .I Bunzip2
 (or
 .I bzip2 \-d
 ) decompresses and restores all specified files whose names
 end in ".bz2".
 Files without this suffix are ignored.
 Again, supplying no filenames
 causes decompression from standard input to standard output.

 You can also compress or decompress files to
 the standard output by giving the \-c flag.
 You can decompress multiple files like this, but you may
 only compress a single file this way, since it would otherwise
 be difficult to separate out the compressed representations of
 the original files.

 Compression is always performed, even if the compressed file is
 slightly larger than the original.  Files of less than about
 one hundred bytes tend to get larger, since the compression
 mechanism has a constant overhead in the region of 50 bytes.
 Random data (including the output of most file compressors)
 is coded at about 8.05 bits per byte, giving an expansion of
 around 0.5%.

 As a self-check for your protection,
 .I bzip2
 uses 32-bit CRCs to make sure that the decompressed
 version of a file is identical to the original.
 This guards against corruption of the compressed data,
 and against undetected bugs in
 .I bzip2
 (hopefully very unlikely).
 The chances of data corruption going undetected is
 microscopic, about one chance in four billion
 for each file processed.  Be aware, though, that the check
 occurs upon decompression, so it can only tell you that
 that something is wrong.  It can't help you recover the
 original uncompressed data.
 You can use
 .I bzip2recover
 to try to recover data from damaged files.

 Return values:
 0 for a normal exit,
 1 for environmental
 problems (file not found, invalid flags, I/O errors, &c),
 2 to indicate a corrupt compressed file,
 3 for an internal consistency error (eg, bug) which caused
 .I bzip2
 to panic.

 .SH MEMORY MANAGEMENT
 .I Bzip2
 compresses large files in blocks.  The block size affects both the
 compression ratio achieved, and the amount of memory needed both for
 compression and decompression.  The flags \-1 through \-9
 specify the block size to be 100,000 bytes through 900,000 bytes
 (the default) respectively.  At decompression-time, the block size used for
 compression is read from the header of the compressed file, and
 .I bunzip2
 then allocates itself just enough memory to decompress the file.
 Since block sizes are stored in compressed files, it follows that the flags
 \-1 to \-9
 are irrelevant to and so ignored during decompression.
 Compression and decompression requirements, in bytes, can be estimated as:

       Compression:   400k + ( 7 x block size )

       Decompression: 100k + ( 5 x block size ), or
 .br
                      100k + ( 2.5 x block size )

 Larger block sizes give rapidly diminishing marginal returns; most
 of the
 compression comes from the first two or three hundred k of block size,
 a fact worth bearing in mind when using
 .I bzip2
 on small machines.  It is also important to appreciate that the
 decompression memory requirement is set at compression-time by the
 choice of block size.

 For files compressed with the default 900k block size,
 .I bunzip2
 will require about 4600 kbytes to decompress.
 To support decompression of any file on a 4 megabyte machine,
 .I bunzip2
 has an option to decompress using approximately half this
 amount of memory, about 2300 kbytes.  Decompression speed is
 also halved, so you should use this option only where necessary.
 The relevant flag is \-s.

 In general, try and use the largest block size
 memory constraints allow, since that maximises the compression
 achieved.  Compression and decompression
 speed are virtually unaffected by block size.

 Another significant point applies to files which fit in a single
 block -- that means most files you'd encounter using a large
 block size.  The amount of real memory touched is proportional
 to the size of the file, since the file is smaller than a block.
 For example, compressing a file 20,000 bytes long with the flag
 \-9
 will cause the compressor to allocate around
 6700k of memory, but only touch 400k + 20000 * 7 = 540
 kbytes of it.  Similarly, the decompressor will allocate 4600k but
 only touch 100k + 20000 * 5 = 200 kbytes.

 Here is a table which summarises the maximum memory usage for
 different block sizes.  Also recorded is the total compressed
 size for 14 files of the Calgary Text Compression Corpus
 totalling 3,141,622 bytes.  This column gives some feel for how
 compression varies with block size.  These figures tend to understate
 the advantage of larger block sizes for larger files, since the
 Corpus is dominated by smaller files.

            Compress   Decompress   Decompress   Corpus
     Flag     usage      usage       -s usage     Size

      -1      1100k       600k         350k      914704
      -2      1800k      1100k         600k      877703
      -3      2500k      1600k         850k      860338
      -4      3200k      2100k        1100k      846899
      -5      3900k      2600k        1350k      845160
      -6      4600k      3100k        1600k      838626
      -7      5400k      3600k        1850k      834096
      -8      6000k      4100k        2100k      828642
      -9      6700k      4600k        2350k      828642

 .SH OPTIONS
 .TP
 .B \-c  --stdout
 Compress or decompress to standard output.  \-c will decompress
 multiple files to stdout, but will only compress a single file to
 stdout.
 .TP
 .B \-d --decompress
 Force decompression.
 .I Bzip2
 and
 .I bunzip2
 are really the same program, and the decision about whether to
 compress or decompress is done on the basis of which name is
 used.  This flag overrides that mechanism, and forces
 .I bzip2
 to decompress.
 .TP
 .B \-f --compress
 The complement to \-d: forces compression, regardless of the invokation
 name.
 .TP
 .B \-t --test
 Check integrity of the specified file(s), but don't decompress them.
 This really performs a trial decompression and throws away the result,
 using the low-memory decompression algorithm (see \-s).
 .TP
 .B \-k --keep
 Keep (don't delete) input files during compression or decompression.
 .TP
 .B \-s --small
 Reduce memory usage, both for compression and decompression.
 Files are decompressed using a modified algorithm which only
 requires 2.5 bytes per block byte.  This means any file can be
 decompressed in 2300k of memory, albeit somewhat more slowly than
 usual.

 During compression, -s selects a block size of 200k, which limits
 memory use to around the same figure, at the expense of your
 compression ratio.  In short, if your machine is low on memory
 (8 megabytes or less), use -s for everything.  See
 MEMORY MANAGEMENT above.

 .TP
 .B \-v --verbose
 Verbose mode -- show the compression ratio for each file processed.
 Further \-v's increase the verbosity level, spewing out lots of
 information which is primarily of interest for diagnostic purposes.
 .TP
 .B \-L --license
 Display the software version, license terms and conditions.
 .TP
 .B \-V --version
 Same as \-L.
 .TP
 .B \-1 to \-9
 Set the block size to 100 k, 200 k .. 900 k when
 compressing.  Has no effect when decompressing.
 See MEMORY MANAGEMENT above.
 .TP
 .B \--repetitive-fast
 .I bzip2
 injects some small pseudo-random variations
 into very repetitive blocks to limit
 worst-case performance during compression.
 If sorting runs into difficulties, the block
 is randomised, and sorting is restarted.
 Very roughly,
 .I bzip2
 persists for three times as long as a well-behaved input
 would take before resorting to randomisation.
 This flag makes it give up much sooner.

 .TP
 .B \--repetitive-best
 Opposite of \--repetitive-fast; try a lot harder before
 resorting to randomisation.

 .SH RECOVERING DATA FROM DAMAGED FILES
 .I bzip2
 compresses files in blocks, usually 900kbytes long.
 Each block is handled independently.  If a media or
 transmission error causes a multi-block .bz2
 file to become damaged,
 it may be possible to recover data from the undamaged blocks
 in the file.

 The compressed representation of each block is delimited by
 a 48-bit pattern, which makes it possible to find the block
 boundaries with reasonable certainty.  Each block also carries
 its own 32-bit CRC, so damaged blocks can be
 distinguished from undamaged ones.

 .I bzip2recover
 is a simple program whose purpose is to search for
 blocks in .bz2 files, and write each block out into
 its own .bz2 file.  You can then use
 .I bzip2 -t
 to test the integrity of the resulting files,
 and decompress those which are undamaged.

 .I bzip2recover
 takes a single argument, the name of the damaged file,
 and writes a number of files "rec0001file.bz2", "rec0002file.bz2",
 etc, containing the extracted blocks.  The output filenames
 are designed so that the use of wildcards in subsequent processing
 -- for example, "bzip2 -dc rec*file.bz2 > recovered_data" --
 lists the files in the "right" order.

 .I bzip2recover
 should be of most use dealing with large .bz2 files, as
 these will contain many blocks.  It is clearly futile to
 use it on damaged single-block files, since a damaged
 block cannot be recovered.  If you wish to minimise
 any potential data loss through media or transmission
 errors, you might consider compressing with a smaller
 block size.

 .SH PERFORMANCE NOTES
 The sorting phase of compression gathers together similar strings
 in the file.  Because of this, files containing very long
 runs of repeated symbols, like "aabaabaabaab ..." (repeated
 several hundred times) may compress extraordinarily slowly.
 You can use the
 \-vvvvv
 option to monitor progress in great detail, if you want.
 Decompression speed is unaffected.

 Such pathological cases
 seem rare in practice, appearing mostly in artificially-constructed
 test files, and in low-level disk images.  It may be inadvisable to
 use
 .I bzip2
 to compress the latter.
 If you do get a file which causes severe slowness in compression,
 try making the block size as small as possible, with flag \-1.

 Incompressible or virtually-incompressible data may decompress
 rather more slowly than one would hope.  This is due to
 a naive implementation of the move-to-front coder.

 .I bzip2
 usually allocates several megabytes of memory to operate in,
 and then charges all over it in a fairly random fashion.  This
 means that performance, both for compressing and decompressing,
 is largely determined by the speed
 at which your machine can service cache misses.
 Because of this, small changes
 to the code to reduce the miss rate have been observed to give
 disproportionately large performance improvements.
 I imagine
 .I bzip2
 will perform best on machines with very large caches.

 Test mode (\-t) uses the low-memory decompression algorithm
 (\-s).  This means test mode does not run as fast as it could;
 it could run as fast as the normal decompression machinery.
 This could easily be fixed at the cost of some code bloat.

 .SH CAVEATS
 I/O error messages are not as helpful as they could be.
 .I Bzip2
 tries hard to detect I/O errors and exit cleanly, but the
 details of what the problem is sometimes seem rather misleading.

 This manual page pertains to version 0.1 of
 .I bzip2.
 It may well happen that some future version will
 use a different compressed file format.  If you try to
 decompress, using 0.1, a .bz2 file created with some
 future version which uses a different compressed file format,
 0.1 will complain that your file "is not a bzip2 file".
 If that happens, you should obtain a more recent version
 of
 .I bzip2
 and use that to decompress the file.

 Wildcard expansion for Windows 95 and NT
 is flaky.

 .I bzip2recover
 uses 32-bit integers to represent bit positions in
 compressed files, so it cannot handle compressed files
 more than 512 megabytes long.  This could easily be fixed.

 .I bzip2recover
 sometimes reports a very small, incomplete final block.
 This is spurious and can be safely ignored.

 .SH RELATIONSHIP TO bzip-0.21
 This program is a descendant of the
 .I bzip
 program, version 0.21, which I released in August 1996.
 The primary difference of
 .I bzip2
 is its avoidance of the possibly patented algorithms
 which were used in 0.21.
 .I bzip2
 also brings various useful refinements (\-s, \-t),
 uses less memory, decompresses significantly faster, and
 has support for recovering data from damaged files.

 Because
 .I bzip2
 uses Huffman coding to construct the compressed bitstream,
 rather than the arithmetic coding used in 0.21,
 the compressed representations generated by the two programs
 are incompatible, and they will not interoperate.  The change
 in suffix from .bz to .bz2 reflects this.  It would have been
 helpful to at least allow
 .I bzip2
 to decompress files created by 0.21, but this would
 defeat the primary aim of having a patent-free compressor.

 Huffman coding necessarily involves some coding inefficiency
 compared to arithmetic coding.  This means that
 .I bzip2
 compresses about 1% worse than 0.21, an unfortunate but
 unavoidable fact-of-life.  On the other hand, decompression
 is approximately 50% faster for the same reason, and the
 change in file format gave an opportunity to add data-recovery
 features.  So it is not all bad.

 .SH AUTHOR
 Julian Seward, jseward@acm.org.

 The ideas embodied in
 .I bzip
 and
 .I bzip2
 are due to (at least) the following people:
 Michael Burrows and David Wheeler (for the block sorting
 transformation), David Wheeler (again, for the Huffman coder),
 Peter Fenwick (for the structured coding model in 0.21,
 and many refinements),
 and
 Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic
 coder in 0.21).  I am much indebted for their help, support and advice.
 See the file ALGORITHMS in the source distribution for pointers to
 sources of documentation.
 Christian von Roques encouraged me to look for faster
 sorting algorithms, so as to speed up compression.
 Bela Lubkin encouraged me to improve the worst-case
 compression performance.
 Many people sent patches, helped with portability problems,
 lent machines, gave advice and were generally helpful.
	.PU
	.TH bzip2 1
	.SH NAME
	bzip2, bunzip2 \- a block-sorting file compressor, v0.1
	.br
	bzip2recover \- recovers data from damaged bzip2 files

	.SH SYNOPSIS
	.ll +8
	.B bzip2
	.RB [ " \-cdfkstvVL123456789 " ]
	[
	.I "filenames \&..."
	]
	.ll -8
	.br
	.B bunzip2
	.RB [ " \-kvsVL " ]
	[
	.I "filenames \&..."
	]
	.br
	.B bzip2recover
	.I "filename"

	.SH DESCRIPTION
	.I Bzip2
	compresses files using the Burrows-Wheeler block-sorting
	text compression algorithm, and Huffman coding.
	Compression is generally considerably
	better than that
	achieved by more conventional LZ77/LZ78-based compressors,
	and approaches the performance of the PPM family of statistical
	compressors.

	The command-line options are deliberately very similar to
	those of
	.I GNU Gzip,
	but they are not identical.

	.I Bzip2
	expects a list of file names to accompany the command-line flags.
	Each file is replaced by a compressed version of itself,
	with the name "original_name.bz2".
	Each compressed file has the same modification date and permissions
	as the corresponding original, so that these properties can be
	correctly restored at decompression time. File name handling is
	naive in the sense that there is no mechanism for preserving
	original file names, permissions and dates in filesystems
	which lack these concepts, or have serious file name length
	restrictions, such as MS-DOS.

	.I Bzip2
	and
	.I bunzip2
	will not overwrite existing files; if you want this to happen,
	you should delete them first.

	If no file names are specified,
	.I bzip2
	compresses from standard input to standard output.
	In this case,
	.I bzip2
	will decline to write compressed output to a terminal, as
	this would be entirely incomprehensible and therefore pointless.

	.I Bunzip2
	(or
	.I bzip2 \-d
	) decompresses and restores all specified files whose names
	end in ".bz2".
	Files without this suffix are ignored.
	Again, supplying no filenames
	causes decompression from standard input to standard output.

	You can also compress or decompress files to
	the standard output by giving the \-c flag.
	You can decompress multiple files like this, but you may
	only compress a single file this way, since it would otherwise
	be difficult to separate out the compressed representations of
	the original files.

	Compression is always performed, even if the compressed file is
	slightly larger than the original. Files of less than about
	one hundred bytes tend to get larger, since the compression
	mechanism has a constant overhead in the region of 50 bytes.
	Random data (including the output of most file compressors)
	is coded at about 8.05 bits per byte, giving an expansion of
	around 0.5%.

	As a self-check for your protection,
	.I bzip2
	uses 32-bit CRCs to make sure that the decompressed
	version of a file is identical to the original.
	This guards against corruption of the compressed data,
	and against undetected bugs in
	.I bzip2
	(hopefully very unlikely).
	The chances of data corruption going undetected is
	microscopic, about one chance in four billion
	for each file processed. Be aware, though, that the check
	occurs upon decompression, so it can only tell you that
	that something is wrong. It can't help you recover the
	original uncompressed data.
	You can use
	.I bzip2recover
	to try to recover data from damaged files.

	Return values:
	0 for a normal exit,
	1 for environmental
	problems (file not found, invalid flags, I/O errors, &c),
	2 to indicate a corrupt compressed file,
	3 for an internal consistency error (eg, bug) which caused
	.I bzip2
	to panic.

	.SH MEMORY MANAGEMENT
	.I Bzip2
	compresses large files in blocks. The block size affects both the
	compression ratio achieved, and the amount of memory needed both for
	compression and decompression. The flags \-1 through \-9
	specify the block size to be 100,000 bytes through 900,000 bytes
	(the default) respectively. At decompression-time, the block size used for
	compression is read from the header of the compressed file, and
	.I bunzip2
	then allocates itself just enough memory to decompress the file.
	Since block sizes are stored in compressed files, it follows that the flags
	\-1 to \-9
	are irrelevant to and so ignored during decompression.
	Compression and decompression requirements, in bytes, can be estimated as:

	Compression: 400k + ( 7 x block size )

	Decompression: 100k + ( 5 x block size ), or
	.br
	100k + ( 2.5 x block size )

	Larger block sizes give rapidly diminishing marginal returns; most
	of the
	compression comes from the first two or three hundred k of block size,
	a fact worth bearing in mind when using
	.I bzip2
	on small machines. It is also important to appreciate that the
	decompression memory requirement is set at compression-time by the
	choice of block size.

	For files compressed with the default 900k block size,
	.I bunzip2
	will require about 4600 kbytes to decompress.
	To support decompression of any file on a 4 megabyte machine,
	.I bunzip2
	has an option to decompress using approximately half this
	amount of memory, about 2300 kbytes. Decompression speed is
	also halved, so you should use this option only where necessary.
	The relevant flag is \-s.

	In general, try and use the largest block size
	memory constraints allow, since that maximises the compression
	achieved. Compression and decompression
	speed are virtually unaffected by block size.

	Another significant point applies to files which fit in a single
	block -- that means most files you'd encounter using a large
	block size. The amount of real memory touched is proportional
	to the size of the file, since the file is smaller than a block.
	For example, compressing a file 20,000 bytes long with the flag
	\-9
	will cause the compressor to allocate around
	6700k of memory, but only touch 400k + 20000 * 7 = 540
	kbytes of it. Similarly, the decompressor will allocate 4600k but
	only touch 100k + 20000 * 5 = 200 kbytes.

	Here is a table which summarises the maximum memory usage for
	different block sizes. Also recorded is the total compressed
	size for 14 files of the Calgary Text Compression Corpus
	totalling 3,141,622 bytes. This column gives some feel for how
	compression varies with block size. These figures tend to understate
	the advantage of larger block sizes for larger files, since the
	Corpus is dominated by smaller files.

	Compress Decompress Decompress Corpus
	Flag usage usage -s usage Size

	-1 1100k 600k 350k 914704
	-2 1800k 1100k 600k 877703
	-3 2500k 1600k 850k 860338
	-4 3200k 2100k 1100k 846899
	-5 3900k 2600k 1350k 845160
	-6 4600k 3100k 1600k 838626
	-7 5400k 3600k 1850k 834096
	-8 6000k 4100k 2100k 828642
	-9 6700k 4600k 2350k 828642

	.SH OPTIONS
	.TP
	.B \-c --stdout
	Compress or decompress to standard output. \-c will decompress
	multiple files to stdout, but will only compress a single file to
	stdout.
	.TP
	.B \-d --decompress
	Force decompression.
	.I Bzip2
	and
	.I bunzip2
	are really the same program, and the decision about whether to
	compress or decompress is done on the basis of which name is
	used. This flag overrides that mechanism, and forces
	.I bzip2
	to decompress.
	.TP
	.B \-f --compress
	The complement to \-d: forces compression, regardless of the invokation
	name.
	.TP
	.B \-t --test
	Check integrity of the specified file(s), but don't decompress them.
	This really performs a trial decompression and throws away the result,
	using the low-memory decompression algorithm (see \-s).
	.TP
	.B \-k --keep
	Keep (don't delete) input files during compression or decompression.
	.TP
	.B \-s --small
	Reduce memory usage, both for compression and decompression.
	Files are decompressed using a modified algorithm which only
	requires 2.5 bytes per block byte. This means any file can be
	decompressed in 2300k of memory, albeit somewhat more slowly than
	usual.

	During compression, -s selects a block size of 200k, which limits
	memory use to around the same figure, at the expense of your
	compression ratio. In short, if your machine is low on memory
	(8 megabytes or less), use -s for everything. See
	MEMORY MANAGEMENT above.

	.TP
	.B \-v --verbose
	Verbose mode -- show the compression ratio for each file processed.
	Further \-v's increase the verbosity level, spewing out lots of
	information which is primarily of interest for diagnostic purposes.
	.TP
	.B \-L --license
	Display the software version, license terms and conditions.
	.TP
	.B \-V --version
	Same as \-L.
	.TP
	.B \-1 to \-9
	Set the block size to 100 k, 200 k .. 900 k when
	compressing. Has no effect when decompressing.
	See MEMORY MANAGEMENT above.
	.TP
	.B \--repetitive-fast
	.I bzip2
	injects some small pseudo-random variations
	into very repetitive blocks to limit
	worst-case performance during compression.
	If sorting runs into difficulties, the block
	is randomised, and sorting is restarted.
	Very roughly,
	.I bzip2
	persists for three times as long as a well-behaved input
	would take before resorting to randomisation.
	This flag makes it give up much sooner.

	.TP
	.B \--repetitive-best
	Opposite of \--repetitive-fast; try a lot harder before
	resorting to randomisation.

	.SH RECOVERING DATA FROM DAMAGED FILES
	.I bzip2
	compresses files in blocks, usually 900kbytes long.
	Each block is handled independently. If a media or
	transmission error causes a multi-block .bz2
	file to become damaged,
	it may be possible to recover data from the undamaged blocks
	in the file.

	The compressed representation of each block is delimited by
	a 48-bit pattern, which makes it possible to find the block
	boundaries with reasonable certainty. Each block also carries
	its own 32-bit CRC, so damaged blocks can be
	distinguished from undamaged ones.

	.I bzip2recover
	is a simple program whose purpose is to search for
	blocks in .bz2 files, and write each block out into
	its own .bz2 file. You can then use
	.I bzip2 -t
	to test the integrity of the resulting files,
	and decompress those which are undamaged.

	.I bzip2recover
	takes a single argument, the name of the damaged file,
	and writes a number of files "rec0001file.bz2", "rec0002file.bz2",
	etc, containing the extracted blocks. The output filenames
	are designed so that the use of wildcards in subsequent processing
	-- for example, "bzip2 -dc rec*file.bz2 > recovered_data" --
	lists the files in the "right" order.

	.I bzip2recover
	should be of most use dealing with large .bz2 files, as
	these will contain many blocks. It is clearly futile to
	use it on damaged single-block files, since a damaged
	block cannot be recovered. If you wish to minimise
	any potential data loss through media or transmission
	errors, you might consider compressing with a smaller
	block size.

	.SH PERFORMANCE NOTES
	The sorting phase of compression gathers together similar strings
	in the file. Because of this, files containing very long
	runs of repeated symbols, like "aabaabaabaab ..." (repeated
	several hundred times) may compress extraordinarily slowly.
	You can use the
	\-vvvvv
	option to monitor progress in great detail, if you want.
	Decompression speed is unaffected.

	Such pathological cases
	seem rare in practice, appearing mostly in artificially-constructed
	test files, and in low-level disk images. It may be inadvisable to
	use
	.I bzip2
	to compress the latter.
	If you do get a file which causes severe slowness in compression,
	try making the block size as small as possible, with flag \-1.

	Incompressible or virtually-incompressible data may decompress
	rather more slowly than one would hope. This is due to
	a naive implementation of the move-to-front coder.

	.I bzip2
	usually allocates several megabytes of memory to operate in,
	and then charges all over it in a fairly random fashion. This
	means that performance, both for compressing and decompressing,
	is largely determined by the speed
	at which your machine can service cache misses.
	Because of this, small changes
	to the code to reduce the miss rate have been observed to give
	disproportionately large performance improvements.
	I imagine
	.I bzip2
	will perform best on machines with very large caches.

	Test mode (\-t) uses the low-memory decompression algorithm
	(\-s). This means test mode does not run as fast as it could;
	it could run as fast as the normal decompression machinery.
	This could easily be fixed at the cost of some code bloat.

	.SH CAVEATS
	I/O error messages are not as helpful as they could be.
	.I Bzip2
	tries hard to detect I/O errors and exit cleanly, but the
	details of what the problem is sometimes seem rather misleading.

	This manual page pertains to version 0.1 of
	.I bzip2.
	It may well happen that some future version will
	use a different compressed file format. If you try to
	decompress, using 0.1, a .bz2 file created with some
	future version which uses a different compressed file format,
	0.1 will complain that your file "is not a bzip2 file".
	If that happens, you should obtain a more recent version
	of
	.I bzip2
	and use that to decompress the file.

	Wildcard expansion for Windows 95 and NT
	is flaky.

	.I bzip2recover
	uses 32-bit integers to represent bit positions in
	compressed files, so it cannot handle compressed files
	more than 512 megabytes long. This could easily be fixed.

	.I bzip2recover
	sometimes reports a very small, incomplete final block.
	This is spurious and can be safely ignored.

	.SH RELATIONSHIP TO bzip-0.21
	This program is a descendant of the
	.I bzip
	program, version 0.21, which I released in August 1996.
	The primary difference of
	.I bzip2
	is its avoidance of the possibly patented algorithms
	which were used in 0.21.
	.I bzip2
	also brings various useful refinements (\-s, \-t),
	uses less memory, decompresses significantly faster, and
	has support for recovering data from damaged files.

	Because
	.I bzip2
	uses Huffman coding to construct the compressed bitstream,
	rather than the arithmetic coding used in 0.21,
	the compressed representations generated by the two programs
	are incompatible, and they will not interoperate. The change
	in suffix from .bz to .bz2 reflects this. It would have been
	helpful to at least allow
	.I bzip2
	to decompress files created by 0.21, but this would
	defeat the primary aim of having a patent-free compressor.

	Huffman coding necessarily involves some coding inefficiency
	compared to arithmetic coding. This means that
	.I bzip2
	compresses about 1% worse than 0.21, an unfortunate but
	unavoidable fact-of-life. On the other hand, decompression
	is approximately 50% faster for the same reason, and the
	change in file format gave an opportunity to add data-recovery
	features. So it is not all bad.

	.SH AUTHOR
	Julian Seward, jseward@acm.org.

	The ideas embodied in
	.I bzip
	and
	.I bzip2
	are due to (at least) the following people:
	Michael Burrows and David Wheeler (for the block sorting
	transformation), David Wheeler (again, for the Huffman coder),
	Peter Fenwick (for the structured coding model in 0.21,
	and many refinements),
	and
	Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic
	coder in 0.21). I am much indebted for their help, support and advice.
	See the file ALGORITHMS in the source distribution for pointers to
	sources of documentation.
	Christian von Roques encouraged me to look for faster
	sorting algorithms, so as to speed up compression.
	Bela Lubkin encouraged me to improve the worst-case
	compression performance.
	Many people sent patches, helped with portability problems,
	lent machines, gave advice and were generally helpful.