bzip2-0.1
diff --git a/ALGORITHMS b/ALGORITHMS
new file mode 100644
index 0000000..7c7d2ca
--- /dev/null
+++ b/ALGORITHMS
@@ -0,0 +1,47 @@
+
+Bzip2 is not research work, in the sense that it doesn't present any
+new ideas. Rather, it's an engineering exercise based on existing
+ideas.
+
+Four documents describe essentially all the ideas behind bzip2:
+
+ Michael Burrows and D. J. Wheeler:
+ "A block-sorting lossless data compression algorithm"
+ 10th May 1994.
+ Digital SRC Research Report 124.
+ ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
+
+ Daniel S. Hirschberg and Debra A. LeLewer
+ "Efficient Decoding of Prefix Codes"
+ Communications of the ACM, April 1990, Vol 33, Number 4.
+ You might be able to get an electronic copy of this
+ from the ACM Digital Library.
+
+ David J. Wheeler
+ Program bred3.c and accompanying document bred3.ps.
+ This contains the idea behind the multi-table Huffman
+ coding scheme.
+ ftp://ftp.cl.cam.ac.uk/pub/user/djw3/
+
+ Jon L. Bentley and Robert Sedgewick
+ "Fast Algorithms for Sorting and Searching Strings"
+ Available from Sedgewick's web page,
+ www.cs.princeton.edu/~rs
+
+The following paper gives valuable additional insights into the
+algorithm, but is not immediately the basis of any code
+used in bzip2.
+
+ Peter Fenwick:
+ Block Sorting Text Compression
+ Proceedings of the 19th Australasian Computer Science Conference,
+ Melbourne, Australia. Jan 31 - Feb 2, 1996.
+ ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
+
+All three are well written, and make fascinating reading. If you want
+to modify bzip2 in any non-trivial way, I strongly suggest you obtain,
+read and understand these papers.
+
+I am much indebted to the various authors for their help, support and
+advice.
+
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..a43ea21
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,339 @@
+ GNU GENERAL PUBLIC LICENSE
+ Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+ 675 Mass Ave, Cambridge, MA 02139, USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users. This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it. (Some other Free Software Foundation software is covered by
+the GNU Library General Public License instead.) You can apply it to
+your programs, too.
+
+ When we speak of free software, we are referring to freedom, not
+price. Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+ To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+ For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have. You must make sure that they, too, receive or can get the
+source code. And you must show them these terms so they know their
+rights.
+
+ We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+ Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software. If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+ Finally, any free program is threatened constantly by software
+patents. We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary. To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+ The precise terms and conditions for copying, distribution and
+modification follow.
+
+ GNU GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License. The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language. (Hereinafter, translation is included without limitation in
+the term "modification".) Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+ 1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+ 2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) You must cause the modified files to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ b) You must cause any work that you distribute or publish, that in
+ whole or in part contains or is derived from the Program or any
+ part thereof, to be licensed as a whole at no charge to all third
+ parties under the terms of this License.
+
+ c) If the modified program normally reads commands interactively
+ when run, you must cause it, when started running for such
+ interactive use in the most ordinary way, to print or display an
+ announcement including an appropriate copyright notice and a
+ notice that there is no warranty (or else, saying that you provide
+ a warranty) and that users may redistribute the program under
+ these conditions, and telling the user how to view a copy of this
+ License. (Exception: if the Program itself is interactive but
+ does not normally print such an announcement, your work based on
+ the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+ a) Accompany it with the complete corresponding machine-readable
+ source code, which must be distributed under the terms of Sections
+ 1 and 2 above on a medium customarily used for software interchange; or,
+
+ b) Accompany it with a written offer, valid for at least three
+ years, to give any third party, for a charge no more than your
+ cost of physically performing source distribution, a complete
+ machine-readable copy of the corresponding source code, to be
+ distributed under the terms of Sections 1 and 2 above on a medium
+ customarily used for software interchange; or,
+
+ c) Accompany it with the information you received as to the offer
+ to distribute corresponding source code. (This alternative is
+ allowed only for noncommercial distribution and only if you
+ received the program in object code or executable form with such
+ an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it. For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable. However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License. Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+ 5. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Program or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+ 6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+ 7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all. For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded. In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+ 9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time. Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation. If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+ 10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission. For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this. Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+ NO WARRANTY
+
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+ END OF TERMS AND CONDITIONS
+
+ Appendix: How to Apply These Terms to Your New Programs
+
+ If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+ To do so, attach the following notices to the program. It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+ <one line to give the program's name and a brief idea of what it does.>
+ Copyright (C) 19yy <name of author>
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+ Gnomovision version 69, Copyright (C) 19yy name of author
+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+ This is free software, and you are welcome to redistribute it
+ under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License. Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary. Here is a sample; alter the names:
+
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+ `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+ <signature of Ty Coon>, 1 April 1989
+ Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs. If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library. If this is what you want to do, use the GNU Library General
+Public License instead of this License.
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..d124743
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,30 @@
+
+CC = gcc
+SH = /bin/sh
+
+CFLAGS = -O3 -fomit-frame-pointer -funroll-loops -Wall -Winline -W
+
+
+
+all:
+ cat words0
+ $(CC) $(CFLAGS) -o bzip2 bzip2.c
+ $(CC) $(CFLAGS) -o bzip2recover bzip2recover.c
+ rm -f bunzip2
+ ln -s ./bzip2 ./bunzip2
+ cat words1
+ ./bzip2 -1 < sample1.ref > sample1.rb2
+ ./bzip2 -2 < sample2.ref > sample2.rb2
+ ./bunzip2 < sample1.bz2 > sample1.tst
+ ./bunzip2 < sample2.bz2 > sample2.tst
+ cat words2
+ cmp sample1.bz2 sample1.rb2
+ cmp sample2.bz2 sample2.rb2
+ cmp sample1.tst sample1.ref
+ cmp sample2.tst sample2.ref
+ cat words3
+
+
+clean:
+ rm -f bzip2 bunzip2 bzip2recover sample*.tst sample*.rb2
+
diff --git a/README b/README
new file mode 100644
index 0000000..d77830f
--- /dev/null
+++ b/README
@@ -0,0 +1,243 @@
+
+GREETINGS!
+
+ This is the README for bzip2, my block-sorting file compressor,
+ version 0.1.
+
+ bzip2 is distributed under the GNU General Public License version 2;
+ for details, see the file LICENSE. Pointers to the algorithms used
+ are in ALGORITHMS. Instructions for use are in bzip2.1.preformatted.
+
+ Please read this file carefully.
+
+
+
+HOW TO BUILD
+
+ -- for UNIX:
+
+ Type `make'. (tough, huh? :-)
+
+ This creates binaries "bzip2", and "bunzip2",
+ which is a symbolic link to "bzip2".
+
+ It also runs four compress-decompress tests to make sure
+ things are working properly. If all goes well, you should be up &
+ running. Please be sure to read the output from `make'
+ just to be sure that the tests went ok.
+
+ To install bzip2 properly:
+
+ -- Copy the binary "bzip2" to a publically visible place,
+ possibly /usr/bin, /usr/common/bin or /usr/local/bin.
+
+ -- In that directory, make "bunzip2" be a symbolic link
+ to "bzip2".
+
+ -- Copy the manual page, bzip2.1, to the relevant place.
+ Probably the right place is /usr/man/man1/.
+
+ -- for Windows 95 and NT:
+
+ For a start, do you *really* want to recompile bzip2?
+ The standard distribution includes a pre-compiled version
+ for Windows 95 and NT, `bzip2.exe'.
+
+ This executable was created with Jacob Navia's excellent
+ port to Win32 of Chris Fraser & David Hanson's excellent
+ ANSI C compiler, "lcc". You can get to it at the pages
+ of the CS department of Princeton University,
+ www.cs.princeton.edu.
+ I have not tried to compile this version of bzip2 with
+ a commercial C compiler such as MS Visual C, as I don't
+ have one available.
+
+ Note that lcc is designed primarily to be portable and
+ fast. Code quality is a secondary aim, so bzip2.exe
+ runs perhaps 40% slower than it could if compiled with
+ a good optimising compiler.
+
+ I compiled a previous version of bzip (0.21) with Borland
+ C 5.0, which worked fine, and with MS VC++ 2.0, which
+ didn't. Here is an comment from the README for bzip-0.21.
+
+ MS VC++ 2.0's optimising compiler has a bug which, at
+ maximum optimisation, gives an executable which produces
+ garbage compressed files. Proceed with caution.
+ I do not know whether or not this happens with later
+ versions of VC++.
+
+ Edit the defines starting at line 86 of bzip.c to
+ select your platform/compiler combination, and then compile.
+ Then check that the resulting executable (assumed to be
+ called bzip.exe) works correctly, using the SELFTEST.BAT file.
+ Bearing in mind the previous paragraph, the self-test is
+ important.
+
+ Note that the defines which bzip-0.21 had, to support
+ compilation with VC 2.0 and BC 5.0, are gone. Windows
+ is not my preferred operating system, and I am, for the
+ moment, content with the modestly fast executable created
+ by lcc-win32.
+
+ A manual page is supplied, unformatted (bzip2.1),
+ preformatted (bzip2.1.preformatted), and preformatted
+ and sanitised for MS-DOS (bzip2.txt).
+
+
+
+COMPILATION NOTES
+
+ bzip2 should work on any 32 or 64-bit machine. It is known to work
+ [meaning: it has compiled and passed self-tests] on the
+ following platform-os combinations:
+
+ Intel i386/i486 running Linux 2.0.21
+ Sun Sparcs (various) running SunOS 4.1.4 and Solaris 2.5
+ Intel i386/i486 running Windows 95 and NT
+ DEC Alpha running Digital Unix 4.0
+
+ Following the release of bzip-0.21, many people mailed me
+ from around the world to say they had made it work on all sorts
+ of weird and wonderful machines. Chances are, if you have
+ a reasonable ANSI C compiler and a 32-bit machine, you can
+ get it to work.
+
+ The #defines starting at around line 82 of bzip2.c supply some
+ degree of platform-independance. If you configure bzip2 for some
+ new far-out platform which is not covered by the existing definitions,
+ please send me the relevant definitions.
+
+ I recommend GNU C for compilation. The code is standard ANSI C,
+ except for the Unix-specific file handling, so any ANSI C compiler
+ should work. Note however that the many routines marked INLINE
+ should be inlined by your compiler, else performance will be very
+ poor. Asking your compiler to unroll loops gives some
+ small improvement too; for gcc, the relevant flag is
+ -funroll-loops.
+
+ On a 386/486 machines, I'd recommend giving gcc the
+ -fomit-frame-pointer flag; this liberates another register for
+ allocation, which measurably improves performance.
+
+ I used the abovementioned lcc compiler to develop bzip2.
+ I would highly recommend this compiler for day-to-day development;
+ it is fast, reliable, lightweight, has an excellent profiler,
+ and is generally excellent. And it's fun to retarget, if you're
+ into that kind of thing.
+
+ If you compile bzip2 on a new platform or with a new compiler,
+ please be sure to run the four compress-decompress tests, either
+ using the Makefile, or with the test.bat (MSDOS) or test.cmd (OS/2)
+ files. Some compilers have been seen to introduce subtle bugs
+ when optimising, so this check is important. Ideally you should
+ then go on to test bzip2 on a file several megabytes or even
+ tens of megabytes long, just to be 110% sure. ``Professional
+ programmers are paranoid programmers.'' (anon).
+
+
+
+VALIDATION
+
+ Correct operation, in the sense that a compressed file can always be
+ decompressed to reproduce the original, is obviously of paramount
+ importance. To validate bzip2, I used a modified version of
+ Mark Nelson's churn program. Churn is an automated test driver
+ which recursively traverses a directory structure, using bzip2 to
+ compress and then decompress each file it encounters, and checking
+ that the decompressed data is the same as the original. As test
+ material, I used several runs over several filesystems of differing
+ sizes.
+
+ One set of tests was done on my base Linux filesystem,
+ 410 megabytes in 23,000 files. There were several runs over
+ this filesystem, in various configurations designed to break bzip2.
+ That filesystem also contained some specially constructed test
+ files designed to exercise boundary cases in the code.
+ This included files of zero length, various long, highly repetitive
+ files, and some files which generate blocks with all values the same.
+
+ The other set of tests was done just with the "normal" configuration,
+ but on a much larger quantity of data.
+
+ Tests are:
+
+ Linux FS, 410M, 23000 files
+
+ As above, with --repetitive-fast
+
+ As above, with -1
+
+ Low level disk image of a disk containing
+ Windows NT4.0; 420M in a single huge file
+
+ Linux distribution, incl Slackware,
+ all GNU sources. 1900M in 2300 files.
+
+ Approx ~100M compiler sources and related
+ programming tools, running under Purify.
+
+ About 500M of data in 120 files of around
+ 4 M each. This is raw data from a
+ biomagnetometer (SQUID-based thing).
+
+ Overall, total volume of test data is about
+ 3300 megabytes in 25000 files.
+
+ The distribution does four tests after building bzip. These tests
+ include test decompressions of pre-supplied compressed files, so
+ they not only test that bzip works correctly on the machine it was
+ built on, but can also decompress files compressed on a different
+ machine. This guards against unforseen interoperability problems.
+
+
+Please read and be aware of the following:
+
+WARNING:
+
+ This program (attempts to) compress data by performing several
+ non-trivial transformations on it. Unless you are 100% familiar
+ with *all* the algorithms contained herein, and with the
+ consequences of modifying them, you should NOT meddle with the
+ compression or decompression machinery. Incorrect changes can and
+ very likely *will* lead to disastrous loss of data.
+
+
+DISCLAIMER:
+
+ I TAKE NO RESPONSIBILITY FOR ANY LOSS OF DATA ARISING FROM THE
+ USE OF THIS PROGRAM, HOWSOEVER CAUSED.
+
+ Every compression of a file implies an assumption that the
+ compressed file can be decompressed to reproduce the original.
+ Great efforts in design, coding and testing have been made to
+ ensure that this program works correctly. However, the complexity
+ of the algorithms, and, in particular, the presence of various
+ special cases in the code which occur with very low but non-zero
+ probability make it impossible to rule out the possibility of bugs
+ remaining in the program. DO NOT COMPRESS ANY DATA WITH THIS
+ PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER
+ SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.
+
+ That is not to say this program is inherently unreliable. Indeed,
+ I very much hope the opposite is true. bzip2 has been carefully
+ constructed and extensively tested.
+
+End of nasty legalities.
+
+
+I hope you find bzip2 useful. Feel free to contact me at
+ jseward@acm.org
+if you have any suggestions or queries. Many people mailed me with
+comments, suggestions and patches after the releases of 0.15 and 0.21,
+and the changes in bzip2 are largely a result of this feedback.
+I thank you for your comments.
+
+Julian Seward
+
+Manchester, UK
+18 July 1996 (version 0.15)
+25 August 1996 (version 0.21)
+
+Guildford, Surrey, UK
+7 August 1997 (bzip2, version 0.0)
\ No newline at end of file
diff --git a/README.DOS b/README.DOS
new file mode 100644
index 0000000..d522b81
--- /dev/null
+++ b/README.DOS
@@ -0,0 +1,20 @@
+
+Windows 95 & Windows NT users:
+
+1. There's a pre-built executable, bzip2.exe, which
+ should work. You don't need to compile anything.
+ You can run the `test.bat' batch file to check
+ the executable is working ok, if you want.
+
+2. The control-C signal catcher seems pretty dodgy
+ under Windows, at least for the executable supplied.
+ When it catches a control-C, bzip2 tries to delete
+ its output file, so you don't get left with a half-
+ baked file. But this sometimes seems to fail
+ under Windows. Caveat Emptor! I think I am doing
+ something not-quite-right in the signal catching.
+ Windows-&-C gurus got any suggestions?
+
+ Control-C handling all seems to work fine under Unix.
+
+7 Aug 97
diff --git a/bzip2.1 b/bzip2.1
new file mode 100644
index 0000000..9094c7c
--- /dev/null
+++ b/bzip2.1
@@ -0,0 +1,441 @@
+.PU
+.TH bzip2 1
+.SH NAME
+bzip2, bunzip2 \- a block-sorting file compressor, v0.1
+.br
+bzip2recover \- recovers data from damaged bzip2 files
+
+.SH SYNOPSIS
+.ll +8
+.B bzip2
+.RB [ " \-cdfkstvVL123456789 " ]
+[
+.I "filenames \&..."
+]
+.ll -8
+.br
+.B bunzip2
+.RB [ " \-kvsVL " ]
+[
+.I "filenames \&..."
+]
+.br
+.B bzip2recover
+.I "filename"
+
+.SH DESCRIPTION
+.I Bzip2
+compresses files using the Burrows-Wheeler block-sorting
+text compression algorithm, and Huffman coding.
+Compression is generally considerably
+better than that
+achieved by more conventional LZ77/LZ78-based compressors,
+and approaches the performance of the PPM family of statistical
+compressors.
+
+The command-line options are deliberately very similar to
+those of
+.I GNU Gzip,
+but they are not identical.
+
+.I Bzip2
+expects a list of file names to accompany the command-line flags.
+Each file is replaced by a compressed version of itself,
+with the name "original_name.bz2".
+Each compressed file has the same modification date and permissions
+as the corresponding original, so that these properties can be
+correctly restored at decompression time. File name handling is
+naive in the sense that there is no mechanism for preserving
+original file names, permissions and dates in filesystems
+which lack these concepts, or have serious file name length
+restrictions, such as MS-DOS.
+
+.I Bzip2
+and
+.I bunzip2
+will not overwrite existing files; if you want this to happen,
+you should delete them first.
+
+If no file names are specified,
+.I bzip2
+compresses from standard input to standard output.
+In this case,
+.I bzip2
+will decline to write compressed output to a terminal, as
+this would be entirely incomprehensible and therefore pointless.
+
+.I Bunzip2
+(or
+.I bzip2 \-d
+) decompresses and restores all specified files whose names
+end in ".bz2".
+Files without this suffix are ignored.
+Again, supplying no filenames
+causes decompression from standard input to standard output.
+
+You can also compress or decompress files to
+the standard output by giving the \-c flag.
+You can decompress multiple files like this, but you may
+only compress a single file this way, since it would otherwise
+be difficult to separate out the compressed representations of
+the original files.
+
+Compression is always performed, even if the compressed file is
+slightly larger than the original. Files of less than about
+one hundred bytes tend to get larger, since the compression
+mechanism has a constant overhead in the region of 50 bytes.
+Random data (including the output of most file compressors)
+is coded at about 8.05 bits per byte, giving an expansion of
+around 0.5%.
+
+As a self-check for your protection,
+.I bzip2
+uses 32-bit CRCs to make sure that the decompressed
+version of a file is identical to the original.
+This guards against corruption of the compressed data,
+and against undetected bugs in
+.I bzip2
+(hopefully very unlikely).
+The chances of data corruption going undetected is
+microscopic, about one chance in four billion
+for each file processed. Be aware, though, that the check
+occurs upon decompression, so it can only tell you that
+that something is wrong. It can't help you recover the
+original uncompressed data.
+You can use
+.I bzip2recover
+to try to recover data from damaged files.
+
+Return values:
+0 for a normal exit,
+1 for environmental
+problems (file not found, invalid flags, I/O errors, &c),
+2 to indicate a corrupt compressed file,
+3 for an internal consistency error (eg, bug) which caused
+.I bzip2
+to panic.
+
+.SH MEMORY MANAGEMENT
+.I Bzip2
+compresses large files in blocks. The block size affects both the
+compression ratio achieved, and the amount of memory needed both for
+compression and decompression. The flags \-1 through \-9
+specify the block size to be 100,000 bytes through 900,000 bytes
+(the default) respectively. At decompression-time, the block size used for
+compression is read from the header of the compressed file, and
+.I bunzip2
+then allocates itself just enough memory to decompress the file.
+Since block sizes are stored in compressed files, it follows that the flags
+\-1 to \-9
+are irrelevant to and so ignored during decompression.
+Compression and decompression requirements, in bytes, can be estimated as:
+
+ Compression: 400k + ( 7 x block size )
+
+ Decompression: 100k + ( 5 x block size ), or
+.br
+ 100k + ( 2.5 x block size )
+
+Larger block sizes give rapidly diminishing marginal returns; most
+of the
+compression comes from the first two or three hundred k of block size,
+a fact worth bearing in mind when using
+.I bzip2
+on small machines. It is also important to appreciate that the
+decompression memory requirement is set at compression-time by the
+choice of block size.
+
+For files compressed with the default 900k block size,
+.I bunzip2
+will require about 4600 kbytes to decompress.
+To support decompression of any file on a 4 megabyte machine,
+.I bunzip2
+has an option to decompress using approximately half this
+amount of memory, about 2300 kbytes. Decompression speed is
+also halved, so you should use this option only where necessary.
+The relevant flag is \-s.
+
+In general, try and use the largest block size
+memory constraints allow, since that maximises the compression
+achieved. Compression and decompression
+speed are virtually unaffected by block size.
+
+Another significant point applies to files which fit in a single
+block -- that means most files you'd encounter using a large
+block size. The amount of real memory touched is proportional
+to the size of the file, since the file is smaller than a block.
+For example, compressing a file 20,000 bytes long with the flag
+\-9
+will cause the compressor to allocate around
+6700k of memory, but only touch 400k + 20000 * 7 = 540
+kbytes of it. Similarly, the decompressor will allocate 4600k but
+only touch 100k + 20000 * 5 = 200 kbytes.
+
+Here is a table which summarises the maximum memory usage for
+different block sizes. Also recorded is the total compressed
+size for 14 files of the Calgary Text Compression Corpus
+totalling 3,141,622 bytes. This column gives some feel for how
+compression varies with block size. These figures tend to understate
+the advantage of larger block sizes for larger files, since the
+Corpus is dominated by smaller files.
+
+ Compress Decompress Decompress Corpus
+ Flag usage usage -s usage Size
+
+ -1 1100k 600k 350k 914704
+ -2 1800k 1100k 600k 877703
+ -3 2500k 1600k 850k 860338
+ -4 3200k 2100k 1100k 846899
+ -5 3900k 2600k 1350k 845160
+ -6 4600k 3100k 1600k 838626
+ -7 5400k 3600k 1850k 834096
+ -8 6000k 4100k 2100k 828642
+ -9 6700k 4600k 2350k 828642
+
+.SH OPTIONS
+.TP
+.B \-c --stdout
+Compress or decompress to standard output. \-c will decompress
+multiple files to stdout, but will only compress a single file to
+stdout.
+.TP
+.B \-d --decompress
+Force decompression.
+.I Bzip2
+and
+.I bunzip2
+are really the same program, and the decision about whether to
+compress or decompress is done on the basis of which name is
+used. This flag overrides that mechanism, and forces
+.I bzip2
+to decompress.
+.TP
+.B \-f --compress
+The complement to \-d: forces compression, regardless of the invokation
+name.
+.TP
+.B \-t --test
+Check integrity of the specified file(s), but don't decompress them.
+This really performs a trial decompression and throws away the result,
+using the low-memory decompression algorithm (see \-s).
+.TP
+.B \-k --keep
+Keep (don't delete) input files during compression or decompression.
+.TP
+.B \-s --small
+Reduce memory usage, both for compression and decompression.
+Files are decompressed using a modified algorithm which only
+requires 2.5 bytes per block byte. This means any file can be
+decompressed in 2300k of memory, albeit somewhat more slowly than
+usual.
+
+During compression, -s selects a block size of 200k, which limits
+memory use to around the same figure, at the expense of your
+compression ratio. In short, if your machine is low on memory
+(8 megabytes or less), use -s for everything. See
+MEMORY MANAGEMENT above.
+
+.TP
+.B \-v --verbose
+Verbose mode -- show the compression ratio for each file processed.
+Further \-v's increase the verbosity level, spewing out lots of
+information which is primarily of interest for diagnostic purposes.
+.TP
+.B \-L --license
+Display the software version, license terms and conditions.
+.TP
+.B \-V --version
+Same as \-L.
+.TP
+.B \-1 to \-9
+Set the block size to 100 k, 200 k .. 900 k when
+compressing. Has no effect when decompressing.
+See MEMORY MANAGEMENT above.
+.TP
+.B \--repetitive-fast
+.I bzip2
+injects some small pseudo-random variations
+into very repetitive blocks to limit
+worst-case performance during compression.
+If sorting runs into difficulties, the block
+is randomised, and sorting is restarted.
+Very roughly,
+.I bzip2
+persists for three times as long as a well-behaved input
+would take before resorting to randomisation.
+This flag makes it give up much sooner.
+
+.TP
+.B \--repetitive-best
+Opposite of \--repetitive-fast; try a lot harder before
+resorting to randomisation.
+
+.SH RECOVERING DATA FROM DAMAGED FILES
+.I bzip2
+compresses files in blocks, usually 900kbytes long.
+Each block is handled independently. If a media or
+transmission error causes a multi-block .bz2
+file to become damaged,
+it may be possible to recover data from the undamaged blocks
+in the file.
+
+The compressed representation of each block is delimited by
+a 48-bit pattern, which makes it possible to find the block
+boundaries with reasonable certainty. Each block also carries
+its own 32-bit CRC, so damaged blocks can be
+distinguished from undamaged ones.
+
+.I bzip2recover
+is a simple program whose purpose is to search for
+blocks in .bz2 files, and write each block out into
+its own .bz2 file. You can then use
+.I bzip2 -t
+to test the integrity of the resulting files,
+and decompress those which are undamaged.
+
+.I bzip2recover
+takes a single argument, the name of the damaged file,
+and writes a number of files "rec0001file.bz2", "rec0002file.bz2",
+etc, containing the extracted blocks. The output filenames
+are designed so that the use of wildcards in subsequent processing
+-- for example, "bzip2 -dc rec*file.bz2 > recovered_data" --
+lists the files in the "right" order.
+
+.I bzip2recover
+should be of most use dealing with large .bz2 files, as
+these will contain many blocks. It is clearly futile to
+use it on damaged single-block files, since a damaged
+block cannot be recovered. If you wish to minimise
+any potential data loss through media or transmission
+errors, you might consider compressing with a smaller
+block size.
+
+.SH PERFORMANCE NOTES
+The sorting phase of compression gathers together similar strings
+in the file. Because of this, files containing very long
+runs of repeated symbols, like "aabaabaabaab ..." (repeated
+several hundred times) may compress extraordinarily slowly.
+You can use the
+\-vvvvv
+option to monitor progress in great detail, if you want.
+Decompression speed is unaffected.
+
+Such pathological cases
+seem rare in practice, appearing mostly in artificially-constructed
+test files, and in low-level disk images. It may be inadvisable to
+use
+.I bzip2
+to compress the latter.
+If you do get a file which causes severe slowness in compression,
+try making the block size as small as possible, with flag \-1.
+
+Incompressible or virtually-incompressible data may decompress
+rather more slowly than one would hope. This is due to
+a naive implementation of the move-to-front coder.
+
+.I bzip2
+usually allocates several megabytes of memory to operate in,
+and then charges all over it in a fairly random fashion. This
+means that performance, both for compressing and decompressing,
+is largely determined by the speed
+at which your machine can service cache misses.
+Because of this, small changes
+to the code to reduce the miss rate have been observed to give
+disproportionately large performance improvements.
+I imagine
+.I bzip2
+will perform best on machines with very large caches.
+
+Test mode (\-t) uses the low-memory decompression algorithm
+(\-s). This means test mode does not run as fast as it could;
+it could run as fast as the normal decompression machinery.
+This could easily be fixed at the cost of some code bloat.
+
+.SH CAVEATS
+I/O error messages are not as helpful as they could be.
+.I Bzip2
+tries hard to detect I/O errors and exit cleanly, but the
+details of what the problem is sometimes seem rather misleading.
+
+This manual page pertains to version 0.1 of
+.I bzip2.
+It may well happen that some future version will
+use a different compressed file format. If you try to
+decompress, using 0.1, a .bz2 file created with some
+future version which uses a different compressed file format,
+0.1 will complain that your file "is not a bzip2 file".
+If that happens, you should obtain a more recent version
+of
+.I bzip2
+and use that to decompress the file.
+
+Wildcard expansion for Windows 95 and NT
+is flaky.
+
+.I bzip2recover
+uses 32-bit integers to represent bit positions in
+compressed files, so it cannot handle compressed files
+more than 512 megabytes long. This could easily be fixed.
+
+.I bzip2recover
+sometimes reports a very small, incomplete final block.
+This is spurious and can be safely ignored.
+
+.SH RELATIONSHIP TO bzip-0.21
+This program is a descendant of the
+.I bzip
+program, version 0.21, which I released in August 1996.
+The primary difference of
+.I bzip2
+is its avoidance of the possibly patented algorithms
+which were used in 0.21.
+.I bzip2
+also brings various useful refinements (\-s, \-t),
+uses less memory, decompresses significantly faster, and
+has support for recovering data from damaged files.
+
+Because
+.I bzip2
+uses Huffman coding to construct the compressed bitstream,
+rather than the arithmetic coding used in 0.21,
+the compressed representations generated by the two programs
+are incompatible, and they will not interoperate. The change
+in suffix from .bz to .bz2 reflects this. It would have been
+helpful to at least allow
+.I bzip2
+to decompress files created by 0.21, but this would
+defeat the primary aim of having a patent-free compressor.
+
+Huffman coding necessarily involves some coding inefficiency
+compared to arithmetic coding. This means that
+.I bzip2
+compresses about 1% worse than 0.21, an unfortunate but
+unavoidable fact-of-life. On the other hand, decompression
+is approximately 50% faster for the same reason, and the
+change in file format gave an opportunity to add data-recovery
+features. So it is not all bad.
+
+.SH AUTHOR
+Julian Seward, jseward@acm.org.
+
+The ideas embodied in
+.I bzip
+and
+.I bzip2
+are due to (at least) the following people:
+Michael Burrows and David Wheeler (for the block sorting
+transformation), David Wheeler (again, for the Huffman coder),
+Peter Fenwick (for the structured coding model in 0.21,
+and many refinements),
+and
+Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic
+coder in 0.21). I am much indebted for their help, support and advice.
+See the file ALGORITHMS in the source distribution for pointers to
+sources of documentation.
+Christian von Roques encouraged me to look for faster
+sorting algorithms, so as to speed up compression.
+Bela Lubkin encouraged me to improve the worst-case
+compression performance.
+Many people sent patches, helped with portability problems,
+lent machines, gave advice and were generally helpful.
+
diff --git a/bzip2.1.preformatted b/bzip2.1.preformatted
new file mode 100644
index 0000000..947dc97
--- /dev/null
+++ b/bzip2.1.preformatted
@@ -0,0 +1,462 @@
+
+
+
+bzip2(1) bzip2(1)
+
+
+NNAAMMEE
+ bzip2, bunzip2 - a block-sorting file compressor, v0.1
+ bzip2recover - recovers data from damaged bzip2 files
+
+
+SSYYNNOOPPSSIISS
+ bbzziipp22 [ --ccddffkkssttvvVVLL112233445566778899 ] [ _f_i_l_e_n_a_m_e_s _._._. ]
+ bbuunnzziipp22 [ --kkvvssVVLL ] [ _f_i_l_e_n_a_m_e_s _._._. ]
+ bbzziipp22rreeccoovveerr _f_i_l_e_n_a_m_e
+
+
+DDEESSCCRRIIPPTTIIOONN
+ _B_z_i_p_2 compresses files using the Burrows-Wheeler block-
+ sorting text compression algorithm, and Huffman coding.
+ Compression is generally considerably better than that
+ achieved by more conventional LZ77/LZ78-based compressors,
+ and approaches the performance of the PPM family of sta-
+ tistical compressors.
+
+ The command-line options are deliberately very similar to
+ those of _G_N_U _G_z_i_p_, but they are not identical.
+
+ _B_z_i_p_2 expects a list of file names to accompany the com-
+ mand-line flags. Each file is replaced by a compressed
+ version of itself, with the name "original_name.bz2".
+ Each compressed file has the same modification date and
+ permissions as the corresponding original, so that these
+ properties can be correctly restored at decompression
+ time. File name handling is naive in the sense that there
+ is no mechanism for preserving original file names, per-
+ missions and dates in filesystems which lack these con-
+ cepts, or have serious file name length restrictions, such
+ as MS-DOS.
+
+ _B_z_i_p_2 and _b_u_n_z_i_p_2 will not overwrite existing files; if
+ you want this to happen, you should delete them first.
+
+ If no file names are specified, _b_z_i_p_2 compresses from
+ standard input to standard output. In this case, _b_z_i_p_2
+ will decline to write compressed output to a terminal, as
+ this would be entirely incomprehensible and therefore
+ pointless.
+
+ _B_u_n_z_i_p_2 (or _b_z_i_p_2 _-_d ) decompresses and restores all spec-
+ ified files whose names end in ".bz2". Files without this
+ suffix are ignored. Again, supplying no filenames causes
+ decompression from standard input to standard output.
+
+ You can also compress or decompress files to the standard
+ output by giving the -c flag. You can decompress multiple
+ files like this, but you may only compress a single file
+ this way, since it would otherwise be difficult to sepa-
+ rate out the compressed representations of the original
+ files.
+
+
+
+ 1
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ Compression is always performed, even if the compressed
+ file is slightly larger than the original. Files of less
+ than about one hundred bytes tend to get larger, since the
+ compression mechanism has a constant overhead in the
+ region of 50 bytes. Random data (including the output of
+ most file compressors) is coded at about 8.05 bits per
+ byte, giving an expansion of around 0.5%.
+
+ As a self-check for your protection, _b_z_i_p_2 uses 32-bit
+ CRCs to make sure that the decompressed version of a file
+ is identical to the original. This guards against corrup-
+ tion of the compressed data, and against undetected bugs
+ in _b_z_i_p_2 (hopefully very unlikely). The chances of data
+ corruption going undetected is microscopic, about one
+ chance in four billion for each file processed. Be aware,
+ though, that the check occurs upon decompression, so it
+ can only tell you that that something is wrong. It can't
+ help you recover the original uncompressed data. You can
+ use _b_z_i_p_2_r_e_c_o_v_e_r to try to recover data from damaged
+ files.
+
+ Return values: 0 for a normal exit, 1 for environmental
+ problems (file not found, invalid flags, I/O errors, &c),
+ 2 to indicate a corrupt compressed file, 3 for an internal
+ consistency error (eg, bug) which caused _b_z_i_p_2 to panic.
+
+
+MMEEMMOORRYY MMAANNAAGGEEMMEENNTT
+ _B_z_i_p_2 compresses large files in blocks. The block size
+ affects both the compression ratio achieved, and the
+ amount of memory needed both for compression and decom-
+ pression. The flags -1 through -9 specify the block size
+ to be 100,000 bytes through 900,000 bytes (the default)
+ respectively. At decompression-time, the block size used
+ for compression is read from the header of the compressed
+ file, and _b_u_n_z_i_p_2 then allocates itself just enough memory
+ to decompress the file. Since block sizes are stored in
+ compressed files, it follows that the flags -1 to -9 are
+ irrelevant to and so ignored during decompression. Com-
+ pression and decompression requirements, in bytes, can be
+ estimated as:
+
+ Compression: 400k + ( 7 x block size )
+
+ Decompression: 100k + ( 5 x block size ), or
+ 100k + ( 2.5 x block size )
+
+ Larger block sizes give rapidly diminishing marginal
+ returns; most of the compression comes from the first two
+ or three hundred k of block size, a fact worth bearing in
+ mind when using _b_z_i_p_2 on small machines. It is also
+ important to appreciate that the decompression memory
+ requirement is set at compression-time by the choice of
+ block size.
+
+
+
+ 2
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ For files compressed with the default 900k block size,
+ _b_u_n_z_i_p_2 will require about 4600 kbytes to decompress. To
+ support decompression of any file on a 4 megabyte machine,
+ _b_u_n_z_i_p_2 has an option to decompress using approximately
+ half this amount of memory, about 2300 kbytes. Decompres-
+ sion speed is also halved, so you should use this option
+ only where necessary. The relevant flag is -s.
+
+ In general, try and use the largest block size memory con-
+ straints allow, since that maximises the compression
+ achieved. Compression and decompression speed are virtu-
+ ally unaffected by block size.
+
+ Another significant point applies to files which fit in a
+ single block -- that means most files you'd encounter
+ using a large block size. The amount of real memory
+ touched is proportional to the size of the file, since the
+ file is smaller than a block. For example, compressing a
+ file 20,000 bytes long with the flag -9 will cause the
+ compressor to allocate around 6700k of memory, but only
+ touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the
+ decompressor will allocate 4600k but only touch 100k +
+ 20000 * 5 = 200 kbytes.
+
+ Here is a table which summarises the maximum memory usage
+ for different block sizes. Also recorded is the total
+ compressed size for 14 files of the Calgary Text Compres-
+ sion Corpus totalling 3,141,622 bytes. This column gives
+ some feel for how compression varies with block size.
+ These figures tend to understate the advantage of larger
+ block sizes for larger files, since the Corpus is domi-
+ nated by smaller files.
+
+ Compress Decompress Decompress Corpus
+ Flag usage usage -s usage Size
+
+ -1 1100k 600k 350k 914704
+ -2 1800k 1100k 600k 877703
+ -3 2500k 1600k 850k 860338
+ -4 3200k 2100k 1100k 846899
+ -5 3900k 2600k 1350k 845160
+ -6 4600k 3100k 1600k 838626
+ -7 5400k 3600k 1850k 834096
+ -8 6000k 4100k 2100k 828642
+ -9 6700k 4600k 2350k 828642
+
+
+OOPPTTIIOONNSS
+ --cc ----ssttddoouutt
+ Compress or decompress to standard output. -c will
+ decompress multiple files to stdout, but will only
+ compress a single file to stdout.
+
+
+
+
+
+ 3
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ --dd ----ddeeccoommpprreessss
+ Force decompression. _B_z_i_p_2 and _b_u_n_z_i_p_2 are really
+ the same program, and the decision about whether to
+ compress or decompress is done on the basis of
+ which name is used. This flag overrides that mech-
+ anism, and forces _b_z_i_p_2 to decompress.
+
+ --ff ----ccoommpprreessss
+ The complement to -d: forces compression, regard-
+ less of the invokation name.
+
+ --tt ----tteesstt
+ Check integrity of the specified file(s), but don't
+ decompress them. This really performs a trial
+ decompression and throws away the result, using the
+ low-memory decompression algorithm (see -s).
+
+ --kk ----kkeeeepp
+ Keep (don't delete) input files during compression
+ or decompression.
+
+ --ss ----ssmmaallll
+ Reduce memory usage, both for compression and
+ decompression. Files are decompressed using a mod-
+ ified algorithm which only requires 2.5 bytes per
+ block byte. This means any file can be decom-
+ pressed in 2300k of memory, albeit somewhat more
+ slowly than usual.
+
+ During compression, -s selects a block size of
+ 200k, which limits memory use to around the same
+ figure, at the expense of your compression ratio.
+ In short, if your machine is low on memory (8
+ megabytes or less), use -s for everything. See
+ MEMORY MANAGEMENT above.
+
+
+ --vv ----vveerrbboossee
+ Verbose mode -- show the compression ratio for each
+ file processed. Further -v's increase the ver-
+ bosity level, spewing out lots of information which
+ is primarily of interest for diagnostic purposes.
+
+ --LL ----lliicceennssee
+ Display the software version, license terms and
+ conditions.
+
+ --VV ----vveerrssiioonn
+ Same as -L.
+
+ --11 ttoo --99
+ Set the block size to 100 k, 200 k .. 900 k when
+ compressing. Has no effect when decompressing.
+ See MEMORY MANAGEMENT above.
+
+
+
+ 4
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ ----rreeppeettiittiivvee--ffaasstt
+ _b_z_i_p_2 injects some small pseudo-random variations
+ into very repetitive blocks to limit worst-case
+ performance during compression. If sorting runs
+ into difficulties, the block is randomised, and
+ sorting is restarted. Very roughly, _b_z_i_p_2 persists
+ for three times as long as a well-behaved input
+ would take before resorting to randomisation. This
+ flag makes it give up much sooner.
+
+
+ ----rreeppeettiittiivvee--bbeesstt
+ Opposite of --repetitive-fast; try a lot harder
+ before resorting to randomisation.
+
+
+RREECCOOVVEERRIINNGG DDAATTAA FFRROOMM DDAAMMAAGGEEDD FFIILLEESS
+ _b_z_i_p_2 compresses files in blocks, usually 900kbytes long.
+ Each block is handled independently. If a media or trans-
+ mission error causes a multi-block .bz2 file to become
+ damaged, it may be possible to recover data from the
+ undamaged blocks in the file.
+
+ The compressed representation of each block is delimited
+ by a 48-bit pattern, which makes it possible to find the
+ block boundaries with reasonable certainty. Each block
+ also carries its own 32-bit CRC, so damaged blocks can be
+ distinguished from undamaged ones.
+
+ _b_z_i_p_2_r_e_c_o_v_e_r is a simple program whose purpose is to
+ search for blocks in .bz2 files, and write each block out
+ into its own .bz2 file. You can then use _b_z_i_p_2 _-_t to test
+ the integrity of the resulting files, and decompress those
+ which are undamaged.
+
+ _b_z_i_p_2_r_e_c_o_v_e_r takes a single argument, the name of the dam-
+ aged file, and writes a number of files "rec0001file.bz2",
+ "rec0002file.bz2", etc, containing the extracted blocks.
+ The output filenames are designed so that the use of wild-
+ cards in subsequent processing -- for example, "bzip2 -dc
+ rec*file.bz2 > recovered_data" -- lists the files in the
+ "right" order.
+
+ _b_z_i_p_2_r_e_c_o_v_e_r should be of most use dealing with large .bz2
+ files, as these will contain many blocks. It is clearly
+ futile to use it on damaged single-block files, since a
+ damaged block cannot be recovered. If you wish to min-
+ imise any potential data loss through media or transmis-
+ sion errors, you might consider compressing with a smaller
+ block size.
+
+
+PPEERRFFOORRMMAANNCCEE NNOOTTEESS
+ The sorting phase of compression gathers together similar
+
+
+
+ 5
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ strings in the file. Because of this, files containing
+ very long runs of repeated symbols, like "aabaabaabaab
+ ..." (repeated several hundred times) may compress
+ extraordinarily slowly. You can use the -vvvvv option to
+ monitor progress in great detail, if you want. Decompres-
+ sion speed is unaffected.
+
+ Such pathological cases seem rare in practice, appearing
+ mostly in artificially-constructed test files, and in low-
+ level disk images. It may be inadvisable to use _b_z_i_p_2 to
+ compress the latter. If you do get a file which causes
+ severe slowness in compression, try making the block size
+ as small as possible, with flag -1.
+
+ Incompressible or virtually-incompressible data may decom-
+ press rather more slowly than one would hope. This is due
+ to a naive implementation of the move-to-front coder.
+
+ _b_z_i_p_2 usually allocates several megabytes of memory to
+ operate in, and then charges all over it in a fairly ran-
+ dom fashion. This means that performance, both for com-
+ pressing and decompressing, is largely determined by the
+ speed at which your machine can service cache misses.
+ Because of this, small changes to the code to reduce the
+ miss rate have been observed to give disproportionately
+ large performance improvements. I imagine _b_z_i_p_2 will per-
+ form best on machines with very large caches.
+
+ Test mode (-t) uses the low-memory decompression algorithm
+ (-s). This means test mode does not run as fast as it
+ could; it could run as fast as the normal decompression
+ machinery. This could easily be fixed at the cost of some
+ code bloat.
+
+
+CCAAVVEEAATTSS
+ I/O error messages are not as helpful as they could be.
+ _B_z_i_p_2 tries hard to detect I/O errors and exit cleanly,
+ but the details of what the problem is sometimes seem
+ rather misleading.
+
+ This manual page pertains to version 0.1 of _b_z_i_p_2_. It may
+ well happen that some future version will use a different
+ compressed file format. If you try to decompress, using
+ 0.1, a .bz2 file created with some future version which
+ uses a different compressed file format, 0.1 will complain
+ that your file "is not a bzip2 file". If that happens,
+ you should obtain a more recent version of _b_z_i_p_2 and use
+ that to decompress the file.
+
+ Wildcard expansion for Windows 95 and NT is flaky.
+
+ _b_z_i_p_2_r_e_c_o_v_e_r uses 32-bit integers to represent bit posi-
+ tions in compressed files, so it cannot handle compressed
+
+
+
+ 6
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ files more than 512 megabytes long. This could easily be
+ fixed.
+
+ _b_z_i_p_2_r_e_c_o_v_e_r sometimes reports a very small, incomplete
+ final block. This is spurious and can be safely ignored.
+
+
+RREELLAATTIIOONNSSHHIIPP TTOO bbzziipp--00..2211
+ This program is a descendant of the _b_z_i_p program, version
+ 0.21, which I released in August 1996. The primary dif-
+ ference of _b_z_i_p_2 is its avoidance of the possibly patented
+ algorithms which were used in 0.21. _b_z_i_p_2 also brings
+ various useful refinements (-s, -t), uses less memory,
+ decompresses significantly faster, and has support for
+ recovering data from damaged files.
+
+ Because _b_z_i_p_2 uses Huffman coding to construct the com-
+ pressed bitstream, rather than the arithmetic coding used
+ in 0.21, the compressed representations generated by the
+ two programs are incompatible, and they will not interop-
+ erate. The change in suffix from .bz to .bz2 reflects
+ this. It would have been helpful to at least allow _b_z_i_p_2
+ to decompress files created by 0.21, but this would defeat
+ the primary aim of having a patent-free compressor.
+
+ Huffman coding necessarily involves some coding ineffi-
+ ciency compared to arithmetic coding. This means that
+ _b_z_i_p_2 compresses about 1% worse than 0.21, an unfortunate
+ but unavoidable fact-of-life. On the other hand, decom-
+ pression is approximately 50% faster for the same reason,
+ and the change in file format gave an opportunity to add
+ data-recovery features. So it is not all bad.
+
+
+AAUUTTHHOORR
+ Julian Seward, jseward@acm.org.
+
+ The ideas embodied in _b_z_i_p and _b_z_i_p_2 are due to (at least)
+ the following people: Michael Burrows and David Wheeler
+ (for the block sorting transformation), David Wheeler
+ (again, for the Huffman coder), Peter Fenwick (for the
+ structured coding model in 0.21, and many refinements),
+ and Alistair Moffat, Radford Neal and Ian Witten (for the
+ arithmetic coder in 0.21). I am much indebted for their
+ help, support and advice. See the file ALGORITHMS in the
+ source distribution for pointers to sources of documenta-
+ tion. Christian von Roques encouraged me to look for
+ faster sorting algorithms, so as to speed up compression.
+ Bela Lubkin encouraged me to improve the worst-case com-
+ pression performance. Many people sent patches, helped
+ with portability problems, lent machines, gave advice and
+ were generally helpful.
+
+
+
+
+
+ 7
+
+
diff --git a/bzip2.c b/bzip2.c
new file mode 100644
index 0000000..0fb45fb
--- /dev/null
+++ b/bzip2.c
@@ -0,0 +1,4036 @@
+
+/*-----------------------------------------------------------*/
+/*--- A block-sorting, lossless compressor bzip2.c ---*/
+/*-----------------------------------------------------------*/
+
+/*--
+ This program is bzip2, a lossless, block-sorting data compressor,
+ version 0.1pl0, dated 17-Aug-1997.
+
+ Copyright (C) 1996, 1997 by Julian Seward.
+ Guildford, Surrey, UK
+ email: jseward@acm.org
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ The GNU General Public License is contained in the file LICENSE.
+
+ This program is based on (at least) the work of:
+ Mike Burrows
+ David Wheeler
+ Peter Fenwick
+ Alistair Moffat
+ Radford Neal
+ Ian H. Witten
+ Robert Sedgewick
+ Jon L. Bentley
+
+ For more information on these sources, see the file ALGORITHMS.
+--*/
+
+/*----------------------------------------------------*/
+/*--- IMPORTANT ---*/
+/*----------------------------------------------------*/
+
+/*--
+ WARNING:
+ This program (attempts to) compress data by performing several
+ non-trivial transformations on it. Unless you are 100% familiar
+ with *all* the algorithms contained herein, and with the
+ consequences of modifying them, you should NOT meddle with the
+ compression or decompression machinery. Incorrect changes can
+ and very likely *will* lead to disasterous loss of data.
+
+ DISCLAIMER:
+ I TAKE NO RESPONSIBILITY FOR ANY LOSS OF DATA ARISING FROM THE
+ USE OF THIS PROGRAM, HOWSOEVER CAUSED.
+
+ Every compression of a file implies an assumption that the
+ compressed file can be decompressed to reproduce the original.
+ Great efforts in design, coding and testing have been made to
+ ensure that this program works correctly. However, the
+ complexity of the algorithms, and, in particular, the presence
+ of various special cases in the code which occur with very low
+ but non-zero probability make it impossible to rule out the
+ possibility of bugs remaining in the program. DO NOT COMPRESS
+ ANY DATA WITH THIS PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE
+ POSSIBILITY, HOWEVER SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.
+
+ That is not to say this program is inherently unreliable.
+ Indeed, I very much hope the opposite is true. bzip2 has been
+ carefully constructed and extensively tested.
+--*/
+
+
+
+/*----------------------------------------------------*/
+/*--- and now for something much more pleasant :-) ---*/
+/*----------------------------------------------------*/
+
+/*---------------------------------------------*/
+/*--
+ Place a 1 beside your platform, and 0 elsewhere.
+--*/
+
+/*--
+ Generic 32-bit Unix.
+ Also works on 64-bit Unix boxes.
+--*/
+#define BZ_UNIX 1
+
+/*--
+ Win32, as seen by Jacob Navia's excellent
+ port of (Chris Fraser & David Hanson)'s excellent
+ lcc compiler.
+--*/
+#define BZ_LCCWIN32 0
+
+
+
+/*---------------------------------------------*/
+/*--
+ Some stuff for all platforms.
+--*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#if DEBUG
+ #include <assert.h>
+#endif
+#include <string.h>
+#include <signal.h>
+#include <errno.h>
+#include <math.h>
+
+#define ERROR_IF_EOF(i) { if ((i) == EOF) ioError(); }
+#define ERROR_IF_NOT_ZERO(i) { if ((i) != 0) ioError(); }
+#define ERROR_IF_MINUS_ONE(i) { if ((i) == (-1)) ioError(); }
+
+
+/*---------------------------------------------*/
+/*--
+ Platform-specific stuff.
+--*/
+
+#if BZ_UNIX
+ #include <utime.h>
+ #include <unistd.h>
+ #include <malloc.h>
+ #include <sys/stat.h>
+ #include <sys/times.h>
+
+ #define Int32 int
+ #define UInt32 unsigned int
+ #define Char char
+ #define UChar unsigned char
+ #define Int16 short
+ #define UInt16 unsigned short
+
+ #define PATH_SEP '/'
+ #define MY_LSTAT lstat
+ #define MY_S_IFREG S_ISREG
+ #define MY_STAT stat
+
+ #define APPEND_FILESPEC(root, name) \
+ root=snocString((root), (name))
+
+ #define SET_BINARY_MODE(fd) /**/
+
+ /*--
+ You should try very hard to persuade your C compiler
+ to inline the bits marked INLINE. Otherwise bzip2 will
+ run rather slowly. gcc version 2.x is recommended.
+ --*/
+ #ifdef __GNUC__
+ #define INLINE inline
+ #define NORETURN __attribute__ ((noreturn))
+ #else
+ #define INLINE /**/
+ #define NORETURN /**/
+ #endif
+#endif
+
+
+
+#if BZ_LCCWIN32
+ #include <io.h>
+ #include <fcntl.h>
+ #include <sys\stat.h>
+
+ #define Int32 int
+ #define UInt32 unsigned int
+ #define Int16 short
+ #define UInt16 unsigned short
+ #define Char char
+ #define UChar unsigned char
+
+ #define INLINE /**/
+ #define NORETURN /**/
+ #define PATH_SEP '\\'
+ #define MY_LSTAT _stat
+ #define MY_STAT _stat
+ #define MY_S_IFREG(x) ((x) & _S_IFREG)
+
+ #if 0
+ /*-- lcc-win32 seems to expand wildcards itself --*/
+ #define APPEND_FILESPEC(root, spec) \
+ do { \
+ if ((spec)[0] == '-') { \
+ root = snocString((root), (spec)); \
+ } else { \
+ struct _finddata_t c_file; \
+ long hFile; \
+ hFile = _findfirst((spec), &c_file); \
+ if ( hFile == -1L ) { \
+ root = snocString ((root), (spec)); \
+ } else { \
+ int anInt = 0; \
+ while ( anInt == 0 ) { \
+ root = snocString((root), \
+ &c_file.name[0]); \
+ anInt = _findnext(hFile, &c_file); \
+ } \
+ } \
+ } \
+ } while ( 0 )
+ #else
+ #define APPEND_FILESPEC(root, name) \
+ root = snocString ((root), (name))
+ #endif
+
+ #define SET_BINARY_MODE(fd) \
+ do { \
+ int retVal = setmode ( fileno ( fd ), \
+ O_BINARY ); \
+ ERROR_IF_MINUS_ONE ( retVal ); \
+ } while ( 0 )
+
+#endif
+
+
+/*---------------------------------------------*/
+/*--
+ Some more stuff for all platforms :-)
+--*/
+
+#define Bool unsigned char
+#define True 1
+#define False 0
+
+/*--
+ IntNative is your platform's `native' int size.
+ Only here to avoid probs with 64-bit platforms.
+--*/
+#define IntNative int
+
+
+/*--
+ change to 1, or compile with -DDEBUG=1 to debug
+--*/
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+
+/*---------------------------------------------------*/
+/*--- ---*/
+/*---------------------------------------------------*/
+
+/*--
+ Implementation notes, July 1997
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ Memory allocation
+ ~~~~~~~~~~~~~~~~~
+ All large data structures are allocated on the C heap,
+ for better or for worse. That includes the various
+ arrays of pointers, striped words, bytes, frequency
+ tables and buffers for compression and decompression.
+
+ bzip2 can operate at various block-sizes, ranging from
+ 100k to 900k in 100k steps, and it allocates only as
+ much as it needs to. When compressing, we know from the
+ command-line options what the block-size is going to be,
+ so all allocation can be done at start-up; if that
+ succeeds, there can be no further allocation problems.
+
+ Decompression is more complicated. Each compressed file
+ contains, in its header, a byte indicating the block
+ size used for compression. This means bzip2 potentially
+ needs to reallocate memory for each file it deals with,
+ which in turn opens the possibility for a memory allocation
+ failure part way through a run of files, by encountering
+ a file requiring a much larger block size than all the
+ ones preceding it.
+
+ The policy is to simply give up if a memory allocation
+ failure occurs. During decompression, it would be
+ possible to move on to subsequent files in the hope that
+ some might ask for a smaller block size, but the
+ complications for doing this seem more trouble than they
+ are worth.
+
+
+ Compressed file formats
+ ~~~~~~~~~~~~~~~~~~~~~~~
+ [This is now entirely different from both 0.21, and from
+ any previous Huffman-coded variant of bzip.
+ See the associated file bzip2.txt for details.]
+
+
+ Error conditions
+ ~~~~~~~~~~~~~~~~
+ Dealing with error conditions is the least satisfactory
+ aspect of bzip2. The policy is to try and leave the
+ filesystem in a consistent state, then quit, even if it
+ means not processing some of the files mentioned in the
+ command line. `A consistent state' means that a file
+ exists either in its compressed or uncompressed form,
+ but not both. This boils down to the rule `delete the
+ output file if an error condition occurs, leaving the
+ input intact'. Input files are only deleted when we can
+ be pretty sure the output file has been written and
+ closed successfully.
+
+ Errors are a dog because there's so many things to
+ deal with. The following can happen mid-file, and
+ require cleaning up.
+
+ internal `panics' -- indicating a bug
+ corrupted or inconsistent compressed file
+ can't allocate enough memory to decompress this file
+ I/O error reading/writing/opening/closing
+ signal catches -- Control-C, SIGTERM, SIGHUP.
+
+ Other conditions, primarily pertaining to file names,
+ can be checked in-between files, which makes dealing
+ with them easier.
+--*/
+
+
+
+/*---------------------------------------------------*/
+/*--- Misc (file handling) data decls ---*/
+/*---------------------------------------------------*/
+
+UInt32 bytesIn, bytesOut;
+Int32 verbosity;
+Bool keepInputFiles, smallMode, testFailsExist;
+UInt32 globalCrc;
+Int32 numFileNames, numFilesProcessed;
+
+
+/*-- source modes; F==file, I==stdin, O==stdout --*/
+#define SM_I2O 1
+#define SM_F2O 2
+#define SM_F2F 3
+
+/*-- operation modes --*/
+#define OM_Z 1
+#define OM_UNZ 2
+#define OM_TEST 3
+
+Int32 opMode;
+Int32 srcMode;
+
+
+Int32 longestFileName;
+Char inName[1024];
+Char outName[1024];
+Char *progName;
+Char progNameReally[1024];
+FILE *outputHandleJustInCase;
+
+void panic ( Char* ) NORETURN;
+void ioError ( void ) NORETURN;
+void compressOutOfMemory ( Int32, Int32 ) NORETURN;
+void uncompressOutOfMemory ( Int32, Int32 ) NORETURN;
+void blockOverrun ( void ) NORETURN;
+void badBlockHeader ( void ) NORETURN;
+void badBGLengths ( void ) NORETURN;
+void crcError ( UInt32, UInt32 ) NORETURN;
+void bitStreamEOF ( void ) NORETURN;
+void cleanUpAndFail ( Int32 ) NORETURN;
+void compressedStreamEOF ( void ) NORETURN;
+
+void* myMalloc ( Int32 );
+
+
+
+/*---------------------------------------------------*/
+/*--- Data decls for the front end ---*/
+/*---------------------------------------------------*/
+
+/*--
+ The overshoot bytes allow us to avoid most of
+ the cost of pointer renormalisation during
+ comparison of rotations in sorting.
+ The figure of 20 is derived as follows:
+ qSort3 allows an overshoot of up to 10.
+ It then calls simpleSort, which calls
+ fullGtU, also with max overshoot 10.
+ fullGtU does up to 10 comparisons without
+ renormalising, giving 10+10 == 20.
+--*/
+#define NUM_OVERSHOOT_BYTES 20
+
+/*--
+ These are the main data structures for
+ the Burrows-Wheeler transform.
+--*/
+
+/*--
+ Pointers to compression and decompression
+ structures. Set by
+ allocateCompressStructures and
+ setDecompressStructureSizes
+
+ The structures are always set to be suitable
+ for a block of size 100000 * blockSize100k.
+--*/
+UChar *block; /*-- compress --*/
+UInt16 *quadrant; /*-- compress --*/
+Int32 *zptr; /*-- compress --*/
+UInt16 *szptr; /*-- overlays zptr ---*/
+Int32 *ftab; /*-- compress --*/
+
+UInt16 *ll16; /*-- small decompress --*/
+UChar *ll4; /*-- small decompress --*/
+
+Int32 *tt; /*-- fast decompress --*/
+UChar *ll8; /*-- fast decompress --*/
+
+
+/*--
+ freq table collected to save a pass over the data
+ during decompression.
+--*/
+Int32 unzftab[256];
+
+
+/*--
+ index of the last char in the block, so
+ the block size == last + 1.
+--*/
+Int32 last;
+
+
+/*--
+ index in zptr[] of original string after sorting.
+--*/
+Int32 origPtr;
+
+
+/*--
+ always: in the range 0 .. 9.
+ The current block size is 100000 * this number.
+--*/
+Int32 blockSize100k;
+
+
+/*--
+ Used when sorting. If too many long comparisons
+ happen, we stop sorting, randomise the block
+ slightly, and try again.
+--*/
+
+Int32 workFactor;
+Int32 workDone;
+Int32 workLimit;
+Bool blockRandomised;
+Bool firstAttempt;
+Int32 nBlocksRandomised;
+
+
+
+/*---------------------------------------------------*/
+/*--- Data decls for the back end ---*/
+/*---------------------------------------------------*/
+
+#define MAX_ALPHA_SIZE 258
+#define MAX_CODE_LEN 23
+
+#define RUNA 0
+#define RUNB 1
+
+#define N_GROUPS 6
+#define G_SIZE 50
+#define N_ITERS 4
+
+#define MAX_SELECTORS (2 + (900000 / G_SIZE))
+
+Bool inUse[256];
+Int32 nInUse;
+
+UChar seqToUnseq[256];
+UChar unseqToSeq[256];
+
+UChar selector [MAX_SELECTORS];
+UChar selectorMtf[MAX_SELECTORS];
+
+Int32 nMTF;
+
+Int32 mtfFreq[MAX_ALPHA_SIZE];
+
+UChar len [N_GROUPS][MAX_ALPHA_SIZE];
+
+/*-- decompress only --*/
+Int32 limit [N_GROUPS][MAX_ALPHA_SIZE];
+Int32 base [N_GROUPS][MAX_ALPHA_SIZE];
+Int32 perm [N_GROUPS][MAX_ALPHA_SIZE];
+Int32 minLens[N_GROUPS];
+
+/*-- compress only --*/
+Int32 code [N_GROUPS][MAX_ALPHA_SIZE];
+Int32 rfreq[N_GROUPS][MAX_ALPHA_SIZE];
+
+
+/*---------------------------------------------------*/
+/*--- 32-bit CRC grunge ---*/
+/*---------------------------------------------------*/
+
+/*--
+ I think this is an implementation of the AUTODIN-II,
+ Ethernet & FDDI 32-bit CRC standard. Vaguely derived
+ from code by Rob Warnock, in Section 51 of the
+ comp.compression FAQ.
+--*/
+
+UInt32 crc32Table[256] = {
+
+ /*-- Ugly, innit? --*/
+
+ 0x00000000L, 0x04c11db7L, 0x09823b6eL, 0x0d4326d9L,
+ 0x130476dcL, 0x17c56b6bL, 0x1a864db2L, 0x1e475005L,
+ 0x2608edb8L, 0x22c9f00fL, 0x2f8ad6d6L, 0x2b4bcb61L,
+ 0x350c9b64L, 0x31cd86d3L, 0x3c8ea00aL, 0x384fbdbdL,
+ 0x4c11db70L, 0x48d0c6c7L, 0x4593e01eL, 0x4152fda9L,
+ 0x5f15adacL, 0x5bd4b01bL, 0x569796c2L, 0x52568b75L,
+ 0x6a1936c8L, 0x6ed82b7fL, 0x639b0da6L, 0x675a1011L,
+ 0x791d4014L, 0x7ddc5da3L, 0x709f7b7aL, 0x745e66cdL,
+ 0x9823b6e0L, 0x9ce2ab57L, 0x91a18d8eL, 0x95609039L,
+ 0x8b27c03cL, 0x8fe6dd8bL, 0x82a5fb52L, 0x8664e6e5L,
+ 0xbe2b5b58L, 0xbaea46efL, 0xb7a96036L, 0xb3687d81L,
+ 0xad2f2d84L, 0xa9ee3033L, 0xa4ad16eaL, 0xa06c0b5dL,
+ 0xd4326d90L, 0xd0f37027L, 0xddb056feL, 0xd9714b49L,
+ 0xc7361b4cL, 0xc3f706fbL, 0xceb42022L, 0xca753d95L,
+ 0xf23a8028L, 0xf6fb9d9fL, 0xfbb8bb46L, 0xff79a6f1L,
+ 0xe13ef6f4L, 0xe5ffeb43L, 0xe8bccd9aL, 0xec7dd02dL,
+ 0x34867077L, 0x30476dc0L, 0x3d044b19L, 0x39c556aeL,
+ 0x278206abL, 0x23431b1cL, 0x2e003dc5L, 0x2ac12072L,
+ 0x128e9dcfL, 0x164f8078L, 0x1b0ca6a1L, 0x1fcdbb16L,
+ 0x018aeb13L, 0x054bf6a4L, 0x0808d07dL, 0x0cc9cdcaL,
+ 0x7897ab07L, 0x7c56b6b0L, 0x71159069L, 0x75d48ddeL,
+ 0x6b93dddbL, 0x6f52c06cL, 0x6211e6b5L, 0x66d0fb02L,
+ 0x5e9f46bfL, 0x5a5e5b08L, 0x571d7dd1L, 0x53dc6066L,
+ 0x4d9b3063L, 0x495a2dd4L, 0x44190b0dL, 0x40d816baL,
+ 0xaca5c697L, 0xa864db20L, 0xa527fdf9L, 0xa1e6e04eL,
+ 0xbfa1b04bL, 0xbb60adfcL, 0xb6238b25L, 0xb2e29692L,
+ 0x8aad2b2fL, 0x8e6c3698L, 0x832f1041L, 0x87ee0df6L,
+ 0x99a95df3L, 0x9d684044L, 0x902b669dL, 0x94ea7b2aL,
+ 0xe0b41de7L, 0xe4750050L, 0xe9362689L, 0xedf73b3eL,
+ 0xf3b06b3bL, 0xf771768cL, 0xfa325055L, 0xfef34de2L,
+ 0xc6bcf05fL, 0xc27dede8L, 0xcf3ecb31L, 0xcbffd686L,
+ 0xd5b88683L, 0xd1799b34L, 0xdc3abdedL, 0xd8fba05aL,
+ 0x690ce0eeL, 0x6dcdfd59L, 0x608edb80L, 0x644fc637L,
+ 0x7a089632L, 0x7ec98b85L, 0x738aad5cL, 0x774bb0ebL,
+ 0x4f040d56L, 0x4bc510e1L, 0x46863638L, 0x42472b8fL,
+ 0x5c007b8aL, 0x58c1663dL, 0x558240e4L, 0x51435d53L,
+ 0x251d3b9eL, 0x21dc2629L, 0x2c9f00f0L, 0x285e1d47L,
+ 0x36194d42L, 0x32d850f5L, 0x3f9b762cL, 0x3b5a6b9bL,
+ 0x0315d626L, 0x07d4cb91L, 0x0a97ed48L, 0x0e56f0ffL,
+ 0x1011a0faL, 0x14d0bd4dL, 0x19939b94L, 0x1d528623L,
+ 0xf12f560eL, 0xf5ee4bb9L, 0xf8ad6d60L, 0xfc6c70d7L,
+ 0xe22b20d2L, 0xe6ea3d65L, 0xeba91bbcL, 0xef68060bL,
+ 0xd727bbb6L, 0xd3e6a601L, 0xdea580d8L, 0xda649d6fL,
+ 0xc423cd6aL, 0xc0e2d0ddL, 0xcda1f604L, 0xc960ebb3L,
+ 0xbd3e8d7eL, 0xb9ff90c9L, 0xb4bcb610L, 0xb07daba7L,
+ 0xae3afba2L, 0xaafbe615L, 0xa7b8c0ccL, 0xa379dd7bL,
+ 0x9b3660c6L, 0x9ff77d71L, 0x92b45ba8L, 0x9675461fL,
+ 0x8832161aL, 0x8cf30badL, 0x81b02d74L, 0x857130c3L,
+ 0x5d8a9099L, 0x594b8d2eL, 0x5408abf7L, 0x50c9b640L,
+ 0x4e8ee645L, 0x4a4ffbf2L, 0x470cdd2bL, 0x43cdc09cL,
+ 0x7b827d21L, 0x7f436096L, 0x7200464fL, 0x76c15bf8L,
+ 0x68860bfdL, 0x6c47164aL, 0x61043093L, 0x65c52d24L,
+ 0x119b4be9L, 0x155a565eL, 0x18197087L, 0x1cd86d30L,
+ 0x029f3d35L, 0x065e2082L, 0x0b1d065bL, 0x0fdc1becL,
+ 0x3793a651L, 0x3352bbe6L, 0x3e119d3fL, 0x3ad08088L,
+ 0x2497d08dL, 0x2056cd3aL, 0x2d15ebe3L, 0x29d4f654L,
+ 0xc5a92679L, 0xc1683bceL, 0xcc2b1d17L, 0xc8ea00a0L,
+ 0xd6ad50a5L, 0xd26c4d12L, 0xdf2f6bcbL, 0xdbee767cL,
+ 0xe3a1cbc1L, 0xe760d676L, 0xea23f0afL, 0xeee2ed18L,
+ 0xf0a5bd1dL, 0xf464a0aaL, 0xf9278673L, 0xfde69bc4L,
+ 0x89b8fd09L, 0x8d79e0beL, 0x803ac667L, 0x84fbdbd0L,
+ 0x9abc8bd5L, 0x9e7d9662L, 0x933eb0bbL, 0x97ffad0cL,
+ 0xafb010b1L, 0xab710d06L, 0xa6322bdfL, 0xa2f33668L,
+ 0xbcb4666dL, 0xb8757bdaL, 0xb5365d03L, 0xb1f740b4L
+};
+
+
+/*---------------------------------------------*/
+void initialiseCRC ( void )
+{
+ globalCrc = 0xffffffffL;
+}
+
+
+/*---------------------------------------------*/
+UInt32 getFinalCRC ( void )
+{
+ return ~globalCrc;
+}
+
+
+/*---------------------------------------------*/
+UInt32 getGlobalCRC ( void )
+{
+ return globalCrc;
+}
+
+
+/*---------------------------------------------*/
+void setGlobalCRC ( UInt32 newCrc )
+{
+ globalCrc = newCrc;
+}
+
+
+/*---------------------------------------------*/
+#define UPDATE_CRC(crcVar,cha) \
+{ \
+ crcVar = (crcVar << 8) ^ \
+ crc32Table[(crcVar >> 24) ^ \
+ ((UChar)cha)]; \
+}
+
+
+/*---------------------------------------------------*/
+/*--- Bit stream I/O ---*/
+/*---------------------------------------------------*/
+
+
+UInt32 bsBuff;
+Int32 bsLive;
+FILE* bsStream;
+Bool bsWriting;
+
+
+/*---------------------------------------------*/
+void bsSetStream ( FILE* f, Bool wr )
+{
+ if (bsStream != NULL) panic ( "bsSetStream" );
+ bsStream = f;
+ bsLive = 0;
+ bsBuff = 0;
+ bytesOut = 0;
+ bytesIn = 0;
+ bsWriting = wr;
+}
+
+
+/*---------------------------------------------*/
+void bsFinishedWithStream ( void )
+{
+ if (bsWriting)
+ while (bsLive > 0) {
+ fputc ( (UChar)(bsBuff >> 24), bsStream );
+ bsBuff <<= 8;
+ bsLive -= 8;
+ bytesOut++;
+ }
+ bsStream = NULL;
+}
+
+
+/*---------------------------------------------*/
+#define bsNEEDR(nz) \
+{ \
+ while (bsLive < nz) { \
+ Int32 zzi = fgetc ( bsStream ); \
+ if (zzi == EOF) compressedStreamEOF(); \
+ bsBuff = (bsBuff << 8) | (zzi & 0xffL); \
+ bsLive += 8; \
+ } \
+}
+
+
+/*---------------------------------------------*/
+#define bsNEEDW(nz) \
+{ \
+ while (bsLive >= 8) { \
+ fputc ( (UChar)(bsBuff >> 24), \
+ bsStream ); \
+ bsBuff <<= 8; \
+ bsLive -= 8; \
+ bytesOut++; \
+ } \
+}
+
+
+/*---------------------------------------------*/
+#define bsR1(vz) \
+{ \
+ bsNEEDR(1); \
+ vz = (bsBuff >> (bsLive-1)) & 1; \
+ bsLive--; \
+}
+
+
+/*---------------------------------------------*/
+INLINE UInt32 bsR ( Int32 n )
+{
+ UInt32 v;
+ bsNEEDR ( n );
+ v = (bsBuff >> (bsLive-n)) & ((1 << n)-1);
+ bsLive -= n;
+ return v;
+}
+
+
+/*---------------------------------------------*/
+INLINE void bsW ( Int32 n, UInt32 v )
+{
+ bsNEEDW ( n );
+ bsBuff |= (v << (32 - bsLive - n));
+ bsLive += n;
+}
+
+
+/*---------------------------------------------*/
+UChar bsGetUChar ( void )
+{
+ return (UChar)bsR(8);
+}
+
+
+/*---------------------------------------------*/
+void bsPutUChar ( UChar c )
+{
+ bsW(8, (UInt32)c );
+}
+
+
+/*---------------------------------------------*/
+Int32 bsGetUInt32 ( void )
+{
+ UInt32 u;
+ u = 0;
+ u = (u << 8) | bsR(8);
+ u = (u << 8) | bsR(8);
+ u = (u << 8) | bsR(8);
+ u = (u << 8) | bsR(8);
+ return u;
+}
+
+
+/*---------------------------------------------*/
+UInt32 bsGetIntVS ( UInt32 numBits )
+{
+ return (UInt32)bsR(numBits);
+}
+
+
+/*---------------------------------------------*/
+UInt32 bsGetInt32 ( void )
+{
+ return (Int32)bsGetUInt32();
+}
+
+
+/*---------------------------------------------*/
+void bsPutUInt32 ( UInt32 u )
+{
+ bsW ( 8, (u >> 24) & 0xffL );
+ bsW ( 8, (u >> 16) & 0xffL );
+ bsW ( 8, (u >> 8) & 0xffL );
+ bsW ( 8, u & 0xffL );
+}
+
+
+/*---------------------------------------------*/
+void bsPutInt32 ( Int32 c )
+{
+ bsPutUInt32 ( (UInt32)c );
+}
+
+
+/*---------------------------------------------*/
+void bsPutIntVS ( Int32 numBits, UInt32 c )
+{
+ bsW ( numBits, c );
+}
+
+
+/*---------------------------------------------------*/
+/*--- Huffman coding low-level stuff ---*/
+/*---------------------------------------------------*/
+
+#define WEIGHTOF(zz0) ((zz0) & 0xffffff00)
+#define DEPTHOF(zz1) ((zz1) & 0x000000ff)
+#define MYMAX(zz2,zz3) ((zz2) > (zz3) ? (zz2) : (zz3))
+
+#define ADDWEIGHTS(zw1,zw2) \
+ (WEIGHTOF(zw1)+WEIGHTOF(zw2)) | \
+ (1 + MYMAX(DEPTHOF(zw1),DEPTHOF(zw2)))
+
+#define UPHEAP(z) \
+{ \
+ Int32 zz, tmp; \
+ zz = z; tmp = heap[zz]; \
+ while (weight[tmp] < weight[heap[zz >> 1]]) { \
+ heap[zz] = heap[zz >> 1]; \
+ zz >>= 1; \
+ } \
+ heap[zz] = tmp; \
+}
+
+#define DOWNHEAP(z) \
+{ \
+ Int32 zz, yy, tmp; \
+ zz = z; tmp = heap[zz]; \
+ while (True) { \
+ yy = zz << 1; \
+ if (yy > nHeap) break; \
+ if (yy < nHeap && \
+ weight[heap[yy+1]] < weight[heap[yy]]) \
+ yy++; \
+ if (weight[tmp] < weight[heap[yy]]) break; \
+ heap[zz] = heap[yy]; \
+ zz = yy; \
+ } \
+ heap[zz] = tmp; \
+}
+
+
+/*---------------------------------------------*/
+void hbMakeCodeLengths ( UChar *len,
+ Int32 *freq,
+ Int32 alphaSize,
+ Int32 maxLen )
+{
+ /*--
+ Nodes and heap entries run from 1. Entry 0
+ for both the heap and nodes is a sentinel.
+ --*/
+ Int32 nNodes, nHeap, n1, n2, i, j, k;
+ Bool tooLong;
+
+ Int32 heap [ MAX_ALPHA_SIZE + 2 ];
+ Int32 weight [ MAX_ALPHA_SIZE * 2 ];
+ Int32 parent [ MAX_ALPHA_SIZE * 2 ];
+
+ for (i = 0; i < alphaSize; i++)
+ weight[i+1] = (freq[i] == 0 ? 1 : freq[i]) << 8;
+
+ while (True) {
+
+ nNodes = alphaSize;
+ nHeap = 0;
+
+ heap[0] = 0;
+ weight[0] = 0;
+ parent[0] = -2;
+
+ for (i = 1; i <= alphaSize; i++) {
+ parent[i] = -1;
+ nHeap++;
+ heap[nHeap] = i;
+ UPHEAP(nHeap);
+ }
+ if (!(nHeap < (MAX_ALPHA_SIZE+2)))
+ panic ( "hbMakeCodeLengths(1)" );
+
+ while (nHeap > 1) {
+ n1 = heap[1]; heap[1] = heap[nHeap]; nHeap--; DOWNHEAP(1);
+ n2 = heap[1]; heap[1] = heap[nHeap]; nHeap--; DOWNHEAP(1);
+ nNodes++;
+ parent[n1] = parent[n2] = nNodes;
+ weight[nNodes] = ADDWEIGHTS(weight[n1], weight[n2]);
+ parent[nNodes] = -1;
+ nHeap++;
+ heap[nHeap] = nNodes;
+ UPHEAP(nHeap);
+ }
+ if (!(nNodes < (MAX_ALPHA_SIZE * 2)))
+ panic ( "hbMakeCodeLengths(2)" );
+
+ tooLong = False;
+ for (i = 1; i <= alphaSize; i++) {
+ j = 0;
+ k = i;
+ while (parent[k] >= 0) { k = parent[k]; j++; }
+ len[i-1] = j;
+ if (j > maxLen) tooLong = True;
+ }
+
+ if (! tooLong) break;
+
+ for (i = 1; i < alphaSize; i++) {
+ j = weight[i] >> 8;
+ j = 1 + (j / 2);
+ weight[i] = j << 8;
+ }
+ }
+}
+
+
+/*---------------------------------------------*/
+void hbAssignCodes ( Int32 *code,
+ UChar *length,
+ Int32 minLen,
+ Int32 maxLen,
+ Int32 alphaSize )
+{
+ Int32 n, vec, i;
+
+ vec = 0;
+ for (n = minLen; n <= maxLen; n++) {
+ for (i = 0; i < alphaSize; i++)
+ if (length[i] == n) { code[i] = vec; vec++; };
+ vec <<= 1;
+ }
+}
+
+
+/*---------------------------------------------*/
+void hbCreateDecodeTables ( Int32 *limit,
+ Int32 *base,
+ Int32 *perm,
+ UChar *length,
+ Int32 minLen,
+ Int32 maxLen,
+ Int32 alphaSize )
+{
+ Int32 pp, i, j, vec;
+
+ pp = 0;
+ for (i = minLen; i <= maxLen; i++)
+ for (j = 0; j < alphaSize; j++)
+ if (length[j] == i) { perm[pp] = j; pp++; };
+
+ for (i = 0; i < MAX_CODE_LEN; i++) base[i] = 0;
+ for (i = 0; i < alphaSize; i++) base[length[i]+1]++;
+
+ for (i = 1; i < MAX_CODE_LEN; i++) base[i] += base[i-1];
+
+ for (i = 0; i < MAX_CODE_LEN; i++) limit[i] = 0;
+ vec = 0;
+
+ for (i = minLen; i <= maxLen; i++) {
+ vec += (base[i+1] - base[i]);
+ limit[i] = vec-1;
+ vec <<= 1;
+ }
+ for (i = minLen + 1; i <= maxLen; i++)
+ base[i] = ((limit[i-1] + 1) << 1) - base[i];
+}
+
+
+
+/*---------------------------------------------------*/
+/*--- Undoing the reversible transformation ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+#define SET_LL4(i,n) \
+ { if (((i) & 0x1) == 0) \
+ ll4[(i) >> 1] = (ll4[(i) >> 1] & 0xf0) | (n); else \
+ ll4[(i) >> 1] = (ll4[(i) >> 1] & 0x0f) | ((n) << 4); \
+ }
+
+#define GET_LL4(i) \
+ (((UInt32)(ll4[(i) >> 1])) >> (((i) << 2) & 0x4) & 0xF)
+
+#define SET_LL(i,n) \
+ { ll16[i] = (UInt16)(n & 0x0000ffff); \
+ SET_LL4(i, n >> 16); \
+ }
+
+#define GET_LL(i) \
+ (((UInt32)ll16[i]) | (GET_LL4(i) << 16))
+
+
+/*---------------------------------------------*/
+/*--
+ Manage memory for compression/decompression.
+ When compressing, a single block size applies to
+ all files processed, and that's set when the
+ program starts. But when decompressing, each file
+ processed could have been compressed with a
+ different block size, so we may have to free
+ and reallocate on a per-file basis.
+
+ A call with argument of zero means
+ `free up everything.' And a value of zero for
+ blockSize100k means no memory is currently allocated.
+--*/
+
+
+/*---------------------------------------------*/
+void allocateCompressStructures ( void )
+{
+ Int32 n = 100000 * blockSize100k;
+ block = malloc ( (n + 1 + NUM_OVERSHOOT_BYTES) * sizeof(UChar) );
+ quadrant = malloc ( (n + NUM_OVERSHOOT_BYTES) * sizeof(Int16) );
+ zptr = malloc ( n * sizeof(Int32) );
+ ftab = malloc ( 65537 * sizeof(Int32) );
+
+ if (block == NULL || quadrant == NULL ||
+ zptr == NULL || ftab == NULL) {
+ Int32 totalDraw
+ = (n + 1 + NUM_OVERSHOOT_BYTES) * sizeof(UChar) +
+ (n + NUM_OVERSHOOT_BYTES) * sizeof(Int16) +
+ n * sizeof(Int32) +
+ 65537 * sizeof(Int32);
+
+ compressOutOfMemory ( totalDraw, n );
+ }
+
+ /*--
+ Since we want valid indexes for block of
+ -1 to n + NUM_OVERSHOOT_BYTES - 1
+ inclusive.
+ --*/
+ block++;
+
+ /*--
+ The back end needs a place to store the MTF values
+ whilst it calculates the coding tables. We could
+ put them in the zptr array. However, these values
+ will fit in a short, so we overlay szptr at the
+ start of zptr, in the hope of reducing the number
+ of cache misses induced by the multiple traversals
+ of the MTF values when calculating coding tables.
+ Seems to improve compression speed by about 1%.
+ --*/
+ szptr = (UInt16*)zptr;
+}
+
+
+/*---------------------------------------------*/
+void setDecompressStructureSizes ( Int32 newSize100k )
+{
+ if (! (0 <= newSize100k && newSize100k <= 9 &&
+ 0 <= blockSize100k && blockSize100k <= 9))
+ panic ( "setDecompressStructureSizes" );
+
+ if (newSize100k == blockSize100k) return;
+
+ blockSize100k = newSize100k;
+
+ if (ll16 != NULL) free ( ll16 );
+ if (ll4 != NULL) free ( ll4 );
+ if (ll8 != NULL) free ( ll8 );
+ if (tt != NULL) free ( tt );
+
+ if (newSize100k == 0) return;
+
+ if (smallMode) {
+
+ Int32 n = 100000 * newSize100k;
+ ll16 = malloc ( n * sizeof(UInt16) );
+ ll4 = malloc ( ((n+1) >> 1) * sizeof(UChar) );
+
+ if (ll4 == NULL || ll16 == NULL) {
+ Int32 totalDraw
+ = n * sizeof(Int16) + ((n+1) >> 1) * sizeof(UChar);
+ uncompressOutOfMemory ( totalDraw, n );
+ }
+
+ } else {
+
+ Int32 n = 100000 * newSize100k;
+ ll8 = malloc ( n * sizeof(UChar) );
+ tt = malloc ( n * sizeof(Int32) );
+
+ if (ll8 == NULL || tt == NULL) {
+ Int32 totalDraw
+ = n * sizeof(UChar) + n * sizeof(UInt32);
+ uncompressOutOfMemory ( totalDraw, n );
+ }
+
+ }
+}
+
+
+
+/*---------------------------------------------------*/
+/*--- The new back end ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+void makeMaps ( void )
+{
+ Int32 i;
+ nInUse = 0;
+ for (i = 0; i < 256; i++)
+ if (inUse[i]) {
+ seqToUnseq[nInUse] = i;
+ unseqToSeq[i] = nInUse;
+ nInUse++;
+ }
+}
+
+
+/*---------------------------------------------*/
+void generateMTFValues ( void )
+{
+ UChar yy[256];
+ Int32 i, j;
+ UChar tmp;
+ UChar tmp2;
+ Int32 zPend;
+ Int32 wr;
+ Int32 EOB;
+
+ makeMaps();
+ EOB = nInUse+1;
+
+ for (i = 0; i <= EOB; i++) mtfFreq[i] = 0;
+
+ wr = 0;
+ zPend = 0;
+ for (i = 0; i < nInUse; i++) yy[i] = (UChar) i;
+
+
+ for (i = 0; i <= last; i++) {
+ UChar ll_i;
+
+ #if DEBUG
+ assert (wr <= i);
+ #endif
+
+ ll_i = unseqToSeq[block[zptr[i] - 1]];
+ #if DEBUG
+ assert (ll_i < nInUse);
+ #endif
+
+ j = 0;
+ tmp = yy[j];
+ while ( ll_i != tmp ) {
+ j++;
+ tmp2 = tmp;
+ tmp = yy[j];
+ yy[j] = tmp2;
+ };
+ yy[0] = tmp;
+
+ if (j == 0) {
+ zPend++;
+ } else {
+ if (zPend > 0) {
+ zPend--;
+ while (True) {
+ switch (zPend % 2) {
+ case 0: szptr[wr] = RUNA; wr++; mtfFreq[RUNA]++; break;
+ case 1: szptr[wr] = RUNB; wr++; mtfFreq[RUNB]++; break;
+ };
+ if (zPend < 2) break;
+ zPend = (zPend - 2) / 2;
+ };
+ zPend = 0;
+ }
+ szptr[wr] = j+1; wr++; mtfFreq[j+1]++;
+ }
+ }
+
+ if (zPend > 0) {
+ zPend--;
+ while (True) {
+ switch (zPend % 2) {
+ case 0: szptr[wr] = RUNA; wr++; mtfFreq[RUNA]++; break;
+ case 1: szptr[wr] = RUNB; wr++; mtfFreq[RUNB]++; break;
+ };
+ if (zPend < 2) break;
+ zPend = (zPend - 2) / 2;
+ };
+ }
+
+ szptr[wr] = EOB; wr++; mtfFreq[EOB]++;
+
+ nMTF = wr;
+}
+
+
+/*---------------------------------------------*/
+#define LESSER_ICOST 0
+#define GREATER_ICOST 15
+
+void sendMTFValues ( void )
+{
+ Int32 v, t, i, j, gs, ge, totc, bt, bc, iter;
+ Int32 nSelectors, alphaSize, minLen, maxLen, selCtr;
+ Int32 nGroups, nBytes;
+
+ /*--
+ UChar len [N_GROUPS][MAX_ALPHA_SIZE];
+ is a global since the decoder also needs it.
+
+ Int32 code[N_GROUPS][MAX_ALPHA_SIZE];
+ Int32 rfreq[N_GROUPS][MAX_ALPHA_SIZE];
+ are also globals only used in this proc.
+ Made global to keep stack frame size small.
+ --*/
+
+
+ UInt16 cost[N_GROUPS];
+ Int32 fave[N_GROUPS];
+
+ if (verbosity >= 3)
+ fprintf ( stderr,
+ " %d in block, %d after MTF & 1-2 coding, %d+2 syms in use\n",
+ last+1, nMTF, nInUse );
+
+ alphaSize = nInUse+2;
+ for (t = 0; t < N_GROUPS; t++)
+ for (v = 0; v < alphaSize; v++)
+ len[t][v] = GREATER_ICOST;
+
+ /*--- Decide how many coding tables to use ---*/
+ if (nMTF <= 0) panic ( "sendMTFValues(0)" );
+ if (nMTF < 200) nGroups = 2; else
+ if (nMTF < 800) nGroups = 4; else
+ nGroups = 6;
+
+ /*--- Generate an initial set of coding tables ---*/
+ {
+ Int32 nPart, remF, tFreq, aFreq;
+
+ nPart = nGroups;
+ remF = nMTF;
+ gs = 0;
+ while (nPart > 0) {
+ tFreq = remF / nPart;
+ ge = gs-1;
+ aFreq = 0;
+ while (aFreq < tFreq && ge < alphaSize-1) {
+ ge++;
+ aFreq += mtfFreq[ge];
+ }
+
+ if (ge > gs
+ && nPart != nGroups && nPart != 1
+ && ((nGroups-nPart) % 2 == 1)) {
+ aFreq -= mtfFreq[ge];
+ ge--;
+ }
+
+ if (verbosity >= 3)
+ fprintf ( stderr,
+ " initial group %d, [%d .. %d], has %d syms (%4.1f%%)\n",
+ nPart, gs, ge, aFreq,
+ (100.0 * (float)aFreq) / (float)nMTF );
+
+ for (v = 0; v < alphaSize; v++)
+ if (v >= gs && v <= ge)
+ len[nPart-1][v] = LESSER_ICOST; else
+ len[nPart-1][v] = GREATER_ICOST;
+
+ nPart--;
+ gs = ge+1;
+ remF -= aFreq;
+ }
+ }
+
+ /*---
+ Iterate up to N_ITERS times to improve the tables.
+ ---*/
+ for (iter = 0; iter < N_ITERS; iter++) {
+
+ for (t = 0; t < nGroups; t++) fave[t] = 0;
+
+ for (t = 0; t < nGroups; t++)
+ for (v = 0; v < alphaSize; v++)
+ rfreq[t][v] = 0;
+
+ nSelectors = 0;
+ totc = 0;
+ gs = 0;
+ while (True) {
+
+ /*--- Set group start & end marks. --*/
+ if (gs >= nMTF) break;
+ ge = gs + G_SIZE - 1;
+ if (ge >= nMTF) ge = nMTF-1;
+
+ /*--
+ Calculate the cost of this group as coded
+ by each of the coding tables.
+ --*/
+ for (t = 0; t < nGroups; t++) cost[t] = 0;
+
+ if (nGroups == 6) {
+ register UInt16 cost0, cost1, cost2, cost3, cost4, cost5;
+ cost0 = cost1 = cost2 = cost3 = cost4 = cost5 = 0;
+ for (i = gs; i <= ge; i++) {
+ UInt16 icv = szptr[i];
+ cost0 += len[0][icv];
+ cost1 += len[1][icv];
+ cost2 += len[2][icv];
+ cost3 += len[3][icv];
+ cost4 += len[4][icv];
+ cost5 += len[5][icv];
+ }
+ cost[0] = cost0; cost[1] = cost1; cost[2] = cost2;
+ cost[3] = cost3; cost[4] = cost4; cost[5] = cost5;
+ } else {
+ for (i = gs; i <= ge; i++) {
+ UInt16 icv = szptr[i];
+ for (t = 0; t < nGroups; t++) cost[t] += len[t][icv];
+ }
+ }
+
+ /*--
+ Find the coding table which is best for this group,
+ and record its identity in the selector table.
+ --*/
+ bc = 999999999; bt = -1;
+ for (t = 0; t < nGroups; t++)
+ if (cost[t] < bc) { bc = cost[t]; bt = t; };
+ totc += bc;
+ fave[bt]++;
+ selector[nSelectors] = bt;
+ nSelectors++;
+
+ /*--
+ Increment the symbol frequencies for the selected table.
+ --*/
+ for (i = gs; i <= ge; i++)
+ rfreq[bt][ szptr[i] ]++;
+
+ gs = ge+1;
+ }
+ if (verbosity >= 3) {
+ fprintf ( stderr,
+ " pass %d: size is %d, grp uses are ",
+ iter+1, totc/8 );
+ for (t = 0; t < nGroups; t++)
+ fprintf ( stderr, "%d ", fave[t] );
+ fprintf ( stderr, "\n" );
+ }
+
+ /*--
+ Recompute the tables based on the accumulated frequencies.
+ --*/
+ for (t = 0; t < nGroups; t++)
+ hbMakeCodeLengths ( &len[t][0], &rfreq[t][0], alphaSize, 20 );
+ }
+
+
+ if (!(nGroups < 8)) panic ( "sendMTFValues(1)" );
+ if (!(nSelectors < 32768 &&
+ nSelectors <= (2 + (900000 / G_SIZE))))
+ panic ( "sendMTFValues(2)" );
+
+
+ /*--- Compute MTF values for the selectors. ---*/
+ {
+ UChar pos[N_GROUPS], ll_i, tmp2, tmp;
+ for (i = 0; i < nGroups; i++) pos[i] = i;
+ for (i = 0; i < nSelectors; i++) {
+ ll_i = selector[i];
+ j = 0;
+ tmp = pos[j];
+ while ( ll_i != tmp ) {
+ j++;
+ tmp2 = tmp;
+ tmp = pos[j];
+ pos[j] = tmp2;
+ };
+ pos[0] = tmp;
+ selectorMtf[i] = j;
+ }
+ };
+
+ /*--- Assign actual codes for the tables. --*/
+ for (t = 0; t < nGroups; t++) {
+ minLen = 32;
+ maxLen = 0;
+ for (i = 0; i < alphaSize; i++) {
+ if (len[t][i] > maxLen) maxLen = len[t][i];
+ if (len[t][i] < minLen) minLen = len[t][i];
+ }
+ if (maxLen > 20) panic ( "sendMTFValues(3)" );
+ if (minLen < 1) panic ( "sendMTFValues(4)" );
+ hbAssignCodes ( &code[t][0], &len[t][0],
+ minLen, maxLen, alphaSize );
+ }
+
+ /*--- Transmit the mapping table. ---*/
+ {
+ Bool inUse16[16];
+ for (i = 0; i < 16; i++) {
+ inUse16[i] = False;
+ for (j = 0; j < 16; j++)
+ if (inUse[i * 16 + j]) inUse16[i] = True;
+ }
+
+ nBytes = bytesOut;
+ for (i = 0; i < 16; i++)
+ if (inUse16[i]) bsW(1,1); else bsW(1,0);
+
+ for (i = 0; i < 16; i++)
+ if (inUse16[i])
+ for (j = 0; j < 16; j++)
+ if (inUse[i * 16 + j]) bsW(1,1); else bsW(1,0);
+
+ if (verbosity >= 3)
+ fprintf ( stderr, " bytes: mapping %d, ", bytesOut-nBytes );
+ }
+
+ /*--- Now the selectors. ---*/
+ nBytes = bytesOut;
+ bsW ( 3, nGroups );
+ bsW ( 15, nSelectors );
+ for (i = 0; i < nSelectors; i++) {
+ for (j = 0; j < selectorMtf[i]; j++) bsW(1,1);
+ bsW(1,0);
+ }
+ if (verbosity >= 3)
+ fprintf ( stderr, "selectors %d, ", bytesOut-nBytes );
+
+ /*--- Now the coding tables. ---*/
+ nBytes = bytesOut;
+
+ for (t = 0; t < nGroups; t++) {
+ Int32 curr = len[t][0];
+ bsW ( 5, curr );
+ for (i = 0; i < alphaSize; i++) {
+ while (curr < len[t][i]) { bsW(2,2); curr++; /* 10 */ };
+ while (curr > len[t][i]) { bsW(2,3); curr--; /* 11 */ };
+ bsW ( 1, 0 );
+ }
+ }
+
+ if (verbosity >= 3)
+ fprintf ( stderr, "code lengths %d, ", bytesOut-nBytes );
+
+ /*--- And finally, the block data proper ---*/
+ nBytes = bytesOut;
+ selCtr = 0;
+ gs = 0;
+ while (True) {
+ if (gs >= nMTF) break;
+ ge = gs + G_SIZE - 1;
+ if (ge >= nMTF) ge = nMTF-1;
+ for (i = gs; i <= ge; i++) {
+ #if DEBUG
+ assert (selector[selCtr] < nGroups);
+ #endif
+ bsW ( len [selector[selCtr]] [szptr[i]],
+ code [selector[selCtr]] [szptr[i]] );
+ }
+
+ gs = ge+1;
+ selCtr++;
+ }
+ if (!(selCtr == nSelectors)) panic ( "sendMTFValues(5)" );
+
+ if (verbosity >= 3)
+ fprintf ( stderr, "codes %d\n", bytesOut-nBytes );
+}
+
+
+/*---------------------------------------------*/
+void moveToFrontCodeAndSend ( void )
+{
+ bsPutIntVS ( 24, origPtr );
+ generateMTFValues();
+ sendMTFValues();
+}
+
+
+/*---------------------------------------------*/
+void recvDecodingTables ( void )
+{
+ Int32 i, j, t, nGroups, nSelectors, alphaSize;
+ Int32 minLen, maxLen;
+ Bool inUse16[16];
+
+ /*--- Receive the mapping table ---*/
+ for (i = 0; i < 16; i++)
+ if (bsR(1) == 1)
+ inUse16[i] = True; else
+ inUse16[i] = False;
+
+ for (i = 0; i < 256; i++) inUse[i] = False;
+
+ for (i = 0; i < 16; i++)
+ if (inUse16[i])
+ for (j = 0; j < 16; j++)
+ if (bsR(1) == 1) inUse[i * 16 + j] = True;
+
+ makeMaps();
+ alphaSize = nInUse+2;
+
+ /*--- Now the selectors ---*/
+ nGroups = bsR ( 3 );
+ nSelectors = bsR ( 15 );
+ for (i = 0; i < nSelectors; i++) {
+ j = 0;
+ while (bsR(1) == 1) j++;
+ selectorMtf[i] = j;
+ }
+
+ /*--- Undo the MTF values for the selectors. ---*/
+ {
+ UChar pos[N_GROUPS], tmp, v;
+ for (v = 0; v < nGroups; v++) pos[v] = v;
+
+ for (i = 0; i < nSelectors; i++) {
+ v = selectorMtf[i];
+ tmp = pos[v];
+ while (v > 0) { pos[v] = pos[v-1]; v--; }
+ pos[0] = tmp;
+ selector[i] = tmp;
+ }
+ }
+
+ /*--- Now the coding tables ---*/
+ for (t = 0; t < nGroups; t++) {
+ Int32 curr = bsR ( 5 );
+ for (i = 0; i < alphaSize; i++) {
+ while (bsR(1) == 1) {
+ if (bsR(1) == 0) curr++; else curr--;
+ }
+ len[t][i] = curr;
+ }
+ }
+
+ /*--- Create the Huffman decoding tables ---*/
+ for (t = 0; t < nGroups; t++) {
+ minLen = 32;
+ maxLen = 0;
+ for (i = 0; i < alphaSize; i++) {
+ if (len[t][i] > maxLen) maxLen = len[t][i];
+ if (len[t][i] < minLen) minLen = len[t][i];
+ }
+ hbCreateDecodeTables (
+ &limit[t][0], &base[t][0], &perm[t][0], &len[t][0],
+ minLen, maxLen, alphaSize
+ );
+ minLens[t] = minLen;
+ }
+}
+
+
+/*---------------------------------------------*/
+#define GET_MTF_VAL(lval) \
+{ \
+ Int32 zt, zn, zvec, zj; \
+ if (groupPos == 0) { \
+ groupNo++; \
+ groupPos = G_SIZE; \
+ } \
+ groupPos--; \
+ zt = selector[groupNo]; \
+ zn = minLens[zt]; \
+ zvec = bsR ( zn ); \
+ while (zvec > limit[zt][zn]) { \
+ zn++; bsR1(zj); \
+ zvec = (zvec << 1) | zj; \
+ }; \
+ lval = perm[zt][zvec - base[zt][zn]]; \
+}
+
+
+/*---------------------------------------------*/
+void getAndMoveToFrontDecode ( void )
+{
+ UChar yy[256];
+ Int32 i, j, nextSym, limitLast;
+ Int32 EOB, groupNo, groupPos;
+
+ limitLast = 100000 * blockSize100k;
+ origPtr = bsGetIntVS ( 24 );
+
+ recvDecodingTables();
+ EOB = nInUse+1;
+ groupNo = -1;
+ groupPos = 0;
+
+ /*--
+ Setting up the unzftab entries here is not strictly
+ necessary, but it does save having to do it later
+ in a separate pass, and so saves a block's worth of
+ cache misses.
+ --*/
+ for (i = 0; i <= 255; i++) unzftab[i] = 0;
+
+ for (i = 0; i <= 255; i++) yy[i] = (UChar) i;
+
+ last = -1;
+
+ GET_MTF_VAL(nextSym);
+
+ while (True) {
+
+ if (nextSym == EOB) break;
+
+ if (nextSym == RUNA || nextSym == RUNB) {
+ UChar ch;
+ Int32 s = -1;
+ Int32 N = 1;
+ do {
+ if (nextSym == RUNA) s = s + (0+1) * N; else
+ if (nextSym == RUNB) s = s + (1+1) * N;
+ N = N * 2;
+ GET_MTF_VAL(nextSym);
+ }
+ while (nextSym == RUNA || nextSym == RUNB);
+
+ s++;
+ ch = seqToUnseq[yy[0]];
+ unzftab[ch] += s;
+
+ if (smallMode)
+ while (s > 0) {
+ last++;
+ ll16[last] = ch;
+ s--;
+ }
+ else
+ while (s > 0) {
+ last++;
+ ll8[last] = ch;
+ s--;
+ };
+
+ if (last >= limitLast) blockOverrun();
+ continue;
+
+ } else {
+
+ UChar tmp;
+ last++; if (last >= limitLast) blockOverrun();
+
+ tmp = yy[nextSym-1];
+ unzftab[seqToUnseq[tmp]]++;
+ if (smallMode)
+ ll16[last] = seqToUnseq[tmp]; else
+ ll8[last] = seqToUnseq[tmp];
+
+ /*--
+ This loop is hammered during decompression,
+ hence the unrolling.
+
+ for (j = nextSym-1; j > 0; j--) yy[j] = yy[j-1];
+ --*/
+
+ j = nextSym-1;
+ for (; j > 3; j -= 4) {
+ yy[j] = yy[j-1];
+ yy[j-1] = yy[j-2];
+ yy[j-2] = yy[j-3];
+ yy[j-3] = yy[j-4];
+ }
+ for (; j > 0; j--) yy[j] = yy[j-1];
+
+ yy[0] = tmp;
+ GET_MTF_VAL(nextSym);
+ continue;
+ }
+ }
+}
+
+
+/*---------------------------------------------------*/
+/*--- Block-sorting machinery ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+/*--
+ Compare two strings in block. We assume (see
+ discussion above) that i1 and i2 have a max
+ offset of 10 on entry, and that the first
+ bytes of both block and quadrant have been
+ copied into the "overshoot area", ie
+ into the subscript range
+ [last+1 .. last+NUM_OVERSHOOT_BYTES].
+--*/
+INLINE Bool fullGtU ( Int32 i1, Int32 i2 )
+{
+ Int32 k;
+ UChar c1, c2;
+ UInt16 s1, s2;
+
+ #if DEBUG
+ /*--
+ shellsort shouldn't ask to compare
+ something with itself.
+ --*/
+ assert (i1 != i2);
+ #endif
+
+ c1 = block[i1];
+ c2 = block[i2];
+ if (c1 != c2) return (c1 > c2);
+ i1++; i2++;
+
+ c1 = block[i1];
+ c2 = block[i2];
+ if (c1 != c2) return (c1 > c2);
+ i1++; i2++;
+
+ c1 = block[i1];
+ c2 = block[i2];
+ if (c1 != c2) return (c1 > c2);
+ i1++; i2++;
+
+ c1 = block[i1];
+ c2 = block[i2];
+ if (c1 != c2) return (c1 > c2);
+ i1++; i2++;
+
+ c1 = block[i1];
+ c2 = block[i2];
+ if (c1 != c2) return (c1 > c2);
+ i1++; i2++;
+
+ c1 = block[i1];
+ c2 = block[i2];
+ if (c1 != c2) return (c1 > c2);
+ i1++; i2++;
+
+ k = last + 1;
+
+ do {
+
+ c1 = block[i1];
+ c2 = block[i2];
+ if (c1 != c2) return (c1 > c2);
+ s1 = quadrant[i1];
+ s2 = quadrant[i2];
+ if (s1 != s2) return (s1 > s2);
+ i1++; i2++;
+
+ c1 = block[i1];
+ c2 = block[i2];
+ if (c1 != c2) return (c1 > c2);
+ s1 = quadrant[i1];
+ s2 = quadrant[i2];
+ if (s1 != s2) return (s1 > s2);
+ i1++; i2++;
+
+ c1 = block[i1];
+ c2 = block[i2];
+ if (c1 != c2) return (c1 > c2);
+ s1 = quadrant[i1];
+ s2 = quadrant[i2];
+ if (s1 != s2) return (s1 > s2);
+ i1++; i2++;
+
+ c1 = block[i1];
+ c2 = block[i2];
+ if (c1 != c2) return (c1 > c2);
+ s1 = quadrant[i1];
+ s2 = quadrant[i2];
+ if (s1 != s2) return (s1 > s2);
+ i1++; i2++;
+
+ if (i1 > last) { i1 -= last; i1--; };
+ if (i2 > last) { i2 -= last; i2--; };
+
+ k -= 4;
+ workDone++;
+ }
+ while (k >= 0);
+
+ return False;
+}
+
+/*---------------------------------------------*/
+/*--
+ Knuth's increments seem to work better
+ than Incerpi-Sedgewick here. Possibly
+ because the number of elems to sort is
+ usually small, typically <= 20.
+--*/
+Int32 incs[14] = { 1, 4, 13, 40, 121, 364, 1093, 3280,
+ 9841, 29524, 88573, 265720,
+ 797161, 2391484 };
+
+void simpleSort ( Int32 lo, Int32 hi, Int32 d )
+{
+ Int32 i, j, h, bigN, hp;
+ Int32 v;
+
+ bigN = hi - lo + 1;
+ if (bigN < 2) return;
+
+ hp = 0;
+ while (incs[hp] < bigN) hp++;
+ hp--;
+
+ for (; hp >= 0; hp--) {
+ h = incs[hp];
+ if (verbosity >= 5)
+ fprintf ( stderr, " shell increment %d\n", h );
+
+ i = lo + h;
+ while (True) {
+
+ /*-- copy 1 --*/
+ if (i > hi) break;
+ v = zptr[i];
+ j = i;
+ while ( fullGtU ( zptr[j-h]+d, v+d ) ) {
+ zptr[j] = zptr[j-h];
+ j = j - h;
+ if (j <= (lo + h - 1)) break;
+ }
+ zptr[j] = v;
+ i++;
+
+ /*-- copy 2 --*/
+ if (i > hi) break;
+ v = zptr[i];
+ j = i;
+ while ( fullGtU ( zptr[j-h]+d, v+d ) ) {
+ zptr[j] = zptr[j-h];
+ j = j - h;
+ if (j <= (lo + h - 1)) break;
+ }
+ zptr[j] = v;
+ i++;
+
+ /*-- copy 3 --*/
+ if (i > hi) break;
+ v = zptr[i];
+ j = i;
+ while ( fullGtU ( zptr[j-h]+d, v+d ) ) {
+ zptr[j] = zptr[j-h];
+ j = j - h;
+ if (j <= (lo + h - 1)) break;
+ }
+ zptr[j] = v;
+ i++;
+
+ if (workDone > workLimit && firstAttempt) return;
+ }
+ }
+}
+
+
+/*---------------------------------------------*/
+/*--
+ The following is an implementation of
+ an elegant 3-way quicksort for strings,
+ described in a paper "Fast Algorithms for
+ Sorting and Searching Strings", by Robert
+ Sedgewick and Jon L. Bentley.
+--*/
+
+#define swap(lv1, lv2) \
+ { Int32 tmp = lv1; lv1 = lv2; lv2 = tmp; }
+
+INLINE void vswap ( Int32 p1, Int32 p2, Int32 n )
+{
+ while (n > 0) {
+ swap(zptr[p1], zptr[p2]);
+ p1++; p2++; n--;
+ }
+}
+
+INLINE UChar med3 ( UChar a, UChar b, UChar c )
+{
+ UChar t;
+ if (a > b) { t = a; a = b; b = t; };
+ if (b > c) { t = b; b = c; c = t; };
+ if (a > b) b = a;
+ return b;
+}
+
+
+#define min(a,b) ((a) < (b)) ? (a) : (b)
+
+typedef
+ struct { Int32 ll; Int32 hh; Int32 dd; }
+ StackElem;
+
+#define push(lz,hz,dz) { stack[sp].ll = lz; \
+ stack[sp].hh = hz; \
+ stack[sp].dd = dz; \
+ sp++; }
+
+#define pop(lz,hz,dz) { sp--; \
+ lz = stack[sp].ll; \
+ hz = stack[sp].hh; \
+ dz = stack[sp].dd; }
+
+#define SMALL_THRESH 20
+#define DEPTH_THRESH 10
+
+/*--
+ If you are ever unlucky/improbable enough
+ to get a stack overflow whilst sorting,
+ increase the following constant and try
+ again. In practice I have never seen the
+ stack go above 27 elems, so the following
+ limit seems very generous.
+--*/
+#define QSORT_STACK_SIZE 1000
+
+
+void qSort3 ( Int32 loSt, Int32 hiSt, Int32 dSt )
+{
+ Int32 unLo, unHi, ltLo, gtHi, med, n, m;
+ Int32 sp, lo, hi, d;
+ StackElem stack[QSORT_STACK_SIZE];
+
+ sp = 0;
+ push ( loSt, hiSt, dSt );
+
+ while (sp > 0) {
+
+ if (sp >= QSORT_STACK_SIZE) panic ( "stack overflow in qSort3" );
+
+ pop ( lo, hi, d );
+
+ if (hi - lo < SMALL_THRESH || d > DEPTH_THRESH) {
+ simpleSort ( lo, hi, d );
+ if (workDone > workLimit && firstAttempt) return;
+ continue;
+ }
+
+ med = med3 ( block[zptr[ lo ]+d],
+ block[zptr[ hi ]+d],
+ block[zptr[ (lo+hi)>>1 ]+d] );
+
+ unLo = ltLo = lo;
+ unHi = gtHi = hi;
+
+ while (True) {
+ while (True) {
+ if (unLo > unHi) break;
+ n = ((Int32)block[zptr[unLo]+d]) - med;
+ if (n == 0) { swap(zptr[unLo], zptr[ltLo]); ltLo++; unLo++; continue; };
+ if (n > 0) break;
+ unLo++;
+ }
+ while (True) {
+ if (unLo > unHi) break;
+ n = ((Int32)block[zptr[unHi]+d]) - med;
+ if (n == 0) { swap(zptr[unHi], zptr[gtHi]); gtHi--; unHi--; continue; };
+ if (n < 0) break;
+ unHi--;
+ }
+ if (unLo > unHi) break;
+ swap(zptr[unLo], zptr[unHi]); unLo++; unHi--;
+ }
+ #if DEBUG
+ assert (unHi == unLo-1);
+ #endif
+
+ if (gtHi < ltLo) {
+ push(lo, hi, d+1 );
+ continue;
+ }
+
+ n = min(ltLo-lo, unLo-ltLo); vswap(lo, unLo-n, n);
+ m = min(hi-gtHi, gtHi-unHi); vswap(unLo, hi-m+1, m);
+
+ n = lo + unLo - ltLo - 1;
+ m = hi - (gtHi - unHi) + 1;
+
+ push ( lo, n, d );
+ push ( n+1, m-1, d+1 );
+ push ( m, hi, d );
+ }
+}
+
+
+/*---------------------------------------------*/
+
+#define BIGFREQ(b) (ftab[((b)+1) << 8] - ftab[(b) << 8])
+
+#define SETMASK (1 << 21)
+#define CLEARMASK (~(SETMASK))
+
+void sortIt ( void )
+{
+ Int32 i, j, ss, sb;
+ Int32 runningOrder[256];
+ Int32 copy[256];
+ Bool bigDone[256];
+ UChar c1, c2;
+ Int32 numQSorted;
+
+ /*--
+ In the various block-sized structures, live data runs
+ from 0 to last+NUM_OVERSHOOT_BYTES inclusive. First,
+ set up the overshoot area for block.
+ --*/
+
+ if (verbosity >= 4) fprintf ( stderr, " sort initialise ...\n" );
+ for (i = 0; i < NUM_OVERSHOOT_BYTES; i++)
+ block[last+i+1] = block[i % (last+1)];
+ for (i = 0; i <= last+NUM_OVERSHOOT_BYTES; i++)
+ quadrant[i] = 0;
+
+ block[-1] = block[last];
+
+ if (last < 4000) {
+
+ /*--
+ Use simpleSort(), since the full sorting mechanism
+ has quite a large constant overhead.
+ --*/
+ if (verbosity >= 4) fprintf ( stderr, " simpleSort ...\n" );
+ for (i = 0; i <= last; i++) zptr[i] = i;
+ firstAttempt = False;
+ workDone = workLimit = 0;
+ simpleSort ( 0, last, 0 );
+ if (verbosity >= 4) fprintf ( stderr, " simpleSort done.\n" );
+
+ } else {
+
+ numQSorted = 0;
+ for (i = 0; i <= 255; i++) bigDone[i] = False;
+
+ if (verbosity >= 4) fprintf ( stderr, " bucket sorting ...\n" );
+
+ for (i = 0; i <= 65536; i++) ftab[i] = 0;
+
+ c1 = block[-1];
+ for (i = 0; i <= last; i++) {
+ c2 = block[i];
+ ftab[(c1 << 8) + c2]++;
+ c1 = c2;
+ }
+
+ for (i = 1; i <= 65536; i++) ftab[i] += ftab[i-1];
+
+ c1 = block[0];
+ for (i = 0; i < last; i++) {
+ c2 = block[i+1];
+ j = (c1 << 8) + c2;
+ c1 = c2;
+ ftab[j]--;
+ zptr[ftab[j]] = i;
+ }
+ j = (block[last] << 8) + block[0];
+ ftab[j]--;
+ zptr[ftab[j]] = last;
+
+ /*--
+ Now ftab contains the first loc of every small bucket.
+ Calculate the running order, from smallest to largest
+ big bucket.
+ --*/
+
+ for (i = 0; i <= 255; i++) runningOrder[i] = i;
+
+ {
+ Int32 vv;
+ Int32 h = 1;
+ do h = 3 * h + 1; while (h <= 256);
+ do {
+ h = h / 3;
+ for (i = h; i <= 255; i++) {
+ vv = runningOrder[i];
+ j = i;
+ while ( BIGFREQ(runningOrder[j-h]) > BIGFREQ(vv) ) {
+ runningOrder[j] = runningOrder[j-h];
+ j = j - h;
+ if (j <= (h - 1)) goto zero;
+ }
+ zero:
+ runningOrder[j] = vv;
+ }
+ } while (h != 1);
+ }
+
+ /*--
+ The main sorting loop.
+ --*/
+
+ for (i = 0; i <= 255; i++) {
+
+ /*--
+ Process big buckets, starting with the least full.
+ --*/
+ ss = runningOrder[i];
+
+ /*--
+ Complete the big bucket [ss] by quicksorting
+ any unsorted small buckets [ss, j]. Hopefully
+ previous pointer-scanning phases have already
+ completed many of the small buckets [ss, j], so
+ we don't have to sort them at all.
+ --*/
+ for (j = 0; j <= 255; j++) {
+ sb = (ss << 8) + j;
+ if ( ! (ftab[sb] & SETMASK) ) {
+ Int32 lo = ftab[sb] & CLEARMASK;
+ Int32 hi = (ftab[sb+1] & CLEARMASK) - 1;
+ if (hi > lo) {
+ if (verbosity >= 4)
+ fprintf ( stderr,
+ " qsort [0x%x, 0x%x] done %d this %d\n",
+ ss, j, numQSorted, hi - lo + 1 );
+ qSort3 ( lo, hi, 2 );
+ numQSorted += ( hi - lo + 1 );
+ if (workDone > workLimit && firstAttempt) return;
+ }
+ ftab[sb] |= SETMASK;
+ }
+ }
+
+ /*--
+ The ss big bucket is now done. Record this fact,
+ and update the quadrant descriptors. Remember to
+ update quadrants in the overshoot area too, if
+ necessary. The "if (i < 255)" test merely skips
+ this updating for the last bucket processed, since
+ updating for the last bucket is pointless.
+ --*/
+ bigDone[ss] = True;
+
+ if (i < 255) {
+ Int32 bbStart = ftab[ss << 8] & CLEARMASK;
+ Int32 bbSize = (ftab[(ss+1) << 8] & CLEARMASK) - bbStart;
+ Int32 shifts = 0;
+
+ while ((bbSize >> shifts) > 65534) shifts++;
+
+ for (j = 0; j < bbSize; j++) {
+ Int32 a2update = zptr[bbStart + j];
+ UInt16 qVal = (UInt16)(j >> shifts);
+ quadrant[a2update] = qVal;
+ if (a2update < NUM_OVERSHOOT_BYTES)
+ quadrant[a2update + last + 1] = qVal;
+ }
+
+ if (! ( ((bbSize-1) >> shifts) <= 65535 )) panic ( "sortIt" );
+ }
+
+ /*--
+ Now scan this big bucket so as to synthesise the
+ sorted order for small buckets [t, ss] for all t != ss.
+ --*/
+ for (j = 0; j <= 255; j++)
+ copy[j] = ftab[(j << 8) + ss] & CLEARMASK;
+
+ for (j = ftab[ss << 8] & CLEARMASK;
+ j < (ftab[(ss+1) << 8] & CLEARMASK);
+ j++) {
+ c1 = block[zptr[j]-1];
+ if ( ! bigDone[c1] ) {
+ zptr[copy[c1]] = zptr[j] == 0 ? last : zptr[j] - 1;
+ copy[c1] ++;
+ }
+ }
+
+ for (j = 0; j <= 255; j++) ftab[(j << 8) + ss] |= SETMASK;
+ }
+ if (verbosity >= 4)
+ fprintf ( stderr, " %d pointers, %d sorted, %d scanned\n",
+ last+1, numQSorted, (last+1) - numQSorted );
+ }
+}
+
+
+/*---------------------------------------------------*/
+/*--- Stuff for randomising repetitive blocks ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+Int32 rNums[512] = {
+ 619, 720, 127, 481, 931, 816, 813, 233, 566, 247,
+ 985, 724, 205, 454, 863, 491, 741, 242, 949, 214,
+ 733, 859, 335, 708, 621, 574, 73, 654, 730, 472,
+ 419, 436, 278, 496, 867, 210, 399, 680, 480, 51,
+ 878, 465, 811, 169, 869, 675, 611, 697, 867, 561,
+ 862, 687, 507, 283, 482, 129, 807, 591, 733, 623,
+ 150, 238, 59, 379, 684, 877, 625, 169, 643, 105,
+ 170, 607, 520, 932, 727, 476, 693, 425, 174, 647,
+ 73, 122, 335, 530, 442, 853, 695, 249, 445, 515,
+ 909, 545, 703, 919, 874, 474, 882, 500, 594, 612,
+ 641, 801, 220, 162, 819, 984, 589, 513, 495, 799,
+ 161, 604, 958, 533, 221, 400, 386, 867, 600, 782,
+ 382, 596, 414, 171, 516, 375, 682, 485, 911, 276,
+ 98, 553, 163, 354, 666, 933, 424, 341, 533, 870,
+ 227, 730, 475, 186, 263, 647, 537, 686, 600, 224,
+ 469, 68, 770, 919, 190, 373, 294, 822, 808, 206,
+ 184, 943, 795, 384, 383, 461, 404, 758, 839, 887,
+ 715, 67, 618, 276, 204, 918, 873, 777, 604, 560,
+ 951, 160, 578, 722, 79, 804, 96, 409, 713, 940,
+ 652, 934, 970, 447, 318, 353, 859, 672, 112, 785,
+ 645, 863, 803, 350, 139, 93, 354, 99, 820, 908,
+ 609, 772, 154, 274, 580, 184, 79, 626, 630, 742,
+ 653, 282, 762, 623, 680, 81, 927, 626, 789, 125,
+ 411, 521, 938, 300, 821, 78, 343, 175, 128, 250,
+ 170, 774, 972, 275, 999, 639, 495, 78, 352, 126,
+ 857, 956, 358, 619, 580, 124, 737, 594, 701, 612,
+ 669, 112, 134, 694, 363, 992, 809, 743, 168, 974,
+ 944, 375, 748, 52, 600, 747, 642, 182, 862, 81,
+ 344, 805, 988, 739, 511, 655, 814, 334, 249, 515,
+ 897, 955, 664, 981, 649, 113, 974, 459, 893, 228,
+ 433, 837, 553, 268, 926, 240, 102, 654, 459, 51,
+ 686, 754, 806, 760, 493, 403, 415, 394, 687, 700,
+ 946, 670, 656, 610, 738, 392, 760, 799, 887, 653,
+ 978, 321, 576, 617, 626, 502, 894, 679, 243, 440,
+ 680, 879, 194, 572, 640, 724, 926, 56, 204, 700,
+ 707, 151, 457, 449, 797, 195, 791, 558, 945, 679,
+ 297, 59, 87, 824, 713, 663, 412, 693, 342, 606,
+ 134, 108, 571, 364, 631, 212, 174, 643, 304, 329,
+ 343, 97, 430, 751, 497, 314, 983, 374, 822, 928,
+ 140, 206, 73, 263, 980, 736, 876, 478, 430, 305,
+ 170, 514, 364, 692, 829, 82, 855, 953, 676, 246,
+ 369, 970, 294, 750, 807, 827, 150, 790, 288, 923,
+ 804, 378, 215, 828, 592, 281, 565, 555, 710, 82,
+ 896, 831, 547, 261, 524, 462, 293, 465, 502, 56,
+ 661, 821, 976, 991, 658, 869, 905, 758, 745, 193,
+ 768, 550, 608, 933, 378, 286, 215, 979, 792, 961,
+ 61, 688, 793, 644, 986, 403, 106, 366, 905, 644,
+ 372, 567, 466, 434, 645, 210, 389, 550, 919, 135,
+ 780, 773, 635, 389, 707, 100, 626, 958, 165, 504,
+ 920, 176, 193, 713, 857, 265, 203, 50, 668, 108,
+ 645, 990, 626, 197, 510, 357, 358, 850, 858, 364,
+ 936, 638
+};
+
+
+#define RAND_DECLS \
+ Int32 rNToGo = 0; \
+ Int32 rTPos = 0; \
+
+#define RAND_MASK ((rNToGo == 1) ? 1 : 0)
+
+#define RAND_UPD_MASK \
+ if (rNToGo == 0) { \
+ rNToGo = rNums[rTPos]; \
+ rTPos++; if (rTPos == 512) rTPos = 0; \
+ } \
+ rNToGo--;
+
+
+
+/*---------------------------------------------------*/
+/*--- The Reversible Transformation (tm) ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+void randomiseBlock ( void )
+{
+ Int32 i;
+ RAND_DECLS;
+ for (i = 0; i < 256; i++) inUse[i] = False;
+
+ for (i = 0; i <= last; i++) {
+ RAND_UPD_MASK;
+ block[i] ^= RAND_MASK;
+ inUse[block[i]] = True;
+ }
+}
+
+
+/*---------------------------------------------*/
+void doReversibleTransformation ( void )
+{
+ Int32 i;
+
+ if (verbosity >= 2) fprintf ( stderr, "\n" );
+
+ workLimit = workFactor * last;
+ workDone = 0;
+ blockRandomised = False;
+ firstAttempt = True;
+
+ sortIt ();
+
+ if (verbosity >= 3)
+ fprintf ( stderr, " %d work, %d block, ratio %5.2f\n",
+ workDone, last, (float)workDone / (float)(last) );
+
+ if (workDone > workLimit && firstAttempt) {
+ if (verbosity >= 2)
+ fprintf ( stderr, " sorting aborted; randomising block\n" );
+ randomiseBlock ();
+ workLimit = workDone = 0;
+ blockRandomised = True;
+ firstAttempt = False;
+ sortIt();
+ if (verbosity >= 3)
+ fprintf ( stderr, " %d work, %d block, ratio %f\n",
+ workDone, last, (float)workDone / (float)(last) );
+ }
+
+ origPtr = -1;
+ for (i = 0; i <= last; i++)
+ if (zptr[i] == 0)
+ { origPtr = i; break; };
+
+ if (origPtr == -1) panic ( "doReversibleTransformation" );
+}
+
+
+/*---------------------------------------------*/
+
+INLINE Int32 indexIntoF ( Int32 indx, Int32 *cftab )
+{
+ Int32 nb, na, mid;
+ nb = 0;
+ na = 256;
+ do {
+ mid = (nb + na) >> 1;
+ if (indx >= cftab[mid]) nb = mid; else na = mid;
+ }
+ while (na - nb != 1);
+ return nb;
+}
+
+
+#define GET_SMALL(cccc) \
+ \
+ cccc = indexIntoF ( tPos, cftab ); \
+ tPos = GET_LL(tPos);
+
+
+void undoReversibleTransformation_small ( FILE* dst )
+{
+ Int32 cftab[257], cftabAlso[257];
+ Int32 i, j, tmp, tPos;
+ UChar ch;
+
+ /*--
+ We assume here that the global array unzftab will
+ already be holding the frequency counts for
+ ll8[0 .. last].
+ --*/
+
+ /*-- Set up cftab to facilitate generation of indexIntoF --*/
+ cftab[0] = 0;
+ for (i = 1; i <= 256; i++) cftab[i] = unzftab[i-1];
+ for (i = 1; i <= 256; i++) cftab[i] += cftab[i-1];
+
+ /*-- Make a copy of it, used in generation of T --*/
+ for (i = 0; i <= 256; i++) cftabAlso[i] = cftab[i];
+
+ /*-- compute the T vector --*/
+ for (i = 0; i <= last; i++) {
+ ch = (UChar)ll16[i];
+ SET_LL(i, cftabAlso[ch]);
+ cftabAlso[ch]++;
+ }
+
+ /*--
+ Compute T^(-1) by pointer reversal on T. This is rather
+ subtle, in that, if the original block was two or more
+ (in general, N) concatenated copies of the same thing,
+ the T vector will consist of N cycles, each of length
+ blocksize / N, and decoding will involve traversing one
+ of these cycles N times. Which particular cycle doesn't
+ matter -- they are all equivalent. The tricky part is to
+ make sure that the pointer reversal creates a correct
+ reversed cycle for us to traverse. So, the code below
+ simply reverses whatever cycle origPtr happens to fall into,
+ without regard to the cycle length. That gives one reversed
+ cycle, which for normal blocks, is the entire block-size long.
+ For repeated blocks, it will be interspersed with the other
+ N-1 non-reversed cycles. Providing that the F-subscripting
+ phase which follows starts at origPtr, all then works ok.
+ --*/
+ i = origPtr;
+ j = GET_LL(i);
+ do {
+ tmp = GET_LL(j);
+ SET_LL(j, i);
+ i = j;
+ j = tmp;
+ }
+ while (i != origPtr);
+
+ /*--
+ We recreate the original by subscripting F through T^(-1).
+ The run-length-decoder below requires characters incrementally,
+ so tPos is set to a starting value, and is updated by
+ the GET_SMALL macro.
+ --*/
+ tPos = origPtr;
+
+ /*-------------------------------------------------*/
+ /*--
+ This is pretty much a verbatim copy of the
+ run-length decoder present in the distribution
+ bzip-0.21; it has to be here to avoid creating
+ block[] as an intermediary structure. As in 0.21,
+ this code derives from some sent to me by
+ Christian von Roques.
+
+ It allows dst==NULL, so as to support the test (-t)
+ option without slowing down the fast decompression
+ code.
+ --*/
+ {
+ IntNative retVal;
+ Int32 i2, count, chPrev, ch2;
+ UInt32 localCrc;
+
+ count = 0;
+ i2 = 0;
+ ch2 = 256; /*-- not a char and not EOF --*/
+ localCrc = getGlobalCRC();
+
+ {
+ RAND_DECLS;
+ while ( i2 <= last ) {
+ chPrev = ch2;
+ GET_SMALL(ch2);
+ if (blockRandomised) {
+ RAND_UPD_MASK;
+ ch2 ^= (UInt32)RAND_MASK;
+ }
+ i2++;
+
+ if (dst)
+ retVal = putc ( ch2, dst );
+
+ UPDATE_CRC ( localCrc, (UChar)ch2 );
+
+ if (ch2 != chPrev) {
+ count = 1;
+ } else {
+ count++;
+ if (count >= 4) {
+ Int32 j2;
+ UChar z;
+ GET_SMALL(z);
+ if (blockRandomised) {
+ RAND_UPD_MASK;
+ z ^= RAND_MASK;
+ }
+ for (j2 = 0; j2 < (Int32)z; j2++) {
+ if (dst) retVal = putc (ch2, dst);
+ UPDATE_CRC ( localCrc, (UChar)ch2 );
+ }
+ i2++;
+ count = 0;
+ }
+ }
+ }
+ }
+
+ setGlobalCRC ( localCrc );
+ }
+ /*-- end of the in-line run-length-decoder. --*/
+}
+#undef GET_SMALL
+
+
+/*---------------------------------------------*/
+
+#define GET_FAST(cccc) \
+ \
+ cccc = ll8[tPos]; \
+ tPos = tt[tPos];
+
+
+void undoReversibleTransformation_fast ( FILE* dst )
+{
+ Int32 cftab[257];
+ Int32 i, tPos;
+ UChar ch;
+
+ /*--
+ We assume here that the global array unzftab will
+ already be holding the frequency counts for
+ ll8[0 .. last].
+ --*/
+
+ /*-- Set up cftab to facilitate generation of T^(-1) --*/
+ cftab[0] = 0;
+ for (i = 1; i <= 256; i++) cftab[i] = unzftab[i-1];
+ for (i = 1; i <= 256; i++) cftab[i] += cftab[i-1];
+
+ /*-- compute the T^(-1) vector --*/
+ for (i = 0; i <= last; i++) {
+ ch = (UChar)ll8[i];
+ tt[cftab[ch]] = i;
+ cftab[ch]++;
+ }
+
+ /*--
+ We recreate the original by subscripting L through T^(-1).
+ The run-length-decoder below requires characters incrementally,
+ so tPos is set to a starting value, and is updated by
+ the GET_FAST macro.
+ --*/
+ tPos = tt[origPtr];
+
+ /*-------------------------------------------------*/
+ /*--
+ This is pretty much a verbatim copy of the
+ run-length decoder present in the distribution
+ bzip-0.21; it has to be here to avoid creating
+ block[] as an intermediary structure. As in 0.21,
+ this code derives from some sent to me by
+ Christian von Roques.
+ --*/
+ {
+ IntNative retVal;
+ Int32 i2, count, chPrev, ch2;
+ UInt32 localCrc;
+
+ count = 0;
+ i2 = 0;
+ ch2 = 256; /*-- not a char and not EOF --*/
+ localCrc = getGlobalCRC();
+
+ if (blockRandomised) {
+ RAND_DECLS;
+ while ( i2 <= last ) {
+ chPrev = ch2;
+ GET_FAST(ch2);
+ RAND_UPD_MASK;
+ ch2 ^= (UInt32)RAND_MASK;
+ i2++;
+
+ retVal = putc ( ch2, dst );
+ UPDATE_CRC ( localCrc, (UChar)ch2 );
+
+ if (ch2 != chPrev) {
+ count = 1;
+ } else {
+ count++;
+ if (count >= 4) {
+ Int32 j2;
+ UChar z;
+ GET_FAST(z);
+ RAND_UPD_MASK;
+ z ^= RAND_MASK;
+ for (j2 = 0; j2 < (Int32)z; j2++) {
+ retVal = putc (ch2, dst);
+ UPDATE_CRC ( localCrc, (UChar)ch2 );
+ }
+ i2++;
+ count = 0;
+ }
+ }
+ }
+
+ } else {
+
+ while ( i2 <= last ) {
+ chPrev = ch2;
+ GET_FAST(ch2);
+ i2++;
+
+ retVal = putc ( ch2, dst );
+ UPDATE_CRC ( localCrc, (UChar)ch2 );
+
+ if (ch2 != chPrev) {
+ count = 1;
+ } else {
+ count++;
+ if (count >= 4) {
+ Int32 j2;
+ UChar z;
+ GET_FAST(z);
+ for (j2 = 0; j2 < (Int32)z; j2++) {
+ retVal = putc (ch2, dst);
+ UPDATE_CRC ( localCrc, (UChar)ch2 );
+ }
+ i2++;
+ count = 0;
+ }
+ }
+ }
+
+ } /*-- if (blockRandomised) --*/
+
+ setGlobalCRC ( localCrc );
+ }
+ /*-- end of the in-line run-length-decoder. --*/
+}
+#undef GET_FAST
+
+
+/*---------------------------------------------------*/
+/*--- The block loader and RLEr ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+/* Top 16: run length, 1 to 255.
+* Lower 16: the char, or MY_EOF for EOF.
+*/
+
+#define MY_EOF 257
+
+INLINE Int32 getRLEpair ( FILE* src )
+{
+ Int32 runLength;
+ IntNative ch, chLatest;
+
+ ch = getc ( src );
+
+ /*--- Because I have no idea what kind of a value EOF is. ---*/
+ if (ch == EOF) {
+ ERROR_IF_NOT_ZERO ( errno );
+ return (1 << 16) | MY_EOF;
+ }
+
+ runLength = 0;
+ do {
+ chLatest = getc ( src );
+ runLength++;
+ bytesIn++;
+ }
+ while (ch == chLatest && runLength < 255);
+
+ if ( chLatest != EOF ) {
+ if ( ungetc ( chLatest, src ) == EOF )
+ panic ( "getRLEpair: ungetc failed" );
+ } else {
+ ERROR_IF_NOT_ZERO ( errno );
+ }
+
+ /*--- Conditional is just a speedup hack. ---*/
+ if (runLength == 1) {
+ UPDATE_CRC ( globalCrc, (UChar)ch );
+ return (1 << 16) | ch;
+ } else {
+ Int32 i;
+ for (i = 1; i <= runLength; i++)
+ UPDATE_CRC ( globalCrc, (UChar)ch );
+ return (runLength << 16) | ch;
+ }
+}
+
+
+/*---------------------------------------------*/
+void loadAndRLEsource ( FILE* src )
+{
+ Int32 ch, allowableBlockSize, i;
+
+ last = -1;
+ ch = 0;
+
+ for (i = 0; i < 256; i++) inUse[i] = False;
+
+ /*--- 20 is just a paranoia constant ---*/
+ allowableBlockSize = 100000 * blockSize100k - 20;
+
+ while (last < allowableBlockSize && ch != MY_EOF) {
+ Int32 rlePair, runLen;
+ rlePair = getRLEpair ( src );
+ ch = rlePair & 0xFFFF;
+ runLen = (UInt32)rlePair >> 16;
+
+ #if DEBUG
+ assert (runLen >= 1 && runLen <= 255);
+ #endif
+
+ if (ch != MY_EOF) {
+ inUse[ch] = True;
+ switch (runLen) {
+ case 1:
+ last++; block[last] = (UChar)ch; break;
+ case 2:
+ last++; block[last] = (UChar)ch;
+ last++; block[last] = (UChar)ch; break;
+ case 3:
+ last++; block[last] = (UChar)ch;
+ last++; block[last] = (UChar)ch;
+ last++; block[last] = (UChar)ch; break;
+ default:
+ inUse[runLen-4] = True;
+ last++; block[last] = (UChar)ch;
+ last++; block[last] = (UChar)ch;
+ last++; block[last] = (UChar)ch;
+ last++; block[last] = (UChar)ch;
+ last++; block[last] = (UChar)(runLen-4); break;
+ }
+ }
+ }
+}
+
+
+/*---------------------------------------------------*/
+/*--- Processing of complete files and streams ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+void compressStream ( FILE *stream, FILE *zStream )
+{
+ IntNative retVal;
+ UInt32 blockCRC, combinedCRC;
+ Int32 blockNo;
+
+ blockNo = 0;
+ bytesIn = 0;
+ bytesOut = 0;
+ nBlocksRandomised = 0;
+
+ SET_BINARY_MODE(stream);
+ SET_BINARY_MODE(zStream);
+
+ ERROR_IF_NOT_ZERO ( ferror(stream) );
+ ERROR_IF_NOT_ZERO ( ferror(zStream) );
+
+ bsSetStream ( zStream, True );
+
+ /*--- Write `magic' bytes B and Z,
+ then h indicating file-format == huffmanised,
+ followed by a digit indicating blockSize100k.
+ ---*/
+ bsPutUChar ( 'B' );
+ bsPutUChar ( 'Z' );
+ bsPutUChar ( 'h' );
+ bsPutUChar ( '0' + blockSize100k );
+
+ combinedCRC = 0;
+
+ if (verbosity >= 2) fprintf ( stderr, "\n" );
+
+ while (True) {
+
+ blockNo++;
+ initialiseCRC ();
+ loadAndRLEsource ( stream );
+ ERROR_IF_NOT_ZERO ( ferror(stream) );
+ if (last == -1) break;
+
+ blockCRC = getFinalCRC ();
+ combinedCRC = (combinedCRC << 1) | (combinedCRC >> 31);
+ combinedCRC ^= blockCRC;
+
+ if (verbosity >= 2)
+ fprintf ( stderr, " block %d: crc = 0x%8x, combined CRC = 0x%8x, size = %d",
+ blockNo, blockCRC, combinedCRC, last+1 );
+
+ /*-- sort the block and establish posn of original string --*/
+ doReversibleTransformation ();
+
+ /*--
+ A 6-byte block header, the value chosen arbitrarily
+ as 0x314159265359 :-). A 32 bit value does not really
+ give a strong enough guarantee that the value will not
+ appear by chance in the compressed datastream. Worst-case
+ probability of this event, for a 900k block, is about
+ 2.0e-3 for 32 bits, 1.0e-5 for 40 bits and 4.0e-8 for 48 bits.
+ For a compressed file of size 100Gb -- about 100000 blocks --
+ only a 48-bit marker will do. NB: normal compression/
+ decompression do *not* rely on these statistical properties.
+ They are only important when trying to recover blocks from
+ damaged files.
+ --*/
+ bsPutUChar ( 0x31 ); bsPutUChar ( 0x41 );
+ bsPutUChar ( 0x59 ); bsPutUChar ( 0x26 );
+ bsPutUChar ( 0x53 ); bsPutUChar ( 0x59 );
+
+ /*-- Now the block's CRC, so it is in a known place. --*/
+ bsPutUInt32 ( blockCRC );
+
+ /*-- Now a single bit indicating randomisation. --*/
+ if (blockRandomised) {
+ bsW(1,1); nBlocksRandomised++;
+ } else
+ bsW(1,0);
+
+ /*-- Finally, block's contents proper. --*/
+ moveToFrontCodeAndSend ();
+
+ ERROR_IF_NOT_ZERO ( ferror(zStream) );
+ }
+
+ if (verbosity >= 2 && nBlocksRandomised > 0)
+ fprintf ( stderr, " %d block%s needed randomisation\n",
+ nBlocksRandomised,
+ nBlocksRandomised == 1 ? "" : "s" );
+
+ /*--
+ Now another magic 48-bit number, 0x177245385090, to
+ indicate the end of the last block. (sqrt(pi), if
+ you want to know. I did want to use e, but it contains
+ too much repetition -- 27 18 28 18 28 46 -- for me
+ to feel statistically comfortable. Call me paranoid.)
+ --*/
+
+ bsPutUChar ( 0x17 ); bsPutUChar ( 0x72 );
+ bsPutUChar ( 0x45 ); bsPutUChar ( 0x38 );
+ bsPutUChar ( 0x50 ); bsPutUChar ( 0x90 );
+
+ bsPutUInt32 ( combinedCRC );
+ if (verbosity >= 2)
+ fprintf ( stderr, " final combined CRC = 0x%x\n ", combinedCRC );
+
+ /*-- Close the files in an utterly paranoid way. --*/
+ bsFinishedWithStream ();
+
+ ERROR_IF_NOT_ZERO ( ferror(zStream) );
+ retVal = fflush ( zStream );
+ ERROR_IF_EOF ( retVal );
+ retVal = fclose ( zStream );
+ ERROR_IF_EOF ( retVal );
+
+ ERROR_IF_NOT_ZERO ( ferror(stream) );
+ retVal = fclose ( stream );
+ ERROR_IF_EOF ( retVal );
+
+ if (bytesIn == 0) bytesIn = 1;
+ if (bytesOut == 0) bytesOut = 1;
+
+ if (verbosity >= 1)
+ fprintf ( stderr, "%6.3f:1, %6.3f bits/byte, "
+ "%5.2f%% saved, %d in, %d out.\n",
+ (float)bytesIn / (float)bytesOut,
+ (8.0 * (float)bytesOut) / (float)bytesIn,
+ 100.0 * (1.0 - (float)bytesOut / (float)bytesIn),
+ bytesIn,
+ bytesOut
+ );
+}
+
+
+/*---------------------------------------------*/
+Bool uncompressStream ( FILE *zStream, FILE *stream )
+{
+ UChar magic1, magic2, magic3, magic4;
+ UChar magic5, magic6;
+ UInt32 storedBlockCRC, storedCombinedCRC;
+ UInt32 computedBlockCRC, computedCombinedCRC;
+ Int32 currBlockNo;
+ IntNative retVal;
+
+ SET_BINARY_MODE(stream);
+ SET_BINARY_MODE(zStream);
+
+ ERROR_IF_NOT_ZERO ( ferror(stream) );
+ ERROR_IF_NOT_ZERO ( ferror(zStream) );
+
+ bsSetStream ( zStream, False );
+
+ /*--
+ A bad magic number is `recoverable from';
+ return with False so the caller skips the file.
+ --*/
+ magic1 = bsGetUChar ();
+ magic2 = bsGetUChar ();
+ magic3 = bsGetUChar ();
+ magic4 = bsGetUChar ();
+ if (magic1 != 'B' ||
+ magic2 != 'Z' ||
+ magic3 != 'h' ||
+ magic4 < '1' ||
+ magic4 > '9') {
+ bsFinishedWithStream();
+ retVal = fclose ( stream );
+ ERROR_IF_EOF ( retVal );
+ return False;
+ }
+
+ setDecompressStructureSizes ( magic4 - '0' );
+ computedCombinedCRC = 0;
+
+ if (verbosity >= 2) fprintf ( stderr, "\n " );
+ currBlockNo = 0;
+
+ while (True) {
+ magic1 = bsGetUChar ();
+ magic2 = bsGetUChar ();
+ magic3 = bsGetUChar ();
+ magic4 = bsGetUChar ();
+ magic5 = bsGetUChar ();
+ magic6 = bsGetUChar ();
+ if (magic1 == 0x17 && magic2 == 0x72 &&
+ magic3 == 0x45 && magic4 == 0x38 &&
+ magic5 == 0x50 && magic6 == 0x90) break;
+
+ if (magic1 != 0x31 || magic2 != 0x41 ||
+ magic3 != 0x59 || magic4 != 0x26 ||
+ magic5 != 0x53 || magic6 != 0x59) badBlockHeader();
+
+ storedBlockCRC = bsGetUInt32 ();
+
+ if (bsR(1) == 1)
+ blockRandomised = True; else
+ blockRandomised = False;
+
+ currBlockNo++;
+ if (verbosity >= 2)
+ fprintf ( stderr, "[%d: huff+mtf ", currBlockNo );
+ getAndMoveToFrontDecode ();
+ ERROR_IF_NOT_ZERO ( ferror(zStream) );
+
+ initialiseCRC();
+ if (verbosity >= 2) fprintf ( stderr, "rt+rld" );
+ if (smallMode)
+ undoReversibleTransformation_small ( stream );
+ else
+ undoReversibleTransformation_fast ( stream );
+
+ ERROR_IF_NOT_ZERO ( ferror(stream) );
+
+ computedBlockCRC = getFinalCRC();
+ if (verbosity >= 3)
+ fprintf ( stderr, " {0x%x, 0x%x}", storedBlockCRC, computedBlockCRC );
+ if (verbosity >= 2) fprintf ( stderr, "] " );
+
+ /*-- A bad CRC is considered a fatal error. --*/
+ if (storedBlockCRC != computedBlockCRC)
+ crcError ( storedBlockCRC, computedBlockCRC );
+
+ computedCombinedCRC = (computedCombinedCRC << 1) | (computedCombinedCRC >> 31);
+ computedCombinedCRC ^= computedBlockCRC;
+ };
+
+ if (verbosity >= 2) fprintf ( stderr, "\n " );
+
+ storedCombinedCRC = bsGetUInt32 ();
+ if (verbosity >= 2)
+ fprintf ( stderr,
+ "combined CRCs: stored = 0x%x, computed = 0x%x\n ",
+ storedCombinedCRC, computedCombinedCRC );
+ if (storedCombinedCRC != computedCombinedCRC)
+ crcError ( storedCombinedCRC, computedCombinedCRC );
+
+
+ bsFinishedWithStream ();
+ ERROR_IF_NOT_ZERO ( ferror(zStream) );
+ retVal = fclose ( zStream );
+ ERROR_IF_EOF ( retVal );
+
+ ERROR_IF_NOT_ZERO ( ferror(stream) );
+ retVal = fflush ( stream );
+ ERROR_IF_NOT_ZERO ( retVal );
+ if (stream != stdout) {
+ retVal = fclose ( stream );
+ ERROR_IF_EOF ( retVal );
+ }
+ return True;
+}
+
+
+/*---------------------------------------------*/
+Bool testStream ( FILE *zStream )
+{
+ UChar magic1, magic2, magic3, magic4;
+ UChar magic5, magic6;
+ UInt32 storedBlockCRC, storedCombinedCRC;
+ UInt32 computedBlockCRC, computedCombinedCRC;
+ Int32 currBlockNo;
+ IntNative retVal;
+
+ SET_BINARY_MODE(zStream);
+ ERROR_IF_NOT_ZERO ( ferror(zStream) );
+
+ bsSetStream ( zStream, False );
+
+ magic1 = bsGetUChar ();
+ magic2 = bsGetUChar ();
+ magic3 = bsGetUChar ();
+ magic4 = bsGetUChar ();
+ if (magic1 != 'B' ||
+ magic2 != 'Z' ||
+ magic3 != 'h' ||
+ magic4 < '1' ||
+ magic4 > '9') {
+ bsFinishedWithStream();
+ fclose ( zStream );
+ fprintf ( stderr, "\n%s: bad magic number (ie, not created by bzip2)\n",
+ inName );
+ return False;
+ }
+
+ smallMode = True;
+ setDecompressStructureSizes ( magic4 - '0' );
+ computedCombinedCRC = 0;
+
+ if (verbosity >= 2) fprintf ( stderr, "\n" );
+ currBlockNo = 0;
+
+ while (True) {
+ magic1 = bsGetUChar ();
+ magic2 = bsGetUChar ();
+ magic3 = bsGetUChar ();
+ magic4 = bsGetUChar ();
+ magic5 = bsGetUChar ();
+ magic6 = bsGetUChar ();
+ if (magic1 == 0x17 && magic2 == 0x72 &&
+ magic3 == 0x45 && magic4 == 0x38 &&
+ magic5 == 0x50 && magic6 == 0x90) break;
+
+ currBlockNo++;
+ if (magic1 != 0x31 || magic2 != 0x41 ||
+ magic3 != 0x59 || magic4 != 0x26 ||
+ magic5 != 0x53 || magic6 != 0x59) {
+ bsFinishedWithStream();
+ fclose ( zStream );
+ fprintf ( stderr,
+ "\n%s, block %d: bad header (not == 0x314159265359)\n",
+ inName, currBlockNo );
+ return False;
+ }
+ storedBlockCRC = bsGetUInt32 ();
+
+ if (bsR(1) == 1)
+ blockRandomised = True; else
+ blockRandomised = False;
+
+ if (verbosity >= 2)
+ fprintf ( stderr, " block [%d: huff+mtf ", currBlockNo );
+ getAndMoveToFrontDecode ();
+ ERROR_IF_NOT_ZERO ( ferror(zStream) );
+
+ initialiseCRC();
+ if (verbosity >= 2) fprintf ( stderr, "rt+rld" );
+ undoReversibleTransformation_small ( NULL );
+
+ computedBlockCRC = getFinalCRC();
+ if (verbosity >= 3)
+ fprintf ( stderr, " {0x%x, 0x%x}", storedBlockCRC, computedBlockCRC );
+ if (verbosity >= 2) fprintf ( stderr, "] " );
+
+ if (storedBlockCRC != computedBlockCRC) {
+ bsFinishedWithStream();
+ fclose ( zStream );
+ fprintf ( stderr, "\n%s, block %d: computed CRC does not match stored one\n",
+ inName, currBlockNo );
+ return False;
+ }
+
+ if (verbosity >= 2) fprintf ( stderr, "ok\n" );
+ computedCombinedCRC = (computedCombinedCRC << 1) | (computedCombinedCRC >> 31);
+ computedCombinedCRC ^= computedBlockCRC;
+ };
+
+ storedCombinedCRC = bsGetUInt32 ();
+ if (verbosity >= 2)
+ fprintf ( stderr,
+ " combined CRCs: stored = 0x%x, computed = 0x%x\n ",
+ storedCombinedCRC, computedCombinedCRC );
+ if (storedCombinedCRC != computedCombinedCRC) {
+ bsFinishedWithStream();
+ fclose ( zStream );
+ fprintf ( stderr, "\n%s: computed CRC does not match stored one\n",
+ inName );
+ return False;
+ }
+
+ bsFinishedWithStream ();
+ ERROR_IF_NOT_ZERO ( ferror(zStream) );
+ retVal = fclose ( zStream );
+ ERROR_IF_EOF ( retVal );
+ return True;
+}
+
+
+
+/*---------------------------------------------------*/
+/*--- Error [non-] handling grunge ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+void cadvise ( void )
+{
+ fprintf (
+ stderr,
+ "\nIt is possible that the compressed file(s) have become corrupted.\n"
+ "You can use the -tvv option to test integrity of such files.\n\n"
+ "You can use the `bzip2recover' program to *attempt* to recover\n"
+ "data from undamaged sections of corrupted files.\n\n"
+ );
+}
+
+
+/*---------------------------------------------*/
+void showFileNames ( void )
+{
+ fprintf (
+ stderr,
+ "\tInput file = %s, output file = %s\n",
+ inName==NULL ? "(null)" : inName,
+ outName==NULL ? "(null)" : outName
+ );
+}
+
+
+/*---------------------------------------------*/
+void cleanUpAndFail ( Int32 ec )
+{
+ IntNative retVal;
+
+ if ( srcMode == SM_F2F && opMode != OM_TEST ) {
+ fprintf ( stderr, "%s: Deleting output file %s, if it exists.\n",
+ progName,
+ outName==NULL ? "(null)" : outName );
+ if (outputHandleJustInCase != NULL)
+ fclose ( outputHandleJustInCase );
+ retVal = remove ( outName );
+ if (retVal != 0)
+ fprintf ( stderr,
+ "%s: WARNING: deletion of output file (apparently) failed.\n",
+ progName );
+ }
+ if (numFileNames > 0 && numFilesProcessed < numFileNames) {
+ fprintf ( stderr,
+ "%s: WARNING: some files have not been processed:\n"
+ "\t%d specified on command line, %d not processed yet.\n\n",
+ progName, numFileNames,
+ numFileNames - numFilesProcessed );
+ }
+ exit ( ec );
+}
+
+
+/*---------------------------------------------*/
+void panic ( Char* s )
+{
+ fprintf ( stderr,
+ "\n%s: PANIC -- internal consistency error:\n"
+ "\t%s\n"
+ "\tThis is a BUG. Please report it to me at:\n"
+ "\tjseward@acm.org\n",
+ progName, s );
+ showFileNames();
+ cleanUpAndFail( 3 );
+}
+
+
+/*---------------------------------------------*/
+void badBGLengths ( void )
+{
+ fprintf ( stderr,
+ "\n%s: error when reading background model code lengths,\n"
+ "\twhich probably means the compressed file is corrupted.\n",
+ progName );
+ showFileNames();
+ cadvise();
+ cleanUpAndFail( 2 );
+}
+
+
+/*---------------------------------------------*/
+void crcError ( UInt32 crcStored, UInt32 crcComputed )
+{
+ fprintf ( stderr,
+ "\n%s: Data integrity error when decompressing.\n"
+ "\tStored CRC = 0x%x, computed CRC = 0x%x\n",
+ progName, crcStored, crcComputed );
+ showFileNames();
+ cadvise();
+ cleanUpAndFail( 2 );
+}
+
+
+/*---------------------------------------------*/
+void compressedStreamEOF ( void )
+{
+ fprintf ( stderr,
+ "\n%s: Compressed file ends unexpectedly;\n\t"
+ "perhaps it is corrupted? *Possible* reason follows.\n",
+ progName );
+ perror ( progName );
+ showFileNames();
+ cadvise();
+ cleanUpAndFail( 2 );
+}
+
+
+/*---------------------------------------------*/
+void ioError ( )
+{
+ fprintf ( stderr,
+ "\n%s: I/O or other error, bailing out. Possible reason follows.\n",
+ progName );
+ perror ( progName );
+ showFileNames();
+ cleanUpAndFail( 1 );
+}
+
+
+/*---------------------------------------------*/
+void blockOverrun ()
+{
+ fprintf ( stderr,
+ "\n%s: block overrun during decompression,\n"
+ "\twhich probably means the compressed file\n"
+ "\tis corrupted.\n",
+ progName );
+ showFileNames();
+ cadvise();
+ cleanUpAndFail( 2 );
+}
+
+
+/*---------------------------------------------*/
+void badBlockHeader ()
+{
+ fprintf ( stderr,
+ "\n%s: bad block header in the compressed file,\n"
+ "\twhich probably means it is corrupted.\n",
+ progName );
+ showFileNames();
+ cadvise();
+ cleanUpAndFail( 2 );
+}
+
+
+/*---------------------------------------------*/
+void bitStreamEOF ()
+{
+ fprintf ( stderr,
+ "\n%s: read past the end of compressed data,\n"
+ "\twhich probably means it is corrupted.\n",
+ progName );
+ showFileNames();
+ cadvise();
+ cleanUpAndFail( 2 );
+}
+
+
+/*---------------------------------------------*/
+void mySignalCatcher ( IntNative n )
+{
+ fprintf ( stderr,
+ "\n%s: Control-C (or similar) caught, quitting.\n",
+ progName );
+ cleanUpAndFail(1);
+}
+
+
+/*---------------------------------------------*/
+void mySIGSEGVorSIGBUScatcher ( IntNative n )
+{
+ if (opMode == OM_Z)
+ fprintf ( stderr,
+ "\n%s: Caught a SIGSEGV or SIGBUS whilst compressing,\n"
+ "\twhich probably indicates a bug in bzip2. Please\n"
+ "\treport it to me at: jseward@acm.org\n",
+ progName );
+ else
+ fprintf ( stderr,
+ "\n%s: Caught a SIGSEGV or SIGBUS whilst decompressing,\n"
+ "\twhich probably indicates that the compressed data\n"
+ "\tis corrupted.\n",
+ progName );
+
+ showFileNames();
+ if (opMode == OM_Z)
+ cleanUpAndFail( 3 ); else
+ { cadvise(); cleanUpAndFail( 2 ); }
+}
+
+
+/*---------------------------------------------*/
+void uncompressOutOfMemory ( Int32 draw, Int32 blockSize )
+{
+ fprintf ( stderr,
+ "\n%s: Can't allocate enough memory for decompression.\n"
+ "\tRequested %d bytes for a block size of %d.\n"
+ "\tTry selecting space-economic decompress (with flag -s)\n"
+ "\tand failing that, find a machine with more memory.\n",
+ progName, draw, blockSize );
+ showFileNames();
+ cleanUpAndFail(1);
+}
+
+
+/*---------------------------------------------*/
+void compressOutOfMemory ( Int32 draw, Int32 blockSize )
+{
+ fprintf ( stderr,
+ "\n%s: Can't allocate enough memory for compression.\n"
+ "\tRequested %d bytes for a block size of %d.\n"
+ "\tTry selecting a small block size (with flag -s).\n",
+ progName, draw, blockSize );
+ showFileNames();
+ cleanUpAndFail(1);
+}
+
+
+/*---------------------------------------------------*/
+/*--- The main driver machinery ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+void pad ( Char *s )
+{
+ Int32 i;
+ if ( (Int32)strlen(s) >= longestFileName ) return;
+ for (i = 1; i <= longestFileName - (Int32)strlen(s); i++)
+ fprintf ( stderr, " " );
+}
+
+
+/*---------------------------------------------*/
+Bool fileExists ( Char* name )
+{
+ FILE *tmp = fopen ( name, "rb" );
+ Bool exists = (tmp != NULL);
+ if (tmp != NULL) fclose ( tmp );
+ return exists;
+}
+
+
+/*---------------------------------------------*/
+/*--
+ if in doubt, return True
+--*/
+Bool notABogStandardFile ( Char* name )
+{
+ IntNative i;
+ struct MY_STAT statBuf;
+
+ i = MY_LSTAT ( name, &statBuf );
+ if (i != 0) return True;
+ if (MY_S_IFREG(statBuf.st_mode)) return False;
+ return True;
+}
+
+
+/*---------------------------------------------*/
+void copyDateAndPermissions ( Char *srcName, Char *dstName )
+{
+ #if BZ_UNIX
+ IntNative retVal;
+ struct MY_STAT statBuf;
+ struct utimbuf uTimBuf;
+
+ retVal = MY_LSTAT ( srcName, &statBuf );
+ ERROR_IF_NOT_ZERO ( retVal );
+ uTimBuf.actime = statBuf.st_atime;
+ uTimBuf.modtime = statBuf.st_mtime;
+
+ retVal = chmod ( dstName, statBuf.st_mode );
+ ERROR_IF_NOT_ZERO ( retVal );
+ retVal = utime ( dstName, &uTimBuf );
+ ERROR_IF_NOT_ZERO ( retVal );
+ #endif
+}
+
+
+/*---------------------------------------------*/
+Bool endsInBz2 ( Char* name )
+{
+ Int32 n = strlen ( name );
+ if (n <= 4) return False;
+ return
+ (name[n-4] == '.' &&
+ name[n-3] == 'b' &&
+ name[n-2] == 'z' &&
+ name[n-1] == '2');
+}
+
+
+/*---------------------------------------------*/
+Bool containsDubiousChars ( Char* name )
+{
+ Bool cdc = False;
+ for (; *name != '\0'; name++)
+ if (*name == '?' || *name == '*') cdc = True;
+ return cdc;
+}
+
+
+/*---------------------------------------------*/
+void compress ( Char *name )
+{
+ FILE *inStr;
+ FILE *outStr;
+
+ if (name == NULL && srcMode != SM_I2O)
+ panic ( "compress: bad modes\n" );
+
+ switch (srcMode) {
+ case SM_I2O: strcpy ( inName, "(stdin)" );
+ strcpy ( outName, "(stdout)" ); break;
+ case SM_F2F: strcpy ( inName, name );
+ strcpy ( outName, name );
+ strcat ( outName, ".bz2" ); break;
+ case SM_F2O: strcpy ( inName, name );
+ strcpy ( outName, "(stdout)" ); break;
+ }
+
+ if ( srcMode != SM_I2O && containsDubiousChars ( inName ) ) {
+ fprintf ( stderr, "%s: There are no files matching `%s'.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode != SM_I2O && !fileExists ( inName ) ) {
+ fprintf ( stderr, "%s: Input file %s doesn't exist, skipping.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode != SM_I2O && endsInBz2 ( inName )) {
+ fprintf ( stderr, "%s: Input file name %s ends in `.bz2', skipping.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode != SM_I2O && notABogStandardFile ( inName )) {
+ fprintf ( stderr, "%s: Input file %s is not a normal file, skipping.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode == SM_F2F && fileExists ( outName ) ) {
+ fprintf ( stderr, "%s: Output file %s already exists, skipping.\n",
+ progName, outName );
+ return;
+ }
+
+ switch ( srcMode ) {
+
+ case SM_I2O:
+ inStr = stdin;
+ outStr = stdout;
+ if ( isatty ( fileno ( stdout ) ) ) {
+ fprintf ( stderr,
+ "%s: I won't write compressed data to a terminal.\n",
+ progName );
+ fprintf ( stderr, "%s: For help, type: `%s --help'.\n",
+ progName, progName );
+ return;
+ };
+ break;
+
+ case SM_F2O:
+ inStr = fopen ( inName, "rb" );
+ outStr = stdout;
+ if ( isatty ( fileno ( stdout ) ) ) {
+ fprintf ( stderr,
+ "%s: I won't write compressed data to a terminal.\n",
+ progName );
+ fprintf ( stderr, "%s: For help, type: `%s --help'.\n",
+ progName, progName );
+ return;
+ };
+ if ( inStr == NULL ) {
+ fprintf ( stderr, "%s: Can't open input file %s, skipping.\n",
+ progName, inName );
+ return;
+ };
+ break;
+
+ case SM_F2F:
+ inStr = fopen ( inName, "rb" );
+ outStr = fopen ( outName, "wb" );
+ if ( outStr == NULL) {
+ fprintf ( stderr, "%s: Can't create output file %s, skipping.\n",
+ progName, outName );
+ return;
+ }
+ if ( inStr == NULL ) {
+ fprintf ( stderr, "%s: Can't open input file %s, skipping.\n",
+ progName, inName );
+ return;
+ };
+ break;
+
+ default:
+ panic ( "compress: bad srcMode" );
+ break;
+ }
+
+ if (verbosity >= 1) {
+ fprintf ( stderr, " %s: ", inName );
+ pad ( inName );
+ fflush ( stderr );
+ }
+
+ /*--- Now the input and output handles are sane. Do the Biz. ---*/
+ errno = 0;
+ outputHandleJustInCase = outStr;
+ compressStream ( inStr, outStr );
+ outputHandleJustInCase = NULL;
+
+ /*--- If there was an I/O error, we won't get here. ---*/
+ if ( srcMode == SM_F2F ) {
+ copyDateAndPermissions ( inName, outName );
+ if ( !keepInputFiles ) {
+ IntNative retVal = remove ( inName );
+ ERROR_IF_NOT_ZERO ( retVal );
+ }
+ }
+}
+
+
+/*---------------------------------------------*/
+void uncompress ( Char *name )
+{
+ FILE *inStr;
+ FILE *outStr;
+ Bool magicNumberOK;
+
+ if (name == NULL && srcMode != SM_I2O)
+ panic ( "uncompress: bad modes\n" );
+
+ switch (srcMode) {
+ case SM_I2O: strcpy ( inName, "(stdin)" );
+ strcpy ( outName, "(stdout)" ); break;
+ case SM_F2F: strcpy ( inName, name );
+ strcpy ( outName, name );
+ if (endsInBz2 ( outName ))
+ outName [ strlen ( outName ) - 4 ] = '\0';
+ break;
+ case SM_F2O: strcpy ( inName, name );
+ strcpy ( outName, "(stdout)" ); break;
+ }
+
+ if ( srcMode != SM_I2O && containsDubiousChars ( inName ) ) {
+ fprintf ( stderr, "%s: There are no files matching `%s'.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode != SM_I2O && !fileExists ( inName ) ) {
+ fprintf ( stderr, "%s: Input file %s doesn't exist, skipping.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode != SM_I2O && !endsInBz2 ( inName )) {
+ fprintf ( stderr,
+ "%s: Input file name %s doesn't end in `.bz2', skipping.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode != SM_I2O && notABogStandardFile ( inName )) {
+ fprintf ( stderr, "%s: Input file %s is not a normal file, skipping.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode == SM_F2F && fileExists ( outName ) ) {
+ fprintf ( stderr, "%s: Output file %s already exists, skipping.\n",
+ progName, outName );
+ return;
+ }
+
+ switch ( srcMode ) {
+
+ case SM_I2O:
+ inStr = stdin;
+ outStr = stdout;
+ if ( isatty ( fileno ( stdin ) ) ) {
+ fprintf ( stderr,
+ "%s: I won't read compressed data from a terminal.\n",
+ progName );
+ fprintf ( stderr, "%s: For help, type: `%s --help'.\n",
+ progName, progName );
+ return;
+ };
+ break;
+
+ case SM_F2O:
+ inStr = fopen ( inName, "rb" );
+ outStr = stdout;
+ if ( inStr == NULL ) {
+ fprintf ( stderr, "%s: Can't open input file %s, skipping.\n",
+ progName, inName );
+ return;
+ };
+ break;
+
+ case SM_F2F:
+ inStr = fopen ( inName, "rb" );
+ outStr = fopen ( outName, "wb" );
+ if ( outStr == NULL) {
+ fprintf ( stderr, "%s: Can't create output file %s, skipping.\n",
+ progName, outName );
+ return;
+ }
+ if ( inStr == NULL ) {
+ fprintf ( stderr, "%s: Can't open input file %s, skipping.\n",
+ progName, inName );
+ return;
+ };
+ break;
+
+ default:
+ panic ( "uncompress: bad srcMode" );
+ break;
+ }
+
+ if (verbosity >= 1) {
+ fprintf ( stderr, " %s: ", inName );
+ pad ( inName );
+ fflush ( stderr );
+ }
+
+ /*--- Now the input and output handles are sane. Do the Biz. ---*/
+ errno = 0;
+ outputHandleJustInCase = outStr;
+ magicNumberOK = uncompressStream ( inStr, outStr );
+ outputHandleJustInCase = NULL;
+
+ /*--- If there was an I/O error, we won't get here. ---*/
+ if ( magicNumberOK ) {
+ if ( srcMode == SM_F2F ) {
+ copyDateAndPermissions ( inName, outName );
+ if ( !keepInputFiles ) {
+ IntNative retVal = remove ( inName );
+ ERROR_IF_NOT_ZERO ( retVal );
+ }
+ }
+ } else {
+ if ( srcMode == SM_F2F ) {
+ IntNative retVal = remove ( outName );
+ ERROR_IF_NOT_ZERO ( retVal );
+ }
+ }
+
+ if ( magicNumberOK ) {
+ if (verbosity >= 1)
+ fprintf ( stderr, "done\n" );
+ } else {
+ if (verbosity >= 1)
+ fprintf ( stderr, "not a bzip2 file, skipping.\n" ); else
+ fprintf ( stderr,
+ "%s: %s is not a bzip2 file, skipping.\n",
+ progName, inName );
+ }
+
+}
+
+
+/*---------------------------------------------*/
+void testf ( Char *name )
+{
+ FILE *inStr;
+ Bool allOK;
+
+ if (name == NULL && srcMode != SM_I2O)
+ panic ( "testf: bad modes\n" );
+
+ strcpy ( outName, "(none)" );
+ switch (srcMode) {
+ case SM_I2O: strcpy ( inName, "(stdin)" ); break;
+ case SM_F2F: strcpy ( inName, name ); break;
+ case SM_F2O: strcpy ( inName, name ); break;
+ }
+
+ if ( srcMode != SM_I2O && containsDubiousChars ( inName ) ) {
+ fprintf ( stderr, "%s: There are no files matching `%s'.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode != SM_I2O && !fileExists ( inName ) ) {
+ fprintf ( stderr, "%s: Input file %s doesn't exist, skipping.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode != SM_I2O && !endsInBz2 ( inName )) {
+ fprintf ( stderr,
+ "%s: Input file name %s doesn't end in `.bz2', skipping.\n",
+ progName, inName );
+ return;
+ }
+ if ( srcMode != SM_I2O && notABogStandardFile ( inName )) {
+ fprintf ( stderr, "%s: Input file %s is not a normal file, skipping.\n",
+ progName, inName );
+ return;
+ }
+
+ switch ( srcMode ) {
+
+ case SM_I2O:
+ if ( isatty ( fileno ( stdin ) ) ) {
+ fprintf ( stderr,
+ "%s: I won't read compressed data from a terminal.\n",
+ progName );
+ fprintf ( stderr, "%s: For help, type: `%s --help'.\n",
+ progName, progName );
+ return;
+ };
+ inStr = stdin;
+ break;
+
+ case SM_F2O: case SM_F2F:
+ inStr = fopen ( inName, "rb" );
+ if ( inStr == NULL ) {
+ fprintf ( stderr, "%s: Can't open input file %s, skipping.\n",
+ progName, inName );
+ return;
+ };
+ break;
+
+ default:
+ panic ( "testf: bad srcMode" );
+ break;
+ }
+
+ if (verbosity >= 1) {
+ fprintf ( stderr, " %s: ", inName );
+ pad ( inName );
+ fflush ( stderr );
+ }
+
+ /*--- Now the input handle is sane. Do the Biz. ---*/
+ errno = 0;
+ allOK = testStream ( inStr );
+
+ if (allOK && verbosity >= 1) fprintf ( stderr, "ok\n" );
+ if (!allOK) testFailsExist = True;
+}
+
+
+/*---------------------------------------------*/
+void license ( void )
+{
+ fprintf ( stderr,
+
+ "bzip2, a block-sorting file compressor. "
+ "Version 0.1pl0, 17-Aug-97.\n"
+ " \n"
+ " Copyright (C) 1996, 1997 by Julian Seward.\n"
+ " \n"
+ " This program is free software; you can redistribute it and/or modify\n"
+ " it under the terms of the GNU General Public License as published by\n"
+ " the Free Software Foundation; either version 2 of the License, or\n"
+ " (at your option) any later version.\n"
+ " \n"
+ " This program is distributed in the hope that it will be useful,\n"
+ " but WITHOUT ANY WARRANTY; without even the implied warranty of\n"
+ " MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n"
+ " GNU General Public License for more details.\n"
+ " \n"
+ " You should have received a copy of the GNU General Public License\n"
+ " along with this program; if not, write to the Free Software\n"
+ " Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.\n"
+ " \n"
+ " The GNU General Public License is contained in the file LICENSE.\n"
+ " \n"
+ );
+}
+
+
+/*---------------------------------------------*/
+void usage ( Char *fullProgName )
+{
+ fprintf (
+ stderr,
+ "bzip2, a block-sorting file compressor. "
+ "Version 0.1pl0, 17-Aug-97.\n"
+ "\n usage: %s [flags and input files in any order]\n"
+ "\n"
+ " -h --help print this message\n"
+ " -d --decompress force decompression\n"
+ " -f --compress force compression\n"
+ " -t --test test compressed file integrity\n"
+ " -c --stdout output to standard out\n"
+ " -v --verbose be verbose (a 2nd -v gives more)\n"
+ " -k --keep keep (don't delete) input files\n"
+ " -L --license display software version & license\n"
+ " -V --version display software version & license\n"
+ " -s --small use less memory (at most 2500k)\n"
+ " -1 .. -9 set block size to 100k .. 900k\n"
+ " --repetitive-fast compress repetitive blocks faster\n"
+ " --repetitive-best compress repetitive blocks better\n"
+ "\n"
+ " If invoked as `bzip2', the default action is to compress.\n"
+ " as `bunzip2', the default action is to decompress.\n"
+ "\n"
+ " If no file names are given, bzip2 compresses or decompresses\n"
+ " from standard input to standard output. You can combine\n"
+ " flags, so `-v -4' means the same as -v4 or -4v, &c.\n"
+ #if BZ_UNIX
+ "\n"
+ #endif
+ ,
+
+ fullProgName
+ );
+}
+
+
+/*---------------------------------------------*/
+/*--
+ All the garbage from here to main() is purely to
+ implement a linked list of command-line arguments,
+ into which main() copies argv[1 .. argc-1].
+
+ The purpose of this ridiculous exercise is to
+ facilitate the expansion of wildcard characters
+ * and ? in filenames for halfwitted OSs like
+ MSDOS, Windows 95 and NT.
+
+ The actual Dirty Work is done by the platform-specific
+ macro APPEND_FILESPEC.
+--*/
+
+typedef
+ struct zzzz {
+ Char *name;
+ struct zzzz *link;
+ }
+ Cell;
+
+
+/*---------------------------------------------*/
+void *myMalloc ( Int32 n )
+{
+ void* p;
+
+ p = malloc ( (size_t)n );
+ if (p == NULL) {
+ fprintf (
+ stderr,
+ "%s: `malloc' failed on request for %d bytes.\n",
+ progName, n
+ );
+ exit ( 1 );
+ }
+ return p;
+}
+
+
+/*---------------------------------------------*/
+Cell *mkCell ( void )
+{
+ Cell *c;
+
+ c = (Cell*) myMalloc ( sizeof ( Cell ) );
+ c->name = NULL;
+ c->link = NULL;
+ return c;
+}
+
+
+/*---------------------------------------------*/
+Cell *snocString ( Cell *root, Char *name )
+{
+ if (root == NULL) {
+ Cell *tmp = mkCell();
+ tmp->name = (Char*) myMalloc ( 5 + strlen(name) );
+ strcpy ( tmp->name, name );
+ return tmp;
+ } else {
+ Cell *tmp = root;
+ while (tmp->link != NULL) tmp = tmp->link;
+ tmp->link = snocString ( tmp->link, name );
+ return root;
+ }
+}
+
+
+
+/*---------------------------------------------*/
+#define ISFLAG(s) (strcmp(aa->name, (s))==0)
+
+
+IntNative main ( IntNative argc, Char *argv[] )
+{
+ Int32 i, j;
+ Char *tmp;
+ Cell *argList;
+ Cell *aa;
+
+
+ #if DEBUG
+ fprintf ( stderr, "bzip2: *** compiled with debugging ON ***\n" );
+ #endif
+
+ /*-- Be really really really paranoid :-) --*/
+ if (sizeof(Int32) != 4 || sizeof(UInt32) != 4 ||
+ sizeof(Int16) != 2 || sizeof(UInt16) != 2 ||
+ sizeof(Char) != 1 || sizeof(UChar) != 1) {
+ fprintf ( stderr,
+ "bzip2: I'm not configured correctly for this platform!\n"
+ "\tI require Int32, Int16 and Char to have sizes\n"
+ "\tof 4, 2 and 1 bytes to run properly, and they don't.\n"
+ "\tProbably you can fix this by defining them correctly,\n"
+ "\tand recompiling. Bye!\n" );
+ exit(1);
+ }
+
+
+ /*-- Set up signal handlers --*/
+ signal (SIGINT, mySignalCatcher);
+ signal (SIGTERM, mySignalCatcher);
+ signal (SIGSEGV, mySIGSEGVorSIGBUScatcher);
+ #if BZ_UNIX
+ signal (SIGHUP, mySignalCatcher);
+ signal (SIGBUS, mySIGSEGVorSIGBUScatcher);
+ #endif
+
+
+ /*-- Initialise --*/
+ outputHandleJustInCase = NULL;
+ ftab = NULL;
+ ll4 = NULL;
+ ll16 = NULL;
+ ll8 = NULL;
+ tt = NULL;
+ block = NULL;
+ zptr = NULL;
+ errno = 0;
+ smallMode = False;
+ keepInputFiles = False;
+ verbosity = 0;
+ blockSize100k = 9;
+ testFailsExist = False;
+ bsStream = NULL;
+ numFileNames = 0;
+ numFilesProcessed = 0;
+ workFactor = 30;
+
+ strcpy ( inName, "(none)" );
+ strcpy ( outName, "(none)" );
+
+ strcpy ( progNameReally, argv[0] );
+ progName = &progNameReally[0];
+ for (tmp = &progNameReally[0]; *tmp != '\0'; tmp++)
+ if (*tmp == PATH_SEP) progName = tmp + 1;
+
+
+ /*-- Expand filename wildcards in arg list --*/
+ argList = NULL;
+ for (i = 1; i <= argc-1; i++)
+ APPEND_FILESPEC(argList, argv[i]);
+
+
+ /*-- Find the length of the longest filename --*/
+ longestFileName = 7;
+ numFileNames = 0;
+ for (aa = argList; aa != NULL; aa = aa->link)
+ if (aa->name[0] != '-') {
+ numFileNames++;
+ if (longestFileName < (Int32)strlen(aa->name) )
+ longestFileName = (Int32)strlen(aa->name);
+ }
+
+
+ /*-- Determine what to do (compress/uncompress/test). --*/
+ /*-- Note that subsequent flag handling may change this. --*/
+ opMode = OM_Z;
+
+ if ( (strcmp ( "bunzip2", progName ) == 0) ||
+ (strcmp ( "BUNZIP2", progName ) == 0) ||
+ (strcmp ( "bunzip2.exe", progName ) == 0) ||
+ (strcmp ( "BUNZIP2.EXE", progName ) == 0) )
+ opMode = OM_UNZ;
+
+
+ /*-- Determine source modes; flag handling may change this too. --*/
+ if (numFileNames == 0)
+ srcMode = SM_I2O; else srcMode = SM_F2F;
+
+
+ /*-- Look at the flags. --*/
+ for (aa = argList; aa != NULL; aa = aa->link)
+ if (aa->name[0] == '-' && aa->name[1] != '-')
+ for (j = 1; aa->name[j] != '\0'; j++)
+ switch (aa->name[j]) {
+ case 'c': srcMode = SM_F2O; break;
+ case 'd': opMode = OM_UNZ; break;
+ case 'f': opMode = OM_Z; break;
+ case 't': opMode = OM_TEST; break;
+ case 'k': keepInputFiles = True; break;
+ case 's': smallMode = True; break;
+ case '1': blockSize100k = 1; break;
+ case '2': blockSize100k = 2; break;
+ case '3': blockSize100k = 3; break;
+ case '4': blockSize100k = 4; break;
+ case '5': blockSize100k = 5; break;
+ case '6': blockSize100k = 6; break;
+ case '7': blockSize100k = 7; break;
+ case '8': blockSize100k = 8; break;
+ case '9': blockSize100k = 9; break;
+ case 'V':
+ case 'L': license(); break;
+ case 'v': verbosity++; break;
+ case 'h': usage ( progName );
+ exit ( 1 );
+ break;
+ default: fprintf ( stderr, "%s: Bad flag `%s'\n",
+ progName, aa->name );
+ usage ( progName );
+ exit ( 1 );
+ break;
+ }
+
+ /*-- And again ... --*/
+ for (aa = argList; aa != NULL; aa = aa->link) {
+ if (ISFLAG("--stdout")) srcMode = SM_F2O; else
+ if (ISFLAG("--decompress")) opMode = OM_UNZ; else
+ if (ISFLAG("--compress")) opMode = OM_Z; else
+ if (ISFLAG("--test")) opMode = OM_TEST; else
+ if (ISFLAG("--keep")) keepInputFiles = True; else
+ if (ISFLAG("--small")) smallMode = True; else
+ if (ISFLAG("--version")) license(); else
+ if (ISFLAG("--license")) license(); else
+ if (ISFLAG("--repetitive-fast")) workFactor = 5; else
+ if (ISFLAG("--repetitive-best")) workFactor = 150; else
+ if (ISFLAG("--verbose")) verbosity++; else
+ if (ISFLAG("--help")) { usage ( progName ); exit ( 1 ); }
+ else
+ if (strncmp ( aa->name, "--", 2) == 0) {
+ fprintf ( stderr, "%s: Bad flag `%s'\n", progName, aa->name );
+ usage ( progName );
+ exit ( 1 );
+ }
+ }
+
+ if (opMode == OM_Z && smallMode) blockSize100k = 2;
+
+ if (opMode == OM_Z && srcMode == SM_F2O && numFileNames > 1) {
+ fprintf ( stderr, "%s: I won't compress multiple files to stdout.\n",
+ progName );
+ exit ( 1 );
+ }
+
+ if (opMode == OM_TEST && srcMode == SM_F2O) {
+ fprintf ( stderr, "%s: -c and -t cannot be used together.\n",
+ progName );
+ exit ( 1 );
+ }
+
+ if (opMode != OM_Z) blockSize100k = 0;
+
+ if (opMode == OM_Z) {
+ allocateCompressStructures();
+ if (srcMode == SM_I2O)
+ compress ( NULL );
+ else
+ for (aa = argList; aa != NULL; aa = aa->link)
+ if (aa->name[0] != '-') {
+ numFilesProcessed++;
+ compress ( aa->name );
+ }
+ } else
+ if (opMode == OM_UNZ) {
+ if (srcMode == SM_I2O)
+ uncompress ( NULL );
+ else
+ for (aa = argList; aa != NULL; aa = aa->link)
+ if (aa->name[0] != '-') {
+ numFilesProcessed++;
+ uncompress ( aa->name );
+ }
+ } else {
+ testFailsExist = False;
+ if (srcMode == SM_I2O)
+ testf ( NULL );
+ else
+ for (aa = argList; aa != NULL; aa = aa->link)
+ if (aa->name[0] != '-') {
+ numFilesProcessed++;
+ testf ( aa->name );
+ }
+ if (testFailsExist) {
+ fprintf ( stderr,
+ "\n"
+ "You can use the `bzip2recover' program to *attempt* to recover\n"
+ "data from undamaged sections of corrupted files.\n\n"
+ );
+ exit(2);
+ }
+ }
+ return 0;
+}
+
+
+/*-----------------------------------------------------------*/
+/*--- end bzip2.c ---*/
+/*-----------------------------------------------------------*/
diff --git a/bzip2.exe b/bzip2.exe
new file mode 100644
index 0000000..4b3c4c1
--- /dev/null
+++ b/bzip2.exe
Binary files differ
diff --git a/bzip2.txt b/bzip2.txt
new file mode 100644
index 0000000..83366bc
--- /dev/null
+++ b/bzip2.txt
@@ -0,0 +1,462 @@
+
+
+
+bzip2(1) bzip2(1)
+
+
+NAME
+ bzip2, bunzip2 - a block-sorting file compressor, v0.1
+ bzip2recover - recovers data from damaged bzip2 files
+
+
+SYNOPSIS
+ bzip2 [ -cdfkstvVL123456789 ] [ filenames ... ]
+ bunzip2 [ -kvsVL ] [ filenames ... ]
+ bzip2recover filename
+
+
+DESCRIPTION
+ Bzip2 compresses files using the Burrows-Wheeler block-
+ sorting text compression algorithm, and Huffman coding.
+ Compression is generally considerably better than that
+ achieved by more conventional LZ77/LZ78-based compressors,
+ and approaches the performance of the PPM family of sta-
+ tistical compressors.
+
+ The command-line options are deliberately very similar to
+ those of GNU Gzip, but they are not identical.
+
+ Bzip2 expects a list of file names to accompany the com-
+ mand-line flags. Each file is replaced by a compressed
+ version of itself, with the name "original_name.bz2".
+ Each compressed file has the same modification date and
+ permissions as the corresponding original, so that these
+ properties can be correctly restored at decompression
+ time. File name handling is naive in the sense that there
+ is no mechanism for preserving original file names, per-
+ missions and dates in filesystems which lack these con-
+ cepts, or have serious file name length restrictions, such
+ as MS-DOS.
+
+ Bzip2 and bunzip2 will not overwrite existing files; if
+ you want this to happen, you should delete them first.
+
+ If no file names are specified, bzip2 compresses from
+ standard input to standard output. In this case, bzip2
+ will decline to write compressed output to a terminal, as
+ this would be entirely incomprehensible and therefore
+ pointless.
+
+ Bunzip2 (or bzip2 -d ) decompresses and restores all spec-
+ ified files whose names end in ".bz2". Files without this
+ suffix are ignored. Again, supplying no filenames causes
+ decompression from standard input to standard output.
+
+ You can also compress or decompress files to the standard
+ output by giving the -c flag. You can decompress multiple
+ files like this, but you may only compress a single file
+ this way, since it would otherwise be difficult to sepa-
+ rate out the compressed representations of the original
+ files.
+
+
+
+ 1
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ Compression is always performed, even if the compressed
+ file is slightly larger than the original. Files of less
+ than about one hundred bytes tend to get larger, since the
+ compression mechanism has a constant overhead in the
+ region of 50 bytes. Random data (including the output of
+ most file compressors) is coded at about 8.05 bits per
+ byte, giving an expansion of around 0.5%.
+
+ As a self-check for your protection, bzip2 uses 32-bit
+ CRCs to make sure that the decompressed version of a file
+ is identical to the original. This guards against corrup-
+ tion of the compressed data, and against undetected bugs
+ in bzip2 (hopefully very unlikely). The chances of data
+ corruption going undetected is microscopic, about one
+ chance in four billion for each file processed. Be aware,
+ though, that the check occurs upon decompression, so it
+ can only tell you that that something is wrong. It can't
+ help you recover the original uncompressed data. You can
+ use bzip2recover to try to recover data from damaged
+ files.
+
+ Return values: 0 for a normal exit, 1 for environmental
+ problems (file not found, invalid flags, I/O errors, &c),
+ 2 to indicate a corrupt compressed file, 3 for an internal
+ consistency error (eg, bug) which caused bzip2 to panic.
+
+
+MEMORY MANAGEMENT
+ Bzip2 compresses large files in blocks. The block size
+ affects both the compression ratio achieved, and the
+ amount of memory needed both for compression and decom-
+ pression. The flags -1 through -9 specify the block size
+ to be 100,000 bytes through 900,000 bytes (the default)
+ respectively. At decompression-time, the block size used
+ for compression is read from the header of the compressed
+ file, and bunzip2 then allocates itself just enough memory
+ to decompress the file. Since block sizes are stored in
+ compressed files, it follows that the flags -1 to -9 are
+ irrelevant to and so ignored during decompression. Com-
+ pression and decompression requirements, in bytes, can be
+ estimated as:
+
+ Compression: 400k + ( 7 x block size )
+
+ Decompression: 100k + ( 5 x block size ), or
+ 100k + ( 2.5 x block size )
+
+ Larger block sizes give rapidly diminishing marginal
+ returns; most of the compression comes from the first two
+ or three hundred k of block size, a fact worth bearing in
+ mind when using bzip2 on small machines. It is also
+ important to appreciate that the decompression memory
+ requirement is set at compression-time by the choice of
+ block size.
+
+
+
+ 2
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ For files compressed with the default 900k block size,
+ bunzip2 will require about 4600 kbytes to decompress. To
+ support decompression of any file on a 4 megabyte machine,
+ bunzip2 has an option to decompress using approximately
+ half this amount of memory, about 2300 kbytes. Decompres-
+ sion speed is also halved, so you should use this option
+ only where necessary. The relevant flag is -s.
+
+ In general, try and use the largest block size memory con-
+ straints allow, since that maximises the compression
+ achieved. Compression and decompression speed are virtu-
+ ally unaffected by block size.
+
+ Another significant point applies to files which fit in a
+ single block -- that means most files you'd encounter
+ using a large block size. The amount of real memory
+ touched is proportional to the size of the file, since the
+ file is smaller than a block. For example, compressing a
+ file 20,000 bytes long with the flag -9 will cause the
+ compressor to allocate around 6700k of memory, but only
+ touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the
+ decompressor will allocate 4600k but only touch 100k +
+ 20000 * 5 = 200 kbytes.
+
+ Here is a table which summarises the maximum memory usage
+ for different block sizes. Also recorded is the total
+ compressed size for 14 files of the Calgary Text Compres-
+ sion Corpus totalling 3,141,622 bytes. This column gives
+ some feel for how compression varies with block size.
+ These figures tend to understate the advantage of larger
+ block sizes for larger files, since the Corpus is domi-
+ nated by smaller files.
+
+ Compress Decompress Decompress Corpus
+ Flag usage usage -s usage Size
+
+ -1 1100k 600k 350k 914704
+ -2 1800k 1100k 600k 877703
+ -3 2500k 1600k 850k 860338
+ -4 3200k 2100k 1100k 846899
+ -5 3900k 2600k 1350k 845160
+ -6 4600k 3100k 1600k 838626
+ -7 5400k 3600k 1850k 834096
+ -8 6000k 4100k 2100k 828642
+ -9 6700k 4600k 2350k 828642
+
+
+OPTIONS
+ -c --stdout
+ Compress or decompress to standard output. -c will
+ decompress multiple files to stdout, but will only
+ compress a single file to stdout.
+
+
+
+
+
+ 3
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ -d --decompress
+ Force decompression. Bzip2 and bunzip2 are really
+ the same program, and the decision about whether to
+ compress or decompress is done on the basis of
+ which name is used. This flag overrides that mech-
+ anism, and forces bzip2 to decompress.
+
+ -f --compress
+ The complement to -d: forces compression, regard-
+ less of the invokation name.
+
+ -t --test
+ Check integrity of the specified file(s), but don't
+ decompress them. This really performs a trial
+ decompression and throws away the result, using the
+ low-memory decompression algorithm (see -s).
+
+ -k --keep
+ Keep (don't delete) input files during compression
+ or decompression.
+
+ -s --small
+ Reduce memory usage, both for compression and
+ decompression. Files are decompressed using a mod-
+ ified algorithm which only requires 2.5 bytes per
+ block byte. This means any file can be decom-
+ pressed in 2300k of memory, albeit somewhat more
+ slowly than usual.
+
+ During compression, -s selects a block size of
+ 200k, which limits memory use to around the same
+ figure, at the expense of your compression ratio.
+ In short, if your machine is low on memory (8
+ megabytes or less), use -s for everything. See
+ MEMORY MANAGEMENT above.
+
+
+ -v --verbose
+ Verbose mode -- show the compression ratio for each
+ file processed. Further -v's increase the ver-
+ bosity level, spewing out lots of information which
+ is primarily of interest for diagnostic purposes.
+
+ -L --license
+ Display the software version, license terms and
+ conditions.
+
+ -V --version
+ Same as -L.
+
+ -1 to -9
+ Set the block size to 100 k, 200 k .. 900 k when
+ compressing. Has no effect when decompressing.
+ See MEMORY MANAGEMENT above.
+
+
+
+ 4
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ --repetitive-fast
+ bzip2 injects some small pseudo-random variations
+ into very repetitive blocks to limit worst-case
+ performance during compression. If sorting runs
+ into difficulties, the block is randomised, and
+ sorting is restarted. Very roughly, bzip2 persists
+ for three times as long as a well-behaved input
+ would take before resorting to randomisation. This
+ flag makes it give up much sooner.
+
+
+ --repetitive-best
+ Opposite of --repetitive-fast; try a lot harder
+ before resorting to randomisation.
+
+
+RECOVERING DATA FROM DAMAGED FILES
+ bzip2 compresses files in blocks, usually 900kbytes long.
+ Each block is handled independently. If a media or trans-
+ mission error causes a multi-block .bz2 file to become
+ damaged, it may be possible to recover data from the
+ undamaged blocks in the file.
+
+ The compressed representation of each block is delimited
+ by a 48-bit pattern, which makes it possible to find the
+ block boundaries with reasonable certainty. Each block
+ also carries its own 32-bit CRC, so damaged blocks can be
+ distinguished from undamaged ones.
+
+ bzip2recover is a simple program whose purpose is to
+ search for blocks in .bz2 files, and write each block out
+ into its own .bz2 file. You can then use bzip2 -t to test
+ the integrity of the resulting files, and decompress those
+ which are undamaged.
+
+ bzip2recover takes a single argument, the name of the dam-
+ aged file, and writes a number of files "rec0001file.bz2",
+ "rec0002file.bz2", etc, containing the extracted blocks.
+ The output filenames are designed so that the use of wild-
+ cards in subsequent processing -- for example, "bzip2 -dc
+ rec*file.bz2 > recovered_data" -- lists the files in the
+ "right" order.
+
+ bzip2recover should be of most use dealing with large .bz2
+ files, as these will contain many blocks. It is clearly
+ futile to use it on damaged single-block files, since a
+ damaged block cannot be recovered. If you wish to min-
+ imise any potential data loss through media or transmis-
+ sion errors, you might consider compressing with a smaller
+ block size.
+
+
+PERFORMANCE NOTES
+ The sorting phase of compression gathers together similar
+
+
+
+ 5
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ strings in the file. Because of this, files containing
+ very long runs of repeated symbols, like "aabaabaabaab
+ ..." (repeated several hundred times) may compress
+ extraordinarily slowly. You can use the -vvvvv option to
+ monitor progress in great detail, if you want. Decompres-
+ sion speed is unaffected.
+
+ Such pathological cases seem rare in practice, appearing
+ mostly in artificially-constructed test files, and in low-
+ level disk images. It may be inadvisable to use bzip2 to
+ compress the latter. If you do get a file which causes
+ severe slowness in compression, try making the block size
+ as small as possible, with flag -1.
+
+ Incompressible or virtually-incompressible data may decom-
+ press rather more slowly than one would hope. This is due
+ to a naive implementation of the move-to-front coder.
+
+ bzip2 usually allocates several megabytes of memory to
+ operate in, and then charges all over it in a fairly ran-
+ dom fashion. This means that performance, both for com-
+ pressing and decompressing, is largely determined by the
+ speed at which your machine can service cache misses.
+ Because of this, small changes to the code to reduce the
+ miss rate have been observed to give disproportionately
+ large performance improvements. I imagine bzip2 will per-
+ form best on machines with very large caches.
+
+ Test mode (-t) uses the low-memory decompression algorithm
+ (-s). This means test mode does not run as fast as it
+ could; it could run as fast as the normal decompression
+ machinery. This could easily be fixed at the cost of some
+ code bloat.
+
+
+CAVEATS
+ I/O error messages are not as helpful as they could be.
+ Bzip2 tries hard to detect I/O errors and exit cleanly,
+ but the details of what the problem is sometimes seem
+ rather misleading.
+
+ This manual page pertains to version 0.1 of bzip2. It may
+ well happen that some future version will use a different
+ compressed file format. If you try to decompress, using
+ 0.1, a .bz2 file created with some future version which
+ uses a different compressed file format, 0.1 will complain
+ that your file "is not a bzip2 file". If that happens,
+ you should obtain a more recent version of bzip2 and use
+ that to decompress the file.
+
+ Wildcard expansion for Windows 95 and NT is flaky.
+
+ bzip2recover uses 32-bit integers to represent bit posi-
+ tions in compressed files, so it cannot handle compressed
+
+
+
+ 6
+
+
+
+
+
+bzip2(1) bzip2(1)
+
+
+ files more than 512 megabytes long. This could easily be
+ fixed.
+
+ bzip2recover sometimes reports a very small, incomplete
+ final block. This is spurious and can be safely ignored.
+
+
+RELATIONSHIP TO bzip-0.21
+ This program is a descendant of the bzip program, version
+ 0.21, which I released in August 1996. The primary dif-
+ ference of bzip2 is its avoidance of the possibly patented
+ algorithms which were used in 0.21. bzip2 also brings
+ various useful refinements (-s, -t), uses less memory,
+ decompresses significantly faster, and has support for
+ recovering data from damaged files.
+
+ Because bzip2 uses Huffman coding to construct the com-
+ pressed bitstream, rather than the arithmetic coding used
+ in 0.21, the compressed representations generated by the
+ two programs are incompatible, and they will not interop-
+ erate. The change in suffix from .bz to .bz2 reflects
+ this. It would have been helpful to at least allow bzip2
+ to decompress files created by 0.21, but this would defeat
+ the primary aim of having a patent-free compressor.
+
+ Huffman coding necessarily involves some coding ineffi-
+ ciency compared to arithmetic coding. This means that
+ bzip2 compresses about 1% worse than 0.21, an unfortunate
+ but unavoidable fact-of-life. On the other hand, decom-
+ pression is approximately 50% faster for the same reason,
+ and the change in file format gave an opportunity to add
+ data-recovery features. So it is not all bad.
+
+
+AUTHOR
+ Julian Seward, jseward@acm.org.
+
+ The ideas embodied in bzip and bzip2 are due to (at least)
+ the following people: Michael Burrows and David Wheeler
+ (for the block sorting transformation), David Wheeler
+ (again, for the Huffman coder), Peter Fenwick (for the
+ structured coding model in 0.21, and many refinements),
+ and Alistair Moffat, Radford Neal and Ian Witten (for the
+ arithmetic coder in 0.21). I am much indebted for their
+ help, support and advice. See the file ALGORITHMS in the
+ source distribution for pointers to sources of documenta-
+ tion. Christian von Roques encouraged me to look for
+ faster sorting algorithms, so as to speed up compression.
+ Bela Lubkin encouraged me to improve the worst-case com-
+ pression performance. Many people sent patches, helped
+ with portability problems, lent machines, gave advice and
+ were generally helpful.
+
+
+
+
+
+ 7
+
+
diff --git a/bzip2recover.c b/bzip2recover.c
new file mode 100644
index 0000000..efdfb3c
--- /dev/null
+++ b/bzip2recover.c
@@ -0,0 +1,399 @@
+
+/*-----------------------------------------------------------*/
+/*--- Block recoverer program for bzip2 ---*/
+/*--- bzip2recover.c ---*/
+/*-----------------------------------------------------------*/
+
+/*--
+ This program is bzip2recover, a program to attempt data
+ salvage from damaged files created by the accompanying
+ bzip2 program.
+
+ Copyright (C) 1996, 1997 by Julian Seward.
+ Guildford, Surrey, UK
+ email: jseward@acm.org
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ The GNU General Public License is contained in the file LICENSE.
+--*/
+
+
+#include <stdio.h>
+#include <errno.h>
+#include <malloc.h>
+#include <stdlib.h>
+#include <strings.h> /*-- or try string.h --*/
+
+#define UInt32 unsigned int
+#define Int32 int
+#define UChar unsigned char
+#define Char char
+#define Bool unsigned char
+#define True 1
+#define False 0
+
+
+Char inFileName[2000];
+Char outFileName[2000];
+Char progName[2000];
+
+UInt32 bytesOut = 0;
+UInt32 bytesIn = 0;
+
+
+/*---------------------------------------------------*/
+/*--- I/O errors ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+void readError ( void )
+{
+ fprintf ( stderr,
+ "%s: I/O error reading `%s', possible reason follows.\n",
+ progName, inFileName );
+ perror ( progName );
+ fprintf ( stderr, "%s: warning: output file(s) may be incomplete.\n",
+ progName );
+ exit ( 1 );
+}
+
+
+/*---------------------------------------------*/
+void writeError ( void )
+{
+ fprintf ( stderr,
+ "%s: I/O error reading `%s', possible reason follows.\n",
+ progName, inFileName );
+ perror ( progName );
+ fprintf ( stderr, "%s: warning: output file(s) may be incomplete.\n",
+ progName );
+ exit ( 1 );
+}
+
+
+/*---------------------------------------------*/
+void mallocFail ( Int32 n )
+{
+ fprintf ( stderr,
+ "%s: malloc failed on request for %d bytes.\n",
+ progName, n );
+ fprintf ( stderr, "%s: warning: output file(s) may be incomplete.\n",
+ progName );
+ exit ( 1 );
+}
+
+
+/*---------------------------------------------------*/
+/*--- Bit stream I/O ---*/
+/*---------------------------------------------------*/
+
+typedef
+ struct {
+ FILE* handle;
+ Int32 buffer;
+ Int32 buffLive;
+ Char mode;
+ }
+ BitStream;
+
+
+/*---------------------------------------------*/
+BitStream* bsOpenReadStream ( FILE* stream )
+{
+ BitStream *bs = malloc ( sizeof(BitStream) );
+ if (bs == NULL) mallocFail ( sizeof(BitStream) );
+ bs->handle = stream;
+ bs->buffer = 0;
+ bs->buffLive = 0;
+ bs->mode = 'r';
+ return bs;
+}
+
+
+/*---------------------------------------------*/
+BitStream* bsOpenWriteStream ( FILE* stream )
+{
+ BitStream *bs = malloc ( sizeof(BitStream) );
+ if (bs == NULL) mallocFail ( sizeof(BitStream) );
+ bs->handle = stream;
+ bs->buffer = 0;
+ bs->buffLive = 0;
+ bs->mode = 'w';
+ return bs;
+}
+
+
+/*---------------------------------------------*/
+void bsPutBit ( BitStream* bs, Int32 bit )
+{
+ if (bs->buffLive == 8) {
+ Int32 retVal = putc ( (UChar) bs->buffer, bs->handle );
+ if (retVal == EOF) writeError();
+ bytesOut++;
+ bs->buffLive = 1;
+ bs->buffer = bit & 0x1;
+ } else {
+ bs->buffer = ( (bs->buffer << 1) | (bit & 0x1) );
+ bs->buffLive++;
+ };
+}
+
+
+/*---------------------------------------------*/
+/*--
+ Returns 0 or 1, or 2 to indicate EOF.
+--*/
+Int32 bsGetBit ( BitStream* bs )
+{
+ if (bs->buffLive > 0) {
+ bs->buffLive --;
+ return ( ((bs->buffer) >> (bs->buffLive)) & 0x1 );
+ } else {
+ Int32 retVal = getc ( bs->handle );
+ if ( retVal == EOF ) {
+ if (errno != 0) readError();
+ return 2;
+ }
+ bs->buffLive = 7;
+ bs->buffer = retVal;
+ return ( ((bs->buffer) >> 7) & 0x1 );
+ }
+}
+
+
+/*---------------------------------------------*/
+void bsClose ( BitStream* bs )
+{
+ Int32 retVal;
+
+ if ( bs->mode == 'w' ) {
+ while ( bs->buffLive < 8 ) {
+ bs->buffLive++;
+ bs->buffer <<= 1;
+ };
+ retVal = putc ( (UChar) (bs->buffer), bs->handle );
+ if (retVal == EOF) writeError();
+ bytesOut++;
+ retVal = fflush ( bs->handle );
+ if (retVal == EOF) writeError();
+ }
+ retVal = fclose ( bs->handle );
+ if (retVal == EOF)
+ if (bs->mode == 'w') writeError(); else readError();
+ free ( bs );
+}
+
+
+/*---------------------------------------------*/
+void bsPutUChar ( BitStream* bs, UChar c )
+{
+ Int32 i;
+ for (i = 7; i >= 0; i--)
+ bsPutBit ( bs, (((UInt32) c) >> i) & 0x1 );
+}
+
+
+/*---------------------------------------------*/
+void bsPutUInt32 ( BitStream* bs, UInt32 c )
+{
+ Int32 i;
+
+ for (i = 31; i >= 0; i--)
+ bsPutBit ( bs, (c >> i) & 0x1 );
+}
+
+
+/*---------------------------------------------*/
+Bool endsInBz2 ( Char* name )
+{
+ Int32 n = strlen ( name );
+ if (n <= 4) return False;
+ return
+ (name[n-4] == '.' &&
+ name[n-3] == 'b' &&
+ name[n-2] == 'z' &&
+ name[n-1] == '2');
+}
+
+
+/*---------------------------------------------------*/
+/*--- ---*/
+/*---------------------------------------------------*/
+
+#define BLOCK_HEADER_HI 0x00003141UL
+#define BLOCK_HEADER_LO 0x59265359UL
+
+#define BLOCK_ENDMARK_HI 0x00001772UL
+#define BLOCK_ENDMARK_LO 0x45385090UL
+
+Int32 main ( Int32 argc, Char** argv )
+{
+ FILE* inFile;
+ FILE* outFile;
+ BitStream* bsIn, *bsWr;
+ Int32 currBlock, b, wrBlock;
+ UInt32 bitsRead;
+ UInt32 bStart[20000];
+ UInt32 bEnd[20000];
+ UInt32 buffHi, buffLo, blockCRC;
+ Char* p;
+
+ strcpy ( progName, argv[0] );
+ inFileName[0] = outFileName[0] = 0;
+
+ fprintf ( stderr, "bzip2recover: extracts blocks from damaged .bz2 files.\n" );
+
+ if (argc != 2) {
+ fprintf ( stderr, "%s: usage is `%s damaged_file_name'.\n",
+ progName, progName );
+ exit(1);
+ }
+
+ strcpy ( inFileName, argv[1] );
+
+ inFile = fopen ( inFileName, "rb" );
+ if (inFile == NULL) {
+ fprintf ( stderr, "%s: can't read `%s'\n", progName, inFileName );
+ exit(1);
+ }
+
+ bsIn = bsOpenReadStream ( inFile );
+ fprintf ( stderr, "%s: searching for block boundaries ...\n", progName );
+
+ bitsRead = 0;
+ buffHi = buffLo = 0;
+ currBlock = 0;
+ bStart[currBlock] = 0;
+
+ while (True) {
+ b = bsGetBit ( bsIn );
+ bitsRead++;
+ if (b == 2) {
+ if (bitsRead >= bStart[currBlock] &&
+ (bitsRead - bStart[currBlock]) >= 40) {
+ bEnd[currBlock] = bitsRead-1;
+ if (currBlock > 0)
+ fprintf ( stderr, " block %d runs from %d to %d (incomplete)\n",
+ currBlock, bStart[currBlock], bEnd[currBlock] );
+ } else
+ currBlock--;
+ break;
+ }
+ buffHi = (buffHi << 1) | (buffLo >> 31);
+ buffLo = (buffLo << 1) | (b & 1);
+ if ( ( (buffHi & 0x0000ffff) == BLOCK_HEADER_HI
+ && buffLo == BLOCK_HEADER_LO)
+ ||
+ ( (buffHi & 0x0000ffff) == BLOCK_ENDMARK_HI
+ && buffLo == BLOCK_ENDMARK_LO)
+ ) {
+ if (bitsRead > 49)
+ bEnd[currBlock] = bitsRead-49; else
+ bEnd[currBlock] = 0;
+ if (currBlock > 0)
+ fprintf ( stderr, " block %d runs from %d to %d\n",
+ currBlock, bStart[currBlock], bEnd[currBlock] );
+ currBlock++;
+ bStart[currBlock] = bitsRead;
+ }
+ }
+
+ bsClose ( bsIn );
+
+ /*-- identified blocks run from 1 to currBlock inclusive. --*/
+
+ if (currBlock < 1) {
+ fprintf ( stderr,
+ "%s: sorry, I couldn't find any block boundaries.\n",
+ progName );
+ exit(1);
+ };
+
+ fprintf ( stderr, "%s: splitting into blocks\n", progName );
+
+ inFile = fopen ( inFileName, "rb" );
+ if (inFile == NULL) {
+ fprintf ( stderr, "%s: can't open `%s'\n", progName, inFileName );
+ exit(1);
+ }
+ bsIn = bsOpenReadStream ( inFile );
+
+ /*-- placate gcc's dataflow analyser --*/
+ blockCRC = 0; bsWr = 0;
+
+ bitsRead = 0;
+ outFile = NULL;
+ wrBlock = 1;
+ while (True) {
+ b = bsGetBit(bsIn);
+ if (b == 2) break;
+ buffHi = (buffHi << 1) | (buffLo >> 31);
+ buffLo = (buffLo << 1) | (b & 1);
+ if (bitsRead == 47+bStart[wrBlock])
+ blockCRC = (buffHi << 16) | (buffLo >> 16);
+
+ if (outFile != NULL && bitsRead >= bStart[wrBlock]
+ && bitsRead <= bEnd[wrBlock]) {
+ bsPutBit ( bsWr, b );
+ }
+
+ bitsRead++;
+
+ if (bitsRead == bEnd[wrBlock]+1) {
+ if (outFile != NULL) {
+ bsPutUChar ( bsWr, 0x17 ); bsPutUChar ( bsWr, 0x72 );
+ bsPutUChar ( bsWr, 0x45 ); bsPutUChar ( bsWr, 0x38 );
+ bsPutUChar ( bsWr, 0x50 ); bsPutUChar ( bsWr, 0x90 );
+ bsPutUInt32 ( bsWr, blockCRC );
+ bsClose ( bsWr );
+ }
+ if (wrBlock >= currBlock) break;
+ wrBlock++;
+ } else
+ if (bitsRead == bStart[wrBlock]) {
+ outFileName[0] = 0;
+ sprintf ( outFileName, "rec%4d", wrBlock );
+ for (p = outFileName; *p != 0; p++) if (*p == ' ') *p = '0';
+ strcat ( outFileName, inFileName );
+ if ( !endsInBz2(outFileName)) strcat ( outFileName, ".bz2" );
+
+ fprintf ( stderr, " writing block %d to `%s' ...\n",
+ wrBlock, outFileName );
+
+ outFile = fopen ( outFileName, "wb" );
+ if (outFile == NULL) {
+ fprintf ( stderr, "%s: can't write `%s'\n",
+ progName, outFileName );
+ exit(1);
+ }
+ bsWr = bsOpenWriteStream ( outFile );
+ bsPutUChar ( bsWr, 'B' ); bsPutUChar ( bsWr, 'Z' );
+ bsPutUChar ( bsWr, 'h' ); bsPutUChar ( bsWr, '9' );
+ bsPutUChar ( bsWr, 0x31 ); bsPutUChar ( bsWr, 0x41 );
+ bsPutUChar ( bsWr, 0x59 ); bsPutUChar ( bsWr, 0x26 );
+ bsPutUChar ( bsWr, 0x53 ); bsPutUChar ( bsWr, 0x59 );
+ }
+ }
+
+ fprintf ( stderr, "%s: finished\n", progName );
+ return 0;
+}
+
+
+
+/*-----------------------------------------------------------*/
+/*--- end bzip2recover.c ---*/
+/*-----------------------------------------------------------*/
diff --git a/sample1.bz2 b/sample1.bz2
new file mode 100644
index 0000000..18dea60
--- /dev/null
+++ b/sample1.bz2
Binary files differ
diff --git a/sample1.ref b/sample1.ref
new file mode 100644
index 0000000..a56e52b
--- /dev/null
+++ b/sample1.ref
Binary files differ
diff --git a/sample2.bz2 b/sample2.bz2
new file mode 100644
index 0000000..d5a6160
--- /dev/null
+++ b/sample2.bz2
Binary files differ
diff --git a/sample2.ref b/sample2.ref
new file mode 100644
index 0000000..34af958
--- /dev/null
+++ b/sample2.ref
Binary files differ
diff --git a/test.bat b/test.bat
new file mode 100644
index 0000000..30b747d
--- /dev/null
+++ b/test.bat
@@ -0,0 +1,9 @@
+@rem
+@rem MSDOS test driver for bzip2
+@rem
+type words1
+.\bzip2 -1 < sample1.ref > sample1.rbz
+.\bzip2 -2 < sample2.ref > sample2.rbz
+.\bzip2 -dvv < sample1.bz2 > sample1.tst
+.\bzip2 -dvv < sample2.bz2 > sample2.tst
+type words3sh
\ No newline at end of file
diff --git a/test.cmd b/test.cmd
new file mode 100644
index 0000000..f7bc866
--- /dev/null
+++ b/test.cmd
@@ -0,0 +1,9 @@
+@rem
+@rem OS/2 test driver for bzip2
+@rem
+type words1
+.\bzip2 -1 < sample1.ref > sample1.rbz
+.\bzip2 -2 < sample2.ref > sample2.rbz
+.\bzip2 -dvv < sample1.bz2 > sample1.tst
+.\bzip2 -dvv < sample2.bz2 > sample2.tst
+type words3sh
\ No newline at end of file
diff --git a/words0 b/words0
new file mode 100644
index 0000000..527fb43
--- /dev/null
+++ b/words0
@@ -0,0 +1,7 @@
+***-------------------------------------------------***
+***--------- IMPORTANT: READ WHAT FOLLOWS! ---------***
+***--------- viz: pay attention :-) ---------***
+***-------------------------------------------------***
+
+Compiling bzip2 ...
+
diff --git a/words1 b/words1
new file mode 100644
index 0000000..c75293b
--- /dev/null
+++ b/words1
@@ -0,0 +1,5 @@
+
+
+Doing 4 tests (2 compress, 2 uncompress) ...
+If there's a problem, things might stop at this point.
+
diff --git a/words2 b/words2
new file mode 100644
index 0000000..d3cafb9
--- /dev/null
+++ b/words2
@@ -0,0 +1,6 @@
+
+
+Checking test results. If any of the four "cmp"s which follow
+report any differences, something is wrong. If you can't easily
+figure out what, please let me know (jseward@acm.org).
+
diff --git a/words3 b/words3
new file mode 100644
index 0000000..5739d18
--- /dev/null
+++ b/words3
@@ -0,0 +1,23 @@
+
+
+If you got this far and the "cmp"s didn't find anything amiss, looks
+like you're in business. You should install bzip2 and bunzip2:
+
+ copy bzip2 to a public place, maybe /usr/bin.
+ In that public place, make bunzip2 a symbolic link
+ to the bzip2 you just copied there.
+ Put the manual page, bzip2.1, somewhere appropriate;
+ perhaps in /usr/man/man1.
+
+Complete instructions for use are in the preformatted
+manual page, in the file bzip2.1.preformatted.
+
+You can also do "bzip2 --help" to see some helpful information.
+
+"bzip2 -L" displays the software license.
+
+Please read the README file carefully.
+Finally, note that bzip2 comes with ABSOLUTELY NO WARRANTY.
+
+Happy compressing!
+
diff --git a/words3sh b/words3sh
new file mode 100644
index 0000000..1139177
--- /dev/null
+++ b/words3sh
@@ -0,0 +1,12 @@
+If you got this far and the "bzip2 -dvv"s give identical
+stored vs computed CRCs, you're probably in business.
+Complete instructions for use are in the preformatted manual page,
+in the file bzip2.txt.
+
+You can also do "bzip2 --help" to see some helpful information.
+"bzip2 -L" displays the software license.
+
+Please read the README file carefully.
+Finally, note that bzip2 comes with ABSOLUTELY NO WARRANTY.
+
+Happy compressing!
\ No newline at end of file