doc/liblzma-security.txt - third_party/xz - Git at Google


 Using liblzma securely
 ----------------------

 0. Introduction

     This document discusses how to use liblzma securely. There are issues
     that don't apply to zlib or libbzip2, so reading this document is
     strongly recommended even for those who are very familiar with zlib
     or libbzip2.

     While making liblzma itself as secure as possible is essential, it's
     out of scope of this document.


 1. Memory usage

     The memory usage of liblzma varies a lot.


 1.1. Problem sources

 1.1.1. Block coder

     The memory requirements of Block encoder depend on the used filters
     and their settings. The memory requirements of the Block decoder
     depend on the which filters and with which filter settings the Block
     was encoded. Usually the memory requirements of a decoder are equal
     or less than the requirements of the encoder with the same settings.

     While the typical memory requirements to decode a Block is from a few
     hundred kilobytes to tens of megabytes, a maliciously constructed
     files can require a lot more RAM to decode. With the current filters,
     the maximum amount is about 7 GiB. If you use multi-threaded decoding,
     every Block can require this amount of RAM, thus a four-threaded
     decoder could suddenly try to allocate 28 GiB of RAM.

     If you don't limit the maximum memory usage in any way, and there are
     no resource limits set on the operating system side, one malicious
     input file can run the system out of memory, or at least make it swap
     badly for a long time. This is exceptionally bad on servers e.g.
     email server doing virus scanning on incoming messages.


 1.1.2. Metadata decoder

     Multi-Block .lzma files contain at least one Metadata Block.
     Externally the Metadata Blocks are similar to Data Blocks, so all
     the issues mentioned about memory usage of Data Blocks applies to
     Metadata Blocks too.

     The uncompressed content of Metadata Blocks contain information about
     the Stream as a whole, and optionally some Extra Records. The
     information about the Stream is kept in liblzma's internal data
     structures in RAM. Extra Records can contain arbitrary data. They are
     not interpreted by liblzma, but liblzma will provide them to the
     application in uninterpreted form if the application wishes so.

     Usually the Uncompressed Size of a Metadata Block is small. Even on
     extreme cases, it shouldn't be much bigger than a few megabytes. Once
     the Metadata has been parsed into native data structures in liblzma,
     it usually takes a little more memory than in the encoded form. For
     all normal files, this is no problem, since the resulting memory usage
     won't be too much.

     The problem is that a maliciously constructed Metadata Block can
     contain huge amount of "information", which liblzma will try to store
     in its internal data structures. This may cause liblzma to allocate
     all the available RAM unless some kind of resource usage limits are
     applied.

     Note that the Extra Records in Metadata are always parsed but, but
     memory is allocated for them only if the application has requested
     liblzma to provide the Extra Records to the application.


 1.2. Solutions

     If you need to decode files from untrusted sources (most people do),
     you must limit the memory usage to avoid denial of service (DoS)
     conditions caused by malicious input files.

     The first step is to find out how much memory you are allowed consume
     at maximum. This may be a hardcoded constant or derived from the
     available RAM; whatever is appropriate in the application.

     The simplest solution is to use setrlimit() if the kernel supports
     RLIMIT_AS, which limits the memory usage of the whole process.
     For more portable and fine-grained limiting, you can use
     memory limiter functions found from <lzma/memlimit.h>.


 1.2.1. Encoder

     lzma_memory_usage() will give you a rough estimate about the memory
     usage of the given filter chain. To dramatically simplify the internal
     implementation, this function doesn't take into account all the small
     helper data structures needed in various places; only the structures
     with significant memory usage are taken into account. Still, the
     accuracy of this function should be well within a mebibyte.

     The Subblock filter is a special case. If a Subfilter has been
     specified, it isn't taken into account when lzma_memory_usage()
     calculates the memory usage. You need to calculate the memory usage
     of the Subfilter separately.

     Keeping track of Blocks in a Multi-Block Stream takes a few dozen
     bytes of RAM per Block (size of the lzma_index structure plus overhead
     of malloc()). It isn't a good idea to put tens of thousands of Blocks
     into a Stream unless you have a very good reason to do so (compressed
     dictionary could be an example of such situation).

     Also keep the number and sizes of Extra Records sane. If you produce
     the list of Extra Records automatically from some untrusted source,
     you should not only validate the content of these Records, but also
     their memory usage.


 1.2.2. Decoder

     A single-threaded decoder should simply use a memory limiter and
     indicate an error if it runs out of memory.

     Memory-limiting with multi-threaded decoding is tricky. The simple
     solution is to divide the maximum allowed memory usage with the
     maximum allowed threads, and give each Block decoder their own
     independent lzma_memory_limiter. The drawback is that if one Block
     needs notably more RAM than any other Block, the decoder will run out
     of memory when in reality there would be plenty of free RAM.

     An attractive alternative would be using shared lzma_memory_limiter.
     Depending on the application and the expected type of input, this may
     either be the best solution or a source of hard-to-repeat problems.
     Consider the following requirements:
       - You use a maximum of n threads.
       - x(i) is the decoder memory requirements of the Block number i
         in an expected input Stream.
       - The memory limiter is set to higher value than the sum of n
         highest values x(i).

     (If you are better at explaining the above conditions, please
     contribute your improved version.)

     If the above conditions aren't met, it is possible that the decoding
     will fail unpredictably. That is, on the same machine using the same
     settings, the decoding may sometimes succeed and sometimes fail. This
     is because sometimes threads may run so that the Blocks with highest
     memory usage are tried to be decoded at the same time.

     Most .lzma files have all the Blocks encoded with identical settings,
     or at least the memory usage won't vary dramatically. That's why most
     multi-threaded decoders probably want to use the simple "separate
     lzma_memory_limiter for each thread" solution, possibly falling back
     to single-threaded mode in case the per-thread memory limits aren't
     enough in multi-threaded mode.

 FIXME: Memory usage of Stream info.

 [

 ]


 2. Huge uncompressed output

 2.1. Data Blocks

     Decoding a tiny .lzma file can produce huge amount of uncompressed
     output. There is an example file of 45 bytes, which decodes to 64 PiB
     (that's 2^56 bytes). Uncompressing such a file to disk is likely to
     fill even a bigger disk array. If the data is written to a pipe, it
     may not fill the disk, but would still take very long time to finish.

     To avoid denial of service conditions caused by huge amount of
     uncompressed output, applications using liblzma should use some method
     to limit the amount of output produced. The exact method depends on
     the application.

     All valid .lzma Streams make it possible to find out the uncompressed
     size of the Stream without actually uncompressing the data. This
     information is available in at least one of the Metadata Blocks.
     Once the uncompressed size is parsed, the decoder can verify that
     it doesn't exceed certain limits (e.g. available disk space).

     When the uncompressed size is known, the decoder can actively keep
     track of the amount of output produced so far, and that it doesn't
     exceed the known uncompressed size. If it does exceed, the file is
     known to be corrupt and an error should be indicated without
     continuing to decode the rest of the file.

     Unfortunately, finding the uncompressed size beforehand is often
     possible only in non-streamed mode, because the needed information
     could be in the Footer Metdata Block, which (obviously) is at the
     end of the Stream. In purely streamed mode decoding, one may need to
     use some rough arbitrary limits to prevent the problems described in
     the beginning of this section.


 2.2. Metadata

     Metadata is stored in Metadata Blocks, which are very similar to
     Data Blocks. Thus, the uncompressed size can be huge just like with
     Data Blocks. The difference is, that the contents of Metadata Blocks
     aren't given to the application as is, but parsed by liblzma. Still,
     reading through a huge Metadata can take very long time, effectively
     creating a denial of service like piping decoded a Data Block to
     another process would do.

     At first it would seem that using a memory limiter would prevent
     this issue as a side effect. But it does so only if the application
     requests liblzma to allocate the Extra Records and provide them to
     the application. If Extra Records aren't requested, they aren't
     allocated either. Still, the Extra Records are being read through
     to validate that the Metadata is in proper format.

     The solution is to limit the Uncompressed Size of a Metadata Block
     to some relatively large value. This will make liblzma to give an
     error when the given limit is reached.

	Using liblzma securely
	----------------------

	0. Introduction

	This document discusses how to use liblzma securely. There are issues
	that don't apply to zlib or libbzip2, so reading this document is
	strongly recommended even for those who are very familiar with zlib
	or libbzip2.

	While making liblzma itself as secure as possible is essential, it's
	out of scope of this document.


	1. Memory usage

	The memory usage of liblzma varies a lot.


	1.1. Problem sources

	1.1.1. Block coder

	The memory requirements of Block encoder depend on the used filters
	and their settings. The memory requirements of the Block decoder
	depend on the which filters and with which filter settings the Block
	was encoded. Usually the memory requirements of a decoder are equal
	or less than the requirements of the encoder with the same settings.

	While the typical memory requirements to decode a Block is from a few
	hundred kilobytes to tens of megabytes, a maliciously constructed
	files can require a lot more RAM to decode. With the current filters,
	the maximum amount is about 7 GiB. If you use multi-threaded decoding,
	every Block can require this amount of RAM, thus a four-threaded
	decoder could suddenly try to allocate 28 GiB of RAM.

	If you don't limit the maximum memory usage in any way, and there are
	no resource limits set on the operating system side, one malicious
	input file can run the system out of memory, or at least make it swap
	badly for a long time. This is exceptionally bad on servers e.g.
	email server doing virus scanning on incoming messages.


	1.1.2. Metadata decoder

	Multi-Block .lzma files contain at least one Metadata Block.
	Externally the Metadata Blocks are similar to Data Blocks, so all
	the issues mentioned about memory usage of Data Blocks applies to
	Metadata Blocks too.

	The uncompressed content of Metadata Blocks contain information about
	the Stream as a whole, and optionally some Extra Records. The
	information about the Stream is kept in liblzma's internal data
	structures in RAM. Extra Records can contain arbitrary data. They are
	not interpreted by liblzma, but liblzma will provide them to the
	application in uninterpreted form if the application wishes so.

	Usually the Uncompressed Size of a Metadata Block is small. Even on
	extreme cases, it shouldn't be much bigger than a few megabytes. Once
	the Metadata has been parsed into native data structures in liblzma,
	it usually takes a little more memory than in the encoded form. For
	all normal files, this is no problem, since the resulting memory usage
	won't be too much.

	The problem is that a maliciously constructed Metadata Block can
	contain huge amount of "information", which liblzma will try to store
	in its internal data structures. This may cause liblzma to allocate
	all the available RAM unless some kind of resource usage limits are
	applied.

	Note that the Extra Records in Metadata are always parsed but, but
	memory is allocated for them only if the application has requested
	liblzma to provide the Extra Records to the application.


	1.2. Solutions

	If you need to decode files from untrusted sources (most people do),
	you must limit the memory usage to avoid denial of service (DoS)
	conditions caused by malicious input files.

	The first step is to find out how much memory you are allowed consume
	at maximum. This may be a hardcoded constant or derived from the
	available RAM; whatever is appropriate in the application.

	The simplest solution is to use setrlimit() if the kernel supports
	RLIMIT_AS, which limits the memory usage of the whole process.
	For more portable and fine-grained limiting, you can use
	memory limiter functions found from <lzma/memlimit.h>.


	1.2.1. Encoder

	lzma_memory_usage() will give you a rough estimate about the memory
	usage of the given filter chain. To dramatically simplify the internal
	implementation, this function doesn't take into account all the small
	helper data structures needed in various places; only the structures
	with significant memory usage are taken into account. Still, the
	accuracy of this function should be well within a mebibyte.

	The Subblock filter is a special case. If a Subfilter has been
	specified, it isn't taken into account when lzma_memory_usage()
	calculates the memory usage. You need to calculate the memory usage
	of the Subfilter separately.

	Keeping track of Blocks in a Multi-Block Stream takes a few dozen
	bytes of RAM per Block (size of the lzma_index structure plus overhead
	of malloc()). It isn't a good idea to put tens of thousands of Blocks
	into a Stream unless you have a very good reason to do so (compressed
	dictionary could be an example of such situation).

	Also keep the number and sizes of Extra Records sane. If you produce
	the list of Extra Records automatically from some untrusted source,
	you should not only validate the content of these Records, but also
	their memory usage.


	1.2.2. Decoder

	A single-threaded decoder should simply use a memory limiter and
	indicate an error if it runs out of memory.

	Memory-limiting with multi-threaded decoding is tricky. The simple
	solution is to divide the maximum allowed memory usage with the
	maximum allowed threads, and give each Block decoder their own
	independent lzma_memory_limiter. The drawback is that if one Block
	needs notably more RAM than any other Block, the decoder will run out
	of memory when in reality there would be plenty of free RAM.

	An attractive alternative would be using shared lzma_memory_limiter.
	Depending on the application and the expected type of input, this may
	either be the best solution or a source of hard-to-repeat problems.
	Consider the following requirements:
	- You use a maximum of n threads.
	- x(i) is the decoder memory requirements of the Block number i
	in an expected input Stream.
	- The memory limiter is set to higher value than the sum of n
	highest values x(i).

	(If you are better at explaining the above conditions, please
	contribute your improved version.)

	If the above conditions aren't met, it is possible that the decoding
	will fail unpredictably. That is, on the same machine using the same
	settings, the decoding may sometimes succeed and sometimes fail. This
	is because sometimes threads may run so that the Blocks with highest
	memory usage are tried to be decoded at the same time.

	Most .lzma files have all the Blocks encoded with identical settings,
	or at least the memory usage won't vary dramatically. That's why most
	multi-threaded decoders probably want to use the simple "separate
	lzma_memory_limiter for each thread" solution, possibly falling back
	to single-threaded mode in case the per-thread memory limits aren't
	enough in multi-threaded mode.

	FIXME: Memory usage of Stream info.

	[

	]


	2. Huge uncompressed output

	2.1. Data Blocks

	Decoding a tiny .lzma file can produce huge amount of uncompressed
	output. There is an example file of 45 bytes, which decodes to 64 PiB
	(that's 2^56 bytes). Uncompressing such a file to disk is likely to
	fill even a bigger disk array. If the data is written to a pipe, it
	may not fill the disk, but would still take very long time to finish.

	To avoid denial of service conditions caused by huge amount of
	uncompressed output, applications using liblzma should use some method
	to limit the amount of output produced. The exact method depends on
	the application.

	All valid .lzma Streams make it possible to find out the uncompressed
	size of the Stream without actually uncompressing the data. This
	information is available in at least one of the Metadata Blocks.
	Once the uncompressed size is parsed, the decoder can verify that
	it doesn't exceed certain limits (e.g. available disk space).

	When the uncompressed size is known, the decoder can actively keep
	track of the amount of output produced so far, and that it doesn't
	exceed the known uncompressed size. If it does exceed, the file is
	known to be corrupt and an error should be indicated without
	continuing to decode the rest of the file.

	Unfortunately, finding the uncompressed size beforehand is often
	possible only in non-streamed mode, because the needed information
	could be in the Footer Metdata Block, which (obviously) is at the
	end of the Stream. In purely streamed mode decoding, one may need to
	use some rough arbitrary limits to prevent the problems described in
	the beginning of this section.


	2.2. Metadata

	Metadata is stored in Metadata Blocks, which are very similar to
	Data Blocks. Thus, the uncompressed size can be huge just like with
	Data Blocks. The difference is, that the contents of Metadata Blocks
	aren't given to the application as is, but parsed by liblzma. Still,
	reading through a huge Metadata can take very long time, effectively
	creating a denial of service like piping decoded a Data Block to
	another process would do.

	At first it would seem that using a memory limiter would prevent
	this issue as a side effect. But it does so only if the application
	requests liblzma to allocate the Extra Records and provide them to
	the application. If Extra Records aren't requested, they aren't
	allocated either. Still, the Extra Records are being read through
	to validate that the Metadata is in proper format.

	The solution is to limit the Uncompressed Size of a Metadata Block
	to some relatively large value. This will make liblzma to give an
	error when the given limit is reached.