You are here: The Oldskool PC/Oldskool PC Software/LZ4 Decompression for the 8088

LZ4_8088

Extremely fast LZ4 decompression for the 8088/8086 CPU

Introduction
Downloads
License
Questions and Answers
Roadmap
Known Bugs
Development History

Introduction

LZ4_8088 is a chunk of assembly code that implements incredibly fast LZ4 decompression for 8088 and 8086 CPUs. It is specifically optimized for the 8088 and is intended for use in hobby or retrocomputing projects. Code is provided for most generic x86 assemblers, as well as an example of how to use it in Turbo Pascal.

LZ4 itself is a compresson format that is implemented as an open-source C library, with ports to other platforms. The goal of the library is speed, and it currently holds many top speed rankings in compression benchmarks. One variant of LZ4 called LZ4_HC implements optimal parsing, which creates the very best possible set of literal and match runs for a given input and coding set. This leads to compression ratios that are competitive with PKZIP/zlib, but unlike PKZIP, LZ4 doesn't implement order-0 coding (ie. bit twiddling which 8088/8086 is very bad at) so the end result decompresses at nearly memcpy() speeds.

Here are some statistics comparing LZ4 with PKWare's Data Compression Library (DCL). The DCL was chosen as a comparison because it uses an algorithm very similar to deflate (which was considered state of the art speed-wise for many years on DOS platforms), and also because the DCL could be measured with microsecond accuracy. To eliminate implementation bias, PKWare's retail DCL compiled assembly code was used.

Data Type	Filename	Original size	LZ4_HC compressed size	LZ4_HC ratio	DCL Implode size	DCL Ratio	memcpy() time in µs (REP MOVSW)	LZ4 Decompression Speed (x slower than memcpy)	DCL Decompression speed (x slower than memcpy)
Sparse bitmap	shuttle.pic	16390	3297	20%	2552	16%	44989	1.91	20.50
RLE bitmap	rle.dmp	16384	436	3%	387	2%	44977	0.76	5.79
Small text file	text.txt	4988	3037	61%	2323	47%	13716	3.17	61.21
Large text file	largetxt.txt	56899	26890	47%	25642	45%	156457	3.17	54.43
Sparse compiled binary	robotron.com	40704	21048	52%	18178	45%	112036	2.75	53.51
Dense compiled binary	linewars.exe	61744	41500	67%	36641	59%	169924	2.16	70.08
Uncompressable data	random.bin	64000	64260	100%	71822	112%	176113	1.39	116.14

Key takeaways from the above table:

For most sources, decompression is never slower than 3.2x memcpy.
For source material that contains long runs of sequences (RLE), decompression is faster than memcpy.
Compression ratios are competitive; they're mostly within a few percentage points of PKWare ratios.

For full documentation, please see the included LZ4_8088.TXT in the download distribution. It is highly recommended you read the documentation to avoid any pitfalls using the code.

Downloads

lz4_8088.zip contains the assembler routine, a Turbo Pascal test harness, documentation, compression samples, and compiled binaries for Win32 and DOS 16-bit.

License

The LZ4 library is provided under the BSD License. However, my code was not derived from any of the original LZ4 code so I can provide it via any license I choose. So, I am providing my code under what I am calling the Demoscene License. The Demoscene License grants you the following rights:

You are free to use this code in any production, commercial or otherwise, without providing remuneration to the author.
If you use this code, you must greet "Trixter/Hornet" if used in a demoscene production, or "Jim Leonard" if used in a normal program. Also, you must send email to trixter@oldskool.org telling him you used the code so he can marvel at your result.

Questions and Answers

Q: Is this code faster than the LZ4 C source code? For 8088-80286 CPUs, yes. For 386+, no, because the 386 has 32-bit registers, additional segment registers, and additional instructions. Compiling the LZ4 C code with a suitable 32-bit compiler will produce code that definitely outperforms this 8088 assembly code.

Q: How is it possible to exceed the speed of a REP MOVSW? Because 1-byte and 2-byte runs in the compressed data are handled specially with REP STOSW. STOSW can set a fixed value in memory in half the time it takes to move a value from one memory location to another.

Q: Why did you bother doing this? Because I like impressing the teenage version of myself.

Roadmap

Currently this distribution only includes decompression code. It is possible to implement LZ4 "c0" compression on 8088 in as little as 16K memory and with reasonable speed, but I don't plan on developing it.

Development History

20130107: Initial release.

20130209: Now includes a size-optimized version of the code as well. The speed routine compiles to 446 bytes including the shift table; the size-optimized code is 30% slower but compiles to 79 bytes. Also, speed-optimized version of the code sped up an additional 1% by contributions from both Peter Ferrie and Terje Mathisen (thanks guys!)

20130210: Forgot an optimization; the size version now optimizes to 78 bytes.

20130213: Fixed a bug that could corrupt output if the matches overlapped.

20190615: Size version optimizes down to 79 bytes, not 78. Also unified the .PAS and .ASM files.

20190617: Added size fixups by Axel Kern. Refactored code into a single .ASM file; the .PAS is the test harness only.

20190630: Fixed a bug in _small; assembles to 78 bytes.

20200314: Added speed improvements by Pavel Zagrebin. These speed optimizations make Peter Ferrie's "reversed match offsets" optimization obsolete, so that version was removed.

20241105: Added speed updates from Peter Jauernig

[Oldskool Home] [Copyright and Usage] [Disclaimer] [Contact Me]

This page's content was last modified on Nov 5, 2024 7:54 pm.
If anything here has helped you, please consider donating to help keep this site alive.