Posts Tagged GZip

Example Huffman Compression Routine in C#

This last week I decided to sit down and hash out a simple Huffman compression routine using C#. I’d never created a compression routine before from scratch (my past implementations were static for the sake of time savings), so I fleshed one out. I know that many examples exist elsewhere on the net…. but they all seemed overly complicated and up their own ass :P

I had a couple goals in mind while creating my routine:

1. KEEP IT SIMPLE — A lot of routines out there WORK, but their code is too overly complicated for their own good. This over complication leads to slowness which brings me to my next goal ;) It should be a simple class that accepts input data, with simple public accessors that are easy to understand even for the novice developer (sorry folks, no asynchronous delegates). :P

2. MAKE IT FAST — When dealing with large amounts of data in C#, especially when running it through an algorithm, it’s all too easy to use all the handy built in virtual methods or using other build in tools which make coding easier with speed being the sacrifice. Die hard C++ developers will point to these routines as C#’s downfall as a legitimate language when it comes to data intensive tasks.

Read the rest of this entry »

, , , , ,

11 Comments

Why does the Microsoft .NET implementation of GZip compression suck?

I’ve been working on a self contained patch generator using C#. For the patch data payload, I wanted to use GZip (or deflate) to compress the payload, this way if you generated a patch for a 1MB file, the patch file wouldn’t be +1MB (5 bytes per change).

I did some quick tests to compare compression ratios on a simulated data payload:

Original Uncompressed Payload: 288k
GZipped using System.IO.Compression:
171k
Zipped using WinZip with “Super Fast” Compression: 105k
7-Zipped using 7-Zip with “Maximum” Compression: 61k

I had to do some research and dig into the MSDN libraries before the answer was finally revealed from the horses mouth:

“The compression functionality in DeflateStream and GZipStream is exposed as a stream. Data is read in on a byte-by-byte basis, so it is not possible to perform multiple passes to determine the best method for compressing entire files or large blocks of data. The DeflateStream and GZipStream classes are best used on uncompressed sources of data. If the source data is already compressed, using these classes may actually increase the size of the stream.”

It’s a bit disappointing that Microsoft itself would not recommend it’s own built in compression routine, and even go so far to say that it may “actually increase the size of the stream”. I hope in the upcoming version of .NET 3.0, they address some shortcomings of this area of the framework. Perhaps an exposed method where you can pass a byte[] which can be compressed using multiple passes and Adaptive Huffman Encoding.

I know there are several alternatives out there, such as assemblies which support the ZIP compression format, but wasn’t the goal of .NET to move away from DLL hell? ;-)

In the meantime, I’ll probably be sticking with the Microsoft implementation of GZip until I take some time to create my own Huffman encoding routine. I think a good place to start would be this book titled Data Compression: The Complete Reference. I remember thumbing through it once at a Borders book store and thinking it was a really great reference. At the time though, I had no use for such a resource.

Quick note to my readers: I understand that there are plenty of online resources for Adaptive Huffman Encoding and other compression methods using C# that have already been created. Personally, I learn by doing. So when I do my own research and create something from the ground up it really helps me understand how it all works.

, , , , ,

1 Comment