This last week I decided to sit down and hash out a simple Huffman compression routine using C#. I’d never created a compression routine before from scratch (my past implementations were static for the sake of time savings), so I fleshed one out. I know that many examples exist elsewhere on the net…. but they all seemed overly complicated and up their own ass
I had a couple goals in mind while creating my routine:
1. KEEP IT SIMPLE — A lot of routines out there WORK, but their code is too overly complicated for their own good. This over complication leads to slowness which brings me to my next goal
It should be a simple class that accepts input data, with simple public accessors that are easy to understand even for the novice developer (sorry folks, no asynchronous delegates).
2. MAKE IT FAST — When dealing with large amounts of data in C#, especially when running it through an algorithm, it’s all too easy to use all the handy built in virtual methods or using other build in tools which make coding easier with speed being the sacrifice. Die hard C++ developers will point to these routines as C#’s downfall as a legitimate language when it comes to data intensive tasks.
The class I came up with is pretty simple. I use a Generic List to store a collection of “Leaf” objects, which have several basic attributes that help not only identify its value but also its place in the tree. Using this method, I was able to utilize the built in methods of the List object (I know, for shame…. but it’s easier in this instance) by firing off anonymous delegates for searches and comparators. I’m not terribly worried about using these virtual methods here only because the creation and encoding of the tree is usually the smallest task in the process
The encoding and decoding is where I decided to focus on optimizations since this is where the BULK of the work is done. The .NET Framework has several methods that make working with binary data easy. You can use the Convert.ToString() method which allows you to pass in a BASE option, thus allowing you to convert any character to it’s binary representation. My original implementation used that method and the end result as embarrassingly slow
I went back to the drawing board and thought to myself, “If I had to re-write this in C++, how would I handle the encoding?” Duh, I’d be using bitwise operators up the wazoo!
After some recoding and pulling my hair out for a couple of hours, I was able to re-write the routine using bit operations and it works! On top of all that, it’s fast as all get out! My current benchmarks had it encoding a 1MB data set in under 1 second with ~50% compression. Not too shabby! Of course, compression ratios will vary depending on how normalized the input data set is.
Now I’m sure there’s some question you have on your mind and I’ll try to address them now:
Q: Does it use a lot of memory?
A: You bet your sweet ass it does!
Seriously though, it’s only the method in which I setup the class that requires the memory. I establish an input buffer within the class that you can write the ‘raw’ data to, which is then read from during encoding. In addition, during encoding I create an output buffer in memory where the ‘encoded’ data is written. So it stands to reason that if you’re encoding 100MB of data, this routine can easily gobble up 200MB of RAM or more. The rule of thumb I found was File Size * 4 would be the memory requirement. There’s optimizations you can make that would lower the memory footprint (like, build the frequency table without buffering then read the input 1 byte at a time, say from a file), but I felt that would over complicate the solution and make it too focused for one specific instance. The current implementation is kept general for a reason
Q: Could this be done faster in C++?
A: Yes, probably…. but not much faster. Although the code is written in C#, at run time the IL is compiled to x86. The bit operations we’re using would compile the exact same as a C++ routine (XOR is XOR, I don’t care what language you’re using). In addition, the encoding of the data itself is only using primitive native types which limits any cross language differences. In fact, you can paste the encode and decode routines into C++ and they work!
Your only speed improvement might be in the generation of the tree itself… but even then, that’s super small overhead when compared with the amount of data you’re probably compressing.
Of course, all that applied to the Encoding (Compression) side of the house, the decompression routine is pretty slow (about 3 seconds per 1MB of decompressed data) and could probably use some more optimizations.
Q: Is this any better than using the built in GZip or Deflate classes available in System.IO.Compression?
A: It’s not even close to being in competition with those routines
This is really just a proof of concept for BASIC Huffman Coding, which doesn’t take into account advanced features of modern compression routines such as pattern or content mapping. This routine is slower (especially in decompression), so I wouldn’t go making anything like this your #1 choice for a compression routine if others are available
So use this for educational purposes only.
Q: What version of the .NET Framework will this work with?
A: The code here was written in Visual Studio 2008 targeting .NET 3.5. I use some Framework 3.5 specific things (such as object initializers), but nothing that would make conversion difficult. I avoided LINQ only because I’m not entirely sold on the idea and I still like using anonymous delegates.
This code could be converted to Framework 2.0 with minor changes and possibly Framework 1.1, but that might require a little more effort.
Q: What is the format used to store the compressed data?
A: I encode the decompression information within the final output stream. The output format is like this:
Bytes 0 – 8: Final Output Size (Not used, but there as a checksum if needed in the future)
Byte 9: Number of Bytes in the Decode Dictionary
Bytes 10 – n: Decode Dictionary
Between the Decode Dictionary and the actual data I add the characters “BCD” (which stands for Binary Coded Data). This lets me know where the dictionary ends and the actual coded data begins. It helped during debugging and I figure it’d help anyone else out there as well while working with this routine
Q: What’s with the essay? Just give me the code!
A: Fine!
Seriously though, the only reason I’m doing such a long write-up on the code is to help people who are perhaps beginning to look into this sort of code for the first time and might have questions on why I did things a certain way. Understanding WHY the code was written helps understand how it operates.
So that’s the high and low of it!
Hope this helps someone out and if you have any questions, please feel free to leave a comment!
Cheers!
Huffman.zip – Download (3k)



#1 by Chris on July 2, 2009 - 1:09 PM
Do you have any interest in modifying this such that I can use it to generate a dictionary (which can be persisted to disk) and use that dictionary for all future compression/decompression. It guess for Huffman, it’s a tree and not a dictionary. If not, could you give me some tips on what changes would be necessary? Thanks.
#2 by Chris on July 2, 2009 - 1:12 PM
Also, since the tree would be stored separately, I would not want it to be a part of the output stream.
#3 by Chris on July 3, 2009 - 8:54 AM
Why are you calling BuildTree from so many places? Looks redundant.
#4 by Chris on July 3, 2009 - 9:15 AM
Decode() should be static.
#5 by Chris on July 3, 2009 - 11:46 AM
I made the changes that I needed and emailed them to you. I’m wondering if the algorithm could be adapted to use a 16-bit tree instead of an 8-bit tree. Since I’m storing the tree separately now, I can afford a much larger tree if it means better compression.
#6 by Chris on July 5, 2009 - 9:00 AM
I was able to get the 16-bit version working. I had to do some optimizing in BuildTree, EncodeTree, and Decode. It does in fact yield better compression.
#7 by Chris on July 5, 2009 - 9:01 AM
Decode is around 100x faster now.
#8 by Chris on July 6, 2009 - 7:57 PM
I’ve made lots more changes since emailing you. 7.4 seconds for an encode/decode cycle on 100MB of data that is already in memory. Strangely, the 16-bit version runs even faster and gives better compression. But it’s still nothing compared to the big guys in terms of speed of compression.
#9 by Chris on July 6, 2009 - 8:01 PM
I meant “speed OR compression”
#10 by chris on February 6, 2011 - 12:07 PM
I am trying to get back into programming. Could you share your faster version of the C# Huffman class. I am a novice programmer and am learning how to optimize my code. Would you mind sending me your project? This will be solely for educational purposes for me–and it will help me learn how to code at a higher plane.
Thank you kindly in advance.
Chris
#11 by Danny on May 30, 2010 - 11:08 PM
Hey man, where do I have to write the path of the file I want to be compressed? Where can I see these files?
#12 by Danny on May 30, 2010 - 11:09 PM
By the way, Thanks for the source code!
#13 by Jame on September 24, 2010 - 8:39 PM
Chris could you share your work like Eric has? Sounds like it could save me time.
Thanks Eric!
#14 by myb on November 6, 2010 - 4:21 AM
nice blog it does help me a lot.. i may need your help for my project but no one as alive to do a retort for my question.
#15 by hangman2525 on November 28, 2010 - 8:45 AM
thk very mush
very useful source code
#16 by hardik shah on February 24, 2011 - 4:03 AM
Can you send me example using this Huffman.cs class??
Thanks
#17 by person on May 29, 2011 - 11:55 AM
Super lame question! I have encoded my string, how do I decode?? If I convert to bytes it doesnt work??
#18 by Yosy on July 8, 2011 - 5:35 PM
Hi,
Why you are building the tree and encoding it twice?
first time in -
public byte[] Encode(out string OutputLog)
you build and encode and the you call to Encode()
and the second time is in -
public byte[] Encode()
BTW,I really like your code – he is well documented,apply the KISS and MIF(Make it fast) rules and it well readable-code.
Thanks,
really helped to understand huffman coding.
#19 by Bongo on October 14, 2011 - 12:00 AM
Is this freeware/public domain?
#20 by eric on October 14, 2011 - 6:04 AM
Yes. This code is free to use/improve upon.
Enjoy!
#21 by George on October 23, 2011 - 11:51 AM
Hey, can you post or point to Chris’s 16 bit version?
#22 by Michael Brown on October 23, 2011 - 2:34 PM
There is a bug in the Encode(out string OutputLog) function.
The calls to BuildTree(); and EncodeTree(); should be removed as it’s being called twice. This results in garbage data that cant be decoded.
#23 by Michael Brown on October 23, 2011 - 2:39 PM
In addition, the Decode function should return a string if the input is a string. Simply add System.Text.ASCIIEncoding.UTF8.GetString(bDecodedOutput) as the return from Decode.
Good code overall, still trying to understand all of it. There are a few optimizations that can be made, would be nice if we could post the changes from the commenter above.
#24 by Chris on December 9, 2011 - 8:23 PM
I just noticed that my comments were being replied-to. I guess you don’t get an email when that happens. Anyway, I posted my code at http://stackoverflow.com/questions/1072242/lossless-compression-in-small-blocks-with-precomputed-dictionary