<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>All Things IT Blog &#187; Data Compression</title>
	<atom:link href="http://www.enusbaum.com/blog/tag/data-compression/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.enusbaum.com/blog</link>
	<description>My little nerded out corner of the Internets!</description>
	<lastBuildDate>Tue, 18 Oct 2011 20:22:58 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Example Huffman Compression Routine in C#</title>
		<link>http://www.enusbaum.com/blog/2009/05/example-huffman-compression-routine-in-c/</link>
		<comments>http://www.enusbaum.com/blog/2009/05/example-huffman-compression-routine-in-c/#comments</comments>
		<pubDate>Fri, 22 May 2009 21:17:04 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[C# Programming]]></category>
		<category><![CDATA[General Programming]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Deflate]]></category>
		<category><![CDATA[GZip]]></category>
		<category><![CDATA[Huffman Coding]]></category>
		<category><![CDATA[Huffman Compression]]></category>

		<guid isPermaLink="false">http://www.enusbaum.com/blog/?p=285</guid>
		<description><![CDATA[This last week I decided to sit down and hash out a simple Huffman compression routine using C#. I&#8217;d never created a compression routine before from scratch (my past implementations were static for the sake of time savings), so I fleshed one out. I know that many examples exist elsewhere on the net&#8230;. but they [...]]]></description>
			<content:encoded><![CDATA[<p>This last week I decided to sit down and hash out a simple Huffman compression routine using C#. I&#8217;d never created a compression routine before from scratch (my past implementations were static for the sake of time savings), so I fleshed one out. I know that many examples exist elsewhere on the net&#8230;. but they all seemed overly complicated and up their own ass <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> </p>
<p>I had a couple goals in mind while creating my routine:</p>
<p><strong>1. KEEP IT SIMPLE</strong> &#8212; A lot of routines out there WORK, but their code is too overly complicated for their own good. This over complication leads to slowness which brings me to my next goal <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  It should be a simple class that accepts input data, with simple public accessors that are easy to understand even for the novice developer (sorry folks, no asynchronous delegates). <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> </p>
<p><strong>2. MAKE IT FAST</strong> &#8212; When dealing with large amounts of data in C#, especially when running it through an algorithm, it&#8217;s all too easy to use all the handy built in virtual methods or using other build in tools which make coding easier with speed being the sacrifice. Die hard C++ developers will point to these routines as C#&#8217;s downfall as a legitimate language when it comes to data intensive tasks.</p>
<p><span id="more-285"></span></p>
<p>The class I came up with is pretty simple. I use a Generic List to store a collection of &#8220;Leaf&#8221; objects, which have several basic attributes that help not only identify its value but also its place in the tree. Using this method, I was able to utilize the built in methods of the List object (I know, for shame&#8230;. but it&#8217;s easier in this instance) by firing off anonymous delegates for searches and comparators. I&#8217;m not terribly worried about using these virtual methods here only because the creation and encoding of the tree is usually the smallest task in the process <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>The encoding and decoding is where I decided to focus on optimizations since this is where the BULK of the work is done. The .NET Framework has several methods that make working with binary data easy. You can use the Convert.ToString() method which allows you to pass in a BASE option, thus allowing you to convert any character to it&#8217;s binary representation. My original implementation used that method and the end result as embarrassingly slow <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> </p>
<p>I went back to the drawing board and thought to myself, &#8220;If I had to re-write this in C++, how would I handle the encoding?&#8221; Duh, I&#8217;d be using bitwise operators up the wazoo! <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>After some recoding and pulling my hair out for a couple of hours, I was able to re-write the routine using bit operations and it works! On top of all that, it&#8217;s fast as all get out! My current benchmarks had it encoding a 1MB data set in under 1 second with ~50% compression. Not too shabby! Of course, compression ratios will vary depending on how normalized the input data set is.</p>
<p>Now I&#8217;m sure there&#8217;s some question you have on your mind and I&#8217;ll try to address them now:</p>
<p><strong>Q: Does it use a lot of memory?</strong></p>
<p><strong>A:</strong> You bet your sweet ass it does! <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  Seriously though, it&#8217;s only the method in which I setup the class that requires the memory. I establish an input buffer within the class that you can write the &#8216;raw&#8217; data to, which is then read from during encoding. In addition, during encoding I create an output buffer in memory where the &#8216;encoded&#8217; data is written. So it stands to reason that if you&#8217;re encoding 100MB of data, this routine can easily gobble up 200MB of RAM or more. The rule of thumb I found was File Size * 4 would be the memory requirement. There&#8217;s optimizations you can make that would lower the memory footprint (like, build the frequency table without buffering then read the input 1 byte at a time, say from a file), but I felt that would over complicate the solution and make it too focused for one specific instance. The current implementation is kept general for a reason <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><strong>Q: Could this be done faster in C++?</strong></p>
<p><strong>A:</strong> Yes, probably&#8230;. but not much faster. Although the code is written in C#, at run time the IL is compiled to x86. The bit operations we&#8217;re using would compile the exact same as a C++ routine (XOR is XOR, I don&#8217;t care what language you&#8217;re using). In addition, the encoding of the data itself is only using primitive native types which limits any cross language differences. In fact, you can paste the encode and decode routines into C++ and they work! <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Your only speed improvement might be in the generation of the tree itself&#8230; but even then, that&#8217;s super small overhead when compared with the amount of data you&#8217;re probably compressing.</p>
<p>Of course, all that applied to the Encoding (Compression) side of the house, the decompression routine is pretty slow (about 3 seconds per 1MB of decompressed data) and could probably use some more optimizations.</p>
<p><strong>Q: Is this any better than using the built in GZip or Deflate classes available in System.IO.Compression?</strong></p>
<p><strong>A:</strong> It&#8217;s not even close to being in competition with those routines <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' />  This is really just a proof of concept for BASIC Huffman Coding, which doesn&#8217;t take into account advanced features of modern compression routines such as pattern or content mapping. This routine is slower (especially in decompression), so I wouldn&#8217;t go making anything like this your #1 choice for a compression routine if others are available <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  So use this for educational purposes only.</p>
<p><strong>Q: What version of the .NET Framework will this work with?</strong></p>
<p><strong>A:</strong> The code here was written in Visual Studio 2008 targeting .NET 3.5. I use some Framework 3.5 specific things (such as object initializers), but nothing that would make conversion difficult. I avoided LINQ only because I&#8217;m not entirely sold on the idea and I still like using anonymous delegates. <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  This code could be converted to Framework 2.0 with minor changes and possibly Framework 1.1, but that might require a little more effort.</p>
<p><strong>Q: What is the format used to store the compressed data?</strong></p>
<p><strong>A:</strong> I encode the decompression information within the final output stream. The output format is like this:</p>
<p>Bytes 0 &#8211; 8: Final Output Size (Not used, but there as a checksum if needed in the future)</p>
<p>Byte 9: Number of Bytes in the Decode Dictionary</p>
<p>Bytes 10 &#8211; n: Decode Dictionary</p>
<p>Between the Decode Dictionary and the actual data I add the characters &#8220;BCD&#8221; (which stands for <strong>B</strong>inary <strong>C</strong>oded <strong>D</strong>ata). This lets me know where the dictionary ends and the actual coded data begins. It helped during debugging and I figure it&#8217;d help anyone else out there as well while working with this routine <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><strong>Q: What&#8217;s with the essay? Just give me the code!</strong></p>
<p><strong>A:</strong> Fine! <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  Seriously though, the only reason I&#8217;m doing such a long write-up on the code is to help people who are perhaps beginning to look into this sort of code for the first time and might have questions on why I did things a certain way. Understanding WHY the code was written helps understand how it operates.</p>
<p>So that&#8217;s the high and low of it! <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Hope this helps someone out and if you have any questions, please feel free to leave a comment!</p>
<p>Cheers! <img src='http://www.enusbaum.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><strong>Huffman.zip</strong> &#8211; <a title="Huffman Coding in C#" href="http://www.enusbaum.com/blog/wp-content/uploads/Huffman.zip">Download</a> (3k)</p>
<div class="su-linkbox" id="post-285-linkbox"><div class="su-linkbox-label">Link to this post!</div><div class="su-linkbox-field"><input type="text" value="&lt;a href=&quot;http://www.enusbaum.com/blog/2009/05/example-huffman-compression-routine-in-c/&quot;&gt;Example Huffman Compression Routine in C#&lt;/a&gt;" onclick="javascript:this.select()" readonly="readonly" style="width: 100%;" /></div></div>]]></content:encoded>
			<wfw:commentRss>http://www.enusbaum.com/blog/2009/05/example-huffman-compression-routine-in-c/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
		</item>
	</channel>
</rss>

