Skip to content

dle51/HuffmanCode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Huffman Code

Dan Le

COMP 363 001 - Fall 2024 - Irakliotis

Assignment

Create a preview application which compresses and decompresses a message using Huffman Encoding. Compute the approximate efficiency of the compression method.

Huffman Code

A Huffman Code is a method that uses optimal prefix codes that is commonly used for lossless compression. While it isn't commonly used alone in modern compression methods, it forms the base for other compression methods. It works through producing a variable length table for encoding source symbols through the "weights" of the symbols present in the source data, which allows common symbols to be represented with fewer bits than uncommon symbols.

This algorithm is considered "greedy", but several implementations will run in $O(n)$ time. While the time complexity of the algorithm is linear, the algorithm is often not optimal. Larger documents may have a more consistent outcome, but because of the entropy of data, smaller data has a wide range of outcomes, with Huffman Code being negative or not optimal in many cases. This is one of the reasons Huffman Code is combined with other algorithms today.

Analysis

TODO: math analysis $$L(C(W)) = \sum_{i=1}^{n} : w_{i}$$

Implementation

The implementation done in this project is a demo version, where we are writing in ASCII characters instead of bytes to get an idea of how the algorithm functions and its efficiency.

Compression & Efficiency

Using several different types of texts ranging from short poems to collections of works, as well as texts in a different language, a general graph could be made of the efficiency of the implementation by assuming each plain text character having a length of 8 and each dictionary entry having a length of 16.

We can see that the compression ratio stayed in the range of $40%$ to $60%$, with a slow rise as the number of characters rose. This was predicted, as longer texts have more opportunities to have less than common characters.

We can also see that the compression efficiency of the implementation started off poorly, but quickly was efficient somewhere between 500 and 4000 characters. Text with natural variations of characters, such as stories, but with characters being skewed often, such as vowels, will often see a positive efficiency. Text with a lack of this skew, such as an input of hundreds of different symbols, will have an extremely poor efficiency.

We are asked the question

"Also, can you imagine/propose a way to ensure $r_\text{a} < 1 $ even for very small values of $m$?"

Huffman fails to perform with small data because of the size of code tree. The easiest thought is to have sets of character, where we see groupings of characters that repeat in the text, be mapped to variable length values as symbols instead of single characters. This thought is similar to how the LZ77 compression method works, which forms a sliding window to create references to these grouping of characters. This may improve performance as it reduces the size of the code table in many different situations.

LZ77 and other compression methods such as DEFLATE and LZ4 are known to perform consistently well with small data values.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages