Step-by-Step Instructions
Huffman coding is applied in data transmission of text files and fax messages to enable data encoding and compression [1]. It is a statistical coding method that bases it prioritization on the occurrence of certain characters. In the DMS above, the output generated is based on probabilities assigned to the alphabets. In computing, all the characters are allocated the same amount of space despite the kind of character it is. In the modern era of data transmission, the code word lengths tend to vary unlike the ASCII [2].
Some can be shorter with more character frequency and other longer with less character frequency. The probability defines the character frequency in this problem [3]. Dr. David Huffman devised an algorithm to perform the Huffman coding tree with efficiency to ensure better compression of data for transmission purposes.
Step 1: reviewing the code or DMS output to determine the probabilities assigned to each character
Event |
Probability |
a |
0.25 |
b |
0.1 |
c |
0.01 |
d |
0.05 |
e |
0.25 |
f |
0.04 |
g |
0.25 |
h |
0.05 |
clc;
clear all;
close all;
% Define the character string
my_str = ‘abcdefgh’;
% Manually define the pbsability distribution
pbs_dist = [0.25 0.1 0.01 0.05 0.25 0.04 0.25 0.05];
num_bits = ceil(log2(length(pbs_dist)))
disp(‘Character Probability:’);
for i = 1:length(pbs_dist)
display(strcat(my_str(i),’ –> ‘,num2str(pbs_dist(i))));
end
total = sum(pbs_dist)
for i = 1:length(my_str)
Pngd_str{i} = my_str(i);
end The output is obtained as,
Step 2: the characters are sorted out or prioritized on the basis of their occurrences in the text. The occurrence in this case is determined by the probability of the event. Therefore,
c |
f |
d |
h |
b |
a |
e |
g |
0.01 |
0.04 |
0.05 |
0.05 |
0.1 |
0.25 |
0.25 |
0.25 |
% Save initial set of symbols and probabilities for later use
init_str = Pngd_str;
init_pbs = pbs_dist;
Pngd_pbs = pbs_dist;
rear = 1;
while (length(Pngd_pbs) > 1)
% Sort pbss
[Pngd_pbs,ndx] = sort(Pngd_pbs,’ascend’);
% Sort string based on ndx
Pngd_str = Pngd_str(ndx);
% Create new symbol
new_node = strcat(Pngd_str(2),Pngd_str(1));
new_pbs = sum(Pngd_pbs(1:2));
% Dequeue used symbols from “old” queue
Pngd_str = Pngd_str(3:length(Pngd_str));
Pngd_pbs = Pngd_pbs(3:length(Pngd_pbs));
% Add new symbol back to “old” queue
Pngd_str = [Pngd_str, new_node];
Pngd_pbs = [Pngd_pbs, new_pbs];
% Add new symbol to “new” queue
newq_str(rear) = new_node;
newq_pbs(rear) = new_pbs;
rear = rear + 1;
end
Step 3: Now, one can build the Huffman code tree based on the prioritized list in step 2. New nodes are created where duplicates are present [4].
c |
f |
d |
h |
b |
a |
e |
g |
0.01 |
0.04 |
0.05 |
0.05 |
0.1 |
0.25 |
0.25 |
0.25 |
0.05 |
0.05 |
0.05 |
0.1 |
0.25 |
0.25 |
0.25 |
|
|
0.1 |
||||||
|
0.2 |
||||||
0.25 |
|||||||
0.5 |
|||||||
|
0.5 |
||||||
1 |
tree = [newq_str,init_str];
tree_pbs = [newq_pbs, init_pbs];
Step 4: Reading the scale from left to right: implementation as done using MATLAB,
% Sort all tree elements
[Pngd_tree_pbs,ndx] = sort(tree_pbs,’descend’);
Pngd_tree = tree(ndx);
parent(1) = 0;
num_children = 2;
for i = 2:length(Pngd_tree)
% Extract my symbol
me = Pngd_tree{i};
% Find my parent’s symbol (search until shortest match is found)
count = 1;
parent_maybe = Pngd_tree{i-count};
diff = strfind(parent_maybe,me);
while (isempty(diff))
count = count + 1;
parent_maybe = Pngd_tree{i-count};
diff = strfind(parent_maybe,me);
end
parent(i) = i – count;
end
treeplot(parent);
title(strcat(‘Huffman Coding Tree – “‘,my_str,'”‘));
display(Pngd_tree)
display(Pngd_tree_pbs)
[xs,ys,h,s] = treelayout(parent);
text(xs,ys,Pngd_tree);
for i = 2:length(Pngd_tree)
Code Examples using MATLAB
% Get my coordinate
my_x = xs(i);
my_y = ys(i);
% Get parent coordinate
parent_x = xs(parent(i));
parent_y = ys(parent(i));
% Calculate weight coordinate (midpoint)
mid_x = (my_x + parent_x)/2;
mid_y = (my_y + parent_y)/2;
% Calculate weight (positive slope = 1, negative = 0)
slope = (parent_y – my_y)/(parent_x – my_x);
if (slope > 0)
weight(i) = 1;
else
weight(i) = 0;
end
text(mid_x,mid_y,num2str(weight(i)));
end
for i = 1:length(Pngd_tree)
% Initialize code
code{i} = ”;
% Loop until root is found
index = i;
p = parent(index);
while(p ~= 0)
% Turn weight into code symbol
w = num2str(weight(index));
% Concatenate code symbol
code{i} = strcat(w,code{i});
% Continue towards root
index = parent(index);
p = parent(index);
end
end
codeBook = [Pngd_tree’, code’]
grid on
To obtain the average codeword length of the Huffman Code [5],
Solution
Average codeword is given as,
From the code word set obtained from the solution, the average codeword is determined such that,
Therefore,
Event |
Probability |
Code |
length |
a |
0.25 |
11 |
2 |
b |
0.1 |
0111 |
4 |
c |
0.01 |
01100 |
5 |
d |
0.05 |
0101 |
4 |
e |
0.25 |
10 |
2 |
f |
0.04 |
01101 |
5 |
g |
0.25 |
00 |
2 |
h |
0.05 |
0100 |
4 |
Performing the entropy of the source, modelled as a random process associated with the probability [6]. The self-information of the solution is given as the information content such that,
alphabet with respective probabilities
let k be the number of source output provided by the DMS, the average self-information is given as,
The average information per source output for the source is given as,The entropy of the source measures the uncertainty of the source. The relationship between the uncertainty and the entropy can be illustrated using two probabilities [7].
pbs_dist = [0.25 0.1 0.01 0.05 0.25 0.04 0.25 0.05];
%% To compute the Source Entropy for the content given,
clc
xv=0;
for f=1:length(pbs_dist)
Hz=-pbs_dist(f)*log(pbs_dist(f));
disp(Hz)
xv=Hz+xv;
end
disp(‘The Source Entropy of the Codeword is given as’)
disp(xv) % Source Entropy of the Codeword
The solution is given as,
The comparison between the average codeword length and the source entropy results in the coding efficiency such that,
QUESTION 2 (50 points)
For the DMS in question 1, we now consider Shannon encoding.
- Design a Huffman code and sketch the corresponding coding tree. Specify the codewords for the letters in the alphabet.
- Determine the average codeword length of the Shannon code.
- Calculate the entropy of the source and compare it with the average codeword length as calculated in (ii)
The Shannon coding is done for noiseless channels. In this coding method, the code is lossless and the channel is perceived noiseless. According to the Shannon coding Theorem, the average code word length of a source of symbols can at best be equal to the source entropy and can never be less than that [8]. The Huffman coding, in the Shannon’s noiseless coding theorem, is optimal for a fixed alphabet size which is subject to the constraint that the source symbols are coded one at a time [9]. The Huffman coding algorithm is implemented,
Step 1: The events are arranged in descending order of their probabilities,
Event |
Probability |
c |
0.01 |
f |
0.04 |
d |
0.05 |
h |
0.05 |
b |
0.1 |
a |
0.25 |
e |
0.25 |
g |
0.25 |
%% define the inception parameters
my_str=’abcdefgh’; %characters
%probability distribution over the characters
h=[0.25 0.1 0.01 0.05 0.25 0.04 0.25 0.05];
m=length(h)
T=cell(0); %defines an empty cell
for i=1:m
T=cell_set(T,i,i);
end
Step 2: The lowest probability symbols into a single compound symbol that replaces them in the next source reduction. The algorithm performs an iterative grouping of events based on their probabilities.
%% The algorithm in this case is slightly different such that the code
… The tree is formed from the lowest probabilities which are merged together
…progressively in an iterative manner
%step 1: Probability at the inception
p=h;
%merging the leading probabilities in order of decay
while length(p)>1
%perform sorting
[v,I]=sort(p);
if v(1)>v(length(v))
v=reverse(v);
I=reverse(I);
end
q=sum(v(1:2));
t=cell_sub(T,I(1:2));
%% simplifying the tree slightly
T = cell_sub(T, I(3:length(I)) );
p = v(3:length(v));
% Adding new events alongside their given probabilities
p(length(p)+1) = q;
T = cell_set(T, length(p), t);
end
%% graphical illustration of the binary coded tree
clf;
plot_hufftree(T)
Based on the output,
Event |
Probability |
Assigned Code |
Code length |
a |
0.25 |
0 |
1 |
f |
0.04 |
10 |
2 |
c |
0.01 |
11 |
2 |
h |
0.05 |
110 |
3 |
d |
0.05 |
1110 |
4 |
b |
0.1 |
1111 |
4 |
g |
0.25 |
11110 |
5 |
e |
0.25 |
111110 |
6 |
double entropy2probability( double entropy )
{ if(entropy>1.0) return 0.0 ;
double d, p = 0.5 ;
double f = entropy*0.75 ;
do { d = ENTROPY(p) – entropy ;
p += d*f ;
}
while( fabs(d)>0.000000001 ) ;
return p ;
Let be code word length of symbol
It is given as, 3.49 bits per event
The Shannon entropy [10] is given as,
It is given as, 2.52 bits
The code efficiency is given as
The use of Shannon encoding improves the coding efficiency from 64 percent to 72 percent [11]. The aim of the digital communication system is to have a coding efficiency closer to unity than any other value. The value at unity denotes that the transmission is successful without any loses or noise in the channel. The limits of efficiency should, however, not go below 0.5 as that would show that the technique is quite poor. As a result, the data transmission is ineffective and the quality of service, QoS, is rated as very poor in such a communication system [12].
References
[1] |
T. Lee and H. Park , “Design and Implementation of a Static Huffman Encoding Hardware using a Parallel shifting Algorithm,” IEEE, vol. 51, 2004. |
[2] |
R. G. Gallager, Information Theory and Reliable Communication, New York: Wiley, 2009. |
[3] |
U. Abrahams, Code and Parse Trees for Lossless Source Encoding, 2008. |
[4] |
M. B. Baer, “Redundancy related bounds for generalized Huffman Code,” IEEE M, B., 2011. |
[5] |
C. E. Shannon, “A mathematical Theory of Communication,” Bell System Technical journal, pp. 34-89, 2011. |
[6] |
L. Y. Liu, W. L. G, R. Wang and L. Y, “Data Structure of Huffman Code and its application to efficient encoding and decoding,” IEEE, 2005. |
[7] |
T. M. Cover, “On the competitive optimality of Huffman codes,” IEEE Transactions on Information Theory, 2001. |
[8] |
J. Ji-Han, C. Chin-Chen and C. Tung-Shou, An efficient Huffman Decoding Method Based on Pattern Partition and Look-up Table, 2009. |
[9] |
R. Gailly, “Performance analysis of Huffman coding algorithm,” IJARCSSE, vol. 3, no. 10, pp. 25-90, 2013. |
[10] |
A. Rabia, S. Adeel and K. Danista, “Performance comparison of Huffman Coding and Double Huffman Coding,” Innovative Computing Technology (InTech), vol. 6, pp. 361-364, 2016. |
[11] |
R. Capocelli, R. Giancarlo and I. Taneja, “Bounds on the redundancy of Huffman codes,” IEEE Transations of information Theory, vol. 32, no. 6, pp. 854-957, 2016. |
[12] |
P. M. Fenwick, “Huffman Code Efficiencies for extensions of sources,” IEEE Transactions on Communications, vol. 43, no. 2/3/4, pp. 34-76, 2005. |