Giter Site home page Giter Site logo

iamiqbal / hybrid-text-compression Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 179 KB

Hybrid text compression pipeline using Burrows Wheelers Transform, Huffman encoding, Run-length encoding, Lempel-Ziv-Welch compression and Delta encoding

License: Apache License 2.0

C++ 100.00%
burrows-wheeler-transform bwt-transform compression delta-encoding huffman-coding huffman-compression-algorithm lempel-ziv-welch lzw-compression runlengthencoding text-compression

hybrid-text-compression's Introduction

Hybrid Text Compression


Introduction

Hybrid Text Compression (HTC) uses different techniques to achieve maximum text compression (lossless). It makes use of LZW, Huffman, Run length encoding and Burrows Wheeler Transform to compress different types of files.


Compression Algorithms


LZW

It is a dictionary based technique for compressing data created by Abraham Lempel, Jacob Ziv, and Terry Welch. It is the improved version of LZ78.


HUFFMAN Code

Huffman code is a type of prefix code which is used for lossless data compression. It was developed by David A. Huffman in 1952. It is an entropy/frequency encoding method.


Burrows Wheeler Transform

The BWT is a block-sorting compression algorithm. It rearranges a string into runs of similar character. It was invented by Michael Burrows and David Wheeler in 1994. It is used in Bio-informics. In Next Generation Sequencing, DNA is fragmented into small pieces of which first few bases are sequenced, yielding several millions of reads each 30 to 500 base pairs(“DNA Characters”) long.


Run Length Encoding

It is a form of lossless data compression in which runs of data are stored as a single data value and count, rather than as original run.


Tools and Dependencies

  1. C++
  2. Python 2.x/3.x
  3. Matplotlib C++ API
  4. Boost for reading/writing binary data (in actual/raw bits)

Compiling

git clone https://github.com/IAMIQBAL/Hybrid-Text-Compression
cd Hybrid-Text-Compression
pacman -S boost
g++ -o main HybridCompressor.cpp
./main

Tests

We have written a test class (tests.cpp) which can be used to check the compression ratio and time taken on a scatter plot. The class uses Matplotlib’s C++ Library to plot the scatter plot. The Tests are as follows:

Note: Red = .json | Green = .txt | Blue = .xml, .html

1. Test for LZW


LZW

2. Test for Huffman


LZW


3. Test for RLE + LZW


LZW

4. Test for BWT + RLE + LZW


LZW

5. All Tests


LZW



Note:

Text, json, html and other formats data have been used for testing purposes.

hybrid-text-compression's People

Contributors

iamiqbal avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.