- Saagar Deshpande (sdeshpande@college, [email protected])
- Irene Chen (iychen@college)
We implemented serial and parallel versions of the Sequitur compression algorithm. The Sequitur algorithm uses hierarchical structure and sequences of discrete symbols to compress files by exploiting repetative structures found in strings.
Our final project is based on the algorithm as described in the following paper: "Identifying Hierarchical Structure in Sequences: A linear-time algorithm", Craig G. Nevill-Manning, Ian H. Witten, Journal of Artificial Intelligence Research 7 (1997) 67.82
- MPI (mpi4py)
- Python (sys, csv)
The above packages should already be installed on the resonance.seas cluster for CS 205. On the resonance node, module load courses/cs205/2012
will load these modules into the user environment automatically.
serial.py
: serial implementation of the Sequitur algorithmconcat_parallel.py
: simple parallel implementation of the Sequitur algorithm. Text is split among workers which run a version of the serial algorithm before the main process merges all results in order.wordcount.py
: parallel implementation based on the Sequitur concept. This method uses frequency of words in the provided text to determine rules to create before compressing.
The repository contains several test cases as well as runscripts to test the code.
Run serial.py
to see the results of the serial implementation.
Run qsub runscript
to run the concat_parallel.py
implementation.
Run qsub wcscript
to run the wordcount.py
implementation.
- Original Paper: Identifying Hierarchical Structure in Sequences: A linear-time algorithm
- Sequitur website
- Bjoern Andres - project mentor
- Cris Cecka, Hanspeter Pfister - for instructional material on MPI
- Michael Mitzenmacher - for introducing us to Sequitur and a general love of algorithms
- Craig G. Nevill-Manning, Ian H. Witten - for creating Sequitur