Giter Site home page Giter Site logo

spimi's Introduction

Single-Pass In-Memory Indexing

This is a project done for the fall 2018 COMP 479 - Information Retrieval course in Concordia University. The goal of the project was to analyze Reuters documents from a bunch of files by tokenizing the documents, subsequently constructing an index containing terms and their corresponding postings lists.

The Reuters files can be downloaded here, though the program will download them for you once run (if they're not already available in the root directory of the project).

Getting Started

Prerequisites

The following Python packages are required to run the program:

Click here for the specific versions of the packages used for this project.

Or just run it with Docker.

Docker

I also included a Dockerfile to make it easier to run on any machine. First, make sure you cd into this repository.

To build the image and start up a container:

docker image build -t spimi .
docker container run -it --name spimi-demo spimi bash

This will take you to an interactive Bash terminal, from which you can run the script. You can include the --rm option in the run command to automatically remove the container when you exit out of it.

Running

The file to run is in the src/ directory.

python3 main.py [-d DOCS_PER_BLOCK]
                [-r {1, 2, 3, ..., 22}]
                [-rs] [-s] [-c] [-rn]
                [-a]

optional arguments:
    -d, --docs                      number of documents per block (default 500)
    -r, --reuters                   number of Reuters files to parse (1-22) (default 22)
    -rs, --remove-stopwords         remove stopwords from the index
    -s, --stem                      stem terms in the index
    -c, --case-folding              reduce terms in the index to lowercase
    -rn, --remove-numbers           remove numbers from the index
    -a, --all                       use options -rs, -s, -c, and -rn

Generated files will appear in the root directory of the repository.

Author

  • Vartan Benohanian - ID: 27492049

Report

A project report showcasing a more detailed description of the SPIMI is available here.

The one showcasing the Okapi BM ranking function can be viewed here.

The Expectations of Originality form is available here.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.