This repository contains the submission files of our group for the 1st Assignment of the Introduction to Big Data course.
In master branch - there is default model implementation, in the bm25 branch we have implementation to BM25 coefficient
Building results in two jars: Indexer and Query, for indexing text corpus and quering the search engine respectively. Both are based on Hadoop's MapReduce, and therefore require hadoop to run
hadoop jar BM-Indexer-0.2.jar /EnWikiSmall bm-word-enumerator bm-doc-counter bm-indexer-out
Arguments you need to provide after container name are:
- Path to directory containing text corpus
- Directory for Word Enumerator output
- Directory for Document Count output
- Directory for AverageDocLength output
- Directory for Indexer output
hadoop jar BM-Query-0.1.jar word-enumerator-out bm-indexer-out bm-av-doc-count bm-query-out "cats top" 5
Arguments you need to provide after container name are:
- Directory for WordEnumerator output
- Directory for Indexer output
- Directory for AverageDocLength output
- Directory for Query output
- Query text
- Number of most relevant documents to find