ibd_assigment1's Introduction

Introduction to Big Data - Assignment 1

This repository contains the submission files of our group for the 1st Assignment of the Introduction to Big Data course.

In master branch - there is default model implementation, in the bm25 branch we have implementation to BM25 coefficient

Building results in two jars: Indexer and Query, for indexing text corpus and quering the search engine respectively. Both are based on Hadoop's MapReduce, and therefore require hadoop to run

Running Indexer Jar:

hadoop jar BM-Indexer-0.2.jar /EnWikiSmall bm-word-enumerator bm-doc-counter bm-indexer-out

Arguments you need to provide after container name are:

Path to directory containing text corpus
Directory for Word Enumerator output
Directory for Document Count output
Directory for AverageDocLength output
Directory for Indexer output

Running Query Jar:

hadoop jar BM-Query-0.1.jar word-enumerator-out bm-indexer-out bm-av-doc-count bm-query-out "cats top" 5

Arguments you need to provide after container name are:

Directory for WordEnumerator output
Directory for Indexer output
Directory for AverageDocLength output
Directory for Query output
Query text
Number of most relevant documents to find

Recommend Projects

pollytur / ibd_assigment1 Goto Github PK

ibd_assigment1's Introduction

Introduction to Big Data - Assignment 1

Running Indexer Jar:

Running Query Jar:

ibd_assigment1's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent