Giter Site home page Giter Site logo

asnjudy / information-retrieval-system Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kuberkaul/information-retrieval-system

0.0 2.0 0.0 1.26 MB

The Project is basically an information retrieval system (ad-hoc search engine for Crainfield Documents(14000 documents) : http://www.iva.dk/bh/core%20concepts%20in%20lis/articles%20a-z/test_collections.htm . The search engine can perform basic functions like ranking of the documents, retrieving document number, summary , title, finding common words between documents and other search functions.

Shell 3.18% Python 96.82%

information-retrieval-system's Introduction

--------------------------------README----------------------------------

Course :  CS 6998, Search Engine Technology
Project : HW 1 Building a basic information retrieval system

Name:
1. Kuber Kaul(UNI- kk2872 )
			
List Of Files Submitted:

SearchEngine.py  -------------------- The python program
External Libraries  --------------------------- NLTK for Python
The install, index and query programs.
Results.py ---------------------------- Code for Similar Words.
			
How To Run The Program :	I have created an install file in the directory.
# To run type "make" and it should run the entire project. 

Internal Design of the Project	:

1)I have divided the code into index.py and query.py as the two files    essential for doing the work.

2)NLTK (Natural Language Toolkit) for python has been used by me to handle:
a. find similar words.
b. reduce stock words.
c. stemming of data to its root word.
 
3)Pickle in Python was used to serialize the data from one file to another and vice versa.

Query-Modification Method :	

a) Description of Parsing Algorithm
1) Approach : I have used my own algorithm for fast parsing of crainfield set of data and have stored DOC NO, TITLE and CONTENT using it.

Additional Information	:	


1) I used regex_tokenization to split the queries and the documents into a list of words as tokens for easy parsing.
2) I decided not to stem the words as the collection of documents is fairly normal in size and not very huge also it would have a counter-effect later on searching for the specific term as it would be rooted down. This would not be beneficial.
3) Though, I did remove stock words reducing the index which brought the index to relatively manageable size of the document set that we have. .This resulted in fast parsing of data as it was stripped down.
4)I have included the similarity feature in my code and hence am able to search for various related words.
  



	
						

information-retrieval-system's People

Contributors

kuberkaul avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.