Giter Site home page Giter Site logo

search-engine's Introduction

Information Retrieval Project- Authors_Search_Engine

Authors Search Engine using ElasticSearch and Python

Getting Start

Environment settting up Instructions

  • Download and Install the ElasticSearch
  • Install the ICU_Tokenizer plugin on the ElasticSearch
  • Install the python3 with pip3
  • Install the python packages in the requirements.txt by pip install -r requirements.txt

Search Engine running Instructions

  1. First start the ElasticSearch locally on port 9200.
  2. Then run index_creation.py file to create the index and insert data.
  3. Next run the main.py to start the search engine
  4. Then visit http://localhost:5000/ for see the user interface.
  5. Finally add your search query in the search box for searching.

Project Structure

  • data - this folder contain authors datasheet and its json file, and python files of data cleaning
  • templates - which includes HTML file of user interface
  • documents - Folder contains project proposal & project report
  • images - Folder contains diagrams used in README.md
  • indexGeneration.py - Python file for index creating and data inserting
  • main.py - main python file which handles API request of queries
  • queryRequestJsonCreation.py - Elastic Search queries
  • requirements.txt - txt file which has python dependancies
  • searchCreation.py - Python file which process search request data

Authors Data

cleaned_authors.json file contains 103 Authors details with the following data.

  1. name - Author name in Sinhala
  2. name_english - Author name in English
  3. date_of_birth - date of birth of Author
  4. birth_place - Birth place of Author in Sinhala
  5. birth_place_english - Birth place of Author in English
  6. education - Educated places of Author
  7. languages - Lanugages is used to write
  8. categories - areas covered by writing
  9. list_of_books - books of Author
  10. description - Brief description of Author

Basic Functionalities

  • search query can be includes in any fields out of author_name, author_name_english, date_of_birth, birth_place, birth_place_english, languages, education, categories, list_of_books

eg: මහාචාර්ය එදිරිවීර සරච්චන්ද්‍ර, Arthur C. Clark, සිංහල, විද්‍යාප්‍රබන්ධ, Koggala

  • there are synonym keyword lists to be identified special word tokens from search query. search will be customized according to priority of each field. whereas high weights will be given for identified fields by perticular keywords

eg: "මාර්ටින් වික්‍රමසිංහ රචකයා ගේ විස්තර" - word 'රචකයා' relates with AuthorName field. then authorName will be given high weight

  • Search Engine supports bilingusl queries.

eg: එදිරිවීර සරච්චන්ද්‍ර රචකයා, author sarachchandra, author එදිරිවීර සරච්චන්ද්‍

  • Search engine supports multiple field searches in search string

eg: "මළගිය ඇත්තෝ පොත රචනා කළ කතුවරයා" - 'පොත' relates with listOfBooks field, 'කතුවරයා' relates with AuthorName

User Interface

  • Following figure shows the example search result of the UI. Search Example of UI

Project Architecture

Following figure shows how the search engine works through the flask server.

Search Engine Architecture

Indexing & Query techniques

Indexing

  • 'ICU_Tokenizer’ which is a standard tokenizer and which has better support for sinhala language to tokenize text into the words.
  • Elastic search ‘edge_ngram’ filter was used to generate n-grams.

MultiSearch with Rule Base classification

  • Rule-based text mining is used to understand and extract data from the user entered query string.

  • A basic set of rules are applied to each search phrase to identify the keywords, classify them into relevant search types. According to the classification user query was classified to one of the following query type.

    1. Faceted query with Cross Fields
    2. Faceted query with phrase-prefix

Follwing diagram further shows the use of Rule Based Classification and Multisearch queries.

MultiSearch with Rule Base classification

search-engine's People

Contributors

rashmikalakshan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.