Giter Site home page Giter Site logo

m-taghizadeh / ltclnc-algorithm Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 932 KB

Implementation Search Engine with LTC-LNC algorithm developed with Python and Flask framework

Home Page: http://m-taghizadeh.ir

Python 100.00%
search-engine flask search-engine-algorithm

ltclnc-algorithm's Introduction

Search Engine (LTC LNC Algorithm)

Implementation Search Engine with LTC-LNC algorithm and Cosine Similarity developed with Python and Flask framework

Author: M-Taghizadeh
Website: m-taghizadeh.ir


Quick Start

LTC-LNC-Algorithm-Steps

  • Step1: Tokenize query terms and remove extera space and strip query
query_terms = tokenizer(query)

def tokenizer(input_str:str):
    input_str.strip()
    split_list = input_str.split(" ")
    
    tokens_list = []
    for token in split_list:
        if len(list(token)) > 1:
            tokens_list.append(token)

    return tokens_list
  • Step2: Read all docs and save all doc title in documents variable
documents = [f for f in os.listdir(docs_path) if f.endswith('.txt')]
  • Step3: Load Postings List from database
POSTINGS_LIST = {}
try:
    with open(f'{postings_path}/POSTINGS_LIST.dat', "rb") as f:
        POSTINGS_LIST = pickle.load(f)
except:
    print("we have an error for loading POSTINGS_LIST!!!")
  • Step4: Create Postings list for query terms and calculate N: number of documents and create set of query terms

    Index_Table = {}
    N = len(documents)
    Index_Table["N"] = N
    set_of_query_term = []
    • Step4-A: Calculate [tf] for query term in [Query]
    for term in query_terms:
    
        if Index_Table.get(term):
            Index_Table[term]["query"]["tf"] += 1
        else:
            Index_Table[term] = {}
            Index_Table[term]["query"] = {}
            Index_Table[term]["query"]["tf"] = 1
            set_of_query_term.append(term)
    • Step4-B: Calculate [tf] for query term in [all Documents] from POSTINGS_LIST
    for term in query_terms:
        Index_Table[term]["document"] = {}
        for doc_title in documents:
            Index_Table[term]["document"][doc_title] = {}
            try:
                Index_Table[term]["document"][doc_title]["tf"] = POSTINGS_LIST[term][doc_title]
            except:
                Index_Table[term]["document"][doc_title]["tf"] = 0
    • Step4-C: Calculate [tf_wt] for query term in [Query and all Documents]
    for term in query_terms:
        tf_query = Index_Table[term]["query"]["tf"]
        Index_Table[term]["query"]["tf_wt"] = 1 + math.log10(tf_query)
    
        try:
            for doc_title in Index_Table.get(term).get("document"):
                tf_document = Index_Table[term]["document"][doc_title]["tf"]
                if tf_document == 0:
                    Index_Table[term]["document"][doc_title]["tf_wt"] = 0
                else:
                    Index_Table[term]["document"][doc_title]["tf_wt"] = 1 + math.log10(tf_document)
        except: print("error")
    • Step4-D: Calculate [df, idf, wt] for query term in [Query] => idf = log10(N/df), wt = tf_wt * idf
    for term in query_terms:
        if POSTINGS_LIST.get(term):
            df = len(POSTINGS_LIST[term])
        else:
            df = 0
    
        Index_Table[term]["query"]["df"] = df
        if df == 0:
            Index_Table[term]["query"]["idf"] = 0
        else:
            Index_Table[term]["query"]["idf"] = math.log10(N/df)
    
        Index_Table[term]["query"]["wt"] = Index_Table[term]["query"]["tf_wt"] * Index_Table[term]["query"]["idf"]
    • Step4-E: Calculate [wt] for query term in [all Documents] => idf = 1, wt = tf_wt * idf(==1)
    for term in query_terms:
        for doc_title in documents:
            Index_Table[term]["document"][doc_title]["wt"] = Index_Table[term]["document"][doc_title]["tf_wt"]
    • End of Calculate Index_Table
    print("------------------------------")    
    print("Index_Table: ", Index_Table)
    print("------------------------------")
  • Step5: Normalization of Query and goto vector space for [wt]:

    • Step5-A: calculate sum of wt for Query
    query_vector_normal = []
    sum = 0
    for term in set_of_query_term:
        sum += math.pow(Index_Table[term]["query"]["wt"], 2)
    sum = math.sqrt(sum)
    print("sum of wt for Query: ", sum)
    • Step5-B: calculate query_vector_normal with sum of wt
    for term in set_of_query_term:
        wt = Index_Table[term]["query"]["wt"]
        if sum != 0:
            query_vector_normal.append(wt / sum)
        else:
            query_vector_normal.append(0)
    print("\n\nquery_vector_normal: ", query_vector_normal, "\n\n")
  • Step6: Normalization of all Documents and goto vector space for [wt] and save that in docs_vector_normal(as a dict)

docs_vector_normal = {}
for doc_title in documents:
    docs_vector_normal[doc_title] = []

    # Step6-A: calculate sum of wt for any Document
    sum = 0
    for term in set_of_query_term:
        sum += math.pow(Index_Table[term]["document"][doc_title]["wt"], 2)
    sum = math.sqrt(sum)
    print("sum_wt: ", sum, " doc title: ", doc_title)
    
    # Step6-B: calculate docs_vector_normal[doc_title] with sum of wt for any document
    for term in set_of_query_term:
        wt = Index_Table[term]["document"][doc_title]["wt"]
        if sum != 0:
            docs_vector_normal[doc_title].append(wt / sum)
        else:
            docs_vector_normal[doc_title].append(0)
print("\n\ndocs_vector_normal: ", docs_vector_normal, "\n\n")
  • Step7: Cosine Similarity => Score list

    • Spte7-A: Dot Product between Query vector normal and any Documents vector normal and save that in Score list[]
    Scores_list = []
    for doc_title in docs_vector_normal:
        doc_vector = docs_vector_normal[doc_title]
        score = 0
        for i in range(0, len(doc_vector)):
            score += doc_vector[i] * query_vector_normal[i]
        Scores_list.append([doc_title, score])
    print("Scores_list: ", Scores_list)
  • Step8: Sorting Scores_list and get top k doc (get rankings)

min = 0
for i in range(0, len(Scores_list)):
    for j in range(0, len(Scores_list)-i-1):
        if Scores_list[j][1] < Scores_list[j+1][1]:
            tmp_Scores_list = Scores_list[j]
            Scores_list[j] = Scores_list[j+1]
            Scores_list[j+1] = tmp_Scores_list
print("\n\nSorted Scores_list: ", Scores_list, "\n\n")
  • Step9: get top k record from Sorted Score List (At the moment, we only return files that have a high score of zero)
k = N # we suppose that k == number of docs
TOP_k_docs = []
for doc in Scores_list:
    if doc[1] != 0: # doc[1] is score of this doc
        TOP_k_docs.append(doc)
    
    if(len(TOP_k_docs) == k):
        break
    
return TOP_k_docs

Screen-Shots

  1. Search Engine Index Page Screen

    index

  2. Add or Edit Documents

    add_doc

  3. Search Results Sorted with Score

    results

  4. Creating Posting list according to user query in each time

    results

  5. Posting list structure

    results

  6. Creating normal vector for query and all of documents for calculating Cosine Similarity

    results

Trace-Algorithm

trace1 trace2

Routes

Endpoint Methods Rule
add_doc GET, POST /add_document/
delete_doc GET /delete_document/string:doc_title/
edit_doc GET /edit_document/string:doc_title/
index GET /
static GET /static/path:filename

Dependencies

  • flask => webframework
  • python-dotenv => for reading .env files

ltclnc-algorithm's People

Contributors

m-taghizadeh avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.