Search Engine (LTC LNC Algorithm)

Implementation Search Engine with LTC-LNC algorithm and Cosine Similarity developed with Python and Flask framework

Author: M-Taghizadeh
Website: m-taghizadeh.ir

Quick Start

LTC LNC ALgorithm Steps
Screen Shots
Trace Algorithm
Dependencies
Routes

LTC-LNC-Algorithm-Steps

Step1: Tokenize query terms and remove extera space and strip query

query_terms = tokenizer(query)

def tokenizer(input_str:str):
    input_str.strip()
    split_list = input_str.split(" ")
    
    tokens_list = []
    for token in split_list:
        if len(list(token)) > 1:
            tokens_list.append(token)

    return tokens_list

Step2: Read all docs and save all doc title in documents variable

documents = [f for f in os.listdir(docs_path) if f.endswith('.txt')]

Step3: Load Postings List from database

POSTINGS_LIST = {}
try:
    with open(f'{postings_path}/POSTINGS_LIST.dat', "rb") as f:
        POSTINGS_LIST = pickle.load(f)
except:
    print("we have an error for loading POSTINGS_LIST!!!")

Step4: Create Postings list for query terms and calculate N: number of documents and create set of query terms

Index_Table = {}
N = len(documents)
Index_Table["N"] = N
set_of_query_term = []

Step4-A: Calculate [tf] for query term in [Query]

for term in query_terms:

    if Index_Table.get(term):
        Index_Table[term]["query"]["tf"] += 1
    else:
        Index_Table[term] = {}
        Index_Table[term]["query"] = {}
        Index_Table[term]["query"]["tf"] = 1
        set_of_query_term.append(term)

Step4-B: Calculate [tf] for query term in [all Documents] from POSTINGS_LIST

for term in query_terms:
    Index_Table[term]["document"] = {}
    for doc_title in documents:
        Index_Table[term]["document"][doc_title] = {}
        try:
            Index_Table[term]["document"][doc_title]["tf"] = POSTINGS_LIST[term][doc_title]
        except:
            Index_Table[term]["document"][doc_title]["tf"] = 0

Step4-C: Calculate [tf_wt] for query term in [Query and all Documents]

for term in query_terms:
    tf_query = Index_Table[term]["query"]["tf"]
    Index_Table[term]["query"]["tf_wt"] = 1 + math.log10(tf_query)

    try:
        for doc_title in Index_Table.get(term).get("document"):
            tf_document = Index_Table[term]["document"][doc_title]["tf"]
            if tf_document == 0:
                Index_Table[term]["document"][doc_title]["tf_wt"] = 0
            else:
                Index_Table[term]["document"][doc_title]["tf_wt"] = 1 + math.log10(tf_document)
    except: print("error")

Step4-D: Calculate [df, idf, wt] for query term in [Query] => idf = log10(N/df), wt = tf_wt * idf

for term in query_terms:
    if POSTINGS_LIST.get(term):
        df = len(POSTINGS_LIST[term])
    else:
        df = 0

    Index_Table[term]["query"]["df"] = df
    if df == 0:
        Index_Table[term]["query"]["idf"] = 0
    else:
        Index_Table[term]["query"]["idf"] = math.log10(N/df)

    Index_Table[term]["query"]["wt"] = Index_Table[term]["query"]["tf_wt"] * Index_Table[term]["query"]["idf"]

Step4-E: Calculate [wt] for query term in [all Documents] => idf = 1, wt = tf_wt * idf(==1)

for term in query_terms:
    for doc_title in documents:
        Index_Table[term]["document"][doc_title]["wt"] = Index_Table[term]["document"][doc_title]["tf_wt"]

End of Calculate Index_Table

print("------------------------------")    
print("Index_Table: ", Index_Table)
print("------------------------------")

Step5: Normalization of Query and goto vector space for [wt]:

Step5-A: calculate sum of wt for Query

query_vector_normal = []
sum = 0
for term in set_of_query_term:
    sum += math.pow(Index_Table[term]["query"]["wt"], 2)
sum = math.sqrt(sum)
print("sum of wt for Query: ", sum)

Step5-B: calculate query_vector_normal with sum of wt

for term in set_of_query_term:
    wt = Index_Table[term]["query"]["wt"]
    if sum != 0:
        query_vector_normal.append(wt / sum)
    else:
        query_vector_normal.append(0)
print("\n\nquery_vector_normal: ", query_vector_normal, "\n\n")

Step6: Normalization of all Documents and goto vector space for [wt] and save that in docs_vector_normal(as a dict)

docs_vector_normal = {}
for doc_title in documents:
    docs_vector_normal[doc_title] = []

    # Step6-A: calculate sum of wt for any Document
    sum = 0
    for term in set_of_query_term:
        sum += math.pow(Index_Table[term]["document"][doc_title]["wt"], 2)
    sum = math.sqrt(sum)
    print("sum_wt: ", sum, " doc title: ", doc_title)
    
    # Step6-B: calculate docs_vector_normal[doc_title] with sum of wt for any document
    for term in set_of_query_term:
        wt = Index_Table[term]["document"][doc_title]["wt"]
        if sum != 0:
            docs_vector_normal[doc_title].append(wt / sum)
        else:
            docs_vector_normal[doc_title].append(0)
print("\n\ndocs_vector_normal: ", docs_vector_normal, "\n\n")

Step7: Cosine Similarity => Score list

Spte7-A: Dot Product between Query vector normal and any Documents vector normal and save that in Score list[]

Scores_list = []
for doc_title in docs_vector_normal:
    doc_vector = docs_vector_normal[doc_title]
    score = 0
    for i in range(0, len(doc_vector)):
        score += doc_vector[i] * query_vector_normal[i]
    Scores_list.append([doc_title, score])
print("Scores_list: ", Scores_list)

Step8: Sorting Scores_list and get top k doc (get rankings)

min = 0
for i in range(0, len(Scores_list)):
    for j in range(0, len(Scores_list)-i-1):
        if Scores_list[j][1] < Scores_list[j+1][1]:
            tmp_Scores_list = Scores_list[j]
            Scores_list[j] = Scores_list[j+1]
            Scores_list[j+1] = tmp_Scores_list
print("\n\nSorted Scores_list: ", Scores_list, "\n\n")

Step9: get top k record from Sorted Score List (At the moment, we only return files that have a high score of zero)

k = N # we suppose that k == number of docs
TOP_k_docs = []
for doc in Scores_list:
    if doc[1] != 0: # doc[1] is score of this doc
        TOP_k_docs.append(doc)
    
    if(len(TOP_k_docs) == k):
        break
    
return TOP_k_docs

Screen-Shots

Search Engine Index Page Screen
Add or Edit Documents
Search Results Sorted with Score
Creating Posting list according to user query in each time
Posting list structure
Creating normal vector for query and all of documents for calculating Cosine Similarity

Trace-Algorithm

Routes

Endpoint	Methods	Rule
add_doc	GET, POST	/add_document/
delete_doc	GET	/delete_document/string:doc_title/
edit_doc	GET	/edit_document/string:doc_title/
index	GET	/
static	GET	/static/path:filename

Dependencies

flask => webframework
python-dotenv => for reading .env files

m-taghizadeh / ltclnc-algorithm Goto Github PK

ltclnc-algorithm's Introduction

Search Engine (LTC LNC Algorithm)

Quick Start

LTC-LNC-Algorithm-Steps

Screen-Shots

Trace-Algorithm

Routes

Dependencies

ltclnc-algorithm's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent