The objective of this repository is to create an autoComplete API that given a user input query, provides suggested queries, based on the search history stored in a csv file and the number of times the queries were typed.
The project was implemented using two different methods. The programming language used is Java with the framework SpringBoot to build the api.
The first one consists in storing all the queries in a Map (text : count). Then, for each input query, we loop over the Map and determine all the queries that starts with the input. The result Map will be sorted according to the values of the count which allows extract the most typed queries. ( branch main of the code)
The second method consists in using a prefix tree (trie). After storing the different queries in a Map (text : count), we loop over the Map to insert each query sentence in the tree. The root node is empty and each node is defined by a hashMap children and an integer endOfSentence to tell whether we have already a complete sentence at this node or not. The insertion is done character by character. Once it is inserted, we set the endOfSentence attribute of the last node to the number of times the query was typed (count Value). Given an input query (prefix), the search is done letter by letter accross the tree which speeds up the search and allows you to eliminate certain branches each time you move forward. ( branch tree-search of the code)
N : max query length M : size of query list dataTime | Memory | |
---|---|---|
naive | O(M) | O(MN) |
tree (insert) | O(MN) | O(MN) |
tree (search) | O(N) | cte |
To get the request input in the url, i use the @GetMapping annotation.
Then, I compared the two methods performances by sending a certain number of requests and calculating each time the time to execute all of them. The simulation is done with python.
The experience is repeated using 5, 10, 100, 200, 500, 1000, 2000, 3000, 5000 Samples.
First method --> naive Second method --> Tree
The time is displayed by seconds.
Example with 5000 input query: time to execute all the queries:
-
first method (naive) : ~184s
-
Second method (tree) : ~20s
The second method may be improved by storing the different sentences in the tree decreasingly according to the count value which allows get the 10 most typed queries without having to loop over all the queries starting with the prefix input.
We can also define a new rule to prioritize and manage the case where our result list of suggested queries contains queries with the same number of time typed.
To implement an approximate suggestion logic by measuring the similarity between the input query and the queries stored and defining a threshold to determine whether the words can be considered as similar or not.
To start the api, clone the project and change the fileName in src/main/resources/application.properties with your csv file path then run the project.