Giter Site home page Giter Site logo

gaurav-van / lizmotors_mobility_assignment_2 Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 21 KB

This is the Repository containing the material required for the 2nd assignment of Lizmotors mobility for the role of AI/ML Engineering Intern.

Python 100.00%
data-scraping electric-vehicles gemini-api web-scraping beautifulsoup duckduckgo-api selenium

lizmotors_mobility_assignment_2's Introduction

Lizmotors Mobility Assignment_2

This is the Repository containing the material required for the 2nd assignment of Lizmotors mobility for the role of AI/ML Engineering Intern.


Initial Understanding of the Assignment

Building a Basic RAG (Retrieval-Augmented-Generation) or Vector Search System for an EV based Company called Canoo. We can Define RAG in simple terms: When connecting a Language Model (LLM) to a datastore, we augment it by adding extra data to a vector database (DB). The prompt is meticulously crafted to enable the LLM to not only consider the original input but also consult the vector DB for the most relevant response.

Basic RAG / Vector Search Architecture

Data Extraction -> Chunks -> Vector Embeddings -> Vector Database 

Retrieval: User Query -> Vector Embeddings -> Search (Vector similarity search) from Vector Database -> Result (summarized, keywords, etc..)

Result from Retrieval stage is then synthesized with the LLM

image

Task Given: Extraction Part of RAG Architecture

1) Based on 4 queries, find relevant web links of each query using Internet search APIs
2) Scrap relevant data from those web links and store as CSV files
   
Based on similarity or disimilarity between the data of all 4 queries the decision to make single or 4 CSV will be decided. 

My Approach

  1. Place those 4 queries in a List image

  2. For queries within the list, extract 10 web links each using duckduckgo API and store them in a text file consisting of links and their respective query

  3. Read the Text file and extract the non_link and link part in sepeate lists. Topic = non_link part or the Queries

  4. Scrape each link using the combination of Selenium and BeautifulSoup. I am extracting their p and span tag.

  5. Then I am using GEMINI API to extract relevant information on the basis of respective Topic from scraped text in clean and clear format. Helps in reducing the task of data cleaning.

  6. Storing Data in a CSV with following Structure. Information Column Data is in Json Format

    Query / Topic url Information

Note: csv files contains extracted Information based on respective Query on each and every respevtive url so some NaN results are expected


Dependencies

  • duckduckgo API
  • csv
  • google_api_core.exceptions
  • google.generativeai
  • genai
  • BeautifulSouo
  • Selenium

lizmotors_mobility_assignment_2's People

Contributors

gaurav-van avatar

Stargazers

Yehan Wasura avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.