Giter Site home page Giter Site logo

lryanle / smare Goto Github PK

View Code? Open in Web Editor NEW
6.0 1.0 0.0 14.32 MB

Senior Design Repository for the Statefarm Automotive Fraud Project

Home Page: https://smare.lryanle.com

License: MIT License

Python 3.54% Dockerfile 0.08% JavaScript 0.78% TypeScript 15.97% CSS 0.35% HTML 0.06% Jupyter Notebook 79.21%
beautifulsoup css framer-motion git html javascript json mdx mongodb next-auth nextjs nodejs npm python react selenium shadcn-ui swr tailwindcss typescript

smare's Introduction

Statefarm SMARE



SMARE Website GitHub commits GitHub pull request GitHub issues GitHub Repo Size Github License

The Social Marketplace Automotive Risk Engine (SMARE) Project. This project aims to detect supicious listings and potential instances of insurance fraud posted on the most popular social marketplace sites, such as Craigslist and Facebook Marketplace. In partnership between Statefarm and UTA CSE Senior Design. View the live deployment at smare.lryanle.com.

๐Ÿ” Table of Contents

๐Ÿ’ป Stack

  • frontend
    • Next: A React framework for building web applications with server-side rendering.
    • Typescript: Typed superset of JavaScript that compiles to plain JavaScript.
    • shadcn/ui: A UI library for React, built using Tailwind CSS.
    • Tailwind: Utility-first CSS framework for rapidly building custom designs.
    • Prisma: Next-generation ORM for Node.js and TypeScript.
    • Nextauth: Authentication for Next.js.
    • Framer Motion: A React library to power animations.
    • Lucide: Open-source icon library.
    • Rechart: A composable charting library built on React components.
    • Remark: A Markdown processor powered by plugins part of the unified collective.
    • SWR: React Hooks library for data fetching.
    • zod: TypeScript-first schema validation with static type inference.
  • backend
    • Selenium: A suite of tools for automating web browsers.
    • BS4 (Beautiful Soup): A Python library for pulling data out of HTML and XML files.
    • Pymongo: Python driver for MongoDB.
    • Pandas: A fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool.
    • Imblearn: Python library to tackle the problem of imbalanced datasets.
    • Loguru: A Python logging library that aims to make life easier for developers.
    • OpenAI API: A generative AI API.

๐Ÿ“ Project Summary

  • frontend: Contains the frontend application with various components and settings.
    • frontend/app: The main page for the client-/customer-facing portion of our SaaS. Main feature is a dashboard to present information retreived from our pipeline.
    • frontend/app/api: Our middleware to retrieve, process, and present information from our DB to the web application.
  • backend: Houses backend logic including cleaners, models, scrapers, and utilities.
    • backend/src/cleaners: Contains all functionality related to cleaning data in the backend pipeline.
    • backend/src/models: Contains our 6 models to score and help flag social marketplace listings that are suspicious.
    • backend/src/cleaners: Retrieves data from various social media marketplaces and places them in our data pipeline flow.
    • backend/src/utilities: Utilities for all backend-related processes such as logging and our custom DB adapter.
  • documentation: Contains project documentation including sprint reports and charters.

โš™๏ธ Setting Up

Database Access

Make a copy of the .env.example file and make the following changes.

  1. remove .example from the extension

  2. Paste the username and password provided in MongoDB Atlas (if you should have access but do not, please contact @waseem-polus)

  3. Paste the connection URL provided provided in MongoDB Atlas. Include the password and username fields using ${VARIABLE} syntax to embed the value of the variable

Run Scrapers locally

Prerequisites

  • python3
  • pipenv

Installing dependencies
Navigate to scrapers/ and open the virtual environment using

pipenv shell

Then install dependencies using

pipenv install

Scraper Usage
To create build a production-ready Docker Image use

pipenv run build

To create build a development Docker Image use

pipenv run dev

If there is an existing smarecontainer, run the following:

pipenv run stop

To run a docker container "smarecontainer" use (Note: delete any containers with the same name before running)

pipenv run cont

then

# Scrape Craigsist homepage
pipenv run craigslist

# Scrape Facebook Marketplace homepage
pipenv run facebook

๐Ÿ™Œ Contributors

waseem-polus
Waseem Polus
lryanle
Ryan Lahlou
temitayoaderounmu
Temitayo Aderounmu
athiya26
Athiya Manoj
Yeabgezz
Yeabsra Gebremeskel

๐Ÿ“Š Statistics

Metrics Pagespeed SMAREScreenShot

๐Ÿ“„ License

This project is licensed under the MIT License - see the MIT License file for details.

smare's People

Contributors

waseem-polus avatar lryanle avatar github-actions[bot] avatar temitayoaderounmu avatar snyk-bot avatar athiya26 avatar dependabot[bot] avatar

Stargazers

Bruno Gomes avatar Hayden Johnson avatar MUHAMMAD FARMAN avatar  avatar  avatar  avatar

Watchers

 avatar

smare's Issues

Model 2 - Image vs Description Inconsistencies

As a...

user of SMARE

Wants to...

view the risk associated with inaccuracies in images vs description provided

So that...

have a better understanding of the overall score given to a post

Acceptance Criteria

Use a GPT Vision model to

  • Detect any body damage not mention in the description
  • Detect color, model, and other visual inaccuracies between images and written descriptions

Resources

No response

Model 4 - Vehicle Frequency

As a...

user of SMARE

Wants to...

view the risk associated with vehicles with significant frequency in marketplaces

So that...

have a better understanding of the overall score given to a post

Acceptance Criteria

  • Find frequencies of found models
  • Assess the freq. of a given scraped vehicle

Resources

No response

Figure out data dimensions

As a...

No response

Wants to...

No response

So that...

No response

Acceptance Criteria

  • [ ]
  • [ ]
  • [ ]

Resources

No response

Spike - Set up web scraping environment

As a...

member of the development team

Wants to...

set up environment

So that...

write and run web scraping scripts

Acceptance Criteria

  • create a virtual environment
  • install selenium
  • selenium pandas
  • selenium beautiful soup

Resources

No response

Setup Crons for scrapers

As a...

user of SMARE

Wants to...

See new car posts regularly as they are posted to Craigslist and Facebook Marketplace

So that...

Have risk information on up-to-date car posts

Acceptance Criteria

  • Create AWS Lambda functions for Craigslist
  • Create AWS Lambda functions for Facebook Marketplace
  • Set up Cron scheduled triggers using EventBridge

Resources

No response

Clean data

As a...

No response

Wants to...

No response

So that...

No response

Acceptance Criteria

  • Uses data from mongodb's raw data collection
  • reads straight from mongo; i.e. not hardcoded
  • be able to work on at least 99% of the dataset in the above collection
  • this should then be used with data normalization...

Resources

No response

Scaffold Next + SWR project

As a...

member of the development team

Wants to...

have a scaffold react project

So that...

have a shared starting point for frontend development

Acceptance Criteria

  • [ ]
  • [ ]
  • [ ]

Resources

No response

Update mongodb record IDs

As a...

of the development team

Wants to...

be able to reliably fetch data from the database using an id

So that...

display specific posting notifications and information

Acceptance Criteria

  • New ID must fit within the 1024 bytes limit
  • ID must be unique per post, even if the post details are the same (re-posting the same car)
  • ID fields between facebook and craigslist must not clash
  • Must update all fields currently in the DB
  • Must update all scripts that pull from the DB and rely on the ID

Resources

Instead of using the entire link, we can extract the unique ID craigslist and facebook assign their posts and use those instead. The 2 IDs seem to meet the acceptance criteria

source full link id Regex digit count notes
facebook https://www.facebook.com/marketplace/item/209748332123684/ 209748332123684 facebook\.com/marketplace/item/(\d+)/ 15-16 can reconstruct link using only id
craigslist https://abilene.craigslist.org/cto/d/abilene-2012-ford-edge-suv/7686412825.html 7686412825 /(\d+)\.html$ 10 cannot reconstruct link using only id

Push Docker Image to AWS ECR

As a...

member of the development team

Wants to...

Containerize scraper scripts and host them on AWS ECR

So that...

I can have an easy way to deploy code to AWS Lambda

Acceptance Criteria

  • Create Docker image with all scripts and necessary dependencies (selenium, bs4, pymongo, chromedrivers, chrome, etc...)
  • Test scrapers work locally inside a docker container
  • Host images on a private repo in AWS ECR (can't create Lambda functions from a public repository)
  • Update README with documentation on how to test scrapers locally

Resources

v4 Scraper Scripts

As a...

owner of SMARE

Wants to...

improve scraper efficiency and safety

So that...

minimize future maintenance and recovery costs

Acceptance Criteria

  • Escape/encode special characters before pushing to DB
  • Stop scrapers after finding 5 duplicates in a row
  • Finish Facebook specific link scraper (description, title status, longitude, latitude)

Resources

No response

CI/CD to push Scrapers Docker Image to AWS

As a...

member of the development team

Wants to...

automatically push changes to the Scrapers Image to AWS

So that...

save time and focus on other aspects of development

Acceptance Criteria

  • Create an IAM user and use its credentials for this action
  • Set environment variables for AWS login
  • Create a GitHub action that automatically pushes the Docker Image to AWS ECR

Resources

No response

Reorganize python backend modules

As a...

member of the development team

Wants to...

organize python code into a clean file structure

So that...

more easily dockerize, and host code on AWS lambda

Acceptance Criteria

  • separate python code into 3 independent modules (that do not call each other): Scrapers, Cleaners, Models
  • These modules are controlled by a main script (app.y) that controls when each module is called
  • Each module should have an entry point function
  • Each module has the dataflow:
    1. pull from database
    2. perform tasks
    3. push to database
    4. report failure to app.script
  • Update entry points in AWS Lambda
  • Restructure repo to the following file structure
root
โ”‚
โ”œโ”€โ”€ backend (currently called scrapers)
โ”‚   โ”œโ”€โ”€ src
โ”‚   โ”‚   โ”œโ”€โ”€ database.py
โ”‚   โ”‚   โ”œโ”€โ”€ scrapers
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ # scraper files here
โ”‚   โ”‚   โ”œโ”€โ”€ cleaners
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ # cleaner files here
โ”‚   โ”‚   โ”œโ”€โ”€ models
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ # model files here
โ”‚   โ””โ”€โ”€ app.py
โ”‚   โ””โ”€โ”€ Dockerfile
โ”‚   โ””โ”€โ”€ # lambda function entry points
โ”‚
โ””โ”€โ”€ frontend
    # frontend code here

Resources

Scrape from a craigslist page

As a...

user

Wants to...

see car posts from Craigslist

So that...

assess associated risks effectively

Acceptance Criteria

  • Access all posts of a given city
  • Get a page's HTML

Resources

No response

Create Orchestrator script

As a...

maintainer of SMARE

Wants to...

understand the health of the backend modules

So that...

quickly improve performance and patch any issues as they arise

Acceptance Criteria

  • Create a new collection in MongoDB
  • Orchestrator calls each module in order
  • Orchestrator posts diagnostic data to DB (how long a job ran and performance metrics of each module)
  • Orchestrator sends run report

Resources

No response

Spike - Learn AWS Lambda + ECR + EventBridge integration

As a...

member of the development team

Wants to...

understand my way around AWS tools such as AWS Lambda and AWS CodePipeline

So that...

Efficiently setup Cron jobs on AWS Lambda and avoid costly misconfiguration mistakes

Acceptance Criteria

  • Become familiar with AWS Lambda (for deploying serverless functions)
  • Become familiar with AWS CodePipeline (CI/CD pipeline with GitHub)
  • Become familiar with AWS ECR (for hosting Docker Images)
  • Become familiar EventBridge and Cron Syntax
  • Become familiar with pricing models of all above services

Resources

Model 3 - Listed vs market price comparison

As a...

user of SMARE

Wants to...

view the risk associated with significant differences between listed price vs market price of a given vehicle

So that...

have a better understanding of the overall score given to a post

Acceptance Criteria

  • Find market price using a KBB (Kelley Blue Book ) API
  • Compare listed price to KBB price

Resources

No response

Spike - Learn Docker

As a...

member of the development team

Wants to...

understand how docker is used

So that...

have a consistent environment for scripts to run on when deployed on AWS

Acceptance Criteria

  • Learn Dockerfile syntax
  • Learn options for building a Docker Image
  • Learn how to test scripts inside a container locally
  • Learn how to push Docker Images to AWS

Resources

No response

Normalize data

As a...

No response

Wants to...

No response

So that...

No response

Acceptance Criteria

  • this should follow from the cleaning data portion of the python extraction script
  • be able to work on at least 99% of the dataset in the above collection
  • All necessary features (make, model, year, etc.) are normalized and in a standard, uniform, understood format
  • should pipe directly into the extraction phase of this extraction script

Resources

No response

Data visualization

As a...

user

Wants to...

see visual representation of the data

So that...

get an intuitive understanding of the information quickly

Acceptance Criteria

  • [ ]
  • [ ]
  • [ ]

Resources

No response

Spike - Learn Beautiful Soup

As a...

member of the development team

Wants to...

learn the basics of Beautiful Soup

So that...

be able to parse and extract information from html pages

Acceptance Criteria

  • [ ]
  • [ ]
  • [ ]

Resources

No response

Set up DB to save scraped information

As a...

member of the development team

Wants to...

save scraped car posts

So that...

perform data analysis and risk assessment on them

Acceptance Criteria

  • [ ]
  • [ ]
  • [ ]

Resources

No response

Extract meaningful information from scrape DB

As a...

No response

Wants to...

No response

So that...

No response

Acceptance Criteria

  • Uses data from mongodb's raw data collection
  • reads straight from mongo; i.e. not hardcoded
  • be able to work on at least 99% of the dataset in the scraped_raw collection
  • All necessary features (make, model, year, etc.) are normalized and in a standard, uniform, understood format
  • push extracted data from scraped_raw into scraped_extract (this is where the data analysis models will read from)

Resources

No response

Export to CSV

As a...

user

Wants to...

export data to csv

So that...

generate reports and spreadsheets

Acceptance Criteria

  • UI button to export
  • Export table as CSV

Resources

No response

Figma webapp prototype

As a...

No response

Wants to...

No response

So that...

No response

Acceptance Criteria

  • [ ]
  • [ ]
  • [ ]

Resources

No response

Model 1 - Posting Sentiment

As a...

user of SMARE

Wants to...

view the risk associated with a post's general sentiment

So that...

have a better understanding of the overall score given to a post

Acceptance Criteria

  • Use a GPT model to analyze the sentiment of a post based on fields such as the title and post body

Resources

No response

Implement 1-stage architecture

As a...

User of SMARE

Wants to...

Have all listing info including description and images

So that...

Get most accurate risk score using all analysis models

Acceptance Criteria

  • Scrape specific listings after/while homepage is scraped
  • Stage 1 scrapers shouldn't scrape images. Only stage 2 (this will save a lot of time on the craigstlist side)
  • Add stage 1 and 2 results to DB at once

Resources

No response

Spike - Learn Python data science libraries

As a...

member of the development team

Wants to...

learn the basics of Python data science libraries

So that...

perform data analysis on collected information

Acceptance Criteria

  • [ ]
  • [ ]
  • [ ]

Resources

No response

Send Log Reports via Email

As a...

member of the development team

Wants to...

be alerted of scraper errors

So that...

quickly take action to prevent and contain damage

Acceptance Criteria

  • Set a timeout in the script (has to be shorter than the AWS Lambda timeout)
  • Alert the dev team using Discord webhooks
  • Alert the dev team when scrapers are triggered

Resources

No response

Integrate react-table table

As a...

user

Wants to...

see car postings in a tabular format

So that...

filter, sort, and search for posts easily

Acceptance Criteria

  • [ ]
  • [ ]
  • [ ]

Resources

No response

Scrape from a Facebook Marketplace page

As a...

user

Wants to...

see car posts from Facebook Marketplace

So that...

assess associated risks effectively

Acceptance Criteria

  • Access all posts of a given city
  • Get a page's HTML

Resources

No response

Spike - Learn Selenium

As a...

member of the development team

Wants to...

learn the basics of Selenium

So that...

be able to navigate web pages using a headless browser

Acceptance Criteria

  • [ ]
  • [ ]
  • [ ]

Resources

No response

Determine data and endpoints needed

As a...

member of the development team

Wants to...

understand data and endpoints needed

So that...

create an API that serves that information to the frontend

Acceptance Criteria

  • [ ]
  • [ ]
  • [ ]

Resources

No response

Model 5 - Theft Likelihood

As a...

user of SMARE

Wants to...

view the risk associated with vehicles with higher theft occurrences

So that...

have a better understanding of the overall score given to a post

Acceptance Criteria

  • Theft based on car make, model, year
  • Theft based on location (e.g. rural vs urban)

Resources

Vehicle Theft Rates Search

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.