Giter Site home page Giter Site logo

hodel33 / ml-dupdetect-inzpo Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 363 KB

Showcases Machine Learning logic and custom Django Admin view for sophisticated potential duplicate detection and management in a larger Django project -> https://inzpo.me 🌟

License: MIT License

Python 100.00%

ml-dupdetect-inzpo's Introduction

ML Duplicate Detection - inzpo.me

πŸ“‹ Overview

This repository showcases a snippet from a larger Django project (inzpo.me) that uses Machine Learning algorithms for the sophisticated analysis and management of potential duplicate content.

Using scikit-learn's TF-IDF Vectorization and Cosine Similarity features, this logic efficiently identifies potential duplicate entries in large datasets. The detection process offers two modes for added flexibility: it can either focus on newly scraped episodes, comparing them against both the existing database and among themselves, or it can analyze the entire dataset.

To enhance manageability, a custom Django Admin view has also been implemented. This allows for easy identification and exclusion of duplicates.

inzpo.me, a passion project of mine & Dyaland, is a first-of-its-kind platform that uses Python and Django to seamlessly connect people with inspiring personalities by notifying users of guest appearances on podcasts. The platform emphasizes scalability, performance optimization, and user engagement through various integrations like Spotify/ChatGPT APIs, Django Q2 for async task management, trie search for efficient data retrieval, custom caching mechanisms and other innovative functionalities.


🌟 Features

  • ML-Driven Analysis: Utilizes Machine Learning algorithms for feature extraction and similarity computation.
  • TF-IDF Vectorization: Transforms textual data into numerical vectors for advanced analysis.
  • Cosine Similarity: Computes similarity scores to accurately identify potential duplicates.
  • Custom Django Admin View: Facilitates the management of potential duplicates, allowing for quick decision-making on whether an entry is a duplicate or not.
  • Flexible and Optimized Runs: The potential duplicate detection process is designed to run in two modes. It can focus only on newly scraped episodes for daily runs or analyze the entire dataset, making it highly efficient and adaptable to different use-cases.
  • Threshold Tuning: The similarity thresholds for names, descriptions, and durations are customizable, allowing for fine-tuning based on specific needs.
  • Resource Monitoring: Includes built-in RAM usage tracking, optimizing performance and ensuring the system remains efficient when hosted online.

Admin View Screenshot

Terminal Output Screenshot


πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

Summary:

  1. Permission: The software and associated documentation files can be used, copied, modified, merged, published, distributed, sublicensed, and/or sold.
  2. Condition: Proper attribution must be given to the original author and the MIT license text must be included in all copies or substantial portions of the software.
  3. No Warranty: The software is provided "as is", without any warranty.

For the full license, please refer to the LICENSE file in the repository.


πŸ’¬ Feedback & Contact

I'd love to network, discuss tech, or swap music recommendations. Feel free to connect with me on:

🌐 LinkedIn: Bjârn Hâdel
🐦 Twitter: @hodel33
πŸ“Έ Instagram: @hodel33
πŸ“§ Email: [email protected]

ml-dupdetect-inzpo's People

Contributors

hodel33 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.