philipp-sc / llm-fraud-detection Goto Github PK
View Code? Open in Web Editor NEWRobust semi-supervised spam detection using Rust native NLP pipelines.
License: Apache License 2.0
Robust semi-supervised spam detection using Rust native NLP pipelines.
License: Apache License 2.0
Replacing the Random Forest Regressor with a better model (Neural Network) should improve the performance.
(Random Forests can not extrapolate, that means it has difficulties to generalize and handle unseen data.)
governance_proposal_spam_ham.csv
---------------
count spam: 172
count ham: 2551
Note: This will be great to reduce false positives, since the model has not yet seen many ham (and spam) data for governance proposals.
Note: consider reducing the ham dataset by filtering some of the rejected proposals with high votes against. To make sure not to train likely spam as ham.
re-train using the improved crypto governance proposal dataset.
to reduce the KNN model size.
Instead of testing the model performance on the same data it was trained on, generate a training and test dataset.
90% training data
10% test data
make sure to sample spam and ham.
inspiration / dataset: https://github.com/ebubekirbbr/phishing_url_detection
add feature url-fraud-likelihood
worth a try
right now one text document is embedded as whole, experiment with partitioning the text first (by sentences, paragraphs, etc)
The current engineered 'hard-coded' features are very basic, while they provide useful information there is room for improvement.
src/build/feature_engineering/mod.rs
Instead of hard-coded conditions, create / augment with Bag Of Words vector that is derived from the training dataset.
E.g then using a frequency encoding of common words that often occur within spam but not in ham and vice versa.
Resulting in two vectors that together contain the most important/common words for/against a spam classification.
lots of testing resulted in ugly code
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.