Comments (5)
Hi @oxlsf we plan to open-source the emails we used for creating the dataset but it will require some work / time since not all of them were originally publicly available like emails from enron public dataset. I.e. we'll need to remove all sensitive information.
Right now we provide just the processed dataset here https://github.com/mailgun/talon/blob/master/talon/signature/data/train.data. Each line represents a line from an email. Each element in a line is either 0 or 1 except for the last one. It's 0 if the corresponding feature from the feature set is false for the line and 1 otherwise. The last element is 1 if the line belongs to a signature line and -1 otherwise. Here's the feature set that we used https://github.com/mailgun/talon/blob/master/talon/signature/learning/featurespace.py#L15
from talon.
Hi @obukhov-sergey, how is your progress on opensource your training data? I have the same need with @oxlsf.
By the way, is it possible that you could kindly provide the processed data from Enron (i.e marked with #sig# if that line belongs a signature part)?
from talon.
@oxlsf @dichen001 we recently open-sourced annotated email dataset. It's not the one used by talon but we plan to switched to it once it has over 600 emails (now it's over 190 emails).
The idea is to use cleansed data that doesn't have private or personal information and encourage people to contribute their own emails (with cleansed phone numbers, URLs, etc) to keep the dataset up to date.
Feel free to contribute :) I'll be adding more emails shortly. I also plan to refactor the code that prepares the train data so that it's easy to add more emails to the dataset and test the library.
from talon.
Hi @obukhov-sergey, I have the same need of training the data, is there a way i can do it now? can you please provide steps to train the data.
from talon.
@itsvivekshetty @dichen001 @oxlsf here's the open-sourced dataset https://github.com/mailgun/forge, it's not the one used for training but once it has more data we'll use it instead, PRs are welcomed, I've also added a section in Readme with more info on how to retrain the classifier with your own raw emails.
from talon.
Related Issues (20)
- PyPI not up-to-date HOT 6
- TypeError: cannot use a string pattern on a bytes-like object HOT 2
- Can not install using python 3.7
- Feature Request: Provide methods that return cursor and/or placeholder at end of reply
- How to arrange the two methods
- html to lined text issue
- How to run the code and extract the body of the email alone
- Not able to use Custom Classifier HOT 1
- Demo app source code HOT 2
- joblib warning HOT 3
- How to calculate Talon's accuracy score
- Joblib error HOT 9
- Parsing email in other languages HOT 1
- Unable to use signature extraction library HOT 1
- Unable to remove the part containing '--- Forwarded message ---' HOT 2
- Unable extract email signature by using talon HOT 1
- When is the next release planned? HOT 1
- error in importing signature
- What version of this library is compatible with Python 3.6?
- Unable to install Talon on python 3.11.4 due to dependency conflict with cchardet package HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from talon.