Comments (2)
What is also unclear to me is that in https://github.com/afedosenko/talon/blob/master/tests/signature/learning/featurespace_test.py
s = '''John Doe
VP Research and Development, Xxxx Xxxx Xxxxx
555-226-2345
[email protected]'''
sender = 'John <[email protected]>'
features = fs.features(sender)
result = fs.apply_features(s, features)
# note that we don't consider the first line because signatures don't
# usually take all the text, empty lines are not considered
eq_(result, [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
the last line contains 'john' which mean the last '0' should be '1'
from talon.
Hi @justafucker. Sorry for confusion and thanks for your interest / questions. Will try to explain them.
The 1st test checks that ['Sergey', 'Obukhov']
will be among extracted names - not that they are the only ones extracted. E.g. if you modify the test and add serobnic
to the list the test will pass as well.
There is a test that specifically checks that given [email protected]
we'll extract sergey
: https://github.com/mailgun/talon/blob/master/tests/signature/learning/helpers_test.py#L103
But we definitely encourage you to submit a PR if you find tests / code confusing and wish to contribute / improve them.
Regarding your 2nd question. The algo looks for lines like "John Doe" or "John" or "Doe" i.e. a line should end with extracted name or extracted name should be a detached word. This requirement might seem strange in respect to "[email protected]" but in general it helps to avoid false positives when extracted name happens to be some general sequence of chars that might occurs in a line.
from talon.
Related Issues (20)
- Consider locking down the version of scipy in setup.py HOT 1
- PyPI not up-to-date HOT 6
- TypeError: cannot use a string pattern on a bytes-like object HOT 2
- Can not install using python 3.7
- Feature Request: Provide methods that return cursor and/or placeholder at end of reply
- How to arrange the two methods
- html to lined text issue
- How to run the code and extract the body of the email alone
- Not able to use Custom Classifier HOT 1
- Demo app source code HOT 2
- joblib warning HOT 3
- How to calculate Talon's accuracy score
- Joblib error HOT 9
- Parsing email in other languages HOT 1
- Unable to use signature extraction library HOT 1
- Unable to remove the part containing '--- Forwarded message ---' HOT 2
- Unable extract email signature by using talon HOT 1
- When is the next release planned? HOT 1
- error in importing signature
- What version of this library is compatible with Python 3.6?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from talon.