spamexperts / pyzor Goto Github PK
View Code? Open in Web Editor NEWPyzor is a Python implementation of a spam-blocking networked system that use spam signatures to identify them.
License: GNU General Public License v2.0
Pyzor is a Python implementation of a spam-blocking networked system that use spam signatures to identify them.
License: GNU General Public License v2.0
Since the public server doesn't permit unauthenticated whitelisting, it would be useful for there to be some way for people to request whitelisting other than emailing the mailing list.
This could ask for the exact message and the digest (and generate an error if they do not match) so that it's clear that the message is legitimately ham (and we can provide assurances around privacy).
I'm thinking of just a simple page asking for the relevant information (with the digest check being the only really dynamic bit) and then emailing the developers with the appropriate information. Since we're spread around the world, this ought to get reasonably quick action.
We need to ensure that we close the all file pointers explicitly rather than relying on auto-closing, which only works under CPython.
For example this is currently causing an issue when running under PyPy because the PID file is not being closed, and the data is not written to disk.
I recently received a spam containing a non-breaking space (encoded as =C2=A0 in quoted-printable UTF-8 if that is relevant). When running pyzor predigest, the non-breaking space is kept in the predigest output. I have no idea if spammers do this but they could randomly replace spaces with non-breaking spaces before sending mail to generate a different fingerprint each time and evade detection.
I believe that simply changing
ws_ptrn = re.compile(r'\s')
to
ws_ptrn = re.compile(r'\s', flags=re.UNICODE)
would address this (including all the other unicode space characters), but at the cost of breaking compatibility with signatures from older versions of pyzor.
The pyzor client supports address that have port as string, for example:
>>> str(client.ping(address=("public.pyzor.org", "24441")))
'Code: 200\nDiag: OK\nPV: 2.1\nThread: 65482\n\n'
But this must match the accounts, for example if the accounts uses integers instead, then the correct account is not used. The Client API should be more malleable with this.
I.e. in the same way we added support for the MySQL engine in #23 but for the Redis engine.
In the current way the records are encoded in the Redis engine, one-step increments are not possible. We'll need to change the records from string to hashes, and also provide a way to migrate the database.
The existing migration script can be used. A simple check should be performed when starting the pyzord server looking for a "version" entry in the database and checking if it matches the current implementation. If not then a error message should be displayed with instructions on migrating the database.
We should improve the pyzor client script to return appropriate error codes when the server is for example unreachable. So the result is easily parsed.
For example in case of timeout, the pyzor script does not output anything:
$ pyzor -d ping
2014-07-21 15:23:14,018 WARNING No accounts are setup. All commands will be executed by the anonymous user.
2014-07-21 15:23:14,018 DEBUG sending: 'Op: ping\nThread: 63596\nPV: 2.1\nUser: anonymous\nTime: 1405945394\nSig: b9e0f69deb13376b4517911e09f01abd526f39e7\n\n'
2014-07-21 15:23:19,024 ERROR ('127.0.0.1', 24441) TimeoutError: Reading response timed-out.
Or in case the command is not valid:
$ pyzor -d pign
2014-07-21 15:26:53,056 WARNING No accounts are setup. All commands will be executed by the anonymous user.
2014-07-21 15:26:53,056 ERROR Unknown command: pign
on http://pyzor.readthedocs.org/en/release-1-0-0/pyzor.client.html the documentation says:
>>> digest = pyzor.digest.get_digest(msg)
I get
>>> import pyzor.digest
>>> pyzor.digest.get_digest
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'get_digest'
_1 Upvote_ This should reflect the the date that the first whitelist occured not the date when the message was first reported as spam.
For example:
$ pyzor info < testausta.eml
public.pyzor.org:24441 (200, 'OK')
Count: 6
Entered: Thu Sep 25 02:39:13 2014
Updated: Sat Jan 17 16:52:32 2015
WL-Count: 12
WL-Entered: Thu Sep 25 02:39:13 2014
WL-Updated: Thu Jan 29 16:15:45 2015
The servers engines in the pyzor.engines
package all respect a certain pattern, and must implement certain methods.
In order to make the package more consistent, and to make adding more engines easier, we should create a base abstract class from which all engines should inherit.
The gdbm backend assumes that dates are stored in the format "%Y-%m-%d %H:%M:%S.%f" which is usually true:
str(datetime.datetime(2014, 8, 14, 11, 42, 17, 1337))
'2014-08-14 11:42:17.001337'
However, if a record is inserted exactly "on the second", the microseconds are removed:
str(datetime.datetime(2014, 8, 14, 11, 42, 17, 0))
'2014-08-14 11:42:17'
on busy servers this happens sometimes, and after a restart, the database can not be loaded:
Traceback (most recent call last):
File "/usr/bin/pyzord", line 4, in <module>
__import__('pkg_resources').run_script('pyzor==0.8.0', 'pyzord')
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 534, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 1445, in run_script
exec(script_code, namespace, namespace)
File "/usr/lib/python2.7/site-packages/pyzor-0.8.0-py2.7.egg/EGG-INFO/scripts/pyzord", line 389, in <module>
File "/usr/lib/python2.7/site-packages/pyzor-0.8.0-py2.7.egg/EGG-INFO/scripts/pyzord", line 363, in main
File "build/bdist.linux-x86_64/egg/pyzor/engines/gdbm_.py", line 40, in __init__
File "build/bdist.linux-x86_64/egg/pyzor/engines/gdbm_.py", line 100, in start_reorganizing
File "build/bdist.linux-x86_64/egg/pyzor/engines/gdbm_.py", line 65, in apply_method
File "build/bdist.linux-x86_64/egg/pyzor/engines/gdbm_.py", line 111, in _really_reorganize
File "build/bdist.linux-x86_64/egg/pyzor/engines/gdbm_.py", line 71, in _really_getitem
File "build/bdist.linux-x86_64/egg/pyzor/engines/gdbm_.py", line 143, in decode_record
File "build/bdist.linux-x86_64/egg/pyzor/engines/gdbm_.py", line 162, in decode_record_1
File "build/bdist.linux-x86_64/egg/pyzor/engines/gdbm_.py", line 23, in <lambda>
File "/usr/lib/python2.7/_strptime.py", line 325, in _strptime
(data_string, format))
ValueError: time data '2014-07-03 14:59:31' does not match format '%Y-%m-%d %H:%M:%S.%f'
I installed this on Ubuntu, and am just trying to get this to run.
I tried installing via PIP, and it said it worked, but didn't actually....
$ pip install pyzor
Downloading/unpacking pyzor
Downloading pyzor-1.0.0.tar.gz
Running setup.py (path:/tmp/pip-build-L9bD5P/pyzor/setup.py) egg_info for package pyzor
Installing collected packages: pyzor
Running setup.py install for pyzor
changing mode of build/scripts-2.7/pyzor from 664 to 775
changing mode of build/scripts-2.7/pyzord from 664 to 775
changing mode of build/scripts-2.7/pyzor-migrate from 664 to 775
changing mode of /home/ace/.local/bin/pyzor-migrate to 775
changing mode of /home/ace/.local/bin/pyzor to 775
changing mode of /home/ace/.local/bin/pyzord to 775
Successfully installed pyzor
Cleaning up...
Then got this:
$ pyzor
The program 'pyzor' is currently not installed. You can install it by typing:
sudo apt-get install pyzor
So i did what it suggested:
$ sudo apt-get install pyzor
And it seemed to install:
$ sudo apt-get install pyzor
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
python-gdbm python-support
Suggested packages:
python-gdbm-dbg
The following NEW packages will be installed:
python-gdbm python-support pyzor
0 upgraded, 3 newly installed, 0 to remove and 3 not upgraded.
Need to get 67.6 kB of archives.
After this operation, 428 kB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://us.archive.ubuntu.com/ubuntu/ wily/main python-gdbm amd64 2.7.9-1 [11.9 kB]
Get:2 http://us.archive.ubuntu.com/ubuntu/ wily/universe python-support all 1.0.15 [26.7 kB]
Get:3 http://us.archive.ubuntu.com/ubuntu/ wily/universe pyzor all 1:0.5.0-2fakesync1 [29.1 kB]
Fetched 67.6 kB in 0s (466 kB/s)
Selecting previously unselected package python-gdbm.
(Reading database ... 211423 files and directories currently installed.)
Preparing to unpack .../python-gdbm_2.7.9-1_amd64.deb ...
Unpacking python-gdbm (2.7.9-1) ...
Selecting previously unselected package python-support.
Preparing to unpack .../python-support_1.0.15_all.deb ...
Unpacking python-support (1.0.15) ...
Selecting previously unselected package pyzor.
Preparing to unpack .../pyzor_1%3a0.5.0-2fakesync1_all.deb ...
Unpacking pyzor (1:0.5.0-2fakesync1) ...
Processing triggers for man-db (2.7.4-1) ...
Setting up python-gdbm (2.7.9-1) ...
Setting up python-support (1.0.15) ...
Setting up pyzor (1:0.5.0-2fakesync1) ...
Processing triggers for python-support (1.0.15) ...
But when I tried to run anything, I get this:
$ pyzor ping
Traceback (most recent call last):
File "/usr/bin/pyzor", line 8, in <module>
pyzor.client.run()
AttributeError: 'module' object has no attribute 'run'
Am I doing something wrong or is there an error? I'm running Ubuntu 15.10.
It would be useful to have a script that would allow migrating data between various engine types. (for e.g. migrating the MySQL data to redis)
This doesn't necessarily have to be deployed.
Thank you for moving here.
It would be nice if you could set a DSN for a Sentry server in the configuration and, if Raven was installed, this would add logging to that server (perhaps also a log level in the configuration).
Now that the code can work with python3.3, considering that the 2to3 has been ran before, we can consider to switch to code to be native in python3.3 and support python2.7 with the 3to2 tool.
This would make supporting both versions considerably much easier, because there are more (good) limitations in python3 (especially in the differentiation between unicode and byte array).
Hello.
Please add new release to https://github.com/SpamExperts/pyzor/releases.
It seems like handling request in separate threads/processes doesn't work well under real traffic conditions. (Although benchmarking PyPy + multi-threading did show some promising results)
We should try and add support for pre-forking, and see how that performs. This does mean will likely need to hack around SocketServer.UDPServer
, or even drop using completely and implement our own version.
We currently increment the report and whitelist count in two steps. First we get the record from the database engine and then reinsert it. Some back-end engines (such as MySQL, and Redis if we use hashes instead of strings) will support doing this a single step.
We should change the way the commands are dispatched and have the MySQL engine use this enhancement.
We'll do MySQL first because it's easiest, hopefully in time for 0.9 release. With Redis we'll need to change the data structure, and we'll also have to provide a migration tool. (we'll do that as well in a future version).
In order to evaluate various improvements and new technologies we are considering adding to the pyzor server we would require a suite of tests.
This build broke, not because of the change, but because of a timing issue.
The tests don’t mock out the send()
method, so there’s an assumption that the send() will take place in the same second as the expected request is constructed. That’s fragile, so it will break some times, like here.
We probably don’t want to mock send(), because that will change what the tests are testing too much. It might be simplest to mock time.time() itself.
Sorry if this is the wrong place to ask.
I got in report that I am listed in Pyzor:
-1.985 PYZOR_CHECK Listed in Pyzor (http://pyzor.sf.net/)
How can I request delisting?
I am lost in this technology.
Hello, I have an issue
Checking mail-tester.com shows me that I'm listed in Pyzor.
"PYZOR_CHECK Listed in Pyzor (http://pyzor.sf.net/)"
But I don't have the required information to complete the white list form. (http://public.pyzor.org/whitelist/)
I need the Pyzor digest, and Raw message. I don't have those, and I don't know where to find that information.
Is there anyone I could contact for help on this, or any way to get the required information?
My server is listed as spam on pyzor , when i try to whitelist it from your form it respond : Message not reported as spam. I don't now how to do that. What's wrong ?
Thanks
There are various situations where after normalization the message ends up empty. For example this happens when the message is short and/or only contains links.
In this case we would still want to attempt to create a unique signature for the messages. This is, however, difficult because we don't have too much to go on.
from rest_framework import serializers
from sponsorapp.models import Sponsormodel
from datetime import datetime
import datetime as dtime
from dateutil.parser import parse
import pdb
from rest_framework.serializers import ValidationError
class SponsorSerializer(serializers.ModelSerializer):
offer_date = serializers.DateTimeField(default =None)
sponsorship_date = serializers.DateTimeField(default =None)
class Meta:
model = Sponsormodel
fields = ('sponsorship_date' , 'offername' , 'offer_date')
# pdb.set_trace()
def validate_offername(self , value):
offername = value
if(len(offername) > 9):
raise ValidationError("Please enter number of characters less than 9 or 9 ")
return value
def validate_sponsorship_date(self , value):
data = self.get_initial()
offerdate = data.get('offer_date')
offerdate = dtime.datetime.strptime(offerdate,'%Y-%m-%d %H:%M').strftime('%Y-%m-%d %H:%M')
sponsorshipdate = value
if(sponsorshipdate > offerdate):
raise ValidationError("Check the dates")
return value
from django.db import models
class Sponsormodel(models.Model):
sponsorship_date = models.DateTimeField(default =None)
offername = models.CharField(max_length = 100)
offer_date = models.DateTimeField(default = None)
def __str__(self):
return self.offername
Hi,
i got the following traceback while trying to check a message with pyzor.
$ pyzor -s mbox check < spam.mbox
Traceback (most recent call last):
File "/usr/bin/pyzor", line 408, in
main()
File "/usr/bin/pyzor", line 152, in main
if not dispatch(client, servers, config):
File "/usr/bin/pyzor", line 237, in check
for digested in get_input_handler(style):
File "/usr/bin/pyzor", line 181, in _get_input_mbox
tfile.write(sys.stdin.read().encode("utf8"))
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 2649: invalid start byte
I'm using pyzor 1.0.0 with python 3.4.3. I can provide the spam message that caused the crash if needed.
BR
Atanas
Fedora really cares about getting the licensing absolutely correct, and we'd like to know if the "any later version" clause in v2 of the GPL applies. In an old version (0.5.0) usage.html seemed to indicate that this was the case, and Fedora had the license as "GPLv2 or later". But usage.html is gone now, and the license section of the docs at http://pyzor.readthedocs.org/en/release-1-0-0/introduction.html#license seems to indicate "GPLv2 only". I know it's minor, but could you clarify?
Rather than listing every username other than anonymous, it would be convenient to have a "non-anonymous" or "authenticated" (or whatever name) string that could be used in the access file.
For example, this could be used to provide whitelist access to any user other than anonymous users.
Right now the Pyzor code is compatible with both Python 2 and Python 3 but only if the the code is converted with 2to3 first. When installing using pip this is done automatically, however it would be better to just use python-future for this and so no conversion will be required.
When pyzord daemonizes itself with the -detach
option the Timer thread that handles expiry will be killed.
We need to start expiring after detaching.
For long running daemons/operations (such as the new forwarder system), the Pyzor client could be improved to batch the reports (or whitelist) to the server. The server will also need to adjust it's protocol to accept multiple digests.
ISTM that the simplest way to do this is by appending more headers Op-Digest
headers to the request.
It could be interesting to add similar concepts used by "Distributed Checksum Clearinghouses" (http://www.rhyolite.com/dcc/) to the Pyzor filtering system.
~ # pyzor check < 1ZKiRG-0007ja-DV--7132194856063680649
Traceback (most recent call last):
File "/usr/local/bin/pyzor", line 408, in <module>
main()
File "/usr/local/bin/pyzor", line 152, in main
if not dispatch(client, servers, config):
File "/usr/local/bin/pyzor", line 237, in check
for digested in get_input_handler(style):
File "/usr/local/bin/pyzor", line 175, in _get_input_msg
digested = digester(msg).value
File "/usr/local/lib/python2.7/site-packages/pyzor/digest.py", line 82, in __init__
for payload in self.digest_payloads(msg):
File "/usr/local/lib/python2.7/site-packages/pyzor/digest.py", line 160, in digest_payloads
payload = payload.decode(charset, errors)
TypeError: decode() argument 1 must be string without null bytes, not str
Probably something to do with this:
Content-Type: text/plain;
charset="iso-8859-1^@^@^@Content-Transfer-Encoding: quoted-printable
We need to remove any '\x00' (NULL characters) when processing messages.
A user reported the following backtrace in https://bugzilla.redhat.com/show_bug.cgi?id=1288853
(Please ignore the fact that the reporter has no interpersonal skills whatsoever.)
Traceback (most recent call last):
File "/usr/bin/pyzor", line 408, in <module>
main()
File "/usr/bin/pyzor", line 152, in main
if not dispatch(client, servers, config):
File "/usr/bin/pyzor", line 239, in check
send_digest(digested, mock_runner, servers)
File "/usr/bin/pyzor", line 262, in send_digest
_send_digest(runner, servers[0], digested)
File "/usr/bin/pyzor", line 253, in _send_digest
runner.run(server, (digested, server))
File "/usr/lib/python3.4/site-packages/pyzor/client.py", line 258, in run
response = self.routine(*args, **kwargs)
File "/usr/lib/python3.4/site-packages/pyzor/client.py", line 122, in _mock_check
pyzor.proto_version))
TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple'
I'm pretty sure this is simply python3-incompatible code. What I don't yet understand is why I can't repeat it. Will have to dig into this further
In would be nice if we could support listening/connecting on Unix sockets as well.
We need to check if Pyzor is compatible with PyPy3 and automatically run tests on Travis-CI.
Hello,
Every single message coming from my server are being marked as "listed" on Pyzor.
Please find attached one of those emails that was sent using the https://www.mail-tester.com website.
false-positive.txt
You can see the report here: https://www.mail-tester.com/web-1rRmDh
I installed Pyzor using Debian and followed your documentation to generate the digest of this email file.
pyzor digest < false-positive.txt
a3917acbee3f33d744611512992355f721fdfdb7
I then go to the formulary to whitelist it and the website reports "Digest does not match message."
Am I missing something obvious here?
Thank you in advance,
The module docstring for pyzor.client refers on line 22 to a function pyzor.digest.get_digest that does not exist.
To get a digest (of an email.message.Message object, or similar):
>>> digest = pyzor.digest.get_digest(msg)
I had to read the pyzor
script to figure out how to get a digest. Would be neat with up to date documentation.
Also, I noticed there's a typo in the docstring for ClientRunner.handle_response on line 266 in client.py.
Thanks
Rather then sending requests to multiple servers sequentially we could use threads and do them all at once. Then collate the final result.
This is not really an issue.
But it's the tenth time a customer tells me "the mail server is not properly configured" because they send a message to mail-tester.com with just "test" in the content.
Maybe it would be nice to whitelist this content.
Please consider adding a release feed.
pyzor notification could be received through an RSS reader.
We should start using Travis-CI to run the tests. To run the full suite of tests we need the following libraries:
As a pre-execution step we (still) require:
Running the tests on Python 3 does also require re-factoring the source code with 2to3
.
We should also take this opportunity to test the compatibility with PyPy.
Yet another unicode decode problem that needs to be solved.
File "pyzor/digest.py", line 59, in __init__
lines.append(norm.encode("utf8"))
UnicodeDecodeError 'ascii' codec can't decode byte 0xed in position 7: ordinal not in range(128)
Hi there!
I don't know why I'm listed on Pyzor and I don't know how to remove. Please can you help me?
My domain is: upnetworks.com.br
I'm following this link: http://www.mail-tester.com/web-wb0iPJ
Thanks
I'm running a Fedora 21 system and have noticed that SELinux complains with this error message quite frequently. Sometimes it is preceded by the following error but not every time:
python[22066]: detected unhandled Python exception in '/usr/bin/pyzor'
Below are some more specifics regarding the system:
kernel-4.1.5-100.fc21.x86_64
pyzor-0.5.0-10.fc21.noarch
Python 2.7.8 (default, Apr 15 2015, 09:26:43)
[GCC 4.9.2 20150212 (Red Hat 4.9.2-6)] on linux2
If it should have access then I want to file a bug report with the SELinux folks but I figured it made sense to start here. Can someone tell me why pyzor is attempting to access this file?
Thanks!
A legit email from Twitter ("you have new followers") got falsely marked as spam by SpamAssassin, due in part to Pyzor for some odd reason (see below). How do I report this or figure out which link triggered it?
Content analysis details: (5.3 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
-1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP
3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
[score: 1.0000]
1.1 URI_HEX URI: URI hostname has long hexadecimal sequence
0.2 BAYES_999 BODY: Bayes spam probability is 99.9 to 100%
[score: 1.0000]
0.0 HTML_MESSAGE BODY: HTML included in message
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
1.4 PYZOR_CHECK Listed in Pyzor (http://pyzor.sf.net/)
0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid
Originally reported by gryphius.
We're seing a lot of pyzor "false positives" from messages with attachments but little or no body text. these messages are all different but generate the same digest da39a3ee5e6b4b0d3255bfef95601890afd80709, which is the sha1-sum of the empty string . It looks like this is is the digest produced if all content is stripped out by the pyzor normalizer.
current public.pyzor.org result for this hash:
public.pyzor.org:24441 (200, 'OK') 159015 5706
pyzord could maybe treat this special hash as statically whitelisted (whithout the need to have clients submit this hash into the whitelist first) and always return a zero hitcount.
This would be especially helpful in spamassassin setups, where only the hitcount is checked ( https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6108 )
if hardcoding this hash is not an option, you could maybe add a config option to read a static whitelist from a file.
Attached is a quick & dirty patch we're using to skip this hash.
When a message has a lot of <style> content, often the pre-digest is filled with this instead of something that actually identifies the content. It would be better to remove this in the same way that the tags themselves are removed. (<script> is uncommon since it's generally ignored, but we might as well remove that as well).
We should add a check to whitelist request web-service (http://public.pyzor.org/whitelist/) if the message has been actually reported as spam, and isn't already whitelisted.
If either of these conditions are not met, we should show an appropriate error message.
When running tests in a non UTC timezone the MySQL servers tests are failing:
self.fail("Delta %s is too big: %s, %s" % (delta , date1, date2))
E AssertionError: Delta 18000.877231 is too big: 2016-01-15 09:41:12, 2016-01-15 14:41:12.877231
We need to fix the tests or the code to handle this as well.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.