johnnykv / mnemosyne Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 40.0 1.7 MB

Normalizer for honeypot data.

License: GNU General Public License v3.0

Python 99.76% Shell 0.24%

mnemosyne's People

Contributors

Stargazers

Watchers

Forkers

jatrost pwnlandia sigterm-no jxy859 lijingchn jazmingp mlalibs mahamie badguy233 superosint 5l1v3r1 alchemycyberblaze hartl3y94 a-alexandrov2021

mnemosyne's Issues

Batch incomming hpfeed data

HPFeed data needs to be batched up before inserting into mongodb. This will resolve locking issues when doing full renormalizations.
The queue need to be implemented around here:
https://github.com/johnnykv/mnemosyne/blob/master/persistance/mnemodb.py#L98

At the current load it seems appropriate to process the queue every 30 second, which gives around 100 entries to process.

Dork service

Allow glastopf to pull dork from mnemosyne.

ref:
mushorg/glastopf#29

Fix Glastopf normalizer

The new log format for glastopf is not supported by the normalizer.

Segegrate web service functionality

The webapi need to live in its own application (and repo?). Maybe it would be appropiate to rewrite it in something like node-restify.

Error in nomalizing with long content to dork collection.

I'm found some errors in mnemosyne.err as below.

OperationFailure: Btree::insert: key too large to index, failing mnemosyne.dork.$content_1 1233 { : "/999999.9+/%2A%2A/uNiOn/%2A%2A/aLl+/%2A%2A/sElEcT+0x393133353134353632312e39,0x393133353134353632322e39,0x393133353134353632332e39,0x39313335313435363..." }

It could be the content is too long to be indexed.
I've using hashed content as index key instead of text :

https://github.com/johnnykv/mnemosyne/blob/master/persistance/mnemodb.py#L48

from pymongo import MongoClient, HASHED


self.db.dork.ensure_index([('content', HASHED)], unique=False, background=True)

Now it seems work fine.
If any suggestion, please let me know.

Write normalizer for dionaea.shellcodeprofiles

Authentication for API calls

Maybe repoze or cork(http://cork.firelet.net/)? Needs to be simple and maintainable.

Improve dork filter

The dork filter needs to be improved, for starters the following is required:

Filter invalid and strange path's, examples:
- /wp-content/themes/sportpress/scripts/wp-content/themes/sportpress/scripts/timthumb.php
- /shop.pl/wp-content/themes/eStore/wp-content/themes/eStore/framework/thumb/thumb.php
- * /axis-cgi/mjpg/dork.php//plugin/replace/plugin.php
Filter non-relevant path's, examples:
- /w00tw00t.at.ISC.SANS.DFind

https://github.com/johnnykv/mnemosyne/blob/master/normalizer/modules/glastopf_events.py#L30-L48

Mark troublesome hpfeeds entries in mongo instead of in-memory collection

https://github.com/johnnykv/mnemosyne/blob/master/mnemosyne.py#L64

Generate statistics at regular intervals.

The current way of serving stats (/aux/get_hpfeed_stats and /aux/get_hpfeeds_channels) is too inefficient and processed on every request. The statistics needs to be generated at regular intervals and cached with beaker - or stored as a document in mongo.

Update docs to reflect current state of v1 and d api

Fix conpot normalizer (modbus traffic)

waiting for live attack data before fixing this issue

Make sure routes in /api never returns html

All response have to JSON or plain text.
hook into @error

Feedpuller crashes at rare intervals

There seems to be a bug in the feedpuller. If a communication error occurs at the "correct" moment it will kill the feedpuller greenlet completly.

2013-02-21 12:47:56,303 (root) Mongo collection count: url: 192001,  hpfeed: 7552164 (635 in error state),  session: 7535125,  dork: 868,  file: 11616,
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 390, in run
    result = self._run(*self.args, **self.kwargs)
  File "/home/johnny/mnemosyne/hpfeeds/feedpuller.py", line 58, in start_listening
    self.hpc.run(on_message, on_error)
  File "/home/johnny/mnemosyne/hpfeeds/hpfeeds.py", line 134, in run
    d = self.s.recv(BUFSIZ)
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 423, in recv
    return sock.recv(*args)
error: [Errno 104] Connection reset by peer
<Greenlet at 0x1498910: <bound method FeedPuller.start_listening of <hpfeeds.feedpuller.FeedPuller instance at 0x14f02d8>>> failed with error

2013-02-21 13:18:46,056 (root) Mongo collection count: url: 192006,  hpfeed: 7552696 (635 in error state),  session: 7535656,  dork: 868,  file: 11616,
2013-02-21 13:49:36,255 (root) Mongo collection count: url: 192006,  hpfeed: 7552696 (635 in error state),  session: 7535656,  dork: 868,  file: 11616,

review and merge @threatstream commits

review and merge relevant @threatstream commits

Filter out private networks

Filter out honeypot sessions where source_ip is from a private network - most of this data is from researchers testing honeypots.

Fix conpot normalizer (snmp traffic)

Search urls by extractions

Allow the urls resource to be queried by hashes to allow discovery of malware distributed from several sites.
Example:

GET /api/d/urls?hash=52e714e5a070a7d66a8b75a3cbafedc1

This work item should only involve modifying urls.py

Write normalizer for thug.files

Search files without returning payloads

Allow files to be searched without returning payload.

Allow the urls resource to be queried by hashes to allow discovery of malware distributed from several sites.
Example:

GET /api/d/files?hash=52e714e5a070a7d66a8b75a3cbafedc1&no_data

mnemosyne normalize issue of glastopf.events time/timezone

I use MHN to deploy some glastopf honeypot, and when I check the log in Mongodb(in mnemosyne.session collection and mnemosyne.hpfeed collection ),
I found a strange situation.

in session collection, I find this document as follow,
{ "_id" : ObjectId("5a6f472d663a5c0b58caccbd"), "protocol" : "http", "hpfeed_id" : ObjectId("5a6f4728663a5c0b58cacbdb"), "timestamp" : ISODate("2018-01-30T00:09:09Z"), "source_ip" : "66.249.69.56", "session_http" : { "request" : { "body" : "", "header" : [ [ "from", "googlebot(at)googlebot.com" ], [ "accept-encoding", "gzip,deflate,br" ], [ "connection", "keep-alive" ], [ "accept", "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8" ], [ "user-agent", "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ], [ "host", "123.123.192.150" ] ], "host" : "123.123.192.150", "verb" : "GET", "path" : "/192.163/base.php?sivu" } }, "source_port" : 62005, "destination_port" : 80, "identifier" : "2dce214a-52b9-11e5-9583-b82a72dbb96d", "honeypot" : "glastopf" }

in this document the timestamp is ISODate("2018-01-30T00:09:09Z")

but when I use "hpfeed_id" : ObjectId("5a6f4728663a5c0b58cacbdb") to find the document before normailze, I find this document as follow :

LocalDB:PRIMARY> db.hpfeed.find({_id:ObjectId("5a6f4728663a5c0b58cacbdb")})
{ "_id" : ObjectId("5a6f4728663a5c0b58cacbdb"), "ident" : "2dce214a-52b9-11e5-9583-b82a72dbb96d", "timestamp" : ISODate("2018-01-29T16:09:12.517Z"), "normalized" : true, "payload" : { "pattern" : "unknown", "time" : "2018-01-30 00:09:09", "filename" : null, "source" : [ "66.249.69.56", 62005 ], "request_raw" : "GET /192.163/base.php?sivu HTTP/1.1\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8\r\nAccept-Encoding: gzip,deflate,br\r\nConnection: keep-alive\r\nFrom: googlebot(at)googlebot.com\r\nHost: 123.123.192.150\r\nUser-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)", "request_url" : "/192.163/base.php?sivu" }, "channel" : "glastopf.events" }

in this document in hpfeed collection, the timestamp is ISODate("2018-01-29T16:09:12.517Z"),

it look like the this event has different timestamp in mnemosyne.hpfeed and mnemosyne.session collection,
I check glastopf_events.py then I found that,
in make_session() function, it use datetime.strptime(data['time'], '%Y-%m-%d %H:%M:%S') as the session's timestamp,
BUT , the "time" in hpfeed collection means our local timezone's time,
so I guess it shouldn't use "time" field as the "timestamp",
is it right ???

Create conpot normalizer

Pre-Aggregated Reports

Pre-Aggregated reports for the following observables must be create every regularly. Timings could be daily, weekly and monthly.

Attacking ip's
Passwords used
Usernames used
Combo of usernames and passwords used

Could be served at /aux/ips, /aux/passwords, /aux/usernames, etc.

Also, the current way of serving hpfeed stats is broken - this need to be repaired to something like the stuff above.

Inspiration: http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ and http://docs.mongodb.org/manual/use-cases/hierarchical-aggregation/

Stops receiving data after reconnect to hpfriends.

The HPFeeds lib will automatically reconnect if the connection fails. After a reconnect no data is received. (this could be a problem in the hpfeeds lib)

Investigate if we could output the normalized data to elasticsearch and use kibana af frontend.

http://kibana.org/
http://www.elasticsearch.org/

Mnemosyne has capability to output all normalized data as json documents - seems like it would be easy to output data to elasticsearch and use kibana as a frontend.

Concurrency issue while upserting

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 390, in run
    result = self._run(*self.args, **self.kwargs)
  File "/home/johnny/mnemosyne/normalizer/normalizer.py", line 102, in inserter
    self.database.insert_normalized(norm, id)
  File "/home/johnny/mnemosyne/persistance/mnemodb.py", line 71, in insert_normalized
    upsert=True)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 481, in update
    check_keys, self.__uuid_subtype), safe)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 844, in _send_message
    rv = self.__check_response_to_last_error(response)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 785, in __check_response_to_last_error
    raise DuplicateKeyError(details["err"])
DuplicateKeyError: E11000 duplicate key error index: mnemosyne.file.$hashes_1  dup key: { : { sha1: "6ed8493fa285bef8f5e8cb62727e54cbb2593c87", sha512: "383ad9a6c6aae35889cb698a5ec80c56fd93daf04438633bbfbce78ae0269c8b15920fa84523db06bf05b63ddce74e5b363e95a67a91173ab7757daf1b9a8ad0", md5: "7208134c90c1b6eb7b6843a783253ba8" } }
<Greenlet at 0x20b6f50: <bound method Normalizer.inserter of <normalizer.normalizer.Normalizer object at 0x1fc49d0>>([([{'file': {'data': '0a090909090977773d77696e646f)> failed with DuplicateKeyError

2013-03-15 22:27:02,775 (root) Mongo collection count: url: 58699,  hpfeed: 11982646 (184 in error state),  session: 2149846,  file: 3539,  dork: 776,

Dropping all collections except for the hpfeed collection
Recreating indexes
Processing of all hpfeed entries to populate the dork, file, url and session collection.
Incrementing alot of counters in the daily_stats collection.

Could this be done faster than 7 hours on the current hardware? (around 2.1 million hpfeed entried pr. hour)