Giter Site home page Giter Site logo

mnemosyne's People

Contributors

jatrost avatar johnnykv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mnemosyne's Issues

Error in nomalizing with long content to dork collection.

I'm found some errors in mnemosyne.err as below.

OperationFailure: Btree::insert: key too large to index, failing mnemosyne.dork.$content_1 1233 { : "/999999.9+/%2A%2A/uNiOn/%2A%2A/aLl+/%2A%2A/sElEcT+0x393133353134353632312e39,0x393133353134353632322e39,0x393133353134353632332e39,0x39313335313435363..." }

It could be the content is too long to be indexed.
I've using hashed content as index key instead of text :

https://github.com/johnnykv/mnemosyne/blob/master/persistance/mnemodb.py#L48

from pymongo import MongoClient, HASHED


self.db.dork.ensure_index([('content', HASHED)], unique=False, background=True)

Now it seems work fine.
If any suggestion, please let me know.

Improve dork filter

The dork filter needs to be improved, for starters the following is required:

  • Filter invalid and strange path's, examples:
    • /wp-content/themes/sportpress/scripts/wp-content/themes/sportpress/scripts/timthumb.php
    • /shop.pl/wp-content/themes/eStore/wp-content/themes/eStore/framework/thumb/thumb.php
    • * /axis-cgi/mjpg/dork.php//plugin/replace/plugin.php
  • Filter non-relevant path's, examples:
    • /w00tw00t.at.ISC.SANS.DFind

https://github.com/johnnykv/mnemosyne/blob/master/normalizer/modules/glastopf_events.py#L30-L48

Generate statistics at regular intervals.

The current way of serving stats (/aux/get_hpfeed_stats and /aux/get_hpfeeds_channels) is too inefficient and processed on every request. The statistics needs to be generated at regular intervals and cached with beaker - or stored as a document in mongo.

Feedpuller crashes at rare intervals

There seems to be a bug in the feedpuller. If a communication error occurs at the "correct" moment it will kill the feedpuller greenlet completly.

2013-02-21 12:47:56,303 (root) Mongo collection count: url: 192001,  hpfeed: 7552164 (635 in error state),  session: 7535125,  dork: 868,  file: 11616,
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 390, in run
    result = self._run(*self.args, **self.kwargs)
  File "/home/johnny/mnemosyne/hpfeeds/feedpuller.py", line 58, in start_listening
    self.hpc.run(on_message, on_error)
  File "/home/johnny/mnemosyne/hpfeeds/hpfeeds.py", line 134, in run
    d = self.s.recv(BUFSIZ)
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 423, in recv
    return sock.recv(*args)
error: [Errno 104] Connection reset by peer
<Greenlet at 0x1498910: <bound method FeedPuller.start_listening of <hpfeeds.feedpuller.FeedPuller instance at 0x14f02d8>>> failed with error

2013-02-21 13:18:46,056 (root) Mongo collection count: url: 192006,  hpfeed: 7552696 (635 in error state),  session: 7535656,  dork: 868,  file: 11616,
2013-02-21 13:49:36,255 (root) Mongo collection count: url: 192006,  hpfeed: 7552696 (635 in error state),  session: 7535656,  dork: 868,  file: 11616,

Filter out private networks

Filter out honeypot sessions where source_ip is from a private network - most of this data is from researchers testing honeypots.

Search urls by extractions

Allow the urls resource to be queried by hashes to allow discovery of malware distributed from several sites.
Example:

GET /api/d/urls?hash=52e714e5a070a7d66a8b75a3cbafedc1

This work item should only involve modifying urls.py

Search files without returning payloads

Allow files to be searched without returning payload.

Allow the urls resource to be queried by hashes to allow discovery of malware distributed from several sites.
Example:

GET /api/d/files?hash=52e714e5a070a7d66a8b75a3cbafedc1&no_data

mnemosyne normalize issue of glastopf.events time/timezone

I use MHN to deploy some glastopf honeypot, and when I check the log in Mongodb(in mnemosyne.session collection and mnemosyne.hpfeed collection ),
I found a strange situation.

in session collection, I find this document as follow,
{ "_id" : ObjectId("5a6f472d663a5c0b58caccbd"), "protocol" : "http", "hpfeed_id" : ObjectId("5a6f4728663a5c0b58cacbdb"), "timestamp" : ISODate("2018-01-30T00:09:09Z"), "source_ip" : "66.249.69.56", "session_http" : { "request" : { "body" : "", "header" : [ [ "from", "googlebot(at)googlebot.com" ], [ "accept-encoding", "gzip,deflate,br" ], [ "connection", "keep-alive" ], [ "accept", "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8" ], [ "user-agent", "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ], [ "host", "123.123.192.150" ] ], "host" : "123.123.192.150", "verb" : "GET", "path" : "/192.163/base.php?sivu" } }, "source_port" : 62005, "destination_port" : 80, "identifier" : "2dce214a-52b9-11e5-9583-b82a72dbb96d", "honeypot" : "glastopf" }

in this document the timestamp is ISODate("2018-01-30T00:09:09Z")

but when I use "hpfeed_id" : ObjectId("5a6f4728663a5c0b58cacbdb") to find the document before normailze, I find this document as follow :

LocalDB:PRIMARY> db.hpfeed.find({_id:ObjectId("5a6f4728663a5c0b58cacbdb")})
{ "_id" : ObjectId("5a6f4728663a5c0b58cacbdb"), "ident" : "2dce214a-52b9-11e5-9583-b82a72dbb96d", "timestamp" : ISODate("2018-01-29T16:09:12.517Z"), "normalized" : true, "payload" : { "pattern" : "unknown", "time" : "2018-01-30 00:09:09", "filename" : null, "source" : [ "66.249.69.56", 62005 ], "request_raw" : "GET /192.163/base.php?sivu HTTP/1.1\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8\r\nAccept-Encoding: gzip,deflate,br\r\nConnection: keep-alive\r\nFrom: googlebot(at)googlebot.com\r\nHost: 123.123.192.150\r\nUser-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)", "request_url" : "/192.163/base.php?sivu" }, "channel" : "glastopf.events" }

in this document in hpfeed collection, the timestamp is ISODate("2018-01-29T16:09:12.517Z"),

it look like the this event has different timestamp in mnemosyne.hpfeed and mnemosyne.session collection,
I check glastopf_events.py then I found that,
in make_session() function, it use datetime.strptime(data['time'], '%Y-%m-%d %H:%M:%S') as the session's timestamp,
BUT , the "time" in hpfeed collection means our local timezone's time,
so I guess it shouldn't use "time" field as the "timestamp",
is it right ???

Pre-Aggregated Reports

Pre-Aggregated reports for the following observables must be create every regularly. Timings could be daily, weekly and monthly.

  • Attacking ip's
  • Passwords used
  • Usernames used
  • Combo of usernames and passwords used

Could be served at /aux/ips, /aux/passwords, /aux/usernames, etc.

Also, the current way of serving hpfeed stats is broken - this need to be repaired to something like the stuff above.

Inspiration: http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ and http://docs.mongodb.org/manual/use-cases/hierarchical-aggregation/

Concurrency issue while upserting

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 390, in run
    result = self._run(*self.args, **self.kwargs)
  File "/home/johnny/mnemosyne/normalizer/normalizer.py", line 102, in inserter
    self.database.insert_normalized(norm, id)
  File "/home/johnny/mnemosyne/persistance/mnemodb.py", line 71, in insert_normalized
    upsert=True)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 481, in update
    check_keys, self.__uuid_subtype), safe)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 844, in _send_message
    rv = self.__check_response_to_last_error(response)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 785, in __check_response_to_last_error
    raise DuplicateKeyError(details["err"])
DuplicateKeyError: E11000 duplicate key error index: mnemosyne.file.$hashes_1  dup key: { : { sha1: "6ed8493fa285bef8f5e8cb62727e54cbb2593c87", sha512: "383ad9a6c6aae35889cb698a5ec80c56fd93daf04438633bbfbce78ae0269c8b15920fa84523db06bf05b63ddce74e5b363e95a67a91173ab7757daf1b9a8ad0", md5: "7208134c90c1b6eb7b6843a783253ba8" } }
<Greenlet at 0x20b6f50: <bound method Normalizer.inserter of <normalizer.normalizer.Normalizer object at 0x1fc49d0>>([([{'file': {'data': '0a090909090977773d77696e646f)> failed with DuplicateKeyError

2013-03-15 22:27:02,775 (root) Mongo collection count: url: 58699,  hpfeed: 11982646 (184 in error state),  session: 2149846,  file: 3539,  dork: 776, 

Investigate faster ways to renormalize.

At the current state (15.000.000 hpfeed entries) it takes around 7 hours to do a full database renormalization (--reset). A database renormalization consists of:

  • Dropping all collections except for the hpfeed collection
  • Recreating indexes
  • Processing of all hpfeed entries to populate the dork, file, url and session collection.
  • Incrementing alot of counters in the daily_stats collection.

Could this be done faster than 7 hours on the current hardware? (around 2.1 million hpfeed entried pr. hour)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.