Giter Site home page Giter Site logo

soprasteria / cybersecurity-dfm Goto Github PK

View Code? Open in Web Editor NEW
40.0 13.0 14.0 27.47 MB

Data Feed Manager (news watch orchestrator to predict topic with deepdetect and store cleaned text in elasticsearch)

License: GNU General Public License v3.0

Makefile 0.01% Python 1.21% CSS 0.23% JavaScript 98.08% HTML 0.18% Shell 0.29% Dockerfile 0.01%

cybersecurity-dfm's Introduction

Data Feeds Manager

./analysis.png

./explore.png

License

Data Feeds Manager is a service which crawl feeds, extract core text content, generate text training set for machine learning and manage score selection based on predictions.

Copyright (C) 2016 Alexandre CABROL PERALES from Sopra Steria Group.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Description

Data Feeds Manager aim to manage Feed based on data received from them.

This service crawl Feeds to assess their content and rank them regarding topics predicted by DeepDetect.

This service let you generate machine learning models from news to DeepDetect.

This service use ElasticSearch as data storage, DeepDetect for content Tagging and Kibana for data visualization. TinyTinyRSS is also used as RSS Feed aggregator but could be replaced by any RSS Feed Manager service which provide un aggregated RSS feed.

Currently RSS Feeds and Twitter are supported, Reddit and Dolphin are planned to be supported in the futur.

Definition

  • News are called doc (type in elasticsearch)
  • Source of news are called also feeds (RSS or Twitter currently)
  • Tag is a keyword (in rss feeds) or a hashtag (in twitter) people set in the news before post it
  • Topic is a group of Tags related to the same subject
  • Model Config is a group of Topics which are linked to the same theme
  • Model is a machine learning model in Deep Detect when it is supervised it used a training set for generation
  • Training Set is extraction of all news related to a model dispatch by topics in order to train a supervised machine learnign algorithm

Requirements

The reference platform is Ubuntu Server 16.04 LTS.

According to ElasticSearch: "Less than 8 GB tends to be counterproductive (you end up needing many, many small machines), and greater than 64 GB has problems."

DeepDetect can use NVIDIA CUDA GPU or standard x86_64 CPU. Current DFM install doesn't install this feature of DeepDetect. See more here: https://github.com/beniz/deepdetect#caffe-dependencies

DFM will crawl large amount of data from the web if you have multiple RSS Feeds or Twitter searches. A good bandwith with unlimited traffic is recommended (fiber, ...).

Minimal hardware might be: - 8Gb Ram - 4 CPUs - 500Gb Hard Disk - Internet Bandwith 24 Mb/s

Recommended hardware might be: - 64 Gb Ram - 32 CPUs or 2 NVIDIA GPU (Tesla) - 2 Tb Hard Disk SSD - Internet Bandwith 10 Gb/s

Install

This installation has been tested with Ubuntu 16.04.1 LTS. Installation folder /opt/dfm Require git installed (apt-get install git). Run following commands in a terminal:

cd /opt
git clone https://github.com/soprasteria/cybersecurity-dfm.git
cd dfm
./install_ubuntu.sh

The install.sh will install all dependencies, build when it is required, and create account dfm to run daemons. There are 4 daemons with web protocol setup in supervisor: - ES for elasticsearch, search engine acting like main storage (port 9200) - KB for kibana, Dashboards (port 5601) - DEDE for deepdetect, Machine Learning server (port 8080) - DFM for Data Feed Manager, orchestrator of other services above (port 12345)

When installation is done.

  • Edit /opt/dfm/dfm/settings.cfg and add your twitter account credentials.
  • restart dfm with supervisorctl restart dfm
  • Open web-browser then connect to http://localhost:12345
  • Click on Source button in the menu
  • Add a source (could be tinytiny-rss feed, rss feed or twitter search)
  • Refresh the page then click on crawl link at the right of the table which list the sources to collecte the news
  • Click on Topics page and create a Topic which selected tags.
  • Click on Model then group Topics in the same model.
  • Then click on generate model at the right of the table.

To setup Dashboards: - Open web-browser the connect to http://localhost:5601 - Click on Settings button - put a "*" in "Index name or pattern" - Select updated for "Time-field name" - Click on "Create" - Click on "Discover" then change the top right time range to weeks or months - Explore more in details Kibana documentation

Other information

  • You can tune memory allocated to ElasticSearch in /etc/supervisor/conf.d/es.conf (default is 8Gb, it might be half of your memory) https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html
  • Max number of files is important also for ElasticSearch in /etc/sysctl.conf read https://www.elastic.co/guide/en/elasticsearch/guide/current/_file_descriptors_and_mmap.html
  • Main text is extracted from the news (in text field) and full html version is stored (in html field) as an ElasticSearch attachement.
  • URL in twitts are browsed to get the target internet page.
  • News which are too small (under NEWS_MIN_TEXT_SIZE config variable) are excluded and deleted from the database.
  • For readability title of models are used as key between DeepDetect and DFM. Topic title are also used as key (label) between DeepDetect and DFM.
  • The rss feed on the frontpage of DFM (port 12345) will provide you the best predicted news related to the topics in your models of the week. If there is not prediction you will have no news in this feed.
  • The best prediction threshold is defined in /opt/dfm/dfm/settings.cfg by default OVERALL_SCORE_THRESHOLD=0.1 . If the prediction scores of your news are lower than 0.1 you will have no news in the DFM frontpage feed.
  • If you set Debug at True in settings.cfg the process will fork and can not be stopped by supervisor you will have to kill it on your own.
  • link field in data structure is used to generate id of all objects so all objects (sources,topics,models) have a link used to generate the uuid
  • Crontab of DFM account is used to call scheduled tasks from the API (http://localhost:12345/api/schedule/...), you can use this url for one time actions like: - crawl one source (eg: http://localhost:12345/api/schedule/cbf1d10571c4da9d101c1b4fab3d3d93) - crawl all source http://localhost:12345/api/schedule/sources_crawl - gather text body and html of doc (news) http://localhost:12345/api/schedule/contents_crawl - predict all news stored with text body http://localhost:12345/api/schedule/contents_predict - re-generate all prediction models http://localhost:12345/api/schedule/generate_models
  • Flask logger is used to log messages. Most of messages are in DEBUG mode. For some reason not totally clear log file generated by flask (/opt/dfm/dfm/dfm.log) is less talkative than supervisor log file (/var/log/supervisor/dfm-stdout*.log).
  • To get efficiency in topics prediction we recommend: - To have same number of news by topics for one model - To have more than 1000 news by topics - To create topics which doesn't mostly overlap (avoid to create multiple topics with synonims tags)

Todo List

  • [ ] OPML import/export
  • [ ] Social Networks other webservices integration (Reddit, Linkedin,... )
  • [X] Extract text from documents (CSV,DOC,DOCX,EML,EPUB,GIF,JPG,JSON,MSG,ODT,PDF,PPTX,PS,RTF,TXT,XSLX,XSL)
  • [ ] Extract text from video's audio speech
  • [ ] Search engines crawling
  • [ ] Pass javascript adds redirection
  • [ ] Pass captcha filter
  • [ ] Pass cookie acceptance

Learn more.

cybersecurity-dfm's People

Contributors

acabrol avatar tmaurelsoprasteria avatar trillejs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cybersecurity-dfm's Issues

Support latest version of DeepDetect python wrapper

Currently DFM use an old version of the DeepDetect python wrapper.

Python wrapper was handling with dedicated class before and now use request exceptions.

DFM code must be modified to support standard exceptions instead of previous deepdetect error class.

Wrong Content-Type

DFM doesn't get the correct content type for some documents.

Here under an example:

CURL request:

curl -I -XGET https://arxiv.org/pdf/1801.01681v1.pdf
HTTP/1.1 200 OK
Date: Thu, 11 Jan 2018 08:38:34 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000
Set-Cookie: browser=86.250.248.55.1515659914652413; path=/; max-age=946080000; domain=.arxiv.org
Last-Modified: Mon, 08 Jan 2018 01:42:36 GMT
ETag: "16b79425-213180-56239e9adddf8"
Accept-Ranges: bytes
Content-Length: 2175360
Access-Control-Allow-Origin: *
Content-Type: application/pdf

DFM Log:

DEBUG in feed [cybersecurity-dfm/dfm/feed.py:572]:
Content-Type:text/html; charset=utf-8 url:https://arxiv.org/pdf/1801.01681v1.pdf

use tweepy api when twitter.com url is detected as source

When an url is submitted from twitter.com it is not yet recognized as a tweet so the tweepy api is not used and link included are not extracted.

During source adress crawling we could detect is the domain is twitter.com and then use tweepy to extract links in the tweet to process source news instead of processing the tweet message only.

Support file text extraction

Currently DFM only extract content from webpage.

In ToDo list we expect to extract text from download file also like pdf, doc, docx, ppt, pptx, odt, odp.

Link below is an idea to detect file format:
https://stackoverflow.com/questions/38710238/python-download-file-over-http-and-detect-filetype-automatically

Text extraction from several type of documents:
http://textract.readthedocs.io/en/stable/

Textract require a file object which can be created with:
https://docs.python.org/2/library/tempfile.html

URI hash collision

There is a collision when generating md5 hash of URI when website use query for parameters

Example of a parsed "Cochonnet" youtube video.

>>> urlparse.urlparse(text_to_string('https://www.youtube.com/watch?v=30Nv0WY4Lg8'))
ParseResult(scheme='https', netloc='www.youtube.com', path='/watch', params='', query='v=30Nv0WY4Lg8', fragment='')
>>> 

If uri contents //, the used URI is a reconstruction of obj_uri.scheme + "://" + obj_uri.netloc + obj_uri.path
But youtube pass the video id in params, so the md5 generated for all youtube videos is exactly the same because it doesn't take into account the query.

Here are all line where I found the bug:

to_hash_uri=urllib.quote(obj_uri.scheme + "://" + obj_uri.netloc + obj_uri.path)

to_hash_uri=urllib.quote(obj_uri.scheme + "://" + obj_uri.netloc + obj_uri.path)

to_hash_uri=urllib.quote(obj_uri.scheme + "://" + obj_uri.netloc + obj_uri.path)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.