Giter Site home page Giter Site logo

alephdata / aleph Goto Github PK

View Code? Open in Web Editor NEW
2.0K 60.0 258.0 181.45 MB

Search and browse documents and data; find the people and companies you look for.

Home Page: http://docs.aleph.occrp.org

License: MIT License

Makefile 0.30% Python 31.83% Mako 0.02% HTML 0.19% Shell 0.35% JavaScript 39.57% Dockerfile 0.14% SCSS 6.65% TypeScript 20.96%
python data-search graph-database journalism osint investigative-journalism

aleph's Introduction

Truth cannot penetrate a closed mind. If all places in the universe are in the Aleph, then all stars, all lamps, all sources of light are in it, too.

-- The Aleph, Jorge Luis Borges

Aleph is a tool for indexing large amounts of both documents (PDF, Word, HTML) and structured (CSV, XLS, SQL) data for easy browsing and search. It is built with investigative reporting as a primary use case. Aleph allows cross-referencing mentions of well-known entities (such as people and companies) against watchlists, e.g. from prior research or public datasets.

For further details on the software, how to use it, install it or manage data imports, please check the documentation at:

Support

Aleph is used and developed by multiple organisations and interested individuals. If you're interested in participating in this process, please read the support policy (SUPPORT.md), the contribution rules (CONTRIBUTING.md), and the code of conduct (CODE_OF_CONDUCT.md) and then get in touch:

Release process

If you are interested in, or have been tasked with releasing a new version of Aleph. The following steps should be followed:

Overview

The basic process for releasing Aleph is this:

1. Check internal libraries for updates and merge. Release our libraries in the following order
  1. servicelayer
  2. followthemoney
  3. ingest-file
  4. react-ftm
  1. Ensure that all libraries for a release are up to date in aleph and merged to the develop branch.
  2. Ensure that any features, bugfixes are merged into develop and that all builds are passing
  3. Ensure that the CHANGELOG.md file is up to date on the develop branch. Add information as required.
  4. Create a RC release of Aleph.
  5. Test and verify the RC. Perform further RC releases as required.
  6. Merge all changes to main
  7. Create a final version of Aleph

As far as possible apply the rules of semantic versioning when determining the type of release to perform.

Technical process

RC releases

If you need to perform an RC release of Aleph, follow these steps:

  1. Ensure that the CHANGELOG` is up to date on the develop branch and that all outstanding PR's have been merged
  2. From the develop branch run bump2version (major, minor, patch) this will create a x.x.x-rc1 version of aleph
  3. push the tags to the remote with git push --tags
  4. push the version bump with git push
  5. If there are problems with the RC you can fix them and use bump2version build to generate new rc release

Major, minor, patch releases

  1. switch to main and pull from remote
  2. If not already done merge develop into main
  3. Update translations using make translate
  4. If you get npm errors, go into the ui folder and run npm install
  5. commit translations to main and push to remote
  6. run bump2version --verbose --sign-tags release. Note that bump2version won't show changes when you make the change, but it will work (see git log to check)
  7. push the tags to the remote with git push --tags
  8. push version bump to remote with git push
  9. merge main back into develop. Slightly unrelated to the release process but this is a good time to do it so that the new version numbers appear in develop as well

aleph's People

Contributors

andkamau avatar benoccrp avatar catileptic avatar davidlemayian avatar dependabot-preview[bot] avatar dependabot[bot] avatar dkhurshudian avatar dschulz-pnnl avatar emmina avatar felixebert avatar iaincollins avatar jcshea avatar kjacks avatar lepstep avatar longhotsummer avatar micahflee avatar monneyboi avatar mynameisfiber avatar pudo avatar rhiaro avatar rinatius avatar rosencrantz avatar simonwoerpel avatar smmbllsm avatar stas avatar stchris avatar sunu avatar tillprochaska avatar ueland avatar wpf500 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aleph's Issues

Typo in Aleph repo description

Sift through large sets of structured and unstructured data, and find the people and comapnies you look for.

Should read: comapnies->companies

Re-work collection UI

Collections should have their own UI in which users can browse, edit, add and remove documents and entities associated with that collection.

Document Search Error

Document search fails with error below:
screen shot 2016-06-12 at 21 34 05

Excerpts from ElasticSearch logs:

[2016-06-12 18:36:09,389][DEBUG][action.search            ] [Robert Bruce Banner] All shards failed for phase: [query]
RemoteTransportException[[Robert Bruce Banner][172.31.5.13:9300][indices:data/read/search[phase/query]]]; nested: SearchParseException[failed to parse search source [{"sort": [{"doc_count": "desc"}, "_score"], "query": {"filtered": {"filter": {"terms": {"collection_id": []}}, "query": {"bool": {"must": [{"match_phrase_prefix": {"terms": "ghana"}}, {"range": {"doc_count": {"gte": 0}}}]}}}}, "_source": ["name", "$schema", "terms", "doc_count"], "size": 5}]]; nested: SearchParseException[No mapping found for [doc_count] in order to sort on];
Caused by: SearchParseException[failed to parse search source [{"sort": [{"doc_count": "desc"}, "_score"], "query": {"filtered": {"filter": {"terms": {"collection_id": []}}, "query": {"bool": {"must": [{"match_phrase_prefix": {"terms": "ghana"}}, {"range": {"doc_count": {"gte": 0}}}]}}}}, "_source": ["name", "$schema", "terms", "doc_count"], "size": 5}]]; nested: SearchParseException[No mapping found for [doc_count] in order to sort on];

Running version: c5fb231
ES version: 2.3.1

@pudo any ideas?

Unable to Get Wb UI of aleph

Hi,

As i have successfully installed, however unable to figure out. How can I run it's Aleph Web UI for Search.

Server Installation Details:

  1. Distributor ID: Ubuntu
    Description: Ubuntu 14.04.4 LTS
    Release: 14.04
    Codename: trusty
  2. Aleph, Installation as per the pudo/aleph installation.
  3. Docker and Docker-compose Installation as per the URL.

Finally:

root@7748cd87f225:/aleph# aleph runserver
INFO:werkzeug: * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Kindly Guide me what are the further dependencies and how can i get it's Web UI.

Regards.

Make OAuth follow-up pluggable

At the moment, the code used to receive an OAuth provider response and to turn it into a set of roles is a hard-coded set of supported services. This should come in via a plugin architecture instead.

Subscribe to alerts for all entities on a watchlist

Currently, you can set up an alert on a particular query, but the query "Show documents matching entities on this watchlist" is not accessible via the UI. Need to expose it, and assign good names to these alerts.

Store role/role relationships

When a user is signed in, he is assigned a set of roles via oauth. This includes his user role, but also other roles, such as user groups. These links aren't currently stored in the database, which means off-line subsystems (like alerting) can't know which roles a user is permitted to access.

Event triggers | notification alerts

Use Case
As a journalist / data importer, I want to be alerted of mentions of an an entity am interested in, so that I can sift through the imported documents incase the entity of interest is missing in the current bunch of documents.

Document how to use rabbitmq inside the docker setup

First of, great work on this project!

I've gotten a bit stuck with getting a version of this running completely locally using docker. I've noticed I can specify the archive type as file and run a RabbitMQ queue instead of using SQS which is really nice. I've tried to do that using the docker set-up but seem to be getting a connection error when I come to use aleph crawldir.

Error:

INFO:aleph.ingest.ingestor:Traceback (most recent call last):
  File "/aleph/aleph/ingest/__init__.py", line 58, in ingest_file
    ingest.delay(collection_id, meta.to_attr_dict())
  File "/usr/local/lib/python2.7/site-packages/celery/app/task.py", line 453, in delay
    return self.apply_async(args, kwargs)
  File "/usr/local/lib/python2.7/site-packages/celery/app/task.py", line 565, in apply_async
    **dict(self._get_exec_options(), **options)
  File "/usr/local/lib/python2.7/site-packages/celery/app/base.py", line 354, in send_task
    reply_to=reply_to or self.oid, **options
  File "/usr/local/lib/python2.7/site-packages/celery/app/amqp.py", line 310, in publish_task
    **kwargs
  File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 172, in publish
    routing_key, mandatory, immediate, exchange, declare)
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 457, in _ensured
    interval_max)
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 369, in ensure_connection
    interval_start, interval_step, interval_max, callback)
  File "/usr/local/lib/python2.7/site-packages/kombu/utils/__init__.py", line 246, in retry_over_time
    return fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 237, in connect
    return self.connection
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 742, in connection
    self._connection = self._establish_connection()
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 697, in _establish_connection
    conn = self.transport.establish_connection()
  File "/usr/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection
    conn = self.Connection(**opts)
  File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__
    self.transport = self.Transport(host, connect_timeout, ssl)
  File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport
    return create_transport(host, connect_timeout, ssl)
  File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transport
    return TCPTransport(host, connect_timeout)
  File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 95, in __init__
    raise socket.error(last_err)
error: [Errno 111] Connection refused

I've added RabbitMQ as a separate container and linked it to the other containers. Here is my docker-compose file:

postgres:
  image: postgres:9.4
  volumes:
    - "/opt/aleph/data/postgres:/var/lib/postgresql/data"
    - "/opt/aleph/logs/postgres:/var/log"
  environment:
    POSTGRES_USER:     aleph
    POSTGRES_PASSWORD: aleph
    POSTGRES_DATABASE: aleph
  ports:
   - "127.0.0.1:5439:5432"

elasticsearch:
  image: elasticsearch:2.2.0
  volumes:
    - "/opt/aleph/data/elasticsearch:/usr/share/elasticsearch/data"
    - "/opt/aleph/logs/elasticsearch:/var/log"
  ports:
    - "127.0.0.1:9201:9209"
  # environment:
  # ES_HEAP_SIZE: 4g

worker:
    build: .
    command: celery -A aleph.queue worker -c 10 -l INFO --logfile=/var/log/celery.log
    links:
      - postgres
      - elasticsearch
      - rabbitmq
    volumes:
      - "/:/host"
      - "/opt/aleph/data:/opt/aleph/data"
      - "/opt/aleph/logs/worker:/var/log"
    environment:
      C_FORCE_ROOT: 'true'
      ALEPH_ELASTICSEARCH_URI: http://elasticsearch:9200/
      ALEPH_DATABASE_URI: postgresql://aleph:aleph@postgres/aleph
      POLYGLOT_DATA_PATH: /opt/aleph/data
      TESSDATA_PREFIX: /usr/share/tesseract-ocr
    env_file:
      - aleph.env

beat:
    build: .
    command: celery -A aleph.queue beat -s /var/run/celerybeat-schedule
    links:
      - postgres
      - elasticsearch
      - rabbitmq
    volumes:
      - "/opt/aleph/logs/beat:/var/log"
      - "/opt/aleph/run/beat:/var/run"
    environment:
      C_FORCE_ROOT: 'true'
      ALEPH_ELASTICSEARCH_URI: http://elasticsearch:9200/
      ALEPH_DATABASE_URI: postgresql://aleph:aleph@postgres/aleph
    env_file:
      - aleph.env

web:
    build: .
    command: gunicorn -w 5 -b 0.0.0.0:8000 --log-level info --log-file /var/log/gunicorn.log aleph.manage:app
    ports:
      - "13376:8000"
    links:
      - postgres
      - elasticsearch
      - rabbitmq
    volumes:
      - "/opt/aleph/logs/web:/var/log"
    environment:
      ALEPH_ELASTICSEARCH_URI: http://elasticsearch:9200/
      ALEPH_DATABASE_URI: postgresql://aleph:aleph@postgres/aleph
    env_file:
      - aleph.env
rabbitmq:
  image: rabbitmq
  ports:
    - 5672

Graph engine integration

This is a tracking ticket to explain the why & how of integrating a graph engine into aleph.

Why?

  • Recommendation engine (people who researched A also want to look into B)
  • Linking unstructured document info with structured DBs in a graph

How?

  • The following can be modelled as graph nodes: Documents, Entities, Aliases, Phones, Emails, Collections. They are connected via MENTIONS, AKA, CONTAINS.

Boost collections

Collection should support a natural boost, on a scale of 1-6, that relative to other collections. This can be used to rank exclusive in-house materials over a scrape of a government database, over press clippings.

/cc @danohuiginn

Error after clicking on login

Hi,

Thanks Once again for all your Help so far!

I have finally managed to get the Aleph UI up and running,

Here is the URL: http://54.191.176.203:13376/

Now i want to perform the following Tasks.

  1. create categories like we see on: https://data.occrp.org
  2. How to add a domain on Docker so that this IP should be changed. FQDN.
  3. Switch Ports. from 13376 to 80.
  4. SSL Installation.
  5. Do we require any particular installation on Docker.

Our Architecture is consist on the following Items.

a. AWS EC2 c4x4 large shared Instance
b. OS
Distributor ID: Ubuntu
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty

API Errors: When we Click on Login it throws Error. how can we fix it.

Regards.

Improvements to tabular data viewer

The following come to mind:

  • Make headers fixed when user scrolls
  • Add a way to view very long cell contents
  • Filter by field
  • Sort by field
  • Show/hide fields
  • Facet by field
  • Format values more nicely

per-site templates and css

aleph.env should contain a list of directories to look for template files, allowing any file to be overridden

custom style sheets can be specified from config

Rename "Watchlist" to "Collection" or "Project"

Starting a discussion here, ping @danohuiginn.

Watchlists are right now meant to be collections of entities that get cross-referenced with the documents in aleph. I'm planning to extend this with additional functionality, such as making the Watchlist/Entity relationships many-to-many and allowing for the de-duplication of entities (currently, our aleph has 5 Bashar Al-Assads).

The next step could be to make Watchlists capable of holding documents as well as entities. This might make sense to allow users to group together documents they're interested in for a particular purpose. However, this is where the name stops making sense. I'd therefore like to propose renaming Watchlists now (before there are too many API dependencies).

What do you think?

"Peek" into hidden search results

When a user's search matches documents that are not visible to them, return the name of the person they need to contact to get access to such documents.

aleph upgrade Error on Installation

have installed aleph via docker and docker-compose. Upon running "aleph upgrade" it throws the following Error.

[root@localhost aleph]# docker-compose run worker /bin/bash
Starting aleph_elasticsearch_1
Starting aleph_postgres_1
root@58ced8c:/aleph# aleph upgrade
INFO:aleph.model:Beginning database migration...
INFO:alembic.runtime.migration:Context impl PostgresqlImpl.
INFO:alembic.runtime.migration:Will assume transactional DDL.
INFO:aleph.model:Creating system roles...
WARNING:elasticsearch:PUT /aleph/mapping/document [status:404 request:0.835s]
Traceback (most recent call last):
File "/usr/local/bin/aleph", line 9, in
load_entry_point('aleph', 'console_scripts', 'aleph')()
File "/aleph/aleph/manage.py", line 167, in main
manager.run()
File "/usr/local/lib/python2.7/site-packages/flask_script/_init.py", line 412, in run
result = self.handle(sys.argv[0], sys.argv[1:])
File "/usr/local/lib/python2.7/site-packages/flask_script/init.py", line 383, in handle
res = handle(args, *config)
File "/usr/local/lib/python2.7/site-packages/flask_script/commands.py", line 216, in call
return self.run(args, *kwargs)
File "/aleph/aleph/manage.py", line 148, in upgrade
upgrade_search()
File "/aleph/aleph/index/admin.py", line 26, in upgrade_search
doc_type=TYPE_DOCUMENT)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(args, params=params, *kwargs)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/client/indices.py", line 291, in put_mapping
'_mapping', doc_type), params=params, body=body)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 93, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: TransportError(404, u'index_not_found_exception')

Language whitelisting

An instance of Aleph should have a whitelist of "plausible" languages which can be detected and which documents can be tagged with. All others are removed both at ingest and results presentation time.

Log failed ingests to the database

Current, if a file cannot be ingested, no record is left in the database (i.e. no Document is created). Instead, a stub record should be created and filled with any exceptions that may occur during processing (e.g. unsupported file format, parsing errors, etc.)

Break up Dockerfile

into one (or more) base images, to streamline deployment

  • Webkit install should happen first (since it won't change)
  • then apt-get and node installs, which change occasionally
  • then copy/install of working directory, which changes ALL THE TIME

Allow users to edit document metadata

This requires:

  • Fix up document write authorisation via a new field on documents.
  • Improve validation of metadata in the application.
  • UI elements for editing metadata.

Crawler ideas [discussion]

(talking with @danohuiginn)

  • Crawlers should give feedback on last run, next run etc.
  • [opt] Crawlers should be scheduled automatically
  • Crawlers need to be fully incremental
  • Crawlers are configured via their crawler class (no in-DB json schmu)

error using S3 archive

[2016-07-26 12:44:49,927: ERROR/MainProcess] Task aleph.ingest.ingest[86d1301b-1d82-40f4-9208-dd16e568fc07] raised unexpected: AttributeError("'s3.ObjectSummary' object has no attribute 'do
wnload_file'",)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/newrelic-2.46.0.37/newrelic/hooks/application_celery.py", line 66, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/aleph/aleph/ingest/__init__.py", line 97, in ingest
    Ingestor.dispatch(collection_id, meta)
  File "/aleph/aleph/ingest/ingestor.py", line 102, in dispatch
    local_path = get_archive().load_file(meta)
  File "/aleph/aleph/archive/s3.py", line 82, in load_file
    obj.download_file(path)
AttributeError: 's3.ObjectSummary' object has no attribute 'download_file'

after uploading a file via web interface

Search history and event logging

Yet another round of events stuff: make a database table with all important user interactions, i.e. login, logout, search and document view/fetch. Both for statistics and, in the long run, to show users their own search history.

Separate user and import queues

At the moment, user-triggered background processing (such as entity updates) are handled by the same Q as bulk document imports, which means it can be delayed by hours or days. These should go into different queues and be processed either by different worker daemons, or at a different priority.

Support ingestion of Emails

Outlook exports, rfc822 files (.mbox, .msg) and Maildirs :) What should this look like in the UI, a PDF or something more structured?

Handle directory-based imports via bundles

Some file formats, such as ESRI Shapefiles or Cronos databases, are based on the contents of a directory, rather than a file. Since Aleph handles archiving on a per-file basis, these data types cannot be ingested properly. The proposed solution to this issue is to introduce a new mechanism, bundles. A bundle is a generated ZIP file, e.g. folder-name.shapefile or database.cro, that is created when the folder is available and then parsed as a file by a format-specific ingestor. The bundling is done upon crawling from a directory or package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.