Giter Site home page Giter Site logo

babbage's Introduction

OpenSpending

Build Status Issues Docs Discord

OpenSpending is a project to make government finances easier to explore and understand. It started out as "Where does my money go", a platform to visualize the United Kingdom's state finance, but has been renamed and restructured to allow arbitrary financial data to be loaded and displayed.

The main use for the software is the site openspending.org which aims to track government finance around the world.

OpenSpending's code is licensed under the GNU Affero Licence except where otherwise indicated. A copy of this licence is available in the file LICENSE.txt.

OpenSpending is a microservices platform made up of a number of separate apps, each maintained in their own git repository. This repository contains docker-compose files that can be used run an instance of Openspending for development, or as the basis for a production deployment. This repository also acts as a central hub for managing issues for the entire platform.

What are these files?

Most applications that make up the OpenSpending platform are maintained in their own repositories, with their own Dockerfiles, built and pushed to the OpenSpending organisation on Docker Hub:

This repository maintains docker-compose files used to help get you started with the platform.

docker-compose.base.yml: This is the main docker-compose file for OpenSpending specific services. All installations will use this as the basis for running the platform.

docker-compose.dev-services.yml: This defines backing services used by the platform, such as Redis, ElasticSearch, and PostgreSQL. This file also includes fake-s3 in place of AWS S3, so you don't have to set up an S3 bucket for development. It is not recommended to use this for production.

docker-compose.data-importers.yml: This defines the services used for the separate os-data-importers application. They depend on services defined in docker-compose.dev-services.yml. Unless you are working on the data-importers or its associated source-spec files, it's not necessary to run this file.

docker-compose.local.yml: Create this file to add additional services, or overrides for the base configuration. It is ignored by git.

Dockerfiles/*: Most services are maintained in their own repositories, but a few small custom services used by the platform are maintained here. os-nginx-frontend is a basic frontend nginx server and configuration files to define resource locations for the platform. This will be build and run directly by docker-compose.base.yml.

I'm a developer, how can I start working on OpenSpending?

  1. Define the environmental variables that applications in the platform need. The easiest way to do this is to create a .env file (use .env.example as a template).

  2. Use docker-compose up to start the platform from the base, dev-services, and optionally local compose files:

$ docker-compose -f docker-compose.base.yml -f docker-compose.dev-services.yml [-f docker-compose.local.yml] up

  1. Open localhost:8080 in your browser.

I'm a developer, how can I work on a specific OpenSpending application? Show me an example!

You can use volumes to map local files from the host to application files in the docker containers. For example, say you're working on OS-Conductor, you'll add an override service to docker-compose.local.yml (create this file if necessary).

  1. Checkout the os-conductor code from https://github.com/openspending/os-conductor into ~/src/dockerfiles/os-conductor on your local machine.
  2. Add the following to docker-compose.local.yml:
version: "3.4"

services:
  os-conductor:
    environment:
      # Force python not to use cached bytecode
      PYTHONDONTWRITEBYTECODE:
    # Override CMD and send `--reload` flag for os-conductor's gunicorn server
    command: /startup.sh --reload
    # Map local os-conductor app files to /app in container
    volumes:
      - ~/src/dockerfiles/os-conductor:/app
  1. Start up the platform with base, dev-services, and your local compose file:

$ docker-compose -f docker-compose.base.yml -f docker-compose.dev-services.yml -f docker-compose.local.yml up

Now you can start working on os-conductor application files in ~/src/dockerfiles/os-conductor and changes will reload the server in the Docker container.

I want to work on the data-importers application. Show me how!

In Openspending, the os-data-importers application provides a way to import data and create fiscal datapackages from source-spec files. You can either work on the app independently, by following the README in the os-data-importers repository, or within the context of an Openspending instance, by using the included docker-compose.data-importers.yml file, and starting Openspending with:

$ docker-compose -f docker-compose.base.yml -f docker-compose.dev-services.yml -f docker-compose.data-importers.yml up

This will start Openspending locally as usual on port :8080, and the pipelines dashboard will be available on port :5000: http://localhost:5000.

I have my own backing service I want to use for development

That's fine, just add the relevant resource locator to the .env file. E.g., you're using a third-party ElasticSearch server:

OS_ELASTICSEARCH_ADDRESS=https://my-elasticsearch-provider.com/my-es-instance:9200

I want to run my own instance of OpenSpending in production

Great! There are many ways to orchestrate Docker containers in a network. E.g. for openspending.org we use Kubernetes. Use the docker-compose.base.yml file as a guide for networking the applications together, with their appropriate environment variables, and add resource locators pointing to your backing services for Postgres, ElasticSearch, Redis, memcached, AWS S3 etc. See the .env.example file for the required env vars you'll need to set up.

You'll also need to set up OAuth credentials for OS-Conductor (see https://github.com/openspending/os-conductor#oauth-credentials), and AWS S3 bucket details.

What happened to the old version of OpenSpending?

You can find the old OpenSpending v2, and the complete history for the codebase to that point, in the openspending-monolith branch.

babbage's People

Contributors

akariv avatar brew avatar holgerd77 avatar pudo avatar pwalsh avatar vitorbaptista avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

babbage's Issues

Provide a simple "Get Started" in README.

@pudo are you up for it?

EG:

  1. Install babbage in a venv
  2. Steps to get a single CSV > model (e.g.: from ipython prompt)
  3. Examples of querying the data just loaded from babbage

This would be really useful :).

multiple joins to the same table

if you have two joins to the same table, it uses the same reference for that table so the join will only return rows where both fact table columns are equal and match some row in the dimension table.

It'd be nice to support cases where you need to join to the same table twice, but where the join columns don't all equal.

To support that we need to alias the dimension table differently for each join - perhaps according to the column being joined to.

sorry - don't have time to make an example model but this might suffice:

demarcation_changes

  • demensions
    • old_demarcation
      • attributes
        • demarcation.code
        • demarcation.label
      • join_column: code
    • new_demarcation
      • attributes
        • demarcation.code
        • demarcation.label
      • join_column: code

This generates a query including

FROM demarcation_changes, demarcation
WHERE demarcation_changes.old_code = demarcation.code AND demarcation_changes.new_code = demarcation.code

while we want

FROM demarcation_changes, demarcation as old_demarcation, demarcation as new_demarcation
WHERE demarcation_changes.old_code = old_demarcation.code 
AND demarcation_changes.new_code = new_demarcation.code

Distinct values from a cube

We have a postgresql table with about 28 million facts with a financial_year column. Users can use the babbage API to essentially query the distinct financial_year values, which is about 10 unique values.

Postgresql seems to be very naive when doing SELECT DISTINCT financial_year FROM table because it runs a table scan even though financial_year has an index, which takes 60+ seconds. This seems to be a known problem with postgresql.

How have others solved this problem? Do we split out the financial_year data (and all the other dimensions of a fact) into a separate table?

Should babbage be able to create it's own tables?

Right now, this functionality is not in the babbage package, but rather in whatever host app uses it (e.g. cubepress, spendb). While I don't think that babbage should only be useable with tables that it has created itself, adding the necessary code to quickly generate a table (eg. from JTS) could make the package more convenient to use.

Support multiple tables / star schema

Right now, only a single fact table is supported, and aggregations of dimensions will assume that all columns for a given dimension are perfectly denormalised. The application should support JOINs for a star schema instead.

Gitter for chat?

Hey @pudo

Would you mind if we add the new OpenSpending Chat room on Gitter to the README here? It is a test to see if Gitter works for us

Gitter

I think sum of NULLs in the aggregate query should be zero

right now babbage just does what the database does. In postgres that means sum of 100 and null is 100, while sum of only nulls is null. I'd expect all values in a sum aggregate to be integers (assuming floats aren't supported)

What do you think?

negative cuts

It'd be nice to be able to exclude specific values in cuts (blacklist).

for example, there's a special programme in out budget data that should generally be excluded from aggregates. To dynamically exclude it and include all other possible names, I'd need to do a members query, then remove the blacklisted item from the list, then cut with the remaining values as a whitelist.

Cubes does this with invert=True in their Python interface https://pythonhosted.org/cubes/reference/browser.html#slicing-and-dicing and looks like the do it with `!dimension:value in their HTTP interface https://github.com/DataBrewery/cubes/blob/master/cubes/query/cells.py#L796

I suggest something like cut=financial_year.year:2018|!programme.name:"Direct Charges against the National Revenue Fund" would return everything for 2018 excluding Direct Charges.

cut value starting with numerals interpreted as int even for column with type 'text'

I expect the query value 904 to match facts where the referred text column has the value 9
04. I also expect that the query value "904aaa" would be interpreted as that string and not "904".

e.g.

curl  'localhost:5000/api/cubes/incexp/facts?cut=function.function_code:904.123aaa'

results in

ProgrammingError: (psycopg2.ProgrammingError) operator does not exist: text = integer
LINE 4: WHERE incexp_labeled.function_code = 904) AS anon_1
                                           ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
 [SQL: 'SELECT count(%(param_1)s) AS count_1 \nFROM (SELECT incexp_labeled.function_code AS "function.function_code", incexp_labeled.function_desc AS "function.function_desc", incexp_labeled.incexp_cde AS "incexp_code.incexp_code", incexp_labeled.description AS "incexp_code.incexp_desc", incexp_labeled.period AS "period.period", incexp_labeled.demarcation_code AS "demarcation_code.demarcation_code", incexp_labeled.act_or_bud_amt AS amount \nFROM incexp_labeled \nWHERE incexp_labeled.function_code = %(param_2)s) AS anon_1'] [parameters: {'param_1': True, 'param_2': 904}]
127.0.0.1 - - [04/Apr/2016 13:17:25] "GET /api/cubes/incexp/facts?cut=function.function_code:904.123aaa HTTP/1.1" 500 -

same with cut value of just 900

ProgrammingError: (psycopg2.ProgrammingError) operator does not exist: text = integer
LINE 4: WHERE incexp_labeled.function_code = 904) AS anon_1
                                           ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
 [SQL: 'SELECT count(%(param_1)s) AS count_1 \nFROM (SELECT incexp_labeled.function_code AS "function.function_code", incexp_labeled.function_desc AS "function.function_desc", incexp_labeled.incexp_cde AS "incexp_code.incexp_code", incexp_labeled.description AS "incexp_code.incexp_desc", incexp_labeled.period AS "period.period", incexp_labeled.demarcation_code AS "demarcation_code.demarcation_code", incexp_labeled.act_or_bud_amt AS amount \nFROM incexp_labeled \nWHERE incexp_labeled.function_code = %(param_2)s) AS anon_1'] [parameters: {'param_1': True, 'param_2': 904}]
127.0.0.1 - - [04/Apr/2016 13:19:18] "GET /api/cubes/incexp/facts?cut=function.function_code:904 HTTP/1.1" 500 -

Support for multiple columns in dimension keys

(Not sure if this is a real issue)

If FDP it's possible to define dimensions which span over multiple attributes in the source data. These dimensions can also have a primary key that contains multiple attributes (see: http://fiscal.dataprotocols.org/spec/#dimensions).

This makes sense, for example, with hierarchical classification systems (like in here: https://github.com/akariv/boost-peru-national/blob/master/datapackage.json#L246). The dimension spans over the 4 attributes which make up the administrative classification.

In these cases, I'd expect the primary key to contain all the attributes in the hierarchy. However, it seems to me that right now babbage supports only one key per column, so that in the above case it would expect each of these administrative classification levels to be defined as a separate dimension.

So, in order to allow FDP to be imported into babbage we need to either:

  • Adjust the dimensions in the FDP to have only one primary key. For example, in the above case, we would create 4 separate dimensions with names such as 'administrative_classification.Admin1' etc
  • Add internal support in babbage for multi-column primary keys

wdyt?

API Response "total_cell_count" shows incorrect number of total cells in response

โš ๏ธ Problem:

  • total_cell_count in API response shows the wrong value when the number of cells is 1

Example of problematic response: https://openspending.org/api/3/cubes/b9d2af843f3a7ca223eea07fb608e62a:expenditure-time-series-v2/aggregate/?cut=amount_kind.amount_kind%3A%22Total%22%7Cfinancial_year.financial_year%3A2015&pagesize=10000

In the first (problematic) response, the total_cell_count indicates 66859 which is equal to the total of cells included in all of the aggregations, but the number should be 1 because the length of the cells returned is 1.

In the second (correct) response, the total_cell_count indicates 2941 which is the number of cells returned in cells (which makes sense).

Signs of success

I had a meeting with the South African National Treasury yesterday and almost fell of my chair when a Treasury team member knew about babbage and what it is useful for. It was all thanks to the Open Spending work with Mexico.

Just wanted to let you guys know that the work you (and folks like @pudo) do has a wide impact. Very gratifying.

Summary and count causes performance issues on large datasets

With very large datasets (e.g. 13m rows), summary and count appear to significantly slow down the response:

babbage/babbage/cube.py

Lines 89 to 96 in 9416105

# Count
count = count_results(self, prep(cuts,
drilldowns=drilldowns,
columns=[1])[0])
# Summary
summary = first_result(self, prep(cuts,
aggregates=aggregates)[0].limit(1))

Without generating summary and count, it's 2-3 times faster to return the response.

It would be useful to make returning these properties optional. E.g. by adding an optional &simple parameter to the request.

CSV return format

All HTTP API queries should be able to return a simple CSV response. In a fancy scenario, this would allow switching between naming the columns according to logical model fields and the original column names which are mentioned in the model.

This should use streaming responses, rather than pre-generating the full response.

Issue with ordering

While trying to use /aggregate and /facts this specific line for ordering

ordering, q, bindings = Ordering(self).apply(q, bindings, order)

is generating variables not in any of the tables

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) missing FROM-clause entry for table "anon_2"
LINE 2: FROM "R_kgDOGLhFbw4edfc924721abb77" ORDER BY anon_2.anon_1 A...

and here is the generated sql query

[SQL: SELECT NULL AS anon_1, count("R_kgDOGLhFbw4edfc924721abb77".activity_id) AS _count, sum("R_kgDOGLhFbw4edfc924721abb77"."MONTO_ADEFAS") AS "MONTO_ADEFAS.sum", sum("R_kgDOGLhFbw4edfc924721abb77"."MONTO_APROBADO") AS "MONTO_APROBADO.sum", sum("R_kgDOGLhFbw4edfc924721abb77"."MONTO_DEVENGADO") AS "MONTO_DEVENGADO.sum", sum("R_kgDOGLhFbw4edfc924721abb77"."MONTO_EJERCICIO") AS "MONTO_EJERCICIO.sum", sum("R_kgDOGLhFbw4edfc924721abb77"."MONTO_EJERCIDO") AS "MONTO_EJERCIDO.sum", sum("R_kgDOGLhFbw4edfc924721abb77"."MONTO_MODIFICADO") AS "MONTO_MODIFICADO.sum", sum("R_kgDOGLhFbw4edfc924721abb77"."MONTO_PAGADO") AS "MONTO_PAGADO.sum" 
FROM "R_kgDOGLhFbw4edfc924721abb77" ORDER BY anon_2.anon_1 ASC NULLS LAST, anon_2._count ASC NULLS LAST, anon_2."MONTO_ADEFAS.sum" ASC NULLS LAST, anon_2."MONTO_APROBADO.sum" ASC NULLS LAST, anon_2."MONTO_DEVENGADO.sum" ASC NULLS LAST, anon_2."MONTO_EJERCICIO.sum" ASC NULLS LAST, anon_2."MONTO_EJERCIDO.sum" ASC NULLS LAST, anon_2."MONTO_MODIFICADO.sum" ASC NULLS LAST, anon_2."MONTO_PAGADO.sum" ASC NULLS LAST 
 LIMIT %(param_1)s]
[parameters: {'param_1': 100}]

but commenting out the ordering line, the api works fine. is there something i'm doing wrong

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.