kuwala-io / kuwala Goto Github PK

Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times

Home Page: https://kuwala.io

License: Apache License 2.0

Python 42.67% Shell 0.03% Jupyter Notebook 3.28% Makefile 0.04% HTML 0.30% JavaScript 51.42% CSS 0.39% R 1.87%

data data-integration data-science open-data spatial-analysis elt kuwala open-source scraping dbt

kuwala's People

Contributors

Stargazers

Watchers

Forkers

sohansputhran floriankuwala piramolk bmacorr alcoholfreebear admariner guutch chrisammon3000 stjordanis abachant avkash mariobehling iam-benyamin nazareno95 obarisk dliofindia adbmd helioxgroup cuulee david30907d arun-kc aditya964 sekmet samduan arifluthfi16 lance0404 visionarylab algonacci eteimz python-repository-hub jeffamaxey bitsnaps argo12 mabubakr007 priya-gittest messidagod teslahenry simitest carlinix mattigrthr jdharminsiddesh cgb-lowcode2022 afiqmuzaffar syedaashirhussain rbt-c syaikhipin bheemaiahnn techxbase bryanchance lcsouzamenezes yunqiqiliang suhib97

kuwala's Issues

CLI: Create and open Jupyter notebook

After the graph database population has been successful, create the Jupyter Notebook environment. When this is created successfully as well, launch it directly from the CLI.

Depends on #42

OSM-POI: Find closest POI for given point (also considering building footprints)

This is a new route that finds the closest POI for a given point either provided as a pair of latitude and longitude or as an H3 index.

For this, the building footprints have to be considered as well. It might be the case, for example, that a queried point is further away from the point location of one POI compared to another one although it is within the building footprint of the first one (e.g., imagine a coffee shop next to a shopping mall when the queried location is still within the shopping mall).

This feature goes hand in hand with #12 and #14.

Can't build docker image due to missing osm-parquetizer

When I try to run step 2 in the readme I get this output:

ERROR: No such service: osm-parquetizer

Running on mac OS Big Sur 11.5.2
docker-compose version 1.29.2, build 5becea4c

OSM-POI: Building footprint for relations

So far we're only generating building footprints for OSM objects of type way.

Relations also need to be supported. Special attention here deserves the case of multi polygons. Some may have holes and others might be separate areas that semantically belong together (e.g., a university complex consisting of several buildings).

The building footprints follow the GeoJSON standard.

The relevant function is in kuwala-pipelines/osm-poi/src/data/processor/pbfParser/geoParser/coordsParser.js called convertMultiPolygon.

The members can be retrieved from the level db which temporarily saves nodes used for building footprints (see convertToGeoJSONCoords).

README: Incomplete project setup/run instructions

There are 2 main points that are missing from the readme file. It is better to properly mention those as they lead to so much frustration during the setup.

There is a dependency on h3-node which requires gyp, make, CMake, and a C compiler (GCC or clang) to be installed on the machine.
There is an installation step missing where everyone should know that first you must run npm ci in the shared folder which is sitting in the root folder in isolation from the other modules yet it is referenced by all other modules. Otheriwse the project won't run.

CLI: Populate graph database

Populate graph database based on user inputs by running and orchestrating all Docker containers.

Depends on #41

Population-Density: Filter results by demographic group

Instead of returning all demographic groups, it should be possible to pass a query parameter to return specific groups.

Population-Density: Wrapper function for R

It would be great if there was a wrapper function to query the data into R easily.
MVP could require (1) coordinates (2) radius (3) granularity in meters.

Notebook: Save results as data set

After running all transformations, the results have to be saved in a dedicated format. In the beginning we should provide CSV, Parquet and JSON.

Depends on #45

Google-POI: Category mapping

The pipeline should also convert the Google category tags into the same high-level categories as it is done in the OSM-POI pipeline. This makes merging the data sources possible also on the category level and enables queries based on categories for Google POIs.

Google-POI: Decode Google's internal placeID encoding to query data using the placeID

Google is using a special encoding in its pb query parameter to get the POI data in a structured matrix format.

We believe that the id that they use in the pb query parameter is the encoded placeID. It might be something else however.

If we can decode that, it would be possible to query the POI data directly with a placeID if that is already known. This would make the search string query obsolete.

Here are some examples that can easily be extended by simply getting the specially encoded id over the /search route and then getting the Google placeID over the /poi-information route:

Google placeID - pb-encoded placeID

ChIJn9YqwQUZ2jERQd1xVpCATwk - 0x31da1905c12ad69f:0x94f80905671dd41
ChIJT4lHCgMZ2jERloDE6WmkdMk - 0x31da19030a47894f:0xc974a469e9c48096
ChIJOeEf9S2vewIRM0B9a06CKwg - 0x27baf2df51fe139:0x82b824e6b7d4033
ChIJxeTGfdpRqEcRvFh20ctiCMM - 0x47a851da7dc6e4c5:0xc30862cbd17658bc
ChIJeXgR8v9RqEcR9aKuzYzqNts - 0x47a851fff2117879:0xdb36ea8ccdaea2f5
ChIJHSGzi_yAhYARjust_4pPIbc - 0x808580fc8bb3211d:0xb7214f8aff2deb8e
ChIJ54twBiJOqEcRFxu9brehHg0 - 0x47a84e2206708be7:0xd1ea1b76ebd1b17
ChIJmc9LEyJOqEcR95Ltsm4XQVU - 0x47a84e22134bcf99:0x5541176eb2ed92f7
ChIJmQJIxlVYwokRLgeuocVOGVU - 0x89c25855c6480299:0x55194ec5a1ae072e
ChIJvVRe4NMEdkgRw3WgQO_J9KM - 0x487604d3e05e54bd:0xa3f4c9ef40a075c3

Population-Density: Show warning and let user retry input when no demographic group has been selected

Traceback (most recent call last):
File "main.py", line 7, in
Processor.start(files, output_dir)
File "/opt/app/src/Processor.py", line 31, in start
map(lambda f: os.path.getsize(f['path'] + os.listdir(f['path'])[0]) / math.pow(1024, 2), files)
TypeError: reduce() of empty sequence with no initial value
ERROR: 1

I am facing the following error while running the docker-compose run population-density command, please throw some light

Core: Use Neo4j Spark connector for importing data

Instead of sending the insertion queries through the Neo4j Python driver, we should use the Spark Neo4j connector to read the preprocessed population and OSM Parquet files and make the inserts. The Spark driver batches the queries automatically, which should speed up the entire population process.

It depends on #47

OSM-POI: Store building footprint as a collection of H3 indexes

Next to the GeoJSON representation, a collection of H3 indexes covering the entire building footprint allows for more performant functionalities such as checking whether a given location is within a building or not.

To have a very accurate representation, the h3.polyfill(coordinates, res, [isGeoJson]) function should be used to get the contained cells at a high resolution (ideally 15 but performance has to be evaluated). Then the h3.compact(h3Set) function clusters cells of higher resolution to a lower resolution that is still within the polygon and keeping high-resolution cells where appropriate. For combining these two steps, a general H3 function, getCompactGeometry(geometry), should be written in shared/js/src/h3Utils/index.js also considering multi polygons.

The compact list of H3 indexes can then be stored as an additional property, buildingFootprintH3.

Population-Density: Only list countries for download where data is available

So far, we have listed all countries using the i18n-iso-countries npm package. However, there is no data available for some countries, e.g., due to bad satellite imagery. Facebook constantly improves its approach and data so that they might add new countries in the future. Therefore, we should use the AWS CLI to only list countries where data is available to avoid an error during the processing.

The command for that is:

aws s3 ls --no-sign-request s3://dataforgood-fb-data/csv/month=2019-06/

This has to adapt to the month partition dynamically.

H3: Check if a cell is contained in the covered area of a collection of H3 indexes

This is a new function to check whether a given cell is contained within the area of a collection of H3 indexes, e.g., an H3 building footprint representation (#12).

The function should be placed under shared/js/src/h3Utils.

OSM-POI: Introduce subcategories

The tags belonging to one of the high-level categories can be further clustered into subcategories. These can then be used as filters in queries (see #16).

The example structure based on kuwala-pipelines/osm-poi/resources/categories.json would look like this (naming for subcategories: highLevelCategory_subcategory):

{
  "sport": {
    "category": "sport",
    "tags": [
      …
    ],
    "subcategories": [
      {
        "category": "sport_fighting",
        "tags": [
           …
        ]
      },
      …
    ]
  },
  …
}

Google-poi project python path not configured and shared folder needs restructuring

Population-Density: Query a hex-grid within defined boundaries and set hex size

That could be potentially more efficient than creating a loop to query each hex in a grid.

Hex size would also help a lot in order to look at bigger regions.

New Data Source: Discover Admin Boundaries from the OSM-Boundaries Project

The OSM-Boundaries Project (https://osm-boundaries.com) did already a great job in making boundary tags of OSM comparable to each other on levels from 2 (country level) - 13 (lowest possible regional level). The docs ( https://osm-boundaries.com/Documentation ) give an idea on how data is extracted.

Goal : The goal is to access the cleaned and verified data from OSM Boundaries and identify blindspots in coverage

Delivery: Statements on blindspot countries and other issues when working with the dataset from OSM boundaries. This will be used on setting next steps to increase data quality and set the plate for an aggregation function that can be run on top of the existing Kuwala data pipelines.

New Pipeline: Instagram post scraper

This is going to be a new pipeline scraping public posts on Instagram with meta information such as location and hashtags.

Implement consistent variable/feature naming

We should ensure consistent naming of variables/features across the pipelines and modules. There are sometimes inconsistencies between camel case and snake case due to Python, JS, and JSON standards. We can achieve this by using a central key-value store for example.

ERROR: The Compose file './docker-compose.yml' is invalid because

when I try to run this command I get following issue.
sudo docker-compose build osm-poi osm-parquetizer


ERROR: The Compose file './docker-compose.yml' is invalid because:
Unsupported config option for services.admin-boundaries: 'profiles'
Unsupported config option for services.database-importer: 'profiles'
Unsupported config option for services.database-transformer: 'profiles'
Unsupported config option for services.google-poi-api: 'profiles'
Unsupported config option for services.google-poi-pipeline: 'profiles'
Unsupported config option for services.google-trends: 'profiles'
Unsupported config option for services.jupyter: 'profiles'
Unsupported config option for services.osm-parquetizer: 'profiles'
Unsupported config option for services.osm-poi: 'profiles'
Unsupported config option for services.population-density: 'profiles'
Unsupported config option for services.postgres: 'profiles'
Unsupported config option for services.torproxy: 'profiles'

However it runs successfully if I comment out profile and its subitems in .yml file.
Is it right to do so? Would it create a problem ahead?

New Data Source: Include new data source to cover blindspots of current Facebook data sets (population-density)

Facebook doesn't offer its population data for some regions, e.g., due to quality issues in satellite imagery. To cover these current blindspots, we should include another data source.

@tjukanovt proposed a dataset by Kontur which combines the Facebook data with GHSL, OSM, and building footprints data on a 400 m resolution using H3 but truly globally. It can be found on HDX.

If it is a quick win, we could either include that dataset directly or write a pipeline for the GHSL data. It would be a decision based on features and dependency on an intermediary like Kontur. We're open to discussing this with you here, of course!

Population-Density: Process GeoTiffs and save results to Parquet

Since the CSVs are not updated regularly on AWS, we switch to GeoTiffs to always get the latest data. To speed up the entire pipeline, we save the results in Parquet and not in Mongo anymore.

~~Process GeoTiffs using Spark~~

Save transformed results in Parquet
Implement new country selection

OSM-POI: Expose GraphQL endpoint

Next to the REST-API, the data should be exposed over a GraphQL endpoint to make requests more flexible.

There is a great guide on how to get started in Express.js on the official GraphQL website.

Also, see #3.

Include GeoNames to verify admin boundaries

We are currently building the admin boundary hierarchy of a country based on OSM data. One common use case is to group data by city. Since the number of the admin levels on OSM have different meanings in different countries (e.g., admin level 8 might be a city in the US, but in Germany admin level 6 might be used for cities), we should include another data source for determining city names.

GeoNames is an open database of worldwide country and city names. We can use their data dumps to get relevant city names of countries worldwide and match them with our OSM admin boundaries.

OSM-POI: Include brand property

OSM objects may include a brand or operator tag, which you can use to derive the brand of a POI.

The issue that exists is that the values of those tags can be spelled differently across several entities (e.g., "McDonalds", "Mc Donald's", or "McDonald's").

There exists a repo that tries to unify the spelling across OSM: https://github.com/osmlab/name-suggestion-index/.

Otherwise, it is an option to find a clean list of worldwide brand names and use string distance measures to connect a POI to a brand.

Population-Density: Expose GraphQL endpoint

Next to the REST-API, the data should be exposed over a GraphQL endpoint to make requests more flexible.

There is a great guide on how to get started in Express.js on the official GraphQL website.

Also, see #18.

Notebook: Set and run transformation

There should be a set of pre-written UDFs that users can pick from. In the beginning, we will definitely provide an aggregation UDF based on the H3 granularity.

Through custom code additional transformations can be written, and even other data (e.g., internal data from a user's company) can be loaded in as well.

Depends on #44

Population-Density: Getting population density per square mile

There should be a query parameter for each route to return the population density per square kilometer.

The area of the queried region can be calculated using H3's .cellArea(h3Index, unit) method.

The density should be calculated for each group.

Depending on the status of #2, the calculation can be limited to specifically requested demographic groups.

Postgres connector

User story

As a user, I quickly want to connect my Postgres data warehouse with Kuwala to start applying transformations. I only want to put in my credentials and establish the connection. Once connected, I want to see the database schema to see all available tables. For every existing table, I want to see a preview of the data and the column types.

Acceptance criteria

The user can establish a connection to their Postgres data warehouse by providing the necessary credentials.
The user can retrieve the database schema and explore all available tables.
The user can preview the data in the table and its data types.

Non-functional requirements

Connecting to the Postgres takes less than 5 seconds.
Loading the list of all available tables takes less than 5 seconds.
Loading the preview of a table takes less than 5 seconds.
In the preview of a table, up to 200 columns are displayed instantly.
In the preview of a table, up to 300 rows are displayed instantly.

Resources and technical details

Airbyte offers open-source data connectors, and it has a source connector for Postgres, which we could integrate.

Airbyte Postgres connector

The leanest solution will be to use PySpark or pip packages like psycopg2 as we don't actually sync data to a destination.

API routes

Get all data sources
Select data sources
Get all selected data sources
Test connection to data source
Save credentials for data source
Get hierarchy of database (schema, category, tables)
Get preview of data in table

Designs

The icons can be found on Font Awesome.

Empty canvas

Clicking Add data source in the sidebar will lead to the Data catalog page under Data Overview.

Data catalog

Clicking on Canvas in the top navigation will go to the canvas. Until a data source is selected, the next button is disabled.

Data catalog - selected data source

When clicking on a data source, it is selected and highlighted.
When clicking on next, the selected pipelines will be added to the Data Pipeline Management overview.

Data pipeline management

When clicking on Add a new data source, the user gets back to the data catalog. When clicking on Configure, the user gets to the Data Pipeline Configuration screen.

Data pipeline configuration

When clicking on Back, the user aborts the configuration. The Save button is disabled if not all input fields are filled out.

Data pipeline configuration - test connection

When clicking on Test connection is successful, a success message is displayed and the status dot of the logo turns green. If all input fields have been filled out, the configuration can be saved.

Data pipeline management - active pipeline

If the configuration was successful, the status of the pipeline changes to active, and a Preview data button is displayed.

Data pipeline preview - unselected

The left-hand menu shows the nested structure of the connected database.

Data pipeline preview - selected table

Tables have to be selected under schema -> category -> table. If a table has been selected, the preview of the data is shown. The side bar and rows are scrollable. The column headers are fixed at the top and horizontally scrollable.

Could not build CLI Docker Image

name: 🛠 Bug report
title: [Bug Report] Could not build CLI Docker Image
labels: bug

Description

File path is wrong, resulted in docker build failure

Steps to Reproduce

docker buildx build -t xxx --file kuwala/core/cli/dockerfile kuwala

Actual Behavior

https://github.com/david30907d/kuwala/runs/4556295883?check_suite_focus=true

More Information

Fixed in #65

CLI: Read user inputs

Define flow and all input parameters
Implement CLI

OSM-POI: Download fails for countries without subregions

When downloading OSM files for countries without subregions, the downloader throws an error because it creates an invalid URL.

Also, when selecting a country behind the first page of a country list, an incorrect country is selected from the first page.

Example: Europe -> Malta (tries to download Albania with which has no subregions.

[1] Existing download
[2] New download
[0] Cancel

Do you want to make a new download? [1, 2, 0]: 2

[1] africa
[2] antarctica
[3] asia
[4] australia-oceania
[5] central-america
[6] europe
[7] north-america
[8] south-america
[0] Cancel

Which continent are you interested in? [1...8 / 0]: 6

[1] Download all
[2] albania
[3] alps
[4] andorra
[5] austria
[6] azores
[7] belarus
[8] belgium
[9] bosnia-herzegovina
[a] britain-and-ireland
[b] bulgaria
[c] croatia
[d] cyprus
[e] czech-republic
[f] dach
[g] denmark
[h] estonia
[i] faroe-islands
[j] finland
[k] france
[l] georgia
[m] germany
[n] great-britain
[o] greece
[p] guernsey-jersey
[q] hungary
[r] iceland
[s] ireland-and-northern-ireland
[t] isle-of-man
[u] italy
[v] kosovo
[w] latvia
[x] liechtenstein
[y] lithuania
[z] luxembourg
[0] Next page

Which region are you interested in? [1...9, a...z, 0]: 0

[1] macedonia
[2] malta
[3] moldova
[4] monaco
[5] montenegro
[6] netherlands
[7] norway
[8] poland
[9] portugal
[a] romania
[b] serbia
[c] slovakia
[d] slovenia
[e] spain
[f] sweden
[g] switzerland
[h] turkey
[i] ukraine
[0] Cancel

Which region are you interested in? [1...9, a...i, 0]: 2
Error: Request failed with status code 404
    at createError (/opt/osm-poi/node_modules/axios/lib/core/createError.js:16:15)
    at settle (/opt/osm-poi/node_modules/axios/lib/core/settle.js:17:12)
    at IncomingMessage.handleStreamEnd (/opt/osm-poi/node_modules/axios/lib/adapters/http.js:260:11)
    at IncomingMessage.emit (node:events:377:35)
    at endReadableNT (node:internal/streams/readable:1312:12)
    at processTicksAndRejections (node:internal/process/task_queues:83:21) {
  config: {
    url: 'https://download.geofabrik.de/europe/albania',
    method: 'get',
    headers: {
      Accept: 'application/json, text/plain, */*',
      'User-Agent': 'axios/0.21.1'
    },
    transformRequest: [ [Function: transformRequest] ],
    transformResponse: [ [Function: transformResponse] ],
    timeout: 0,
    adapter: [Function: httpAdapter],
    xsrfCookieName: 'XSRF-TOKEN',
    xsrfHeaderName: 'X-XSRF-TOKEN',
    maxContentLength: -1,
    maxBodyLength: -1,
    validateStatus: [Function: validateStatus],
    data: undefined
  },
  request: <ref *1> ClientRequest {
    _events: [Object: null prototype] {
      abort: [Function (anonymous)],
      aborted: [Function (anonymous)],
      connect: [Function (anonymous)],
      error: [Function (anonymous)],
      socket: [Function (anonymous)],
      timeout: [Function (anonymous)],
      prefinish: [Function: requestOnPrefinish]
    },
    _eventsCount: 7,
    _maxListeners: undefined,
    outputData: [],
    outputSize: 0,
    writable: true,
    destroyed: false,
    _last: true,
    chunkedEncoding: false,
    shouldKeepAlive: false,
    _defaultKeepAlive: true,
    useChunkedEncodingByDefault: false,
    sendDate: false,
    _removedConnection: false,
    _removedContLen: false,
    _removedTE: false,
    _contentLength: 0,
    _hasBody: true,
    _trailer: '',
    finished: true,
    _headerSent: true,
    _closed: false,
    socket: TLSSocket {
      _tlsOptions: [Object],
      _secureEstablished: true,
      _securePending: false,
      _newSessionPending: false,
      _controlReleased: true,
      secureConnecting: false,
      _SNICallback: null,
      servername: 'download.geofabrik.de',
      alpnProtocol: false,
      authorized: true,
      authorizationError: null,
      encrypted: true,
      _events: [Object: null prototype],
      _eventsCount: 10,
      connecting: false,
      _hadError: false,
      _parent: null,
      _host: 'download.geofabrik.de',
      _readableState: [ReadableState],
      _maxListeners: undefined,
      _writableState: [WritableState],
      allowHalfOpen: false,
      _sockname: null,
      _pendingData: null,
      _pendingEncoding: '',
      server: undefined,
      _server: null,
      ssl: [TLSWrap],
      _requestCert: true,
      _rejectUnauthorized: true,
      parser: null,
      _httpMessage: [Circular *1],
      [Symbol(res)]: [TLSWrap],
      [Symbol(verified)]: true,
      [Symbol(pendingSession)]: null,
      [Symbol(async_id_symbol)]: 786,
      [Symbol(kHandle)]: [TLSWrap],
      [Symbol(kSetNoDelay)]: false,
      [Symbol(lastWriteQueueSize)]: 0,
      [Symbol(timeout)]: null,
      [Symbol(kBuffer)]: null,
      [Symbol(kBufferCb)]: null,
      [Symbol(kBufferGen)]: null,
      [Symbol(kCapture)]: false,
      [Symbol(kBytesRead)]: 0,
      [Symbol(kBytesWritten)]: 0,
      [Symbol(connect-options)]: [Object],
      [Symbol(RequestTimeout)]: undefined
    },
    _header: 'GET /europe/albania HTTP/1.1\r\n' +
      'Accept: application/json, text/plain, */*\r\n' +
      'User-Agent: axios/0.21.1\r\n' +
      'Host: download.geofabrik.de\r\n' +
      'Connection: close\r\n' +
      '\r\n',
    _keepAliveTimeout: 0,
    _onPendingData: [Function: nop],
    agent: Agent {
      _events: [Object: null prototype],
      _eventsCount: 2,
      _maxListeners: undefined,
      defaultPort: 443,
      protocol: 'https:',
      options: [Object: null prototype],
      requests: [Object: null prototype] {},
      sockets: [Object: null prototype],
      freeSockets: [Object: null prototype] {},
      keepAliveMsecs: 1000,
      keepAlive: false,
      maxSockets: Infinity,
      maxFreeSockets: 256,
      scheduling: 'lifo',
      maxTotalSockets: Infinity,
      totalSocketCount: 1,
      maxCachedSessions: 100,
      _sessionCache: [Object],
      [Symbol(kCapture)]: false
    },
    socketPath: undefined,
    method: 'GET',
    maxHeaderSize: undefined,
    insecureHTTPParser: undefined,
    path: '/europe/albania',
    _ended: true,
    res: IncomingMessage {
      _readableState: [ReadableState],
      _events: [Object: null prototype],
      _eventsCount: 3,
      _maxListeners: undefined,
      socket: [TLSSocket],
      httpVersionMajor: 1,
      httpVersionMinor: 1,
      httpVersion: '1.1',
      complete: true,
      rawHeaders: [Array],
      rawTrailers: [],
      aborted: false,
      upgrade: false,
      url: '',
      method: null,
      statusCode: 404,
      statusMessage: 'Not Found',
      client: [TLSSocket],
      _consuming: false,
      _dumped: false,
      req: [Circular *1],
      responseUrl: 'https://download.geofabrik.de/europe/albania',
      redirects: [],
      [Symbol(kCapture)]: false,
      [Symbol(kHeaders)]: [Object],
      [Symbol(kHeadersCount)]: 10,
      [Symbol(kTrailers)]: null,
      [Symbol(kTrailersCount)]: 0,
      [Symbol(RequestTimeout)]: undefined
    },
    aborted: false,
    timeoutCb: null,
    upgradeOrConnect: false,
    parser: null,
    maxHeadersCount: null,
    reusedSocket: false,
    host: 'download.geofabrik.de',
    protocol: 'https:',
    _redirectable: Writable {
      _writableState: [WritableState],
      _events: [Object: null prototype],
      _eventsCount: 2,
      _maxListeners: undefined,
      _options: [Object],
      _ended: true,
      _ending: true,
      _redirectCount: 0,
      _redirects: [],
      _requestBodyLength: 0,
      _requestBodyBuffers: [],
      _onNativeResponse: [Function (anonymous)],
      _currentRequest: [Circular *1],
      _currentUrl: 'https://download.geofabrik.de/europe/albania',
      [Symbol(kCapture)]: false
    },
    [Symbol(kCapture)]: false,
    [Symbol(kNeedDrain)]: false,
    [Symbol(corked)]: 0,
    [Symbol(kOutHeaders)]: [Object: null prototype] {
      accept: [Array],
      'user-agent': [Array],
      host: [Array]
    }
  },
  response: {
    status: 404,
    statusText: 'Not Found',
    headers: {
      date: 'Sun, 30 May 2021 13:00:01 GMT',
      server: 'Apache',
      'content-length': '196',
      connection: 'close',
      'content-type': 'text/html; charset=iso-8859-1'
    },
    config: {
      url: 'https://download.geofabrik.de/europe/albania',
      method: 'get',
      headers: [Object],
      transformRequest: [Array],
      transformResponse: [Array],
      timeout: 0,
      adapter: [Function: httpAdapter],
      xsrfCookieName: 'XSRF-TOKEN',
      xsrfHeaderName: 'X-XSRF-TOKEN',
      maxContentLength: -1,
      maxBodyLength: -1,
      validateStatus: [Function: validateStatus],
      data: undefined
    },
    request: <ref *1> ClientRequest {
      _events: [Object: null prototype],
      _eventsCount: 7,
      _maxListeners: undefined,
      outputData: [],
      outputSize: 0,
      writable: true,
      destroyed: false,
      _last: true,
      chunkedEncoding: false,
      shouldKeepAlive: false,
      _defaultKeepAlive: true,
      useChunkedEncodingByDefault: false,
      sendDate: false,
      _removedConnection: false,
      _removedContLen: false,
      _removedTE: false,
      _contentLength: 0,
      _hasBody: true,
      _trailer: '',
      finished: true,
      _headerSent: true,
      _closed: false,
      socket: [TLSSocket],
      _header: 'GET /europe/albania HTTP/1.1\r\n' +
        'Accept: application/json, text/plain, */*\r\n' +
        'User-Agent: axios/0.21.1\r\n' +
        'Host: download.geofabrik.de\r\n' +
        'Connection: close\r\n' +
        '\r\n',
      _keepAliveTimeout: 0,
      _onPendingData: [Function: nop],
      agent: [Agent],
      socketPath: undefined,
      method: 'GET',
      maxHeaderSize: undefined,
      insecureHTTPParser: undefined,
      path: '/europe/albania',
      _ended: true,
      res: [IncomingMessage],
      aborted: false,
      timeoutCb: null,
      upgradeOrConnect: false,
      parser: null,
      maxHeadersCount: null,
      reusedSocket: false,
      host: 'download.geofabrik.de',
      protocol: 'https:',
      _redirectable: [Writable],
      [Symbol(kCapture)]: false,
      [Symbol(kNeedDrain)]: false,
      [Symbol(corked)]: 0,
      [Symbol(kOutHeaders)]: [Object: null prototype]
    },
    data: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n' +
      '<html><head>\n' +
      '<title>404 Not Found</title>\n' +
      '</head><body>\n' +
      '<h1>Not Found</h1>\n' +
      '<p>The requested URL was not found on this server.</p>\n' +
      '</body></html>\n'
  },
  isAxiosError: true,
  toJSON: [Function: toJSON]
}

Linting for `kuwala/pipelines`

@david30907d created a PR #77 for setting up linting and formatting rules for the Python code.

The current GitHub actions check kuwala/common and kuwala/core. We should also fix any linting exceptions in kuwala/pipelines and run the GitHub action for this as well.

OSM-POI: Filter results by category

This is a new query parameter introduced for all routes used to filter the results by category. The high-level and subcategories are valid values for the query parameter (for subcategories see #17).

An additional query parameter must be introduced to determine whether the results must include all or just at least one of the queried categories.

OSM-POI: SIGKILL when processing OSM download

Thanks everybody again for this tool! Out of the box, I couldn't get OSM data processing to work either. The download works fine, but the processing step most likely causes Node to run out of memory or something similar:

[1] Existing download
[2] New download
[0] Cancel

Do you want to make a new download? [1, 2, 0]: 1

[1] europe
[0] Cancel

Which region are you interested in? [1/0]: 1

[1] finland-latest.osm.pbf
[0] Cancel

Which region are you interested in? [1/0]: 1
Running: 0s - 0 objects added
Running: 7m47s - 233807 objects addednpm ERR! path /opt/app
npm ERR! command failed
npm ERR! signal SIGKILL
npm ERR! command sh -c NODE_ENV=local node --max-old-space-size=8192 src/data/main.js

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2021-06-10T11_17_00_519Z-debug.log
ERROR: 1

Solve setup difficulties by providing an easy setup process through docker and docker-compose

Here is the core of what I am planning to do in order to provide an easier way of running the Kuwala's pipeline projects and streamline the setup/run of all potential future pipeline additions.

Restructure the scope/context of docker, docker-compose, and shared folder
Utilize profiles in docker-compose in order to help running essential services and services on demand
Create a dockerfile for each of the projects, and configure them as runnable services in the docker-compose file.
Enhance volume mapping for db and for downloaded/created files.
Update all Readme files to catch up to all the updates above.

Population-Density: Handle re-processing of previously processed countries

When processing the data, currently, it is only checked whether the files including the raw data have previously been downloaded for a country by checking the existence of the related folder under tmp/.

When a country has previously been processed and saved to the database, you should be able to either:

Overwrite the existing, add new, and delete obsolete cells
Abort the processing

Google-POI: Populate Google POI data in the graph database

Extend Graph model
Setup scraping pipeline based on OSM data
Save results in Parquet format
Populate graph
Dockerize scraping process (run API and scraping script)

Population-Density: Unable to download

Hello everybody, and thanks for creating a promising new tool! In the process of benchmarking such tools right now, just a quick tryout out of the box resulted in this issue

[z] IT - Italy
[0] Next page

Please select a country! [1...9, a...z, 0]: 2
Downloading data for Finland
Error: No data available for FIN
    at /opt/app/src/utils/countryPicker/index.js:19:17
    at new Promise (<anonymous>)
    at downloadFiles (/opt/app/src/utils/countryPicker/index.js:7:12)
    at /opt/app/src/utils/countryPicker/index.js:86:23
    at new Promise (<anonymous>)
    at Object.pick (/opt/app/src/utils/countryPicker/index.js:41:12)
    at /opt/app/src/data/processor/index.js:101:49
    at new Promise (<anonymous>)
    at Object.start (/opt/app/src/data/processor/index.js:99:12)
    at /opt/app/src/data/main.js:43:41
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

The problem was mentioned in #23 (comment) , but that PR is already merged, so apparently it didn't resolve the issue for all users.

OSM-POI: Switch to Python and use Parquet + Spark

To use Spark for the transformations, we switch to Python for this pipeline.

We can use this repo to parse the OSM pbf files to Parquet files.

Running demo on Windows

Hello!
I found that if you run the demo using this line

cd kuwala/scripts && sh initialize_core_components.sh && sh run_cli.sh

On docker desktop for Windows, it founds an error:

sh: 0: Can't open build_neo4j.sh
sh: 0: Can't open build_cli.sh
build_jupyter_notebook.sh: 1: cd: can't cd to ..
Traceback (most recent call last):
  File "urllib3/connectionpool.py", line 677, in urlopen
  File "urllib3/connectionpool.py", line 392, in _make_request
  File "http/client.py", line 1277, in request
  File "http/client.py", line 1323, in _send_request
  File "http/client.py", line 1272, in endheaders
  File "http/client.py", line 1032, in _send_output
  File "http/client.py", line 972, in send
  File "docker/transport/unixconn.py", line 43, in connect
PermissionError: [Errno 13] Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "requests/adapters.py", line 449, in send
  File "urllib3/connectionpool.py", line 727, in urlopen
  File "urllib3/util/retry.py", line 410, in increment
  File "urllib3/packages/six.py", line 734, in reraise
  File "urllib3/connectionpool.py", line 677, in urlopen
  File "urllib3/connectionpool.py", line 392, in _make_request
  File "http/client.py", line 1277, in request
  File "http/client.py", line 1323, in _send_request
  File "http/client.py", line 1272, in endheaders
  File "http/client.py", line 1032, in _send_output
  File "http/client.py", line 972, in send
  File "docker/transport/unixconn.py", line 43, in connect
urllib3.exceptions.ProtocolError: ('Connection aborted.', PermissionError(13, 'Permission denied'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "docker/api/client.py", line 214, in _retrieve_server_version
  File "docker/api/daemon.py", line 181, in version
  File "docker/utils/decorators.py", line 46, in inner
  File "docker/api/client.py", line 237, in _get
  File "requests/sessions.py", line 543, in get
  File "requests/sessions.py", line 530, in request
  File "requests/sessions.py", line 643, in send
  File "requests/adapters.py", line 498, in send
requests.exceptions.ConnectionError: ('Connection aborted.', PermissionError(13, 'Permission denied'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "docker-compose", line 3, in <module>
  File "compose/cli/main.py", line 81, in main
  File "compose/cli/main.py", line 200, in perform_command
  File "compose/cli/command.py", line 70, in project_from_options
  File "compose/cli/command.py", line 153, in get_project
  File "compose/cli/docker_client.py", line 43, in get_client
  File "compose/cli/docker_client.py", line 170, in docker_client
  File "docker/api/client.py", line 197, in __init__
  File "docker/api/client.py", line 222, in _retrieve_server_version
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', PermissionError(13, 'Permission denied'))
[7178] Failed to execute script docker-compose

I assume that the bash script is not fully compatible with Windows filesystem... Any idea for this?

Thank you ^^

OSM-POI: Calculate area of building footprint

This is a new property for POIs with a building footprint. Based on the H3 representation of the building footprint, you can calculate the area by summing up the exact area of each cell using h3.cellArea(h3Index, unit).

The area calculation should be done in a general H3 function under shared/js/src/h3Utils.

This issue is dependant on #12.

The resolution of the polyfill is calculated based on the radius
Find a compact H3 representation (mixed resolutions for best approximation) based on the high-resolution polyfill

Notebook: Show data schema

Display the schema of the populated graph to understand the connections between the different data sources in order to write suitable transformations.