arweaveteam / gateway Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 29.0 1.4 MB

License: MIT License

TypeScript 45.73% Shell 54.05% Dockerfile 0.22%

gateway's People

Contributors

Stargazers

Watchers

gateway's Issues

Upgrade Binary Trees to Generalized Search Trees

Description

Currently, the codebase uses Binary Trees for indices. This can be optimized by implementing Generalized Search Trees instead. This will help scale tag searches and reduce bloat in the indices as well too.

Solution

There are a couple moving components to the project. We would need to enable GIST by running the SQL command.

CREATE EXTENSION btree_gist;

In development, this is fairly trivial as you can just ssh into Postgres and run the command. However, with docker-compose we would need to have a boot sql script to run this command as an adminstrator.

After, all indices that are VARCHAR should be upgraded to GIST as opposed to HASH or BTREE.

Reference

https://www.postgresql.org/docs/9.5/gist-intro.html

Improve chunk seeding with gateway

To ensure that chunk data is seeded onto the network reliably and not lost in the case of node failure, the gateway should send a single /chunk request to multiple peer nodes.

I mentioned this in #8 but I wasn't too clear on it.

On a smaller note, sending a POST to https://gateway.amplify.host/chunk yields a Request type not found response so chunk seeding actually isn't possible atm. :)

v1.0.0 DB misses some transactions!

I sent the same request to https://arweave.net/graphql and to my own v1.0.0 installation:

query {
  transactions(
    first: 2
    ids: [
      "qul1Elv7Xg6LtyTkPqDj0i5jNr38aOLvi9qsDMoO9Jg",
      "c3T92aWcYEt79uok96MkH6TyXuKageX-1cn1oW46v7s"
    ]
  ) {, 
    edges {
      node {
        block {
          id
          height
        }
        id
      }
    }
  }
}

arweave.net replies:

{
  "data": {
    "transactions": {
      "edges": [
        {
          "node": {
            "block": {
              "id": "ImTHSwo1yrcUM35-Kd5UiZs5mCXebgHueroxWuXq_yaYLE3CU6_1agcv44aSOhCG",
              "height": 132702
            },
            "id": "c3T92aWcYEt79uok96MkH6TyXuKageX-1cn1oW46v7s"
          }
        },
        {
          "node": {
            "block": {
              "id": "xbBn0jc4zrqAXmDqczhkzypG0G1b_Nj0XqJZ73UETsQNb0lIc3WIXaVvvv0YtOJS",
              "height": 97283
            },
            "id": "qul1Elv7Xg6LtyTkPqDj0i5jNr38aOLvi9qsDMoO9Jg"
          }
        }
      ]
    }
  }
}

but my installation replies:

{
  "data": {
    "transactions": {
      "edges": [
        {
          "node": {
            "block": {
              "id": "ImTHSwo1yrcUM35-Kd5UiZs5mCXebgHueroxWuXq_yaYLE3CU6_1agcv44aSOhCG",
              "height": 132702
            },
            "id": "c3T92aWcYEt79uok96MkH6TyXuKageX-1cn1oW46v7s"
          }
        }
      ]
    }
  }
}

You see that the transaction qul1Elv7Xg6LtyTkPqDj0i5jNr38aOLvi9qsDMoO9Jg is missing in my server.
Moreover:

select count(*) from transactions where id='c3T92aWcYEt79uok96MkH6TyXuKageX-1cn1oW46v7s';

returns 1.

select count(*) from transactions where id='qul1Elv7Xg6LtyTkPqDj0i5jNr38aOLvi9qsDMoO9Jg';

returns 0.

It could be explained by my server being partially synchronized, but my server does show a transaction of a higher block height. So, there is a bug.

Import entries with PG COPY instead of Knex Transactions

Description

Knex transactions should be phased out for the COPY command.
Entries should be generated in a /cache folder.
Entries are flushed and sync progress is updated after each and every successful COPY
Make sure that the implementation is redundant as well as performant.

Support serving ANS-102 data items from data route

Description

Currently, the gateway parses and indexes ANS-102 bundles which allow them to be queried through the GraphQL API.
However, the gateway doesn't store the data of these data items for serving to clients.
The gateway needs to store data item data as they cannot be retrieved from miners efficiently. (possibly compressed in some way)
The gateway should store and serve up data item data to clients.

Support wallet balance retrieval

The Arweave HTTP API lets you retrieve the wallet balance of addresses on Arweave. The gateway doesn't proxy these requests to nodes.

It would be cool to be able to query balances using the GraphQL API.

I wonder how a trustless implementation of this would work. Performing an aggregation query on the indexed transactions to retrieve a wallet balance might be unscalable.

Support of "Range Request" for video/music streaming

Description

The Gateway does not support "Range" Request
The range request is required for streaming video and music for iOS Safari.

Example

Use the following video:

https://viaolgkfiunnzqev4afokc42gqjxqec7dfimjqdizucjmw4db27q.arweave.net/qgDlmUVFGtzAleAK5QuaNBN4EF8ZUMTAaM0ElluDDr8
Transaction ID: qgDlmUVFGtzAleAK5QuaNBN4EF8ZUMTAaM0ElluDDr8

References

Verify Command

Description

Create a new command dev:verify
Start at the latest block and work down the chain.
Verify that each transaction and tag is indexed in the database. If not, insert it into the database.
Have a progress bar to indicate ETA of completion.

Sometimes API returns erroneously empty list of edges

Sometimes request

query {
  transactions(
    first: 1
    tags: [{ name: "Content-Type", values: ["text/pdf"] }]
  ) {
    edges {
      cursor
      node {
        id
        owner {
          address
        }
        tags {
          name
          value
        }
      }
    }
  }
}

is responded with erroneous "edges": []

{
  "data": {
    "transactions": {
      "edges": []
    }
  }
}

This seems completely random when edges are erroneously reported empty.

docker-compose did not crash during this bug happening.

gateway commit f6d5d370806b8efe808218ccbcf3f4592afc25e9.

root@many-files:~# tail -f ~manyfiles/arweave.log 
server_1    | 
server_1    | info: [database] could not retrieve tx o9a-tUOc0RtS0WEf6jdVrXmnqMDYG9DHWXv0Sx9hOF8 at height 325855 , missing tx stored in .rescan
server_1    | 
server_1    | info: [database] could not retrieve tx Q5jlQh4ZEwthUsOod6NMuEitAu1t6x8oQzbd9NkZS_Q at height 325914 , attempting to retrieve again
server_1    | 
server_1    | info: [database] could not retrieve tx pISJjCTPD6GjhMxMccjy1TbkvXbb-XSVWbL1drRpicc at height 325914 , attempting to retrieve again
server_1    | 
server_1    | info: [database] could not retrieve tx Q5jlQh4ZEwthUsOod6NMuEitAu1t6x8oQzbd9NkZS_Q at height 325914 , missing tx stored in .rescan
server_1    | 
server_1    | info: [database] could not retrieve tx pISJjCTPD6GjhMxMccjy1TbkvXbb-XSVWbL1drRpicc at height 325914 , missing tx stored in .rescan
server_1    | 
server_1    | info: [database] could not retrieve tx umfo0SXYxufnWuXsv8F57x0zS2VBQ9yYysCjiFtDuoo at height 325971 , attempting to retrieve again
server_1    | 
server_1    | info: [database] could not retrieve tx umfo0SXYxufnWuXsv8F57x0zS2VBQ9yYysCjiFtDuoo at height 325971 , missing tx stored in .rescan

Status GraphQL Query

Description

Be able to query transactions by status.

Add a new status filter to the GraphQL Schema.
Legacy code from the AWS version.

Example

query {
  transactions(status: "any" | "confirmed" | "pending") {
    edges {
      node {
        id
      }
    }
  }
}

Possible need for sanitization

In theory someone could have a path that is "drop db". Just putting this up here for reference.

gateway/src/database/import.database.ts

Line 9 in 285585c

export async function importBlocks(path: string) {

Should warn for no "Content-Type" in INDICES

If I add "Content-Type" to INDICES, it produces nonsensical [object Object] at transactions.Content-Type column in DB what break queries' results:

INDICES=["App-Name", "app", "domain", "namespace"]

Add comments to the user not to do that.

I am syncing the blockchain third time because of bugs.

Deprecate Snapshot Functionality

Existing snapshot functionality is redundant.
Migrate all snapshot generation guides to \COPY command.
Update all documentation to reflect using PG COPY.
Remove all code related to legacy snapshots.

Support of setting "Content-*" headers in transaction tags

Currently, it's possible to set the "Content-Type" tag as the transaction data content type. It's used as an HTTP header when the transaction data is rendered in a browser.

It would be nice to enable support for all other "Content-*" headers (esp. "Content-Encoding"). It will be very useful because senders will be able to compress their data with gzip and set "Content-Encoding: gzip". This way the browsers will automatically decompress tx data using gzip algorithm while rendering the tx data. This approach may help to save more than 50% on tx rewards.

Support transaction creation/upload and seeding

The arweave.net gateway supports creating and uploading transactions. It additionally seeds uploaded chunks to its peers to ensure that data is not lost in the event of a node failure.

This gateway currently doesn't support that.

Running Gateway inside Docker Container will not allow to send HTTP GET request with custom headers in browser

Problem

I am trying to send custom headers request-public-key and x-request-signature to https://<gateway>/<trxId>. from chrome

Tried Solutions

I tried to add cors so that it can accept Cross-Site requests with Custom headers (mentioned above). Refer to Pull Request
This fixed the problem potentially i.e when running the container without docker via yarn start or npm run dev:start but when the gateway is running inside a docker container via sudo docker container up --build -d it will give CORS issue on the browser.

STR

To reproduce the problem

start the gateway from v0.12.1 via sudo docker container up --build -d
send a GET request to https://<gateway>/<trxId> from bowser and append some custom headers like

headers":{
  "key1":"value1"
}

You will get CORS issue.

To Potentially fix the problem

start the gateway from v0.12.1 via yarn start or npm run dev:start
send a GET request to https://<gateway>/<trxId> from bowser and append some custom headers like

headers":{
  "key1":"value1"
}

You will get 200

Concerns

My main concern is that why when ran without docker it works and gives 200 and when ran without containerizing the application it works just fine

Environment

Docker Version: Docker version 20.10.7, build f0df350
Docker-compose version: docker-compose version 1.29.2, build 5becea4c
OS:Ubuntu 20.04 LTS

Upgrade to Sixth Normal Form

Description

All active queries that require sorting should be to tables in sixth normal form.
Primary key indexed data storing regular table metadata should be in second normal form.
Sixth normal form ensures optimal performance on queries.

Table Structure

General guideline for restructuring tables... Subject to change on implementation

Primary key for blocks table is changed to height and an integer
transactions table is converted into an auto increment table containing id as an integer and tx_hash as the actual transaction id.
Key metadata for transactions and tags are refactored to reference this id as an integer instead of varchar

References

Filter mempool only txs

Hi,

I found current it's not possible to filter out mempool only txs. They are included in the normal result. I tried to use block:{max:1} or block{max:0} but it doesn't work.

It is useful in some cases. May I suggest to add this capacity to the gateway?

Thanks

Each DB call opens a new connection

I'm starting the app in a docker, connected to the local arweave-node, it starts populating db, and I get repeating errors:
postgres FATAL: sorry, too many clients already
The reason is that await connection.queryBuilder() creates a new connection, and this situation becomes worse bc of pool with minimum 10 connections:
https://github.com/ArweaveTeam/gateway/blob/master/src/database/connection.database.ts#L9
Removing this line and raising db connections limit does trick for me, but this is a half-measure. The proper way is to create a single connection and reuse it.

Support custom filters for transactions to be indexed

To reduce database storage requirements, I would like to be able to choose which transactions I want to index in the database based on various attributes such as tags, owner, and the recipient (for AR transfers).

Considering how complex the filtering logic could be I wonder what kind of format it should be defined in. Perhaps allow a custom JS function to be defined in some kind of config script? Assuming this will be a common use case for developers, it would be ideal not to need to fork the repo to implement our own filters.

Dockerfile should run bin/index.create.sh

Precached Gateway Block Height

Update Gateway block height to be precached and in memory, updating after minute
Refactor the utility function for performance reasons

In v1.0.0 tag's tx_id seems wrong

In v1.0.0 when I get a tx_id value from tags table I find that in transactions there is no record with id equal to it. This looks like a bug (and it is confirmed by always empty tags arrays returned).

I am checking whether v0.12.1 is also vulnerable.

Rescan missing transactions

Description

Sometimes nodes may not have transactions. These transactions are reported as missing and need to be indexed.
There needs to be a .rescan flat file that stores missing transactions.
There needs to be a yarn dev:rescan command to restore those transactions in the database.

Snapshots of Postgres Database

Description

The Postgres database should have Snapshots.
You should be able to pull the snapshots via yarn dev:seed to retrieve the .txt.gz file.
You should be able to generate your own snapshot via yarn dev:snapshot.
Should upload using PG COPY with .csv format.

POST /tx does not verify transactions.

Posting an invalid TX always responds with 200 OK as long as is valid in structure.

Previous behaviour was that if you posted a tx with one of:

Invalid signature
Same owner/target
Wallet balance below the amount needed

And possibly some other conditions, you would get a 400 'Tx failed verification'.

Not sure how exactly or if we want to fix this. The old 'Tx failed verification' message wasn't the most helpful anyway, but it was at least some indication of failure.

The new behaviour is you'll get a 410 transaction failed from the status endpoint for some period after posting before it goes to 404.

Note also: these transactions are served as un-mined transactions, for example
https://dev.arweave.net/zj3YCDIOHlOXrgMSMruW4w7JfItgMjnErabbnYjwlhQ
was posted with an invalid signature.

Main concern is just that it is erroring later for Developers that previously, so if someone accidentally adds a tag or changes the tx after signing, it appears to work.

Reduce Docker image size

Currently, the size of the built Docker image is >1GB. It should be possible to reduce this quite significantly with some changes which will make spinning up new gateway instances faster and cheaper.

The size can be reduced in two ways by:

Making use of multi-stage builds in Docker and only packaging production node_modules in the final image.
Using a smaller Node image like the Node alpine images.

Dropping the dependency on ts-node brought in by knex would be great as well though I'm not sure how one could do that if migrations need to be ran in the production container.

Multi-stage builds: https://blog.logrocket.com/reduce-docker-image-sizes-using-multi-stage-builds/

Alpine and other Docker images: https://medium.com/swlh/alpine-slim-stretch-buster-jessie-bullseye-bookworm-what-are-the-differences-in-docker-62171ed4531d

GraphQL - transaction query

top level transaction queries fail:

query {   
  transaction(id: "WLblzWaoQIqB_GK-S4cALm2pOPQfWQAnCtUSgzgqsGE") {
    id
  }
}

Response:

{
  "errors": [
    {
      "message": "Cannot return null for non-nullable field Transaction.id.",
      "locations": [
        {
          "line": 3,
          "column": 5
        }
      ],
      "path": [
        "transaction",
        "id"
      ],
      "extensions": {
        "code": "INTERNAL_SERVER_ERROR"
      }
    }
  ],
  "data": {
    "transaction": null
  }
}

Validate `/chunk` seed data

Building on #82, the data provided to /chunk should be validated before it is propagated onto nodes.

The previous gateway implementation did this but only validated data_path. The gateway should validate the provided chunk data as well.

QueryString loss when redirecting from `arweave.net/txid?param1=foo` to `txid.arweave.net/txid`

It appears that the gateways, when redirecting to the unique subdomain, are not including the QueryString part of the URL

Query string are very useful for front-end applications:

prefilling data in a form
selecting data in javascript
passing any arbitrary data to the code (which can be great for any tracking / behavior etc...)

QueryString is part of the URL RFC and there shouldn't be any reason to remove it when doing a redirection.

By removing it, the gateway is forcing people that need it to link to https://txId.arweave.net/txId/?foo=bar directly, which kind of breaks the idea of decentralization (we should be able to use ar://txId?foo=bar and not set any specific gateway) but also makes storing those URL way more expensive when stored on systems where every byte has a cost.

example

this link https://arweave.net/7nhx1ZjpG4xstYIxDQs6TITw30ntwOfcjP43txyxqXc?imageId=0

should redirect to https://5z4hdvmy5enyy3fvqiyq2cz2jscpbx2j5xaopxem7y33ohfrvf3q.arweave.net/7nhx1ZjpG4xstYIxDQs6TITw30ntwOfcjP43txyxqXc/?imageId=0

but only redirect to https://5z4hdvmy5enyy3fvqiyq2cz2jscpbx2j5xaopxem7y33ohfrvf3q.arweave.net/7nhx1ZjpG4xstYIxDQs6TITw30ntwOfcjP43txyxqXc/

which totally break the front-end since imageId is a parameter needed to select data in the JavaScript

Provide ready-to-use Docker images

Same thing as ArweaveTeam/testweave-docker#3

It would be very helpful to have the container image ready to use just by running docker pull. The image can be stored along with this repo using the GitHub Container Registry which is free for OSS.

A GitHub action can be set up to build the image every time a GitHub release is created.

Something like the below will be sufficient. I can't set this up as someone from the team will need to provide a credential for pushing the image to the repo.

For setting up CR_PAT: https://docs.github.com/en/packages/guides/migrating-to-github-container-registry-for-docker-images#authenticating-with-the-container-registry

name: Build Release

on:
  release:
    types: [created]

jobs:
  build_image:
    name: Build image
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
        with:
          ref: ${{ github.head_ref }}
      - name: Cache Docker layers
        uses: actions/cache@v2
        with:
          path: /tmp/.buildx-cache
          key: ${{ runner.os }}-buildx-${{ github.sha }}
          restore-keys: |
            ${{ runner.os }}-buildx-
      - name: Login to GitHub Container Registry
        uses: docker/login-action@v1
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.CR_PAT }}   <- The repo owner needs to generate a credential for pushing to the registry
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
      - name: Build and push image
        id: docker_build
        uses: docker/build-push-action@v2
        with:
          context: .
          file: ./docker/gateway.dockerfile
          push: true
          tags: |
            ghcr.io/ArweaveTeam/gateway:latest
            ghcr.io/ArweaveTeam/gateway:${{ github.event.release.tag_name }}
          cache-from: type=local,src=/tmp/.buildx-cache
          cache-to: type=local,dest=/tmp/.buildx-cache
      - name: Image digest
        run: echo ${{ steps.docker_build.outputs.digest }}

Support streaming download of transaction data

From the implementation here it looks like transaction data is entirely buffered into memory first before being sent to the client.

Ideally, it should be streamed in straight to the client to reduce latency and to allow loading of transactions larger than the ~2GB Node supports.

Review sync status with a [GET] request

Description

Be able to review the sync status of the chain with a GET request.
Should be under a /status [GET] endpoint.

Example

An example of what the output should be is the following.

{
  "synced": 100,
  "height": 100000
  "eta": "90210s"
}

Requesting a non-existent path from a manifest Tx results in 504 timeout.

curl -L -v https://arweave.net/x--IDzM94TZ2ezVz1UrHw1jaQkVpvyBym5h0QvH_Ang/xyz

The TxId is here is a path manifest, but xyz does not exist in it. Instead of getting a 404 we get a 504 Gateway Timeout after a long wait.

Improved Code Coverage

Currently, CodeCov reports are at around 32% as seen here.
Should work towards +80% code coverage.
When TestWeave is ready. Integrate TestWeave into Unit Testings for Mock Testing.
Should bump this to high priority when ready to integrate TestWeave.

Properly format payloads

Description

ANS-102 transactions should be parsed and formatted when you go to the /[tx-id] route.
All payloads should be formatted based on the Content-Type tag on retrieval.
Videos should be formatted with: #15 seen here.

Timestamp GraphQL Query

Description

Be able to query blocks by unix timestamp.
Add a new timestamp filter to the GraphQL Schema.

Example

query {
    transactions(timestamp: {min: [unix], max: [unix]}) {
        edges {
            node {
                id
            }
        }
    }
}

Transaction Manifests

Description

When routed to the /[tx-id] should have a subdomain TLD.
Transaction manifests should be autogenerated and formatted as SHA-256.
Can reference legacy source code for path manifests found here: https://github.com/ArweaveTeam/gateway/tree/feature/v2.1-chunks

In v1.0.0 sync is completely broken

In v1.0.0 CSV files created by docker-compose up look like this:

$ docker-compose exec server tail /app/snapshot/transaction.csv

format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace

Passive Threaded Insertions for Synchronization and Data Verification

Deprecate COPY for syncing data
Use child_process to create passive threads to insert data
These threaded insertions should also apply to the verify scripts.

Postgres Optimization Guides

Provide Postgres optimization suggestions.
Create general benchmarking guidelines and RAM requirements based on DB size.
Update guides to drop and generate indices for PG COPY.

Your system of indexing metadata is an antipattern

You system of creating indices for INDICES environment variable is both not enough flexible and over-complicated.

The reasonable way to store metadata is to use a composite PostgreSQL index of both name and value. For queries use JOINs.

Then no need to create a separate DB table field for each index and every index could be used for filtering/sorting.

Support discovery & peering with untrusted nodes

For my use case, I want to be able to discover and peer with untrusted Arweave nodes to retrieve data as opposed to defining my own list of trusted nodes.

To do so, the gateway will need to first discover nodes to peer with then it will need to verify that the data it receives from those nodes are correct.

Blocks can be verified to be valid by checking that a majority of the peers agree on it (that's my understanding at least)
Transactions and their inclusion in blocks can be checked with the blocks that have been indexed (though how this works in block format v3 is a bit less obvious to me).
The actual transaction data can be verified for ANS-102 unbundling and client streaming using the generateTransactionChunksAsync() function I wrote here. Though for client streaming whether or not the data is valid can only be known once the entire stream has completed.

arweaveteam / gateway Goto Github PK

gateway's People

Contributors

Stargazers

Watchers

Forkers

gateway's Issues

Description

Solution

Reference

Description

Description

Description

Example

References

Description

Description

Example

Problem

Tried Solutions

STR

To Potentially fix the problem

Concerns

Environment

Description

Table Structure

References

Description

Description

example

Description

Example

Description

Description

Example

Description

Recommend Projects

Recommend Topics

Recommend Org