arweaveteam / gateway Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Currently, the codebase uses Binary Trees for indices. This can be optimized by implementing Generalized Search Trees instead. This will help scale tag searches and reduce bloat in the indices as well too.
There are a couple moving components to the project. We would need to enable GIST
by running the SQL command.
CREATE EXTENSION btree_gist;
In development, this is fairly trivial as you can just ssh into Postgres and run the command. However, with docker-compose
we would need to have a boot
sql
script to run this command as an adminstrator.
After, all indices that are VARCHAR
should be upgraded to GIST
as opposed to HASH
or BTREE
.
To ensure that chunk data is seeded onto the network reliably and not lost in the case of node failure, the gateway should send a single /chunk
request to multiple peer nodes.
I mentioned this in #8 but I wasn't too clear on it.
On a smaller note, sending a POST to https://gateway.amplify.host/chunk
yields a Request type not found
response so chunk seeding actually isn't possible atm. :)
I sent the same request to https://arweave.net/graphql and to my own v1.0.0
installation:
query {
transactions(
first: 2
ids: [
"qul1Elv7Xg6LtyTkPqDj0i5jNr38aOLvi9qsDMoO9Jg",
"c3T92aWcYEt79uok96MkH6TyXuKageX-1cn1oW46v7s"
]
) {,
edges {
node {
block {
id
height
}
id
}
}
}
}
arweave.net replies:
{
"data": {
"transactions": {
"edges": [
{
"node": {
"block": {
"id": "ImTHSwo1yrcUM35-Kd5UiZs5mCXebgHueroxWuXq_yaYLE3CU6_1agcv44aSOhCG",
"height": 132702
},
"id": "c3T92aWcYEt79uok96MkH6TyXuKageX-1cn1oW46v7s"
}
},
{
"node": {
"block": {
"id": "xbBn0jc4zrqAXmDqczhkzypG0G1b_Nj0XqJZ73UETsQNb0lIc3WIXaVvvv0YtOJS",
"height": 97283
},
"id": "qul1Elv7Xg6LtyTkPqDj0i5jNr38aOLvi9qsDMoO9Jg"
}
}
]
}
}
}
but my installation replies:
{
"data": {
"transactions": {
"edges": [
{
"node": {
"block": {
"id": "ImTHSwo1yrcUM35-Kd5UiZs5mCXebgHueroxWuXq_yaYLE3CU6_1agcv44aSOhCG",
"height": 132702
},
"id": "c3T92aWcYEt79uok96MkH6TyXuKageX-1cn1oW46v7s"
}
}
]
}
}
}
You see that the transaction qul1Elv7Xg6LtyTkPqDj0i5jNr38aOLvi9qsDMoO9Jg
is missing in my server.
Moreover:
select count(*) from transactions where id='c3T92aWcYEt79uok96MkH6TyXuKageX-1cn1oW46v7s';
returns 1.
select count(*) from transactions where id='qul1Elv7Xg6LtyTkPqDj0i5jNr38aOLvi9qsDMoO9Jg';
returns 0.
It could be explained by my server being partially synchronized, but my server does show a transaction of a higher block height. So, there is a bug.
Knex transactions should be phased out for the COPY
command.
Entries should be generated in a /cache
folder.
Entries are flushed and sync progress is updated after each and every successful COPY
Make sure that the implementation is redundant as well as performant.
Currently, the gateway parses and indexes ANS-102 bundles which allow them to be queried through the GraphQL API.
However, the gateway doesn't store the data of these data items for serving to clients.
The gateway needs to store data item data as they cannot be retrieved from miners efficiently. (possibly compressed in some way)
The gateway should store and serve up data item data to clients.
The Arweave HTTP API lets you retrieve the wallet balance of addresses on Arweave. The gateway doesn't proxy these requests to nodes.
It would be cool to be able to query balances using the GraphQL API.
I wonder how a trustless implementation of this would work. Performing an aggregation query on the indexed transactions to retrieve a wallet balance might be unscalable.
The Gateway does not support "Range" Request
The range request is required for streaming video and music for iOS Safari.
Use the following video:
Transaction ID: qgDlmUVFGtzAleAK5QuaNBN4EF8ZUMTAaM0ElluDDr8
Create a new command dev:verify
Start at the latest block and work down the chain.
Verify that each transaction and tag is indexed in the database. If not, insert it into the database.
Have a progress bar to indicate ETA of completion.
Sometimes request
query {
transactions(
first: 1
tags: [{ name: "Content-Type", values: ["text/pdf"] }]
) {
edges {
cursor
node {
id
owner {
address
}
tags {
name
value
}
}
}
}
}
is responded with erroneous "edges": []
{
"data": {
"transactions": {
"edges": []
}
}
}
This seems completely random when edges are erroneously reported empty.
docker-compose
did not crash during this bug happening.
gateway
commit f6d5d370806b8efe808218ccbcf3f4592afc25e9
.
root@many-files:~# tail -f ~manyfiles/arweave.log
server_1 |
server_1 | info: [database] could not retrieve tx o9a-tUOc0RtS0WEf6jdVrXmnqMDYG9DHWXv0Sx9hOF8 at height 325855 , missing tx stored in .rescan
server_1 |
server_1 | info: [database] could not retrieve tx Q5jlQh4ZEwthUsOod6NMuEitAu1t6x8oQzbd9NkZS_Q at height 325914 , attempting to retrieve again
server_1 |
server_1 | info: [database] could not retrieve tx pISJjCTPD6GjhMxMccjy1TbkvXbb-XSVWbL1drRpicc at height 325914 , attempting to retrieve again
server_1 |
server_1 | info: [database] could not retrieve tx Q5jlQh4ZEwthUsOod6NMuEitAu1t6x8oQzbd9NkZS_Q at height 325914 , missing tx stored in .rescan
server_1 |
server_1 | info: [database] could not retrieve tx pISJjCTPD6GjhMxMccjy1TbkvXbb-XSVWbL1drRpicc at height 325914 , missing tx stored in .rescan
server_1 |
server_1 | info: [database] could not retrieve tx umfo0SXYxufnWuXsv8F57x0zS2VBQ9yYysCjiFtDuoo at height 325971 , attempting to retrieve again
server_1 |
server_1 | info: [database] could not retrieve tx umfo0SXYxufnWuXsv8F57x0zS2VBQ9yYysCjiFtDuoo at height 325971 , missing tx stored in .rescan
Be able to query transactions by status.
Add a new status filter to the GraphQL Schema.
Legacy code from the AWS version.
query {
transactions(status: "any" | "confirmed" | "pending") {
edges {
node {
id
}
}
}
}
In theory someone could have a path that is "drop db". Just putting this up here for reference.
If I add "Content-Type"
to INDICES
, it produces nonsensical [object Object]
at transactions.Content-Type
column in DB what break queries' results:
INDICES=["App-Name", "app", "domain", "namespace"]
Add comments to the user not to do that.
I am syncing the blockchain third time because of bugs.
Existing snapshot functionality is redundant.
Migrate all snapshot generation guides to \COPY
command.
Update all documentation to reflect using PG COPY
.
Remove all code related to legacy snapshots.
Currently, it's possible to set the "Content-Type" tag as the transaction data content type. It's used as an HTTP header when the transaction data is rendered in a browser.
It would be nice to enable support for all other "Content-*" headers (esp. "Content-Encoding"). It will be very useful because senders will be able to compress their data with gzip and set "Content-Encoding: gzip". This way the browsers will automatically decompress tx data using gzip algorithm while rendering the tx data. This approach may help to save more than 50% on tx rewards.
The arweave.net
gateway supports creating and uploading transactions. It additionally seeds uploaded chunks to its peers to ensure that data is not lost in the event of a node failure.
This gateway currently doesn't support that.
I am trying to send custom headers request-public-key
and x-request-signature
to https://<gateway>/<trxId>
. from chrome
I tried to add cors so that it can accept Cross-Site requests with Custom headers (mentioned above). Refer to Pull Request
This fixed the problem potentially i.e when running the container without docker via yarn start
or npm run dev:start
but when the gateway is running inside a docker container via sudo docker container up --build -d
it will give CORS issue on the browser.
To reproduce the problem
sudo docker container up --build -d
GET
request to https://<gateway>/<trxId>
from bowser and append some custom headers likeheaders":{
"key1":"value1"
}
yarn start
or npm run dev:start
GET
request to https://<gateway>/<trxId>
from bowser and append some custom headers likeheaders":{
"key1":"value1"
}
200
My main concern is that why when ran without docker it works and gives 200
and when ran without containerizing the application it works just fine
Docker Version: Docker version 20.10.7, build f0df350
Docker-compose version: docker-compose version 1.29.2, build 5becea4c
OS:Ubuntu 20.04 LTS
All active queries that require sorting should be to tables in sixth normal form.
Primary key indexed data storing regular table metadata should be in second normal form.
Sixth normal form ensures optimal performance on queries.
General guideline for restructuring tables... Subject to change on implementation
Primary key for blocks
table is changed to height and an integer
transactions
table is converted into an auto increment table containing id
as an integer and tx_hash
as the actual transaction id.
Key metadata for transactions and tags are refactored to reference this id
as an integer
instead of varchar
Hi,
I found current it's not possible to filter out mempool only txs. They are included in the normal result. I tried to use block:{max:1} or block{max:0} but it doesn't work.
It is useful in some cases. May I suggest to add this capacity to the gateway?
Thanks
I'm starting the app in a docker, connected to the local arweave-node, it starts populating db, and I get repeating errors:
postgres FATAL: sorry, too many clients already
The reason is that await connection.queryBuilder()
creates a new connection, and this situation becomes worse bc of pool with minimum 10 connections:
https://github.com/ArweaveTeam/gateway/blob/master/src/database/connection.database.ts#L9
Removing this line and raising db connections limit does trick for me, but this is a half-measure. The proper way is to create a single connection and reuse it.
To reduce database storage requirements, I would like to be able to choose which transactions I want to index in the database based on various attributes such as tags, owner, and the recipient (for AR transfers).
Considering how complex the filtering logic could be I wonder what kind of format it should be defined in. Perhaps allow a custom JS function to be defined in some kind of config script? Assuming this will be a common use case for developers, it would be ideal not to need to fork the repo to implement our own filters.
Update Gateway block height to be precached and in memory, updating after minute
Refactor the utility function for performance reasons
In v1.0.0
when I get a tx_id
value from tags
table I find that in transactions
there is no record with id
equal to it. This looks like a bug (and it is confirmed by always empty tags
arrays returned).
I am checking whether v0.12.1
is also vulnerable.
Sometimes nodes may not have transactions. These transactions are reported as missing and need to be indexed.
There needs to be a .rescan
flat file that stores missing transactions.
There needs to be a yarn dev:rescan
command to restore those transactions in the database.
The Postgres database should have Snapshots.
You should be able to pull the snapshots via yarn dev:seed
to retrieve the .txt.gz
file.
You should be able to generate your own snapshot via yarn dev:snapshot
.
Should upload using PG COPY
with .csv
format.
Posting an invalid TX always responds with 200 OK as long as is valid in structure.
Previous behaviour was that if you posted a tx with one of:
And possibly some other conditions, you would get a 400 'Tx failed verification'.
Not sure how exactly or if we want to fix this. The old 'Tx failed verification' message wasn't the most helpful anyway, but it was at least some indication of failure.
The new behaviour is you'll get a 410 transaction failed from the status endpoint for some period after posting before it goes to 404.
Note also: these transactions are served as un-mined transactions, for example
https://dev.arweave.net/zj3YCDIOHlOXrgMSMruW4w7JfItgMjnErabbnYjwlhQ
was posted with an invalid signature.
Main concern is just that it is erroring later for Developers that previously, so if someone accidentally adds a tag or changes the tx after signing, it appears to work.
Currently, the size of the built Docker image is >1GB. It should be possible to reduce this quite significantly with some changes which will make spinning up new gateway instances faster and cheaper.
The size can be reduced in two ways by:
node_modules
in the final image.Dropping the dependency on ts-node
brought in by knex
would be great as well though I'm not sure how one could do that if migrations need to be ran in the production container.
Multi-stage builds: https://blog.logrocket.com/reduce-docker-image-sizes-using-multi-stage-builds/
Alpine and other Docker images: https://medium.com/swlh/alpine-slim-stretch-buster-jessie-bullseye-bookworm-what-are-the-differences-in-docker-62171ed4531d
top level transaction
queries fail:
query {
transaction(id: "WLblzWaoQIqB_GK-S4cALm2pOPQfWQAnCtUSgzgqsGE") {
id
}
}
Response:
{
"errors": [
{
"message": "Cannot return null for non-nullable field Transaction.id.",
"locations": [
{
"line": 3,
"column": 5
}
],
"path": [
"transaction",
"id"
],
"extensions": {
"code": "INTERNAL_SERVER_ERROR"
}
}
],
"data": {
"transaction": null
}
}
Building on #82, the data provided to /chunk
should be validated before it is propagated onto nodes.
The previous gateway implementation did this but only validated data_path
. The gateway should validate the provided chunk
data as well.
It appears that the gateways, when redirecting to the unique subdomain, are not including the QueryString part of the URL
Query string are very useful for front-end applications:
QueryString is part of the URL RFC and there shouldn't be any reason to remove it when doing a redirection.
By removing it, the gateway is forcing people that need it to link to https://txId.arweave.net/txId/?foo=bar
directly, which kind of breaks the idea of decentralization (we should be able to use ar://txId?foo=bar
and not set any specific gateway) but also makes storing those URL way more expensive when stored on systems where every byte has a cost.
this link https://arweave.net/7nhx1ZjpG4xstYIxDQs6TITw30ntwOfcjP43txyxqXc?imageId=0
should redirect to https://5z4hdvmy5enyy3fvqiyq2cz2jscpbx2j5xaopxem7y33ohfrvf3q.arweave.net/7nhx1ZjpG4xstYIxDQs6TITw30ntwOfcjP43txyxqXc/?imageId=0
but only redirect to https://5z4hdvmy5enyy3fvqiyq2cz2jscpbx2j5xaopxem7y33ohfrvf3q.arweave.net/7nhx1ZjpG4xstYIxDQs6TITw30ntwOfcjP43txyxqXc/
which totally break the front-end since imageId is a parameter needed to select data in the JavaScript
Same thing as ArweaveTeam/testweave-docker#3
It would be very helpful to have the container image ready to use just by running docker pull
. The image can be stored along with this repo using the GitHub Container Registry which is free for OSS.
A GitHub action can be set up to build the image every time a GitHub release is created.
Something like the below will be sufficient. I can't set this up as someone from the team will need to provide a credential for pushing the image to the repo.
For setting up CR_PAT
: https://docs.github.com/en/packages/guides/migrating-to-github-container-registry-for-docker-images#authenticating-with-the-container-registry
name: Build Release
on:
release:
types: [created]
jobs:
build_image:
name: Build image
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
ref: ${{ github.head_ref }}
- name: Cache Docker layers
uses: actions/cache@v2
with:
path: /tmp/.buildx-cache
key: ${{ runner.os }}-buildx-${{ github.sha }}
restore-keys: |
${{ runner.os }}-buildx-
- name: Login to GitHub Container Registry
uses: docker/login-action@v1
with:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ secrets.CR_PAT }} <- The repo owner needs to generate a credential for pushing to the registry
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
- name: Build and push image
id: docker_build
uses: docker/build-push-action@v2
with:
context: .
file: ./docker/gateway.dockerfile
push: true
tags: |
ghcr.io/ArweaveTeam/gateway:latest
ghcr.io/ArweaveTeam/gateway:${{ github.event.release.tag_name }}
cache-from: type=local,src=/tmp/.buildx-cache
cache-to: type=local,dest=/tmp/.buildx-cache
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
From the implementation here it looks like transaction data is entirely buffered into memory first before being sent to the client.
Ideally, it should be streamed in straight to the client to reduce latency and to allow loading of transactions larger than the ~2GB Node supports.
Be able to review the sync status of the chain with a GET
request.
Should be under a /status
[GET]
endpoint.
An example of what the output should be is the following.
{
"synced": 100,
"height": 100000
"eta": "90210s"
}
curl -L -v https://arweave.net/x--IDzM94TZ2ezVz1UrHw1jaQkVpvyBym5h0QvH_Ang/xyz
The TxId is here is a path manifest, but xyz
does not exist in it. Instead of getting a 404 we get a 504 Gateway Timeout after a long wait.
Currently, CodeCov reports are at around 32%
as seen here.
Should work towards +80% code coverage.
When TestWeave
is ready. Integrate TestWeave
into Unit Testings for Mock Testing.
Should bump this to high priority when ready to integrate TestWeave
.
ANS-102 transactions should be parsed and formatted when you go to the /[tx-id]
route.
All payloads should be formatted based on the Content-Type
tag on retrieval.
Videos should be formatted with: #15 seen here.
Be able to query blocks by unix timestamp.
Add a new timestamp
filter to the GraphQL Schema.
query {
transactions(timestamp: {min: [unix], max: [unix]}) {
edges {
node {
id
}
}
}
}
When routed to the /[tx-id]
should have a subdomain TLD.
Transaction manifests should be autogenerated and formatted as SHA-256.
Can reference legacy source code for path manifests found here: https://github.com/ArweaveTeam/gateway/tree/feature/v2.1-chunks
In v1.0.0
CSV files created by docker-compose up
look like this:
$ docker-compose exec server tail /app/snapshot/transaction.csv
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
format|id|signature|owner|owner_address|target|reward|last_tx|height|tags|quantity|content_type|data_size|data_root|Content-Type|domain|namespace
Deprecate COPY
for syncing data
Use child_process
to create passive threads to insert data
These threaded insertions should also apply to the verify scripts.
Provide Postgres optimization suggestions.
Create general benchmarking guidelines and RAM requirements based on DB size.
Update guides to drop and generate indices for PG COPY
.
You system of creating indices for INDICES
environment variable is both not enough flexible and over-complicated.
The reasonable way to store metadata is to use a composite PostgreSQL index of both name and value. For queries use JOINs.
Then no need to create a separate DB table field for each index and every index could be used for filtering/sorting.
For my use case, I want to be able to discover and peer with untrusted Arweave nodes to retrieve data as opposed to defining my own list of trusted nodes.
To do so, the gateway will need to first discover nodes to peer with then it will need to verify that the data it receives from those nodes are correct.
generateTransactionChunksAsync()
function I wrote here. Though for client streaming whether or not the data is valid can only be known once the entire stream has completed.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.