neondatabase / neon Goto Github PK

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, branching, and bottomless storage.

License: Apache License 2.0

Rust 74.20% Python 18.96% Makefile 0.19% PLpgSQL 0.03% Dockerfile 0.08% C 5.85% Shell 0.15% TLA 0.17% HTML 0.01% C# 0.01% Java 0.01% Swift 0.04% JavaScript 0.17% C++ 0.14%

postgres postgresql serverless database rust

neon's Introduction

Neon

Neon is a serverless open-source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes the PostgreSQL storage layer by redistributing data across a cluster of nodes.

Quick start

Try the Neon Free Tier to create a serverless Postgres instance. Then connect to it with your preferred Postgres client (psql, dbeaver, etc) or use the online SQL Editor. See Connect from any application for connection instructions.

Alternatively, compile and run the project locally.

Architecture overview

A Neon installation consists of compute nodes and the Neon storage engine. Compute nodes are stateless PostgreSQL nodes backed by the Neon storage engine.

The Neon storage engine consists of two major components:

Pageserver: Scalable storage backend for the compute nodes.
Safekeepers: The safekeepers form a redundant WAL service that received WAL from the compute node, and stores it durably until it has been processed by the pageserver and uploaded to cloud storage.

See developer documentation in SUMMARY.md for more information.

Running local installation

Installing dependencies on Linux

Install build dependencies and other applicable packages

On Ubuntu or Debian, this set of packages should be sufficient to build the code:

apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \
libcurl4-openssl-dev openssl python3-poetry lsof libicu-dev

On Fedora, these packages are needed:

dnf install flex bison readline-devel zlib-devel openssl-devel \
  libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \
  protobuf-devel libcurl-devel openssl poetry lsof libicu-devel libpq-devel python3-devel \
  libffi-devel

On Arch based systems, these packages are needed:

pacman -S base-devel readline zlib libseccomp openssl clang \
postgresql-libs cmake postgresql protobuf curl lsof

Building Neon requires 3.15+ version of protoc (protobuf-compiler). If your distribution provides an older version, you can install a newer version from here.

Install Rust

# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Installing dependencies on macOS (12.3.1)

Install XCode and dependencies

xcode-select --install
brew install protobuf openssl flex bison icu4c pkg-config

# add openssl to PATH, required for ed25519 keys generation in neon_local
echo 'export PATH="$(brew --prefix openssl)/bin:$PATH"' >> ~/.zshrc

Install Rust

# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install PostgreSQL Client

# from https://stackoverflow.com/questions/44654216/correct-way-to-install-psql-without-full-postgres-on-macos
brew install libpq
brew link --force libpq

Rustc version

The project uses rust toolchain file to define the version it's built with in CI for testing and local builds.

This file is automatically picked up by rustup that installs (if absent) and uses the toolchain version pinned in the file.

rustup users who want to build with another toolchain can use the rustup override command to set a specific toolchain for the project's directory.

non-rustup users most probably are not getting the same toolchain automatically from the file, so are responsible to manually verify that their toolchain matches the version in the file. Newer rustc versions most probably will work fine, yet older ones might not be supported due to some new features used by the project or the crates.

Building on Linux

Build neon and patched postgres

# Note: The path to the neon sources can not contain a space.

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`nproc` -s"
# Remove -s for the verbose build log

make -j`nproc` -s

Building on OSX

Build neon and patched postgres

# Note: The path to the neon sources can not contain a space.

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`sysctl -n hw.logicalcpu` -s"
# Remove -s for the verbose build log

make -j`sysctl -n hw.logicalcpu` -s

Dependency installation notes

To run the psql client, install the postgresql-client package or modify PATH and LD_LIBRARY_PATH to include pg_install/bin and pg_install/lib, respectively.

To run the integration tests or Python scripts (not required to use the code), install Python (3.9 or higher), and install the python3 packages using ./scripts/pysync (requires poetry>=1.3) in the project directory.

Running neon database

Start pageserver and postgres on top of it (should be called from repo root):

# Create repository in .neon with proper paths to binaries and data
# Later that would be responsibility of a package install script
> cargo neon init
Initializing pageserver node 1 at '127.0.0.1:64000' in ".neon"

# start pageserver, safekeeper, and broker for their intercommunication
> cargo neon start
Starting neon broker at 127.0.0.1:50051.
storage_broker started, pid: 2918372
Starting pageserver node 1 at '127.0.0.1:64000' in ".neon".
pageserver started, pid: 2918386
Starting safekeeper at '127.0.0.1:5454' in '.neon/safekeepers/sk1'.
safekeeper 1 started, pid: 2918437

# create initial tenant and use it as a default for every future neon_local invocation
> cargo neon tenant create --set-default
tenant 9ef87a5bf0d92544f6fafeeb3239695c successfully created on the pageserver
Created an initial timeline 'de200bd42b49cc1814412c7e592dd6e9' at Lsn 0/16B5A50 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c
Setting tenant 9ef87a5bf0d92544f6fafeeb3239695c as a default one

# create postgres compute node
> cargo neon endpoint create main

# start postgres compute node
> cargo neon endpoint start main
Starting new endpoint main (PostgreSQL v14) on timeline de200bd42b49cc1814412c7e592dd6e9 ...
Starting postgres at 'postgresql://[email protected]:55432/postgres'

# check list of running postgres instances
> cargo neon endpoint list
 ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS
 main      127.0.0.1:55432  de200bd42b49cc1814412c7e592dd6e9  main         0/16B5BA8  running

Now, it is possible to connect to postgres and run some queries:

> psql -p 55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text);
CREATE TABLE
postgres=# insert into t values(1,1);
INSERT 0 1
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

And create branches and run postgres on them:

# create branch named migration_check
> cargo neon timeline branch --branch-name migration_check
Created timeline 'b3b863fa45fa9e57e615f9f2d944e601' at Lsn 0/16F9A00 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c. Ancestor timeline: 'main'

# check branches tree
> cargo neon timeline list
(L) main [de200bd42b49cc1814412c7e592dd6e9]
(L) ┗━ @0/16F9A00: migration_check [b3b863fa45fa9e57e615f9f2d944e601]

# create postgres on that branch
> cargo neon endpoint create migration_check --branch-name migration_check

# start postgres on that branch
> cargo neon endpoint start migration_check
Starting new endpoint migration_check (PostgreSQL v14) on timeline b3b863fa45fa9e57e615f9f2d944e601 ...
Starting postgres at 'postgresql://[email protected]:55434/postgres'

# check the new list of running postgres instances
> cargo neon endpoint list
 ENDPOINT         ADDRESS          TIMELINE                          BRANCH NAME      LSN        STATUS
 main             127.0.0.1:55432  de200bd42b49cc1814412c7e592dd6e9  main             0/16F9A38  running
 migration_check  127.0.0.1:55434  b3b863fa45fa9e57e615f9f2d944e601  migration_check  0/16F9A70  running

# this new postgres instance will have all the data from 'main' postgres,
# but all modifications would not affect data in original postgres
> psql -p 55434 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

postgres=# insert into t values(2,2);
INSERT 0 1

# check that the new change doesn't affect the 'main' postgres
> psql -p 55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

If you want to run tests afterwards (see below), you must stop all the running pageserver, safekeeper, and postgres instances you have just started. You can terminate them all with one command:

> cargo neon stop

More advanced usages can be found at Control Plane and Neon Local.

Handling build failures

If you encounter errors during setting up the initial tenant, it's best to stop everything (cargo neon stop) and remove the .neon directory. Then fix the problems, and start the setup again.

Running tests

Rust unit tests

We are using cargo-nextest to run the tests in Github Workflows. Some crates do not support running plain cargo test anymore, prefer cargo nextest run instead. You can install cargo-nextest with cargo install cargo-nextest.

Integration tests

Ensure your dependencies are installed as described here.

git clone --recursive https://github.com/neondatabase/neon.git

CARGO_BUILD_FLAGS="--features=testing" make

./scripts/pytest

By default, this runs both debug and release modes, and all supported postgres versions. When testing locally, it is convenient to run just one set of permutations, like this:

DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest

Flamegraphs

You may find yourself in need of flamegraphs for software in this repository. You can use flamegraph-rs or the original flamegraph.pl. Your choice!

Important

If you're using lld or mold, you need the --no-rosegment linker argument. It's a general thing with Rust / lld / mold, not specific to this repository. See this PR for further instructions.

Cleanup

For cleaning up the source tree from build artifacts, run make clean in the source directory.

For removing every artifact from build and configure steps, run make distclean, and also consider removing the cargo binaries in the target directory, as well as the database in the .neon directory. Note that removing the .neon directory will remove your database, with all data in it. You have been warned!

Documentation

docs Contains a top-level overview of all available markdown documentation.

sourcetree.md contains overview of source tree layout.

To view your rustdoc documentation in a browser, try running cargo doc --no-deps --open

See also README files in some source directories, and rustdoc style documentation comments.

Other resources:

SELECT 'Hello, World': Blog post by Nikita Shamgunov on the high level architecture
Architecture decisions in Neon: Blog post by Heikki Linnakangas
Neon: Serverless PostgreSQL!: Presentation on storage system by Heikki Linnakangas in the CMU Database Group seminar series

Postgres-specific terms

Due to Neon's very close relation with PostgreSQL internals, numerous specific terms are used. The same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.

To get more familiar with this aspect, refer to:

Neon glossary
PostgreSQL glossary
Other PostgreSQL documentation and sources (Neon fork sources can be found here)

Join the development

Read CONTRIBUTING.md to learn about project code style and practices.
To get familiar with a source tree layout, use sourcetree.md.
To learn more about PostgreSQL internals, check http://www.interdb.jp/pg/index.html

neon's People

Contributors

Stargazers

Watchers

Forkers

adsharma isgasho funbringer chengbin001 arssher nkotlyarov phan-pivotal umd-dslam markhannum bowenxiao1999 zshell31 theseusyang reshke ctring woonhak elchinoo phoenix24 chhetripradeep hlinnaka pcaruana js00070 suryatmodulus botranvan kianmeng em3ndez ngaut techiev2 cyberflamego linecode rogervaas ezreal1997 lmjw iamfork gurpreet-singh121 fresaip avinassh eternalerrors akankushjnvku mukika fawazghali zinohome rbqren000 pokareczkilona jacobjohansen sysadmin-tools crypto-forks crt-fork metaver5o ryanrussell archiveproject ming535 nanderoo marceloboeira makimarc willie-lin brandnewx cxz saqib-trywe williewillie-lin zxcjhg sekmet makeable xyzzy3000 jdatcmd ashutoshpw cuulee ricosfeifei fusiongalaxy elitan rmsrob mrcodechef chenjianxin cloudbox-network shism2 jianlirong wang-shun ahlfors vadim2404 ansrivas fairhopeweb gujun720 ranjanprj mohamedtaoufik kkxiaotikk webclinic017 alvin-tosh git-blue husen8891 wuqunfei 22317a57 akarasavov cnut-wp jhinboard wa7sa34cx andresrsanchez db-extreme alexander-irbis mkos11 imneov 5l1v3r1

neon's Issues

Create storage manager api/hook and extract smgr into extension

Information missing from WAL records that is needed by primary

Some WAL records leave out information that is not needed in replica, but is still needed by the primary. That's a problem with Zenith, where we rely on the WAL redo to reconstruct a page not only for replicas, but for the original primary instance, too.

I went through all the *_mask() functions, and all the problematic cases seem to be in the heapam:

commandid is set to 1 in heap insert/update/delete records
speculative token is not WAL-logged

Crash in spgscan.c

During test_regress compute node crashes after SELECT count(*) FROM quad_box_tbl WHERE b |&> box '((100,200),(300,500))';

TRAP: FailedAssertion("offset >= FirstOffsetNumber && offset <= max", File: "/Users/stas/code/zenith/tmp_install/build/../../vendor/postgres/src/backend/access/spgist/spgscan.c", Line: 833, PID: 14607)
0   postgres                            0x0000000106e38c1a ExceptionalCondition + 234
1   postgres                            0x0000000106735fa6 spgWalk + 1414
2   postgres                            0x0000000106735a0a spggetbitmap + 106
3   postgres                            0x00000001066f2ef4 index_getbitmap + 388
4   postgres                            0x00000001069b3dc5 MultiExecBitmapIndexScan + 325
5   postgres                            0x0000000106995d12 MultiExecProcNode + 162
6   postgres                            0x00000001069b31e0 BitmapHeapNext + 224
7   postgres                            0x000000010699a6b2 ExecScanFetch + 722
8   postgres                            0x000000010699a2d9 ExecScan + 153
9   postgres                            0x00000001069b2d3b ExecBitmapHeapScan + 59
10  postgres                            0x0000000106995c62 ExecProcNodeFirst + 82

Consider using `cargo fmt`

Hi everyone!

Please consider formatting the code with cargo fmt before the code style problem gets out of control. When a project is too big, losing the ability to git blame could become an instant nonstarter.

Page stream/batch for sequential scans

Refactoring and code cleanup for restore_s3 and restore_datadir

Add S3 offloading mechanics to safekeepers

Subj.

I'll use term wal_service for safekeer_proxy .

I suggest not try invent something complicated now and just put config param / option to safekeeper, that we were able to start one of them with --s3-offload option.

I imagine something following:

each safekeeper has its own synced_lsn (recent wal synced to disk), s3_lsn (wal pushed to s3) both of that is reported back to proxy. s3_lsn is only reported by safekeeper with --s3-offload flag.
proxy sees majority_max(synced_lsn) and decides VCL and sees majority_max(s3_lsn) and decides trim_lsn.
WAL older trim_lsn can be deleted.

In this setup only one safekeeper will advance s3_lsn, but by reporting it to the wal_service it will affect trim_lsn, and allow his peer safekeepers to delete old WAL.

If safekeeper with --s3-offload will die, then other safekeepers will stop trimming WAL until one of them would be restarted with --s3-offload.

Pass regressions tests in compute+pageserver setup

That's is covering ticket for all the problems that will arise out of running pg_regress among with pageserver. Goal here is to get all redo edge cases / problems early on.

Some support routines to make testing easier:

> cd contrib/zenith_store
> make start # will start compute+pageserver and target them to each other using PostgresNode.pm
> make run-regress # starts pg_regress on compute

This task may done in a few steps:

Fix all pageserver crashes, ignoring test diffs
Fix test diffs itself, add proper synchronization of pageserver page version with latest evicted or written LSN.
When pg_regress passes it would make sense to write loop over all collected pages at pageserver and restore them all -- sicne there is possibility that not all pages were asked by processing node. This code would also make be helpful for snapshot export from pageserver.

Add snapshots to the cli

We may model them as a basebackups now.

create -- create basebackup
export -- sends basebackup to destination (pageserver)
import -- receives and saves basebackup from a remote postgres via replication channel and save that

Create LSN datatype

This pattern is quite tedious:

format!("current LSN is {:X}/{:X}", lsn >> 32, lsn & 0xffffffff);

We should have a datatype with Display trait, so that you could write just:

format!("current LSN is {}", lsn);

smgr for receiving pages from the remote page store

Async functions and blocking I/O don't mix

The pageserver code seems to freely mix async functions and blocking I/O. I first noticed this in the rocksdb change (#54), but I don't have to look far to find other examples: restore_relfile and write_wal_file (#43) both do blocking I/O and are called from async functions.

I think this is a big problem.

Mixing blocking and async code together just doesn't work. At its best, we're wasting effort in building async code, only to have the code unable to achieve any concurrency because there is I/O blocking the thread. At worst, the combined binary will have confusing stalls, deadlocks, and performance bugs that are hard to debug.

To make this code async-friendly, we would either need to push the I/O work to standalone threads, or use the async equivalents.

Async file I/O is not too hard: just swap in tokio::fs for std::fs and start propagating .await and async fn as far as needed. But something like rocksdb is a lot harder: as far as I can tell, there's no such thing as an async rocksdb.

Another alternative would be to give up on the async model, and just do everything with explicit threads.

Observability

Each event in safekeeper/pageserver should define some counters/aggregates, that would be at first stored and recalculated locally, then pushed somewhere in the general direction of metrics store like cloud watch or prometheus. Also we may adopt all the opentracing stuff -- define 'span' for each continuous event and send them to jaeger.

If we adopt this earlier that would greatly simplify benchmarking: instead of ad-hoc setup
for each test we can just run load and look into grafana/jaeger to understand what is happening.

Subtasks

Smgr: cache n_blocks info for relations and avoid spamming pageserver with corresponding requests.

Teach Page Server's WAL receiver to respond to KeepAlives

Currently, the page server ignores PrimaryKeepAlive messages. See walreceiver.rs:

            ReplicationMessage::PrimaryKeepAlive(_keepalive) => {
                trace!("received PrimaryKeepAlive");
                // FIXME: Reply, or the connection will time out
            }

Because of that, the primary will terminate the connection after a timeout.

clarify libcurl dependency?

My first build:

Compiling postgres
In file included from /home/eric/zenith/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/smgr/lazyrestore.c:39:
/home/eric/zenith/zenith/tmp_install/build/../../vendor/postgres/src/include/zenith_s3/s3_ops.h:17:10: fatal error: curl/curl.h: No such file or directory
   17 | #include <curl/curl.h>
      |          ^~~~~~~~~~~~~
compilation terminated.
make[4]: *** [../../../../src/Makefile.global:923: lazyrestore.o] Error 1
make[3]: *** [/home/eric/zenith/zenith/tmp_install/build/../../vendor/postgres/src/backend/common.mk:39: smgr-recursive] Error 2
make[3]: *** Waiting for unfinished jobs....
/home/eric/zenith/zenith/tmp_install/build/../../vendor/postgres/src/zenith_s3/s3_ops.c:4:10: fatal error: curl/curl.h: No such file or directory
    4 | #include <curl/curl.h>
      |          ^~~~~~~~~~~~~
compilation terminated.
make[3]: *** [../../src/Makefile.global:923: s3_ops.o] Error 1
make[3]: *** Waiting for unfinished jobs....
/home/eric/zenith/zenith/tmp_install/build/../../vendor/postgres/src/zenith_s3/s3_sign.c:3:10: fatal error: curl/curl.h: No such file or directory
    3 | #include <curl/curl.h>  /* for curl_slist */
      |          ^~~~~~~~~~~~~

I would have expected ./configure to complain if libcurl isn't installed.

Also, on my machine (Ubuntu) I have a few different libcurl -dev package options:

libcurl4-gnutls-dev
  development files and documentation for libcurl (GnuTLS flavour)

libcurl4-nss-dev
  development files and documentation for libcurl (NSS flavour)

libcurl4-openssl-dev
  development files and documentation for libcurl (OpenSSL flavour)

I installed the openssl variant, but haven't tried the others. Maybe we should document the recommended package for debian-flavored environments?

Storage format for the pageserver

Right now everything is stored in-memory on pageserver. One way for persistency is just to have LSM of pages/records sorted by (db, ... , pageno, lsn). But own LSM sounds like a lot of works to do. Can we come up with something better/simpler here? And could we provide better performance compared to RocksDB?

Crash at "VACUUM FULL pg_class"

psql (14devel)
Type "help" for help.

regression=# vacuum full pg_class;
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: WARNING:  terminating connection because of crash of another server process

Backtrace:

Core was generated by `postgres: zenith postgres [local] VACUUM                '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055f6917bffd3 in RelationReloadNailed (relation=0x7fdacb6eaef0) at relcache.c:2338
2338				relp = (Form_pg_class) GETSTRUCT(pg_class_tuple);
(gdb) bt
#0  0x000055f6917bffd3 in RelationReloadNailed (relation=0x7fdacb6eaef0) at relcache.c:2338
zenithdb/postgres#1  0x000055f6917c0448 in RelationClearRelation (relation=0x7fdacb6eaef0, rebuild=true) at relcache.c:2476
zenithdb/postgres#2  0x000055f6917c0b07 in RelationFlushRelation (relation=0x7fdacb6eaef0) at relcache.c:2739
zenithdb/postgres#3  0x000055f6917c0caf in RelationCacheInvalidateEntry (relationId=1259) at relcache.c:2810
zenithdb/postgres#4  0x000055f6917b17cb in LocalExecuteInvalidationMessage (msg=0x55f692e58918) at inval.c:595
zenithdb/postgres#5  0x000055f6917b151f in ProcessInvalidationMessages (hdr=0x55f692de5b60, func=0x55f6917b16c4 <LocalExecuteInvalidationMessage>) at inval.c:466
zenithdb/postgres#6  0x000055f6917b20cc in CommandEndInvalidationMessages () at inval.c:1098
zenithdb/postgres#7  0x000055f69123e754 in AtCCI_LocalCache () at xact.c:1488
zenithdb/postgres#8  0x000055f69123e138 in CommandCounterIncrement () at xact.c:1058
zenithdb/postgres#9  0x000055f6912899b7 in reindex_relation (relid=1259, flags=18, params=0x7ffdbec14658) at index.c:3998
zenithdb/postgres#10 0x000055f691337d0b in finish_heap_swap (OIDOldHeap=1259, OIDNewHeap=16384, is_system_catalog=true, swap_toast_by_content=false, check_constraints=false, is_internal=true, 
    frozenXid=539, cutoffMulti=1, newrelpersistence=112 'p') at cluster.c:1418
zenithdb/postgres#11 0x000055f6913367ab in rebuild_relation (OldHeap=0x7fdacb6eaef0, indexOid=0, verbose=false) at cluster.c:611
zenithdb/postgres#12 0x000055f6913361e3 in cluster_rel (tableOid=1259, indexOid=0, params=0x7ffdbec14788) at cluster.c:427
zenithdb/postgres#13 0x000055f6913d9b94 in vacuum_rel (relid=1259, relation=0x55f692d63a60, params=0x7ffdbec14990) at vacuum.c:1940
zenithdb/postgres#14 0x000055f6913d7b98 in vacuum (relations=0x55f692e3d618, params=0x7ffdbec14990, bstrategy=0x55f692e3d4f0, isTopLevel=true) at vacuum.c:459
zenithdb/postgres#15 0x000055f6913d7636 in ExecVacuum (pstate=0x55f692d84a80, vacstmt=0x55f692d63b48, isTopLevel=true) at vacuum.c:252
zenithdb/postgres#16 0x000055f69163ec57 in standard_ProcessUtility (pstmt=0x55f692d63f48, queryString=0x55f692d63040 "vacuum full pg_class;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
    dest=0x55f692d64038, qc=0x7ffdbec14dd0) at utility.c:826
zenithdb/postgres#17 0x000055f69163e42f in ProcessUtility (pstmt=0x55f692d63f48, queryString=0x55f692d63040 "vacuum full pg_class;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
    dest=0x55f692d64038, qc=0x7ffdbec14dd0) at utility.c:525
zenithdb/postgres#18 0x000055f69163d1d9 in PortalRunUtility (portal=0x55f692dc8860, pstmt=0x55f692d63f48, isTopLevel=true, setHoldSnapshot=false, dest=0x55f692d64038, qc=0x7ffdbec14dd0) at pquery.c:1159
zenithdb/postgres#19 0x000055f69163d406 in PortalRunMulti (portal=0x55f692dc8860, isTopLevel=true, setHoldSnapshot=false, dest=0x55f692d64038, altdest=0x55f692d64038, qc=0x7ffdbec14dd0) at pquery.c:1305
zenithdb/postgres#20 0x000055f69163c8e1 in PortalRun (portal=0x55f692dc8860, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x55f692d64038, altdest=0x55f692d64038, qc=0x7ffdbec14dd0)
    at pquery.c:779
zenithdb/postgres#21 0x000055f691636409 in exec_simple_query (query_string=0x55f692d63040 "vacuum full pg_class;") at postgres.c:1240
zenithdb/postgres#22 0x000055f69163a958 in PostgresMain (argc=1, argv=0x7ffdbec15060, dbname=0x55f692d90748 "postgres", username=0x55f692d90728 "zenith") at postgres.c:4394
zenithdb/postgres#23 0x000055f691574241 in BackendRun (port=0x55f692d84320) at postmaster.c:4484
zenithdb/postgres#24 0x000055f691573b73 in BackendStartup (port=0x55f692d84320) at postmaster.c:4206
zenithdb/postgres#25 0x000055f69156ff0f in ServerLoop () at postmaster.c:1730
zenithdb/postgres#26 0x000055f69156f75c in PostmasterMain (argc=3, argv=0x55f692d5c9b0) at postmaster.c:1402
zenithdb/postgres#27 0x000055f69146e9fa in main (argc=3, argv=0x55f692d5c9b0) at main.c:213

This happens in the 'vacuum` regression test.

Handle splits in per-page wal

As expected there are problems with multi-page WAL records. Didn't look closely yet, but it seems that redo tries to read page from a different block_id and now there is now place to read it from.

    frame zenithdb/postgres#3: 0x000000010728b3b4 postgres`ExceptionalCondition(conditionName="P_INCOMPLETE_SPLIT(pageop)", errorType="FailedAssertion", fileName="nbtxlog.c", lineNumber=152) at assert.c:69:2
    frame zenithdb/postgres#4: 0x0000000106b828d4 postgres`_bt_clear_incomplete_split + 372
    frame zenithdb/zenith#40: 0x0000000106b82116 postgres`btree_xlog_newroot + 598
    frame zenithdb/zenith#39: 0x0000000106b7ecdc postgres`btree_redo + 540
    frame zenithdb/zenith#38: 0x00000001077cc15f zenith_store.so`get_page_by_key + 463

Add module name to logs in pageserver

Rebuild crate binaries on integration tests rebuild

Right now we call compiled binaries from integration tests, so cargo do not know there is actual dependency

Store non-rel wal records on pageservers

Handle race condition between applying WAL and requesting pages from Pagers

This race condition in GetPage@LSN was described in Socrates article (4.5 Secondary Compute Node).
When replica requests some page from Pager with LSN equal to the current WAL apply LSN (lets call it LSN_0),
then there are three possible issues:

Pager may return page "from future" (with LSN=LSN_1 greater than specified LSN_0). It is true for Socrates: "According to the GetPage@LSN protocol, the Page Server may return a page from the future". I am not sure that our pager should follow this rule. Actually it complicates switching replica to the particular snapshot. But if such situation can happen, then smgr should wait some time, given a chance to wal receiver to apply WAL up to this LSN_1.
Pagers may return page with LSN-LSN_1 < LSN_0, but page was updated between LSN_1 and LSN_0. It can happen if pager didn't yet apply WAL till LSN_0. that this situation should be prevented at Pager which should suspend GetPage@LSN request until WAL is applied till this LSN.
Pager returns page with LSN_0, but while waiting for this response, replica has applied WAL till LSN_2 which contains records updating this page. In this case we have three choices:

Suspend applying WAL if it tries to update page requested from Pager (so, definitely, we should remember fact of requesting page from Pager).
Remember WAL records touching requested page and apply them when page is received
Resubmit GetPage@LSN request once such collision was detected.
I think that suspending WAL apply is the best approach, because it is much simpler than trying to maintain queue of delayed WAL records and applying the for the concrete page. And there is no risk of starvation as in case of retrying GetPage request.

safekeeper in rust

create_index regression test failure because of duplicates

https://github.com/zenithdb/zenith/runs/2457481429

Regression from top-of-tree (2021-04-30)

I realize this is a new work-in-progress, but perhaps this would serve as a useful regression test

#!/bin/sh

export PGHOST=127.0.0.1
export PGPORT=55432
export PGDATABASE=postgres

[ -f chembl_28_postgresql.tar.gz ] || {
wget ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_28_postgresql.tar.gz
tar -zxf chembl_28_postgresql.tar.gz
}
createuser chembl
createdb chembl_28 -O chembl
pg_restore -O -j 4 -U chembl -d chembl_28 chembl_28/chembl_28_postgresql/chembl_28_postgresql.dmp

While restoring I observed the following errors after trying to branching the database while the restore was running

% RUST_BACKTRACE=full ./target/debug/zenith branch eric
thread 'main' panicked at 'Missing start-point', zenith/src/main.rs:237:13
stack backtrace:
   0:        0x1030d008c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h9955000502fddd8a
   1:        0x1031068d0 - core::fmt::write::hd6aaffe828fad50f
   2:        0x1030cf7e4 - std::io::Write::write_fmt::h8939db9f062fb6a9
   3:        0x1030e663c - std::panicking::default_hook::{{closure}}::h22d061fa0f32cbb9
   4:        0x1030e61b0 - std::panicking::default_hook::h900d59958999a29b
   5:        0x1030e6af0 - std::panicking::rust_panic_with_hook::h407c595b7d5fe810
   6:        0x102c5a568 - std::panicking::begin_panic::{{closure}}::h160eaef961b35fb8
                               at /private/tmp/rust-20210325-61749-fsgy2a/rustc-1.51.0-src/library/std/src/panicking.rs:520:9
   7:        0x102c58bc0 - std::sys_common::backtrace::__rust_end_short_backtrace::hcd38a220f638de80
                               at /private/tmp/rust-20210325-61749-fsgy2a/rustc-1.51.0-src/library/std/src/sys_common/backtrace.rs:141:18
   8:        0x103108c70 - std::panicking::begin_panic::hd8fa8c5428944368
                               at /private/tmp/rust-20210325-61749-fsgy2a/rustc-1.51.0-src/library/std/src/panicking.rs:519:12
   9:        0x1027c2040 - zenith::run_branch_cmd::h257a5f0220d7d10f
                               at /Users/eradman/git.oss/zenith/zenith/src/main.rs:237:13
  10:        0x1027c02ec - zenith::main::h49c8481c8bce9d2a
                               at /Users/eradman/git.oss/zenith/zenith/src/main.rs:98:39
  11:        0x1027bccc0 - core::ops::function::FnOnce::call_once::hcf21086145e5d2a3
                               at /private/tmp/rust-20210325-61749-fsgy2a/rustc-1.51.0-src/library/core/src/ops/function.rs:227:5
  12:        0x1027c42cc - std::sys_common::backtrace::__rust_begin_short_backtrace::hb1bc45af75409a3c
                               at /private/tmp/rust-20210325-61749-fsgy2a/rustc-1.51.0-src/library/std/src/sys_common/backtrace.rs:125:18
  13:        0x1027bea10 - std::rt::lang_start::{{closure}}::hecd713fbcf38ff77
                               at /private/tmp/rust-20210325-61749-fsgy2a/rustc-1.51.0-src/library/std/src/rt.rs:66:18
  14:        0x1030ea1f0 - std::rt::lang_start_internal::hcd175898c32ebdc5
  15:        0x1027be9e8 - std::rt::lang_start::hc1dd096f18cf7119
                               at /private/tmp/rust-20210325-61749-fsgy2a/rustc-1.51.0-src/library/std/src/rt.rs:65:5
  16:        0x1027c3674 - _main

from .zenith/pgdatadirs/pg1/log:

2021-04-30 10:46:18.442 EDT [23331] HINT:  Please REINDEX it.
2021-04-30 10:46:18.442 EDT [23331] CONTEXT:  COPY component_sequences, line 92
2021-04-30 10:46:18.442 EDT [23331] STATEMENT:  COPY component_sequences (component_id, component_type, accession, sequence, sequence_md5sum, description, tax_id, organism, db_source, db_version) FROM stdin;
	
2021-04-30 10:46:18.788 EDT [23447] LOG:  [ZENITH_SMGR] libpqpagestore: connected to 'host=127.0.0.1 port=64000'
2021-04-30 10:46:18.843 EDT [23447] ERROR:  index "pk_actsm_id" contains unexpected zero page at block 0
2021-04-30 10:46:18.843 EDT [23447] HINT:  Please REINDEX it.
2021-04-30 10:46:18.843 EDT [23447] CONTEXT:  while cleaning up index "pk_actsm_id" of relation "public.activity_supp_map"
	automatic vacuum of table "chembl_28.public.activity_supp_map"
2021-04-30 10:46:18.855 EDT [23447] ERROR:  catalog is missing 1 attribute(s) for relid 58831
2021-04-30 10:46:18.855 EDT [23447] CONTEXT:  automatic vacuum of table "chembl_28.public.activity_supp"
TRAP: FailedAssertion("relid == targetRelId", File: "/Users/eradman/git.oss/zenith/tmp_install/build/../../vendor/postgres/src/backend/utils/cache/relcache.c", Line: 1090, PID: 23447)
0   postgres                            0x0000000104ef5de8 ExceptionalCondition + 268
1   postgres                            0x0000000104edee5c RelationBuildDesc + 324
2   postgres                            0x0000000104ede0f8 RelationIdGetRelation + 404
3   postgres                            0x00000001047858a8 try_relation_open + 192
4   postgres                            0x0000000104a5f714 vacuum_open_relation + 180
5   postgres                            0x0000000104a5e980 vacuum_rel + 372
6   postgres                            0x0000000104a5e07c vacuum + 1664
7   postgres                            0x0000000104c00670 autovacuum_do_vac_analyze + 136
8   postgres                            0x0000000104bff390 do_autovacuum + 3356
9   postgres                            0x0000000104bfc938 AutoVacWorkerMain + 1244
10  postgres                            0x0000000104bfc444 StartAutoVacWorker + 204
11  postgres                            0x0000000104c1c108 StartAutovacuumWorker + 260
12  postgres                            0x0000000104c1535c sigusr1_handler + 1064
13  libsystem_platform.dylib            0x000000019ae0dc44 _sigtramp + 56
14  postgres                            0x0000000104c16c6c ServerLoop + 364
15  postgres                            0x0000000104c1446c PostmasterMain + 5804
16  postgres                            0x0000000104aff8c8 main + 856
17  libdyld.dylib                       0x000000019ade1f34 start + 4
2021-04-30 10:46:18.903 EDT [22094] LOG:  server process (PID 23447) was terminated by signal 6: Abort trap: 6
2021-04-30 10:46:18.903 EDT [22094] DETAIL:  Failed process was running: autovacuum: VACUUM ANALYZE public.activity_supp_map
2021-04-30 10:46:18.903 EDT [22094] LOG:  terminating any other active server processes
2021-04-30 10:46:18.909 EDT [23449] LOG:  PID 23332 in cancel request did not match any process
2021-04-30 10:46:18.911 EDT [23450] LOG:  PID 23334 in cancel request did not match any process
2021-04-30 10:46:18.912 EDT [22094] LOG:  all server processes terminated; reinitializing
2021-04-30 10:46:18.913 EDT [23451] LOG:  database system was interrupted; last known up at 2021-04-30 10:46:16 EDT
2021-04-30 10:46:18.943 EDT [23451] LOG:  database system was not properly shut down; automatic recovery in progress
2021-04-30 10:46:18.945 EDT [23451] LOG:  redo starts at 5/E9003BB8
2021-04-30 10:46:19.420 EDT [23451] LOG:  invalid record length at 6/3525A78: wanted 24, got 0
2021-04-30 10:46:19.420 EDT [23451] LOG:  redo done at 6/3523A28 system usage: CPU: user: 0.42 s, system: 0.05 s, elapsed: 0.47 s
2021-04-30 10:46:19.423 EDT [22094] LOG:  database system is ready to accept connections

% uname -a
Darwin Erics-Mac-mini.local 20.3.0 Darwin Kernel Version 20.3.0: Thu Jan 21 00:06:51 PST 2021; root:xnu-7195.81.3~1/RELEASE_ARM64_T8101 arm64

Handle CREATE DATABASE in pageserver

When new database is created, postgres performs copydir from template to new database dir. In most cases, it will be only catalog files.
We need to copy all pages in page_cache when XLOG_DBASE_CREATE received,
Or we can implement it by including template database pages to the replay chain.

It is necessary to pass regression tests.
Now there is a crutch to read such files locally from filesystem: neondatabase/postgres@daec929

Improve on transmute for bindgen structs

Is there a way to improve upon this?

    // Initialize an all-zeros ControlFileData struct
    pub fn new() -> ControlFileData {
        let controlfile: ControlFileData;

        let b = [0u8; SIZEOF_CONTROLDATA];
        controlfile = unsafe {
            std::mem::transmute::<[u8; SIZEOF_CONTROLDATA], ControlFileData>(b)
        };

        return controlfile;
    }
}

Ideally we could have some automatically derived functions that convert to/from [u8].

basic integration tests for walkeeper

package zenith cli as a pip module

and bundle pg binaries inside to run on a mac and linux

File::write does not guarantee all bytes were written

Clippy says:

error: written amount is not handled. Use `Write::write_all` instead
   --> pageserver/src/walredo.rs:384:13
    |
384 |             config.write(b"shared_buffers=128kB\n")?;
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[deny(clippy::unused_io_amount)]` on by default
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unused_io_amount

error: written amount is not handled. Use `Write::write_all` instead
   --> pageserver/src/walredo.rs:385:13
    |
385 |             config.write(b"fsync=off\n")?;
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unused_io_amount

Implement sandbox to run WAL redo routines in

Creating an SP-GiST index crashes with page servers

Creating an SP-Gist index is currently failing. From the regression tests:

CREATE INDEX quad_box_tbl_idx ON quad_box_tbl USING spgist(b);
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
connection to server was lost

The root cause is that SP-Gist index build takes some shortcuts wrt. WAL-logging. It builds the index by inserting each tuple one-by-one. It uses the shared buffer cache as usual for that. But it doesn't WAL-log anything in this stage. Instead, it scans the whole relation at the end of the CREATE INDEX operation, and writes all the pages to WAL. That doesn't work, when you don't have local storage at all. With the Page Servers, we rely on WAL redo to reconstruct pages, and if the pages have not been WAL-logged yet when they are evicted from the page cache, they are lost. So when the index building needs to read back a page that it had created earlier, it gets back an all-zeros page, and trips an assertion failure

This is actually fairly new in PostgreSQL, before commit 9155580fd5fc2a0cbb23376dfca7cd21f59c2c7b, it was done in the "naive" way, which would've worked with the page server.

page cache atomic variable accesses use wrong ordering

page_cache.rs has quite a few places where we see code like this:

self.last_valid_lsn.store(lsn, Ordering::Relaxed);
// ...
self.last_valid_lsn.load(Ordering::Relaxed),

This is not a valid use of Ordering::Relaxed. We may get away with it for a while, but the semantics of Relaxed would allow the compiler/CPU to do things we clearly don't intend.

The load and store accesses should be Acquire and Release, respectively. I don't recall what the right ordering should be for things like fetch_add...

Spring cleaning

Go through the repos and "git rm" obsolete stuff:

Is contrib/zenith_store/memstore.* still needed? We have the Rust pageserver now.
Which of the programs in src/bin/zenith_* are still needed?

Update all the READMEs

Import data in the pageserver

Now pageserver accepts only incoming WAL. If we want to import/restore database in it we also need to accept just pages. I would model that as stored procedure that accepts pages via COPY protocol. Format may be just base backup data stream.

Set up continuous integration

When you push a commit, a Continuous Integration server somewhere starts up, runs all regression tests, and publishes the results on the web.

Implement CLI prototype

My proposal is here:

https://github.com/libzenith/rfcs/blob/003-laptop-cli.md/003-laptop-cli.md

but feel free to suggest changes or even different concept.

I think we can iterate now faster with CLI writter in python, but later we should change this to some compiled language to simplify distribution and do not depend on anything on end-user laptop.

Introduce proper network messaging

Now in a lot of places we use custom serialization of messages, we can adopt serde or protobuf or smth else. If we define good practices here earlier it would be less work in future.

Way I see it is that we may stick with general layout of postgres protocol, but serialize / deserialize
things we put in copy messages. Or just use custom protocols (and lose ability psql to that endpoint and
query smth).

Protobuf is boring and means auto-generated code and some associated pain, but it brings following
niceties:

Message definitions are separated from code and serializers can be generated for a bunch of languages, also that definition serve as good overview of api's
enforce no-deleted-fields policy and forward / backward compatibility.

So we need to do some research and decide what to use.

track latest evicted lsn and wait for it in getPage request

upload objects < FirstBootstrapObjectId to the page store

To store catalog in the page store we need to have genbki objects there.

Page server consumes 100% CPU when idle

Tasks: 324 total,   1 running, 322 sleeping,   0 stopped,   1 zombie
%Cpu(s): 10.8 us,  4.6 sy,  0.0 ni, 84.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7837.1 total,    349.9 free,   5690.8 used,   1796.5 buff/cache
MiB Swap:   8072.0 total,   5805.0 free,   2267.0 used.   1353.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                          
1444906 heikki    20   0 1340240 639656  18188 S 100.0   8.0  65:33.04 pageserver                       
   3045 heikki    20   0  590544  57248  23876 S   6.2   0.7  44:41.82 xfce4-terminal                   
      1 root      20   0  167840   8748   5660 S   0.0   0.1   0:12.65 systemd                          
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.18 kthreadd                         
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                           
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                       
      9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                     
     10 root      20   0       0      0      0 S   0.0   0.0   0:00.00 rcu_tasks_rude_                  
     11 root      20   0       0      0      0 S   0.0   0.0   0:00.00 rcu_tasks_trace                  
     12 root      20   0       0      0      0 S   0.0   0.0   0:08.03 ksoftirqd/0                      
     13 root      20   0       0      0      0 I   0.0   0.0   3:02.13 rcu_sched                        
     14 root      rt   0       0      0      0 S   0.0   0.0   0:01.41 migration/0                      
     15 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0

perf report:

Samples: 29K of event 'cycles', Event count (approx.): 24216597242
  Children      Self  Command          Shared Object      Symbol
+   99.08%     0.00%  Page Service th  pageserver         [.] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
+   99.07%     0.00%  Page Service th  pageserver         [.] tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
+   99.07%     0.00%  Page Service th  pageserver         [.] <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
+   99.06%     0.01%  Page Service th  pageserver         [.] pageserver::page_service::page_service_main::{{closure}}::{{closure}}
+   99.05%     0.02%  Page Service th  pageserver         [.] <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
+   99.03%     0.01%  Page Service th  pageserver         [.] pageserver::page_service::Connection::run::{{closure}}
+   99.01%     0.01%  Page Service th  pageserver         [.] <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
+   98.99%     0.01%  Page Service th  pageserver         [.] pageserver::page_service::Connection::process_query::{{closure}}
+   98.98%     0.00%  Page Service th  pageserver         [.] <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
+   98.96%     0.88%  Page Service th  pageserver         [.] pageserver::page_service::Connection::handle_pagerequests::{{closure}}
+   95.53%     0.50%  Page Service th  pageserver         [.] <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
+   93.12%     1.76%  Page Service th  pageserver         [.] pageserver::page_service::Connection::read_message::{{closure}}
+   87.36%     1.44%  Page Service th  pageserver         [.] <tokio::io::util::read_buf::ReadBuf<R,B> as core::future::future::Future>::poll
+   74.02%     0.57%  Page Service th  pageserver         [.] <&mut T as tokio::io::async_read::AsyncRead>::poll_read
+   73.32%     0.29%  Page Service th  pageserver         [.] <tokio::io::util::buf_writer::BufWriter<W> as tokio::io::async_read::AsyncRead>::poll_read
+   72.40%     0.34%  Page Service th  pageserver         [.] <tokio::net::tcp::stream::TcpStream as tokio::io::async_read::AsyncRead>::poll_read
+   71.69%     0.30%  Page Service th  pageserver         [.] tokio::net::tcp::stream::TcpStream::poll_read_priv
+   71.40%     1.48%  Page Service th  pageserver         [.] tokio::io::poll_evented::PollEvented<E>::poll_read
+   68.54%     0.14%  Page Service th  pageserver         [.] tokio::io::driver::registration::Registration::poll_read_io
+   68.39%     0.79%  Page Service th  pageserver         [.] tokio::io::driver::registration::Registration::poll_io
+   44.22%     0.41%  Page Service th  pageserver         [.] tokio::io::poll_evented::PollEvented<E>::poll_read::{{closure}}
+   41.12%     0.09%  Page Service th  pageserver         [.] <&mio::net::tcp::stream::TcpStream as std::io::Read>::read
+   41.03%     0.08%  Page Service th  pageserver         [.] mio::io_source::IoSource<T>::do_io
+   40.94%     0.10%  Page Service th  pageserver         [.] mio::sys::unix::IoSourceState::do_io
+   40.82%     0.15%  Page Service th  pageserver         [.] <&mio::net::tcp::stream::TcpStream as std::io::Read>::read::{{closure}}
+   40.67%     0.72%  Page Service th  pageserver         [.] <&std::net::tcp::TcpStream as std::io::Read>::read
+   40.00%     1.77%  Page Service th  libc-2.31.so       [.] __libc_recv
+   24.27%     0.00%  Page Service th  [unknown]          [.] 0xffffffffa8c0008c
+   23.40%     1.30%  Page Service th  pageserver         [.] tokio::io::driver::registration::Registration::poll_ready
+   15.18%    15.18%  Page Service th  [kernel.kallsyms]  [k] syscall_exit_to_user_mode
+   14.98%     0.00%  Page Service th  [unknown]          [.] 0xffffffffa8aadbe4
+    8.58%     0.00%  Page Service th  [unknown]          [.] 0xffffffffa8aaa483
+    8.17%     0.00%  Page Service th  [unknown]          [.] 0xffffffffa88b1a65
+    7.38%     0.18%  Page Service th  pageserver         [.] tokio::coop::poll_proceed
+    7.20%     0.27%  Page Service th  pageserver         [.] std::thread::local::LocalKey<T>::with
+    6.95%     0.98%  Page Service th  pageserver         [.] tokio::io::read_buf::ReadBuf::filled
+    6.78%     0.83%  Page Service th  pageserver         [.] std::thread::local::LocalKey<T>::try_with
+    6.37%     6.37%  Page Service th  [kernel.kallsyms]  [k] syscall_return_via_sysret
+    6.31%     0.00%  Page Service th  pageserver         [.] 0x0000561016cad088
+    6.02%     0.49%  Page Service th  pageserver         [.] core::slice::index::<impl core::ops::index::Index<I> for [T]>::index
+    5.69%     5.69%  Page Service th  libc-2.31.so       [.] __memmove_avx_unaligned_erms
+    5.53%     0.12%  Page Service th  pageserver         [.] tokio::io::driver::Handle::inner
+    5.49%     0.71%  Page Service th  pageserver         [.] <core::ops::range::RangeTo<usize> as core::slice::index::SliceIndex<[T]>>::index
+    5.41%     1.18%  Page Service th  pageserver         [.] alloc::sync::Weak<T>::upgrade
+    5.37%     5.37%  Page Service th  [kernel.kallsyms]  [k] entry_SYSCALL_64
+    5.36%     0.57%  Page Service th  pageserver         [.] core::cell::Cell<T>::set
+    4.81%     1.11%  Page Service th  pageserver         [.] <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[T]>>::index
+    4.75%     0.00%  Page Service th  [unknown]          [.] 0xffffffffa88b1985
+    4.26%     0.57%  Page Service th  pageserver         [.] tokio::coop::poll_proceed::{{closure}}
+    4.14%     0.00%  Page Service th  [unknown]          [.] 0xffffffffa8c00030
Tip: Create an archive with symtabs to analyse on other machine: perf archive

Investigate why & fix. Probably something silly and straightforward..

Deal with unlogged tables

Currently, the smgr API doesn't differentiate unlogged tables. We will try to read/write them through the page servers, too. Can we detect and force them to be local-only or something?

Figure out repo structure

Create website specs for designer

Implement integration tests for whole setup

page_service doesn't handle messages split across TCP packets correctly

I noticed while reading the code that FeMessage::parse function consumes the bytes it reads from the buffer, even if the whole message is not present. That causes trouble if the message is split across multiple TCP packets. To reproduce, I created a proxy that forwards the data one byte at a time, i.e. each byte is sent as separate TCP packet:

# start pageserver, listening on port 64000
./target/debug/pageserver -l 127.0.0.1:64000

# in another terminal, create proxy:
socat -b1 TCP-LISTEN:64002,reuseaddr TCP:localhost:64000

# connect with psql:
psql -p 64002 -h localhost

# pageserver panics:
thread 'tokio-runtime-worker' panicked at 'cannot advance past `remaining`: 80 <= 8', /home/heikki/.cargo/registry/src/github.com-1ecc6299db9ec823/bytes-1.0.1/src/bytes_mut.rs:953:9

If you replace -b1 above with a higher number, it works.

Tests in test_wall_acceptor.rs are broken by clap argument changes

Commit 3b09a74: Implement offloading of old WAL files to S3 in walkeeper breaks all of the tests in test_wal_acceptor.rs.

I tried removing the duplicate "listen" clap argument, but the tests still fail, trying to use the --pageserver argument, which was removed.

Fix race at postgres instance + walreceiver start

Now the flag is set on pageserver side.
https://github.com/zenithdb/zenith/blob/main/pageserver/src/page_cache.rs#L292
If we request some user data earlier than the walreceiver got the record and updated the flag, page_cache fails to wait for the most recent version. Failure looks like this:
https://github.com/zenithdb/zenith/runs/2402413454

TODO: Move this logic to postgres side extension.

Snapshots part of CLI

host several datadirs in one pageserver
walreceiver call me pls
import snapshot (in backup format) from standalone postgres and S3
start postgres from snapshot
savepoint-like snapshots