pendulum-project / ntpd-rs Goto Github PK

View Code? Open in Web Editor NEW

725.0 17.0 41.0 8.61 MB

A full-featured implementation of the Network Time Protocol, including NTS support.

Home Page: https://trifectatech.org

License: Other

Rust 98.82% Shell 1.18%

ntp ntp-client ntp-protocol ntp-server ntpd time-synchronization clock time

ntpd-rs's People

Contributors

Stargazers

Watchers

Forkers

atezet djc jsha tshepang erikjee cyberflamego jauderho step-security-bot fanweixiao tngo04 mayhemheroes ohkinozomu 0xcrust sanmai-nl moderation andrewaylett valpackett sylvestre elarasuz chertov lisetroos mfiumara jamestiotio phil-skillwon paulgear mikaelurankar mickvangelderen shabbirhasan1 ekeranen sivizius wolfi-chainguard-demo emiliewu dcampbell24 terminaldweller xylim98 chainguard-wolfi-bites-back mu-l luishsr krioyo artisdom nkwilson

ntpd-rs's Issues

Implement clock adjustment system calls

Preferably without dependencies on non-core crates (prefer not to use nix, see also #14). If this requires some unsafe code then that would ideally be put in a separate crate.

Write documentation management client

The management client (#138) will need end-user documentation written for it once it is closer to its final shape.

Fix PeerSnapshot not updating on peerreset

Logging

Setup basics for logging and add some logging in relevant places.

Implement proper peer management

Have a datastructure and sufficient signaling to the owner of that structure to keep it up to date.

Cleanup NTP-proto

Current work has created a bit of a mess, we need to

choose better names
look at where code is still dead but needed in the future
re-evaluate the public/private-ness of fields/functions on types

Properly implement seconds counter `c.t`

c.t is special; in the spec it is a seconds counter. Idea: use rust Instant

Add SRV record peer discovery mechanism

One easy way to discover peer addresses would be by using SRV records in DNS. This would make the client a lot easier to use in many cloud-based environments.

Peer should detect network interface changes

Mainly for devices on unstable internet connections, e.g. laptop that switches wifi networks

Discussion: Failure mode upon detecting programming errors

There are currently a few, and probably in the future will be a few more, places where we can and do do checks that essentially represent invariants that should always hold, regardless of any input provided from external sources. As such, failure of these checks directly indicate bugs in our code, and the question then becomes what should the behaviour of these checks be in release builds.

Given the specific nature of NTP, especially for an NTP client, I personally am of the opinion that the safer option is to actively blow up upon detection of such errors. This is because, assuming we detected it early enough, the system time now is hopefully reasonably correct, and in that case without corrections shouldn't drift to the point in the short term, and at the same time blowing up makes the issue very visible to whomever is managing the server running the client. However, silently ignoring the error or trying to work around it could result in incorrect steering of the clock (since the software is now in a state that was never anticipated), and incorrect steering could potentially result in significant clock deviation from UTC fairly quickly, and furthermore is far less visible to whomever is managing the server running the client, increasing the potential for a faulty situation to last for a significant time interval.

Is this the view we want to take as project, or are there arguments to the contrary that I am forgetting about here.

Run beta and nightly Rust in CI

Helps find problems early.

Implement support for DENY/RSTR kiss codes

Implement proper killing-off of associations with peers that want nothing to do with us.

Log warn/error tracing messages to sentry

Fix race condition that may occur when to reset-all-peer events happen in short succession

The following sequence of events is possible

Clock controller initiates peer reset
Peer A resets, controller handles
Peer B resets, controller handles
Peer C resets, does measurement, but controller is busy
Clock controller initiates second peer reset
Peer C reset handled by clock controller
Clock steering uses peer C state from just before second reset.

Which results in the steering code using incorrect state from peer C

Write documentation on configuration

Describe which configuration options our ntp client provides and how they can be used.

Also should describe current scope of project.

Fix incorrect logic around timestamps for packet acceptabiltiy

Deny kiss packets whose origin timestamp is incorrect

This is a departure from the NTP specification, however the security gains, especially against DOS attacks is such that it is worth it.

Chrony replacement

We are currently using chrony to synchronize with AWS clocks. Would this tool be able to replace chrony? If so how would that work, roughly?

test observer.rs

start task, see if behavior of unix socket is right

Kernel-level send timestamping

Figure out a good api and implement kernel-level send timestamping (stretch goal)

Write a readme

Write a (short) readme explaining what this repo is and what the current state of it is.

Configuration mechanism daemon

Implement a proper mechanism for configuring the daemon. E.g. whcih parameter values as needed by proto+which servers to connect to.

BUG: No warning when started with empty peer list

When started with empty peer list, no indication of an issue is given, instead the daemon stays very silent.

Implement dynamic management of peers

In preparation for pools we should have tools to allow dynamic adding and removing of peers. (e.g. at runtime)

Missing abs() When Checking Panic Threshold

reported by jsha

I notice a bug in the panic calculation: offset_too_large compares an offset against the panic threshold without first calling .abs() on it. This means that negative offsets will never be considered too large and will never cause a panic. By contrast, checking the step threshold does call .abs().

Run cargo deny in CI

This helps check dependency licenses, security vulnerabilities and other stuff. Something like this:

https://github.com/InstantDomain/instant-distance/blob/main/.github/workflows/rust.yml#L79

Clock adjustment

Implement the state machine needed for actually doing clock adjustments. (good luck)

test that peer sends the messages/has the state updates that we expect

somehow manually trigger the poll

Add JSON based log output

Add a command line option/config option that enables json based output instead of the current text based format.

implement clock selection

https://datatracker.ietf.org/doc/html/rfc5905#section-11.2.1

https://datatracker.ietf.org/doc/html/rfc5905#appendix-A.5.5.1

Add support for pools

In NTP/Chrony there is support for pools:

A pool uses multiple DNS query results to the pool address to get additional peers to connect to. A single pool can instantiate multiple peers. This is different from a traditional server directive which only instantiates a single peer connection.

Extract configuration parameters from ntp-proto

A number of constants currently used in ntp-proto really should be configurable. Make these configurable by caller.

test dynamic configuration

have the client code talk to an RwLock with the relevant data, check that it is updated

Write documentation on deployment concerns

Write additional documentation describing operational procedures and concerns that should be taken into account.

Will NTS (RFC8915) be within the scope of this project?

https://datatracker.ietf.org/doc/html/rfc8915

Write developer documentation

Write an updated documentation describing code structure and main design decisions.

test socket.rs

composition of read/write is identity

Transfer mechanism system state to peer

Ensure that peers have access to (a copy of) needed system state values.

Add consul as a peer discovery method

Often used in cloud based environments, see https://www.consul.io/

Run the program in an as unpriviliged environment as possible

This would probably involve setting up capabilities (CAP_SYS_TIME specifically for our case) and setuid mechanics (to switch away to a user with no meaningful permissions on the system).

Exceeding panic threshold does not cause proper termination

When trigering the panic threshold, the ntp-daemon goes into an unresponsive state but does not properly terminate. It should just terminate abnormally.

Implement method for exporting metrics to prometheus

This is hopefully covered well by the socket mechanism needed for #119

test system.rs

reset behavior

Fix peer poll interval never decreasing

Due to the logic used to determine peer polling intervals, they can never decrease. As this is unwanted, figure out better logic that does allow the polling interval to decrease.

Better error handling for well known failure scenarios that require operator intervention

Some failure modes:

when the client can't start up because of permissions
when the client can't start because of configuration failure
When a sudden time jump is detected in the middle of normal operation

I think we may want to try and emit specific exit codes for these well known failure modes, so that they can be distinguished from other errors and panics. We also want to specifically make sure that we emit an error level log message before exiting the program to make sure that such a message pops up in a system that monitors the log messages.

Peer reset mechanism

Ensure we can reset peer measurement state after clock stepping. Includes canceling/ignoring result of current poll if already started. Peer should confirm the occurence of the reset back to where it was initiated from.

NOTE: Polling state (how often we are allowed to poll and such) should be kept intact.

Client for observing state and dynamically changing configuration

we have two unix sockets, by default

/run/ntpd-rs/log-level
/run/ntpd-rs/config

the log level is unprotected, the config needs additional permissions.

we can use https://docs.rs/tokio/latest/tokio/net/struct.UnixStream.html

for sending data over the socket, use https://docs.rs/postcard/latest/postcard/ ? (or send json as bytes?)

client --set-log-level=debug

client --step-if-bigger=1000 --step-first-updates=10

then we also need some observability features, some ideas

client peers list # lists all remotes we are connected to
client peers watch # show for each connected peer its `PeerStatus`

test clap argument parsing

give clap a vector of strings, and assert the right things are set

Implement kernel-level software timestamping

We need some way to have software timestamping work with tokio in a proper way. Would prefer to have thin unsafe wrappers for the system calls in separate crates, prefer also to have few dependencies (this shouldn't be too much code, and it is probably better to own it ourselves than have a dependency on something like nix, libc is acceptable though in my view)

Peer process error recovery

Figure out what errors need extra work to recover from, and implement what is neccessary for those.