ggist / bip-rs Goto Github PK

View Code? Open in Web Editor NEW

295.0 295.0 33.0 1.02 MB

BitTorrent Infrastructure Project In Rust

License: Apache License 2.0

Rust 99.99% RenderScript 0.01%

bip-rs's People

Contributors

Stargazers

Watchers

bip-rs's Issues

Allow PeerWireProtocol To Enforce Block Size Checks

This isnt super important because we already implemented #94, but would be nice to have.

Probably blocked on #95 if we want this to be at all useful.

Limit Number Of Peers In PeerManager

Because of the implications with the Timer capacity, we want to make sure we prevent any panics that we surface to the user due to not enough timer slots being available per peer.

We should have the user set the number of peers they will want in the PeerManager at any one time, and configure the Timer capacity based on that.

Implement File Cache And Block Cache

Both of these are nice to haves, which should significantly speed up performance for users on a fast network (where disk transfer speed is the bottleneck, due to our implementation).

Two components that are required, FileHandleCache and ReadWriteCache. Each of these could implement FileSystem, and could take some inner object implementing FileSystem, that way, calls are forwarded on to the underlying FileSystem (like NativeFileSystem) when the cache doesn't have some entry and allows us to easily layer different FileSystems on top of one another.

Both of these will require us to support a new message, IDiskMessage::SyncTorrent(InfoHash). This will also create a new method, FileSystem::sync(Self::File), which will sync the contents of the file to disk. On NativeFileSystem, this could be a no-op (though we should decide if we should call fsync here...). For ReadWriteCache, this could flush any cached file contents to the underlying FileSystem, and for FileHandleCache, this could drop all of the file handles.

A separate thing which may be a nice feature in the future, would be to have different allocation methods for AddTorrent, implemented in terms of FileSystem:

Full allocation (what we do currently, all files allocated fully with 0s filled in)
Sparse allocation (uses OS sparse files to speed up AddTorrent)
Block allocation (all torrent contents written to a single file, split to multiple files later)
- Block allocation could actually be implemented on top of full allocation or sparse allocation, by just maintaining a mapping of file -> requested location, writing to a single file in the inner FileSystem, then untangling the files later (possibly when SyncTorrent is send to us)

Disk only benchmarks

Linux:

test benches::bench_native_fs_1_mb_pieces_128_kb_blocks ... bench:   2,122,092 ns/iter (+/- 137,163)
test benches::bench_native_fs_1_mb_pieces_16_kb_blocks  ... bench:   5,743,897 ns/iter (+/- 1,613,292)
test benches::bench_native_fs_1_mb_pieces_2_kb_blocks   ... bench:  26,742,021 ns/iter (+/- 11,962,446)

Windows (Antivirus Disabled):

test benches::bench_native_fs_1_mb_pieces_128_kb_blocks ... bench:   2,922,467 ns/iter (+/- 236,198)
test benches::bench_native_fs_1_mb_pieces_16_kb_blocks  ... bench:  22,277,612 ns/iter (+/- 3,752,197)
test benches::bench_native_fs_1_mb_pieces_2_kb_blocks   ... bench: 156,519,336 ns/iter (+/- 11,694,039)

Windows (Antivirus Enabled):

test benches::bench_native_fs_1_mb_pieces_128_kb_blocks ... bench:  39,069,596 ns/iter (+/- 8,133,512)
test benches::bench_native_fs_1_mb_pieces_16_kb_blocks  ... bench: 270,543,506 ns/iter (+/- 23,768,410)
test benches::bench_native_fs_1_mb_pieces_2_kb_blocks   ... bench: Too Damn Long!

Real world benchmarks

Windows Localhost (via Deluge):

Setup:
3.9GB Torrent
2MB Piece Size
16KB Block Size

Results (Best Timings):
200 MB/s Download
? (Reported 60MB/s Disk Activity)

Bencode should support [u8] lookups

Rust strings must be valid utf8. Bencoding is a binary format where keys are primarily binary data which just usually happen to be ascii or utf8.

In some cases they do not represent valid utf8 sequences. E.g. HTTP scrape responses contain binary infohashes as dictionary keys.

Scrutinize Building And Parsing APIs

MetainfoFile currently provides methods for initializing itself from either some bytes or a file at the given Path. However, the MetainfoBuilder API doesnt provide subsequent methods for saving the file out to a given Path, only retriving the output bytes. We should either add functionality for saving out to a Path for the MetainfoBuilder or remove the ability to initialize a MetainfoFile from a Path.

MetainfoBuilder typically terminates on either the build_from_file() or build_from_directory() function calls. The user typically should not have to differentiate between the two methods as they know what they are pointing their provided Path at.

Provide Hook For MetainfoBuilder Progress

If clients are using the MetainfoBuilder in our library for building metainfo files from large files, we want to be able to provide them a way of obtaining the percentage of the files that have been processed which will allow them to provide a loading bar of some sort to the user's of their application.

This is easy enough to do as our master hasher is constantly sending out pieces for workers to process along with the associated piece index. All we have to do is divide that piece index by the total number of pieces and send that value, probably an f64 value between 0 and 1 to either a user provided callback or a channel.

I like the callback idea because it would allow us to push large(ish) amounts of updates without taking up memory as opposed to a user who either did not care about the current status or did not realize they were keeping their end of the channel open, with messages being queued and taking up memory. However, I am not sure I want to execute user provided code in our master hasher loop due to slow downs and/or panics.

One option is to execute the user provided callback in a separate thread and so if they did some heavy computations in their callback, it would only affect their ability to read our updates as fast as we are providing them. If we chose this option, we would essentially leave it to the user to not panic in their callback which I think is the best tradeoff. If the callback did panic, they would NOT be entitled to receiving any more updates for the current build process. Ideally a channel would be used to bridge the master hasher and this callback thread meaning if a user did not care about the progress, they would provide us an empty closure to the build call which would drop the update immediately meaning it would not be taking up needless memory waiting in the channel.
The second option is to spawn a new thread if the current thread paniced which would mean potentially skipping an update. However, this could get expensive if the user callback panicked EVERY time in which case it would slow down our master hasher because it would constantly be creating new threads.
The third option is to use a thread pool which I am not sure solves the downsides with the second option, and on top of that, it would add the complexity of running user provided callbacks concurrently which adds a lot of complexity for the user to the API.

Propagate Peer Wire Protocol Message Length Cast Overflows

Peer wire protocol headers include a 4 byte message id. For most purposes, this u32 value need to be used as a usize value. We should validate that the cast from a u32 to a usize doesn't overflow, and if it does, we should be terminating the connection and propagating an appropriate error as currently we just panic.

Create Trait To Abstract Handle/RemoteHandle

Many of our crates try to be agnostic over the execution mechanism for their futures. However, currently we hardcode against Handle to spin off asynchronous tasks.

It would be nice if our crates depended instead on some trait in bip_util that exposed a function to pass a future to, which would be executed when some event loop/other mechanism, starts up.

System Overview For bip_peer

Issue for tracking what is implemented and what is left for implementing the bip_peer module which will include an API for programmatically queueing up torrent files for download given a MetainfoFile or MagnetLink.

High level overview of the system

Torrent Client Layer

The basic idea is that the TorrentClient communicates with the selection strategy thread over a two way channel. From the client to the strategy thread, we can stop, start, pause, or remove torrents from the download queue. We can also provide configuration options to limit upload/download bandwidth either client wide or on a per torrent basis. From the strategy thread to the client thread we can provide notifications for when torrents are done or if any errors occurred.

Selection Strategy Layer

The selection strategy thread is concerned with sending and receiving high level peer wire protocol messages, initiating peer chokes/unchokes, and deciding what piece to transmit or receive next and from what peer. Each peer is pinned to a channel which is connected to one of potentially many peer protocols, the strategy thread doesn't care what protocol. If a peer disconnects in the protocol layer, a message is sent to the strategy layer alerting it that the peer is no longer connected to us.

Peer Protocol Layer

The peer protocol layer is concerned with reading messages off the wire, deserializing them into peer wire protocol messages heads (variable length data is ignored at this point). Special regions of memory may be set aside for bitfield messages, not sure if we should eat the cost of pre allocating or allocating on demand (they are only sent once per peer so on demand might not be bad).

Disk Manager

The disk manager is what both layers use as an intermediary for sending and receiving pieces. If we determine in the selection strategy layer that we should send a piece to a peer, instead of loading that data in and sending it through the channel to the peer protocol layer, we will ask the disk manager to load in that data if it isn't already in memory. We will then receive a token for that request and send the token down to the peer protocol layer which will tell the disk manager to notify it when the piece has been loaded. It will then be able to access the memory for that piece. For receiving, the peer protocol layer will tell the disk manager to allocate memory for the incoming piece and get notified when it is ready. It will then be able to write the piece directly to that region of memory. I am not sure whether to do checksumming at this point or defer it to the selection strategy layer so that is TBD. After the write occurs, a message will be sent up to the selection strategy thread letting it know what piece it received from what peer.

Notes

This may change as I go about implementation as I want to make it easy to provide HTTP or Socks proxies in the future so I may have to go one layer below the protocol layer for that. At the same time, I want to reduce the number of threads that a TorrentClient requires as currently, just taking into account tcp peers, it will take at least 8 threads (includes 4 worker threads for the disk manager but not including the thread running the user code that is calling into the TorrentClient).

Work Progress

Disk Manager:

Manager Thread
Worker Thread

Handshaker:

TCP Handshaker
UTP Handshaker

Peer Protocol Layer:

Messaging Primitives
TCP Peer Thread
UTP Peer Thread

Selection Strategy Layer:

Main Thread
Leecher Strategy
Super Seeder Strategy
Streamer Strategy

Torrent Client:

Migrate Packages To Tokio

Playing around with the examples, it looks like for the handshaking and peer wire protocols, we can't really take advantage of tokio-service or tokio-proto as both of those are oriented towards request/response communication (let alone long lived connections perhaps?).

Right now, I am looking at tokio-core and futures. However, I don't really feel like each of the components that we are building should spin up their own cores and communicate between each other. Instead, it would be ideal if bip_handshake and bip_peer depended solely on futures and exported their own futures that could be connected in some way (peer handshake -> peer connect) to be run in one Core by the end application, ideally as friction less as possible.

Will update with more information.

Add Maximum Message Length Check To PeerProtocolCodec

Currently any peers would be able to drain the client machine of memory by sending a message with a large payload (this would only affect variable length message fields).

We should add a max message length to PeerProtocolCodec so that when the codec checks the number of bytes that the message will use, it can kill the connection and propagate an appropriate error (one that clients can identify and filter the peer in the Handshaker).

Relicense under dual MIT/Apache-2.0

Why?

The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback, and has protections from
patent trolls and an explicit contribution licensing clause. However, the
Apache license is incompatible with GPLv2. This is why Rust is dual-licensed as
MIT/Apache (the "primary" license being Apache, MIT only for GPLv2 compat), and
doing so would be wise for this project. This also makes this crate suitable
for inclusion in the Rust standard distribution and other project using dual
MIT/Apache.

How?

To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright) and then add the following to
your README:

## License

Licensed under either of
 * Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you shall be dual licensed as above, without any
additional terms or conditions.

and in your license headers, use the following boilerplate (based on that used in Rust):

// Copyright (c) 2015 t developers
// Licensed under the Apache License, Version 2.0
// <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT
// license <LICENSE-MIT or http://opensource.org/licenses/MIT>,
// at your option. All files in the project carrying such
// notice may not be copied, modified, or distributed except
// according to those terms.

And don't forget to update the license metadata in your Cargo.toml!

Contributor checkoff

Use milestones

Using milestones and issues can help people understand the project's direction.
Issues can be used as tasks-to-complete and can be assigned to developers.

We can follow the CoreUtils way for example.

Relicense under dual MIT/Apache-2.0

This issue was automatically generated. Feel free to close without ceremony if
you do not agree with re-licensing or if it is not possible for other reasons.
Respond to @cmr with any questions or concerns, or pop over to
#rust-offtopic on IRC to discuss.

You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.

TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.

Why?

The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.

Some ask, "Does this really apply to binary redistributions? Does MIT really
require reproducing the whole thing?" I'm not a lawyer, and I can't give legal
advice, but some Google Android apps include open source attributions using
this interpretation. Others also agree with
it.
But, again, the copyright notice redistribution is not the primary motivation
for the dual-licensing. It's stronger protections to licensees and better
interoperation with the wider Rust ecosystem.

How?

To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright) and then add the following to
your README:

## License

Licensed under either of
 * Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.

and in your license headers, use the following boilerplate (based on that used in Rust):

// Copyright (c) 2016 redox-rs developers
//
// Licensed under the Apache License, Version 2.0
// <LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0> or the MIT
// license <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. All files in the project carrying such notice may not be copied,
// modified, or distributed except according to those terms.

Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these
from the Rust repo for a plain-text
version.

And don't forget to update the license metadata in your Cargo.toml to:

license = "MIT/Apache-2.0"

I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!

Contributor checkoff

To agree to relicensing, comment with :

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option

Or, if you're a contributor, you can check the box in this repo next to your
name. My scripts will pick this exact phrase up and check your checkbox, but
I'll come through and manually review this issue later as well.

Add Timeouts To Connections In BTHandshaker

Currently failed handshakes will take up room in the buffer we allocate for connection tokens. We need to wait for Slabs to support something like insert_with_opt because we want our connection to keep track of the Token it is associated with so it can set it's own timeouts, however, the creation of the connection may fail and in that case the Token would go unused.

Doing this will then let us support a continuous stream of handshakes without filling up with stale handshakes.

Document How Name Field Is Exposed

Document how we expose the name field in torrent files as just a single File with an Option::None directory. This will confuse people using the library, who are already familiar with the internals of torrent files.

Filter On Initiate/Complete

Our HandshakeFilter could also filter on whether or not we initiated the handshake or a remote peer initiated it.

This would be pretty trivial to implement (create some enum, HandshakeInitiator, with two variants and pass those in when we initiate/complete a handshake and test our filters).

Create Custom Error Type For Peer Connections

Currently we stash errors like payload checks, protocol errors, block size checks, etc in io::Error. However, tokio_io::codec::Decode and tokio_io::codec::Encode would allow us to specify some wrapper error.

Currently we store informative strings in the custom error type, so its not too difficult to track down why we severed a peer connection, but if we want client to be able to easily check why we severed the connection, and ban peers based on this, we need to export some enum of all possible error types they may want to ban on (as well as a catch all io::Error variant).

Execute Multiple Handshakes In Parallel

Currently, we forward both initiate and complete handshakes to a central future that executes a handshake (with timeout), which then gets forwarded on to our Stream to buffer completed handshakes for the user until the user pulls them out.

Even though we have timeouts on handshakes, it would be nice to be able to do n handshakes in parallel. For that, I believe we would need some sort of futures compatible mpmc implementation.

Support For Persisting Nodes To Disk

Some mechanism for persisting DHT nodes to disk is necessary to make our DHT truly decentralized and not rely on bootstrap nodes as is the current scenario. This would also give us a faster start up time depending on how we want to implement it for the DHT.

A naive implementation would be fairly easy, however, we want to decide whether this is something we should provide to the client or if we should just provide a RoutingTable dump the the client and have them persist our nodes.

I prefer the former as it would let us be in charge of the encoding of the peer information and would simplify what clients would have to do to load those peers back in when starting the DHT up again since we know how the node information was stored.

Use EventLoopConfig For All Crates Using mio

Currently any crate using mio is relying on the default values set for EventLoop which could lead to inconsistent behavior for our APIs when capacity's set by our crates are larger than the default capacitys set by the EventLoop. Therefore, we should be using EventLoopConfig for all such crates.

When Pending Messages Hits Limit, start_send Doesnt Get Called Anymore

We are probably doing something wrong with the Sink interface here.

Iterator boxing for objects implementing TorrentView

Currently we are boxing Iterators for objects implementing TorrentView. This allows us to create different Metainfo parsers while allowing us to expose the fields from all parsers regardless of how those fields are stored at any moment. This does incur some overhead, especially since we are creating a new Box for every iteration of FilePath. Whether or not we want this flexibility with the incurred overhead is debatable.

Currently most of our iterators that have two levels of boxing are consumed when the second level boxed iterator is requested. We could allow iterators to be cloned allowing users to request the iterator willy nilly but I don't want to gloss over the fact that this is expensive and so I would like to reflect that fact within the API.

Most of this may not be a problem when/if generic return types are implemented, but I am leaving this open for discussion.

Threading/async paradigms?

Hi, many thanks for starting this project!

I've started looking how to use BTHandshaker::new(), read up on mio, and wondered why I don't need to pass in my EventLoop. Instead handshaker.stream(info_hash) returns a channel reader that I'm supposed to block for in my per-torrent thread? Especially when handling multiple torrents, it'd be great to continue to use mio in the calling code.

Which of the two paradigms do you intend for htracker and the wire protocol, async or threaded?

Given magnet link, how do I connect to swarm?

Sorry, this may be not the appropriate place for asking this, but how do I even connect to swarm?
Consider several scenarios:

I have a torrent file and know which tracker to ask for peers
I have only hash of torrent (i.e. magnet link).

It's been said that there is a DHT support, but I've been studying code for hours and could not locate the place to connect to DHT.

Avoid Double Allocation For PieceMessage (E2E BytesMut)

Currently we decode from a &[u8], but since our actual protocol codec uses BytesMut, we should probably decode from that, as some messages that contain arbitrary byte payloads, like bitfield message and piece message, are currently allocating again.

If we wanted to have zero copy end to end, we should make it so that the corresponding messages instead store the BytesMut slices, and that we can then pass these through to DiskManager to then store the bytes (when applicable). When we get those bytes sent back to use via ODiskManagerMessage, we can then drop them immediately, and the network side of things should be able to reuse that region of memory again.

Support File Ordering Guarantees In MetainfoBuilder

Right now it is dependent on the order in which walkdir gives them to us. However, for some torrents, you may want to have bigger file(s) in the front (streaming), order files to match piece boundaries as close as possible (selective file downloading), file name alphanumeric ordering, or perhaps a custom user defined ordering.

We should be able to support all of these use cases.

Bencode Parser Vulnerability/Level 3 Parser

Currently our lazy bencode parser uses recursion to decode and encode data. This makes it trivial for anyone to crash one of the applications using bip_bencode where the data is coming off the network. With a maximum stack of say 80 stack frames, they could crash our services using a minimum of 160 bytes (ex: just nest a bunch of lists, l l l l l l ... e e e e e e).

The obvious solution is to implement an iterative decoder/encoder. However, since we are already introducing complexity into the bip_bencode module by doing so, we could also reach for yet another performance boost in terms of our implementation.

What we used to have implemented was a level 1 bencode parser where all dictionary keys and byte arrays were allocated on the heap. We then moved to a level 2 bencode parser where those two structures were just references, but we still allocated bencode lists and dictionaries on the heap. LibTorrent has a nice blog post where they go over the implementation of a level 3 bencode parser that has both a borrowed list of bytes, as well as a (heap allocated) list of tokens pointing into the list of bytes.

The benefit that a level 3 bencode parser brings is that we get great token locality as well as amortized heap allocations since we are using a single (fairly small) heap allocated structure instead of many small ones. We can also see that encoding is as easy as returning the already made list of bytes (essentially a noop). The downside is we have to copy data used in our macros to that list of bytes and make corresponding tokens for them. However, we may be able to pre compute the needed pre allocation capacity for macro related bencode construction (depends on what macros allow). Additonally, dictionary searching goes down to (at best) log n retrival speed instead of constant depending on if we want to store key offsets and inserting ANYTHING into an already made structure will push (potentially many) bytes back which could get expensive depending on the usage patterns. For this reason, we may want to keep the current implementation and just add the new implementation as the current implementation is great at cheaply inserting data into an already made bencode structure.

Will benchmark against the current implementation to see the kind of performance boost we get to see if it is worth it or not.

Edit: Vulnerability only requires the list/dictionary to be started which means l l l l l ... would suffice meaning a maximum of 80 stack frames requires a minimum of 80 bytes.

[On Hold Until Merge] Investigate scalability of BTHandshaker

The BTHandshaker implementation which is part of the dht_support branch is structured to provide a common object to perform handshaking with connections from sources such as trackers, dhts, local peer discovery, or other discovery mechanisms.

At the moment, some initial benchmarking has been done and under full load the handshaker blows up and causes system instability. Initially it looks like it was creating too many system handles so a ThreadPool has been used to cap the amount of thread handles generated. This improved the stability greatly but it looks like it is possible to also create too many TcpStreams on both connection initiation, and completion via TcpListener.

At this point, the ThreadPool should act as a throttle for connection initiation, but the TcpListener thread may be causing too many TcpListener handles to be created. We should investigate whether a SyncSender can solve this and where the sweet spot is in terms of performance for capping the number of completion worker messages.

Builder For InfoDictionary

To support http://www.bittorrent.org/beps/bep_0009.html in bip_peer, users need to be able to parse, as well as build (serialize), InfoDictionarys directly.

We should see if we want to make this a configuration options (or separate method), on MetainfoBuilder, or a separate builder. Subsequently, we should allow a user a method similar to MetainfoFile::from_bytes, but for an InfoDictionary.

Moving Bencode::Dict keys and Bencode::Bytes to Cow objects

Since we now have macros that allow us to easily create Bencode objects in code, we could look in to both the memory and performance benefits of switching Bencode::Dict String keys and Bencode::Bytes Vec<u8> objects over to Cow<'a, T> objects.

As long as this doesn't cause regressions in performance for out current use case (parsing large amounts of bencoded data from a file) then it can be implemented on the current Bencode object. Otherwise, we can always move it to a new object implementing BencodeView and adjust the current macros accordingly.

bip_bencode: performance drop between 202e3a7 and 4b08461

There is a large performance drop between these two commits. The code in question is run on a sample of 20000 random torrents. It only reads the torrent files and then executes bip_bencode::Bencode::decode(&bytes).unwrap(). Example torrent file is attached.

The first commit 202e3a7, specified in Cargo.toml as
bip_bencode = { git = "https://github.com/GGist/bip-rs", rev = "202e3a7" }
has a runtime of 300ms.

Just changing the revision to 4b08461 results in a runtime of 530ms.

0A1V30Nlw3QetOiHJ5CIzfOFUA9fiBGa.torrent.tar.gz

Relicense under dual MIT/Apache-2.0

You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.

TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.

Why?

The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.

How?

To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright) and then add the following to
your README:

## License

Licensed under either of
 * Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.

and in your license headers, use the following boilerplate (based on that used in Rust):

// Copyright (c) 2016 bip-rs developers
//
// Licensed under the Apache License, Version 2.0
// <LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0> or the MIT
// license <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. All files in the project carrying such notice may not be copied,
// modified, or distributed except according to those terms.

Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these
from the Rust repo for a plain-text
version.

And don't forget to update the license metadata in your Cargo.toml to:

license = "MIT/Apache-2.0"

I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!

Contributor checkoff

To agree to relicensing, comment with :

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option

Relicense under dual MIT/Apache-2.0

You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.

TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.

Why?

The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.

How?

To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright, due to not being a "creative
work", e.g. a typo fix) and then add the following to your README:

## License

Licensed under either of

 * Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.

and in your license headers, if you have them, use the following boilerplate
(based on that used in Rust):

// Copyright 2016 bittorrent-rs developers
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.

It's commonly asked whether license headers are required. I'm not comfortable
making an official recommendation either way, but the Apache license
recommends it in their appendix on how to use the license.

Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these
from the Rust repo for a plain-text
version.

And don't forget to update the license metadata in your Cargo.toml to:

license = "MIT/Apache-2.0"

I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!

Contributor checkoff

To agree to relicensing, comment with :

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.

Add Connection ID Cache For TrackerClient

While testing the TrackerClient in bip_utracker it looks like if many requests are sent to a single tracker in a short amount of time we will start to see something akin to request throttling happening to us. I believe this throttling was just for connection id requests judging from the packets that were being emitted by our client.

We should ideally be caching connection ids so that if multiple requests are sent to the same tracker (ip, port) in a short amount of time we can avoid being throttled (or at least, throttled so easily). We could probably just store the ids in a map and when creating a new request, populate the connection id for the ConnectTimer if there is one in the map.

We also need to account for two things, build up of unused connection ids for long running clients (memory leak) and renewing of connection ids:

Assuming we say a connection id should be valid for 1 minute, when querying the map for a connection id for a given tracker, if the entry is older than 1 minute, evict the entry and return none. This lazy approach would help us keep connection ids within the map renewed.
To combat against a long running client that makes requests to many different trackers sporadically, we need to make sure that the map is not building up unused mappings. We can run a timer on a 1 minute interval which will go through and explicitly clean up unused entries in the map. Now, the lazy approach above needs to be combined with this approach as we can see, if an entry was added to the map 1 second after the cleanup just happened, it would actually exist in the map for 1 minute and 59 seconds which means it is expired for 59 seconds before being cleaned up.

Avoid hitting the same node multiple times in lookups

During lookups, we use both alpha to specify how many requests we want to send in parallel on the initial lookup, and beta which is how many requests we want to send in parallel on responses with nodes whose ids are closer to our target id.

Currently, these parallel lookups could have overlap in the nodes that they request from. In practice, I have seen ~15 parallel lookups that all converge on one node. So it seems like all of the parallel lookups found the closest node they could and all ended up requesting from it. Doing stuff like this, coupled with clients potentially executing many searches in a short period of time, could get our client's node banned by the nodes that our lookups converge on.

Instead, we should filter out nodes we have already requested from when deciding if a response is worth iterating on or not. This may reduce the amount of contact information we receive, although I wouldnt expect it to be by a large margin since the contact information we would be missing would be from nodes that are not as close as we have found up to that point.

bip_utracker dependency on bip_handshake

Because the calling code needs to setup and invoke both anyway, it could also:

Configure port and peer_id for bip_utracker requests
Pass results to bip_handshake in its own controlled fashion

That way a utracker client wouldn't depend on the handshaker.

Do you intend to start a project that combines the various crates into a functioning application?

Switching to tokio+futures?

Just stumbled across this project and noticed that it uses mio. Any thoughts about switching to Tokio and Futures?

https://aturon.github.io/blog/2016/08/26/tokio/

bip_metainfo: API Type Changes

Will have to double check that we aren't breaking spec, but in general, having to deal with the current types has been painful in downstream libraries (like bip_disk).

File::length() should return a u64
InfoDictionary::piece_length() should return a u64

On a not spec breaking related note:

File::paths() should return a &Path (so we should allocate a PathBuf internally)

bip_metainfo: Accessor::access_directory Return Value

We should update the return value to Option<&Path> instead of Option<&str>.

Sporadic Performance With MetainfoBuilder

Currently experiencing sporadic performance with the MetainfoBuilder. Initially I suspected it was due to the SHA-1 library that bip_util uses. However, after switching that out, while the performance is a lot better (the old library may have been allocating while hashing), there are still cases where the piece hasher workers are experiencing major slowdowns.

Profiling has shown that the slowdown occurs when computing the SHA-1 value. However, I suspect it has something to do with the OS paging to the disk for the memory mapped bytes while computing the SHA-1 values. This is because when the slowdown occurs, the processors go from being maxed out to hovering around 20%, which is around what they would idle at outside of the benchmark.

The slowdown could be related to many factors as I have not been able to pin down what makes it occur exactly. It could be related to piece length (lengths could be slightly larger than the page size), workers getting out of sync (unlucky OS scheduling cause different threads to be on pieces VERY far away from one another causing more disk paging), or something I have not thought of.

In the mean time, I will be trying out other approaches, namely, switching out mmap for just reading files in directly into pre-allocated buffers, and sending those buffers as something the workers can hash directly and then re-using those buffers to read in more data.

Relicense under dual MIT/Apache-2.0

You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.

TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.

Why?

The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.

How?

## License

Licensed under either of

 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.

and in your license headers, if you have them, use the following boilerplate
(based on that used in Rust):

// Copyright 2016 bip-rs Developers
//
// Licensed under the Apache License, Version 2.0, <LICENSE-APACHE or
// http://apache.org/licenses/LICENSE-2.0> or the MIT license <LICENSE-MIT or
// http://opensource.org/licenses/MIT>, at your option. This file may not be
// copied, modified, or distributed except according to those terms.

Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these
from the Rust repo for a plain-text
version.

And don't forget to update the license metadata in your Cargo.toml to:

license = "MIT OR Apache-2.0"

I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!

Contributor checkoff

To agree to relicensing, comment with :

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.

Update BTHandshaker To Export std::net::TcpStream Instead Of mio::tcp::TcpStream

Currently BTHandshaker exports mio::tcp::TcpStream because internally that is the TcpStream implementation we are using to make the handshake. We should instead convert completed handshake TcpStreams into std::net::TcpStream which would involve going through net2 to re-bind the socket as a blocking socket.

When mio adds support to access the SOCKET handle on windows we can move the conversion code into bip_util and have implementations for both windows and unix conversions.

Bencode enums should contain offset and length of raw representation in source buffers

BEP 3 specifies that the infohash must be calculated from the raw representation of the dictionary as found in the source, not from round-tripping through the decoder and encoder. This requires the ability to find the offset and length of a particular value in the raw source buffer.

Similarly BEP 44 requires extraction of the raw data for the v key.

Allow external users to define their own extension messages

Right now, we have Extensions in bip_handshake as well as built in extension messages in bip_peer.

However, we should be able to allow users to define their own extensions as part of the handshaking process, and subsequently send those messages using our PeerManager in bip_peer when the unioned Extensions struct indicates the peer supports such a message.

RFC: Handshaker Redesign

Problem

The bip ecosystem is meant to contain a modular set of crates that expose functionality and services to clients wishing to leverage BitTorrent infrastructure in their applications. We want to be able to provide flexibility for a client so that they can painlessly integrate either a single crate or many crates from our ecosystem into their application. In the case of a single crate, that crate should provide a usable interface to clients; in the case of many crates, those crates should provide a unified interface to clients.

Because of this, we cannot afford to export a per crate asynchronous or synchronous interface for clients to use as that would force a specific architecture on our clients for the purposes of tieing our API into their application.

Goal

Provide a generic interface that clients can use to accept callbacks from every peer discovery service that our ecosystem offers. This callback interface should accept at the bare minimum:

A callback for connecting to a peer found via the peer discovery service
A callback for metadata that was found during peer discovery

Proposal

My proposal is to modify the current Handshaker to adhere to the following interface:

/// Handshaker for peer discovery services which may or may not contain request metadata.
trait Handshaker: Send {
    /// Type that the metadata will be passed back to the client as.
    type Envelope;

    /// PeerId exposed to peer discovery services.
    fn id(&self) -> PeerId;

    /// Port exposed to peer discovery services.
    fn port(&self) -> u16;

    /// Connect to the given address with the InfoHash and expecting the PeerId.
    fn connect(&mut self, expected: Option<PeerId>, hash: InfoHash, addr: SocketAddr);

    /// Sets a new filter to filter requests based on an InfoHash and SocketAddr.
    fn filter(&mut self, filter: Box<Fn(InfoHash, SocketAddr) -> bool + Send>);

    /// Send the given Metadata back to the client.
    fn metadata(&mut self, data: Self::Envelope);
}

Handshaker implementations would typically accept some channel that an Envelope can be sent over and make sure that the result they yield in response to a connect can be convertible to an Envelope type. Similarly, when a client goes to use the Handshaker in a peer discovery service, the metadata returned from that service must also be convertible to an Envelope.

As an example, lets see how we would integrate a BTHandshaker, as well as one or more peer discovery services in with a mio event loop. A concrete BTHandshaker would accept a mio channel that can send Envelope types. The BTHandshaker impl would assert that the user has provided a From implementation for creating an Envelope from a TcpStream. The client then goes over to a TrackerClient and tries to create one using our BTHandshaker as the generic Handshaker. The TrackerClient impl would assert that the user has provided a From implementation for creating an Envelope from SomeMetadata. Similarly, for every peer discovery service the client uses, this would be enforced for the service specific metadata.

For services which receive no metadata, a generic Handshaker would be accepted and no constraint would be put on the contained Envelope.

With this example, we can see how a mio event loop is now integrated with a number of peer discovery services and can accept both metadata as well as connections over a single channel.

Positives

Zero cost abstractions for the user
User pays only for the services they use
- Only needs From impls for the initial BTHandshaker as well as the services it uses
Unified interface for any of our existing or future peer discovery services to hook in to
Specific architecture not enforced on the client
- Handshaker makes _little_ assumptions about the underlying transport
- Can use any communication primitive when implementing a Handshaker

Negatives

Since Handshaker is manipulated in a peer discovery service's own thread, synchronous programming requires a bit more effort
Increased complexity
Increased up front cost for the user due to:
- Conversion impls
- Indirectly related to this RFC: Accepting a generic channel in, for example, a BTHandshaker requires user's to wrap types such as mio's Sender

Implement Common Message Types For Extension Protocol

We need message containers for common extension messages, as well as some standard extension protocol container, which itself can contain multiple extension protocols (along with their messages):

Extension message (http://www.bittorrent.org/beps/bep_0010.html)
PEX (http://www.bittorrent.org/beps/bep_0011.html)
Sending InfoDictionary (http://www.bittorrent.org/beps/bep_0009.html)

Read Lock In Future

Currently, we take a read lock on our Filters in the event loop, which is less than ideal. We would like some way to get some read lock on that structure, which can work with our event loop.

examples

Any chance some examples of how to use the different crates together will be added?

Bencode object key ordering in Debug and macros

When we serialize a Bencode object we are explicitly sorting the keys before encoding them so that we follow the bencode specification.

Since then, we have derived Debug for Bencode and allowed Bencode building via macros. In the current design we do not sort or validate data coming or leaving from these two avenues respectively. We would have to sort and verify the sorting of dictionary keys every time we do a Debug print or build a Bencode object via macros. It will be significantly easier (and maybe more performant depending on our usage) if we just made the switch from a HashMap to a BTreeMap in Bencode::Dict.

ggist / bip-rs Goto Github PK

bip-rs's People

Contributors

Stargazers

Watchers

Forkers

bip-rs's Issues

Disk only benchmarks

Real world benchmarks

High level overview of the system

Torrent Client Layer

Selection Strategy Layer

Peer Protocol Layer

Disk Manager

Notes

Work Progress

Why?

How?

Contributor checkoff

Why?

How?

Contributor checkoff

Why?

How?

Contributor checkoff

Why?

How?

Contributor checkoff

Why?

How?

Contributor checkoff

Problem

Goal

Proposal

Positives

Negatives

Recommend Projects

Recommend Topics

Recommend Org