mamba-org / rattler Goto Github PK

View Code? Open in Web Editor NEW

217.0 9.0 39.0 34.93 MB

Rust crates to work with the Conda ecosystem.

License: BSD 3-Clause "New" or "Revised" License

Rust 88.38% Python 11.23% HTML 0.26% CSS 0.11% Batchfile 0.01% Shell 0.01%

conda rust

rattler's People

Contributors

Stargazers

Watchers

Forkers

eh2406 baszalmstra aochagavia wolfv ceonlabs travishathaway davidbrochart ruben-arts clean-code-studio beeankha 0xbe7a pavelzw maresb johnhany97 yeungonion tusharsadhwani wackyator derthorsten tdejager dholth nichmor linhduongtuan mariusvniekerk benjaminlowry johnwillliam hadim orhun vlad-ivanov-name swarnimarun alxhill filmor davidalphafox cnonim kassoulait hofer-julian manulpatel clement-chaneching iamthebot jaimergp

rattler's Issues

jlap improvements

When executing JLAP one can notice the sync execution of the patching step. We should wrap the deserialization, patching and serialization steps in tokio::task::spawn_blocking.

We prototyped this yesterday and one of the challenges is the ownership & lifetime of some of the variables.

One workaround would be to use the https://docs.rs/async-scoped/latest/async_scoped/ crate

libsolv-rs is choosing slightly older libgfortran5 (on macos-arm64)

When installing numpy on macOS ARM64, the new libsolv implementation picks libgfortran5-12.2.0-h0eea778_32.

Conda and mamba choose libgfortran5-12.3.0-ha3a6a3e_0.

I wonder if that has to do with libgfortran-5.0.0-12_3_0_hd922786_0 vs. libgfortran-5.0.0-12_2_0_hd922786_32. If we compare / sort by build number vs. build string we might pick the older one (because it has a much higer build number).

Use default_cache_dir in create

Currently a rattler user has to choose the cache dir but I think it would be convenient if we export the default value uses (dirs::cache_dir().join("rattler/cache")).

Implement build string version comparison

Our matchspec implementation currently doesn't implement build string version comparison like >6 in python 3.9.3[build_number=">6"].

This would be a nice addition!

Conda implementation: https://github.com/conda/conda/blob/main/conda/models/version.py#L613

Remove dependency on `extendhash`

Some parts of rattler use the extendhash crate. However, the same functionality is also provided by the md-5 and sha2 crates which are also used. Since md-5 and sha2 seem more complete we should get rid of extendhash.

Use the `#[sorted]` macro on more conda json types

For reproducibility reasons we want to ensure that fields are serialized in a deterministic (alphabetical) order.

To achieve this, we have a macro in rattler_macros that ensures that all struct fields are sorted alphabetically.

We need to apply this to more structs, such as

IndexJson
AboutJson
possibly more

Implement the `jlap` format

It would be nice to implement the jlap format for partial repodata updates (using a specific sequenece of JSON patch applications).

That makes interactive usage faster because only a subset of the entire data needs to be fetched over the network.

FreeBSD

Conda added support for freebsd (conda/conda#12647). We should also add it.

version: access element range / bump elements range

In rattler-build we have the pin_subpackage and pin_compatible functions. They work by specifying a max_pin and min_pin with x.x.x syntax (where each x stands for one version element).

If we have a package A with version 1.2.3.4 and we use

pin_subpackage(A, max_pin="x.x", min_pin="x.x")

we are expected to convert this to A >=1.2,<1.3

For this it would be ideal if we could extract sub-versions based on an index (e.g. version[..2] to get the first two elements, and then also version[..2].bump() to create the upper bound).

Add dependabot

It would be cool 🚀 if we integrate dependabot to get automatic dependency updates.

Remove `usize` from types in conda_package_types

I think usize is platform dependent and will be an u32 on 32 bit systems and a u64 on 64 bit systems.

I think that doesn't make much sense for structs that are parsed or written to from JSON since it'd be better to strictly encode the maximum size for these integers regardless of the host CPU architecture.

WDYT @baszalmstra ?

Invalidate `>=3.8<3.9`

What I meant to write was >=3.8,<3.9 but missing the , gives you really weird behavior as it only uses the >=3.8.

Conda invalidates this request, as should rattler.

depends / constrains -> parse into MatchSpecs?

Should we parse depends and constrains into matchspecs in IndexJson or repodata?

RepoData, deserialize timestamp to a `chrono::DateTimeUtc` timestamp instead of an integer

For typesafety and because it is nice, we should deserialize the timestamp to the proper timestamp type.

@baszalmstra prefers chrono::DateTimeUtc

Bindings for Python

We want to also expose the awesomeness of this library to Python so it because easy to use from within the Python ecosystem as well.

We can start by binding small library parts to Python like version, matchspec, etc.
It would be nice to use Python async for the async parts (like downloading).
We can use pyo3 to generate the bindings.

Typed checksums

Currently we're storing SHA256 and MD5 hashes as strings. I am wondering if it would be better to store them as typed byte arrays and de-/serialize them at the serde level only?

rattler should ignore missing `arch` repodata.json

Only noarch should always be required -- a missing win-64, or linux-64 etc. folder should be ignored.

Make sure that serde serialization is ordered for package files

To improve chances of reproducible packages, we should always make sure that the JSON files are map keys are sorted alphabetically when being serialized.

E.g. when writing out an index.json, about.json or paths.json file.

RunExportsJson - parse into MatchSpec?

I wonder if we should parse the contents of RunExportsJson as Vec<MatchSpec>?

The only issue I have is that we can't guarantee roundtrip parsing (e.g. I think the output of matchspec to str is sometimes different, but equivalent, from what it parses).

PackageCache - for cache validity, also check SHA256?

I think currently the files in the cache aren't updated if a package with the same name (but different hash) is requested. Happens mostly when building new packages locally :)

Test extraction using an expected hash

Currently, the tests for package extraction just check if the process doesn't error out. This doesn't really test if a package is properly extracted.

The tests for validating packages do use the same code to extract a package and validate its content but it would be nice if we could compute a sha256 hash of an extracted package and validate that that is indeed what we would expect.

We can use the rstest crate to create test cases with the package to extract and an expected hash as an input. e.g.:

#[rstest]
#[case("some_package.conda", "somecomplexshahashnumber")]
fn validate_extracted_packages(#[case] path: &str, #[case] hash: &str) {
 // ...
}

When developing this feature some care is required to ensure that symlinks are properly hashed. I think it would be better to hash the link itself instead of the content it points to. This is because ../a and ./../a refer to the same file but are different symlinks.

`jinja2 >2.10*` results in a crash

Unfortunately, there is at least one instance of a MatchSpec that currently crashes rattler:

jinja2 >2.10* in jupyter-server.

We could either parse this as 2.10* or we patch the repodata.

Add support for sha256 and md5 field in matchspec

Support for matching package records by hash is missing. MatchSpec does not include these fields and parsing it should also be added.

This should work:

python[sha256=deadbeef, md5=something]

Implement parsing key value pairs of matchspec without quotes

Currently, when using key value pairs in matchspecs we require there to be quotes around the values. This shouldn't actually be necessary, it would be nice if we could refactor the matchspec parsing a little to make sure this is not required.

So our parser currently only accepts:

foo[build="py2*"]

but it should also support

foo[build=py2*]

This can be a little tricky in the case of version specs because in this case its ambiguous if the comma indicates that a next key should follow or if its part of the version spec. We should check how conda handles this.

foo[version=1.3,2.0]

force docs on public functions and structs.

All public methods in rattler crates should have proper docs. We can force this with #![warn(missing_docs)] or even #![deny(missing_docs)]. Some crates already have this but we need to play catchup for some of the others.

Take into account channel priority when solving

conda/mamba know about strict and less strict channel priority and I think we should (at least) implement strict channel priority.

Use `rstest` for extraction

Instead of iterating over all archives in the package extraction tests we can use the rstest crate to create individual test cases for all archives. This will give us better insight into when a test failed into which exact package caused the issue. It will likely also give us a little better overview of which test cases take a long time.

With using cases from rstest we do lose the ability to use a glob pattern to find all test cases but I think it would also be good to not test all package archives contained in this repository per se, but use more targeted tests with some problematic cases.

So instead of something like this (current implementation):

#[test]
fn test_extract_conda() {
    let temp_dir = Path::new(env!("CARGO_TARGET_TMPDIR"));
    println!("Target dir: {}", temp_dir.display());

    for file_path in
        find_all_archives().filter(|path| ArchiveType::try_from(path) == Some(ArchiveType::Conda))
    {
        // ..
    }
}

We do something like this:

#[rstest]
#[case("conda-22.11.1-py38haa244fe_1.conda")]
#[case("mamba-1.1.0-py39hb3d9227_2.conda")]
fn test_extract_conda(#[case] input: &str) {
    // ..
}

We can also use rstest_reuse to reuse some of the test cases.

Implement shell detection

We use very dumb logic to detect the current shell. I would like to have a proper implementation. We can base it on: https://docs.rs/clap_complete/latest/clap_complete/shells/enum.Shell.html#method.from_env

Ideally, we also have a function to return the "default" shell for the platform. This can then be used when spawning a shell when the current one cannot be used.

Implement the `add_pip_to_python` weirdness

conda / mamba have a flag to add pip as a python dependency. This isn't done in the repodata to prevent a circular dependency situation (I think) but the package manager does add it "on the fly".

This is useful because most people / developers expect pip to be installed alongside Python and it's also good if it's pulled in if another package that depends on Python is installed.

E.g. installing some noarch package like rich should also automatically install pip.

Implement layered package caches

Conda and Mamba both support having multiple cache directories. Rattler currently only supports a single directory. The "layered" cache can have multiple readable directories and only one writable directory. Conda/Mamba tries to write a magical file to all these directories at startup to determine which one is writable.

Rattler should also facilitate "layered" package caches. We could introduce a trait for PackageCache that is implemented for both the current implementation as well as a layered version.

Write package - include size hint for zstd?

Apparently it's possible to give zstd a hint for the "decompressed size" of the tarball and that is supposed to improve memory usage when decoding.

The conda team added this as described here: https://conda.discourse.group/t/conda-package-streaming-0-8-0-and-conda-package-handling-2-1-0-released/267

rattler_solve, NUL characters and robustness

We are currently using libsolv as solver backend in rattler_solve through FFI calls. When passing the information about the available packages to libsolv, parts of the data (such as a package's build string or license) need to be converted from a Rust string to a NUL-terminated string (using the CString Rust type). This assumes that the original string does not contain NUL characters, which is not always guaranteed.

In its current state, rattler is unable to solve an environment if the repodata.json for a particular channel includes strings with \u0000. Adding the following package to test-data/channels/dummy/linux-64/repodata.json, for instance, causes all tests related to that file to fail:

"baz-1.0-unix_py36h1af98f8_2\u0000.tar.bz2": {
  "build": "unix_py36h1af98f8_2\u0000",
  "build_number": 1,
  "depends": [
    "__unix"
  ],
  "license": "MIT",
  "license_family": "MIT",
  "md5": "bc13aa58e2092bcb0b97c561373d3905",
  "name": "bar",
  "sha256": "97ec377d2ad83dfef1194b7aa31b0c9076194e10d995a6e696c9d07dd782b14a",
  "size": 414494,
  "subdir": "linux-64",
  "timestamp": 1605110689658,
  "version": "1.2.3"
}

I see two alternatives to go on about this:

Ignore the issue. Other C-based tools in the conda ecosystem would probably break as well if there are NUL escapes in repodata.json, and maybe we can just trust the maintainers of a channel to take that into consideration.
Use a custom method to construct CStrings that are always valid, by replacing NUL characters with something else (similar to String::from_utf8_lossy). I will push a PR for this shortly, so you can see how it would look like.

Double check that libabseil matchspec matches a package

I thought that we had to change the parsing of matchspecs but after double checking it seems to be correct. However, we had some resolve issues with libabseil and constraints of the following sort:

libabseil 20230125.0 cxx17* should match libabseil-20230125.0-cxx17_hb7217d7_0.

I'll try to add a test for this

`~=2.4,<4` is evaluated wrongly

The MatchSpec currently also matches 3.1 or other versions. In fact, the <4 should be completely ignored, and the matchspec should match only 2.4.*.

Parse `platform` into `Option<Platform>` instead of `Option<String>`

In the IndexJson struct we could parse platform (or rather the subdir field) into the Platform enum (with an escape hatch for Platform::Other(String) I would argue.

Similarly I would say that we can remove arch and subdir fields, and only reconstruct them for serialization?

Basically, the relationship is as follows: subdir is <platform> - <arch> with the special that for 64 arch is x86_64 and for 32 arch is x86.

Channel in `RepoDataRecord` is a simple String

Should we either create a Channel type that we can use in RepoDataRecord and store it or a smart pointer to a shared / cached channel? Or should we type it as an URL?

It would be nice to be able to call a .url() and .name() method on the channel to get the short or expanded versions.

Test: Check invalidation of matchspecs that are formatted incorrect

We're looking to bolster the robustness of our Rattler's matchspec implementation, particularly in the area of input parsing. While we currently have some testing mechanisms in place, we'd like to ensure we are fully covered, especially with respect to certain error scenarios.

Two primary errors that need to be checked for are:
StringMatcherParseError
ParseMatchSpecError

We recommend taking a look at Conda's tests for some inspiration:
https://github.com/conda/conda/blob/9e8425844a28ffad0c4a3adcf28a2e769f965947/tests/models/test_match_spec.py

This is a fantastic opportunity for anyone looking to make their first contribution, as this issue is primarily about adding more tests. We welcome incremental progress, so don't feel pressured to craft a massive Pull Request - a single test at a time is perfectly fine!

If you have any questions or need any assistance, feel free to ask here or join our conversation on discord. We're excited to see your contributions!

Make PythonFormatter pub?

We also need to dump a JSON struct in the rattler-build part, and ideally it looks the same as the Python JSON dumps. Could be useful to reuse the PythonFormatter there, too.

Support conda-lock v2 file format

The file format for conda-lock seems to have changed slightly. Instead of having all locked packages under packages its now first grouped by platform.

Implement solver error messages from mamba

We should implement the solver error message algorithm that's found in mamba (for the libsolv solver):

Implementation: https://github.com/mamba-org/mamba/blob/main/libmamba/src/core/solver.cpp
Blog post: https://medium.com/@AntoineProuvost/managing-conflicts-with-mamba-6a5fa10ed6a

Use zstd multithreading for compression / package writing

We can enable the zstdmt feature on the zstd crate and use multithreading as described here:

https://docs.rs/zstd/latest/zstd/stream/write/struct.Encoder.html#method.multithread

For the package writing functions here: https://github.com/mamba-org/rattler/blob/main/crates/rattler_package_streaming/src/write.rs

Package cache locking

Multiple processes are able to write to the package cache at the same time. At this point, they are completely unaware of each other. This could cause problems when multiple processes try to write to the same cache.

To work around this issue we want to introduce file locking. Conda and Mamba both already have a system in place to facilitate this. We should mimic their behavior for compatibility to ensure that when both Rattler and Conda/Mamba write to the cache there are no issues.

@wolfv about the mamba implementation:

The different operating systems (UNIX and Windows) support methods to lock files (e.g. on unix it's fcntl whatever). We started out with the idea of writing the PID of the locking process into the file but that was pretty brittle. We can just rely on the OS to make sure a file is locked or unlocked.

There are a few crates for this but I don't know how good they are.

Network drives don't support file locking so we would need a different solution there.

rattler_package_streaming not supporting `rustls-tls`?

I think we need to expose it as a feature to be able to use reqwest with rustls-tls (and not openssl), correct?

Constructing index.json / NoArchType doesn't work easily?

The tuple field of the NoArchType is private and that makes it hard to construct it IIUC.

`extracted_package_dir` in conda-meta has mix of backward and forward slashes on Windows

Handle "file://" URLs in package cache

The default reqwest client doesn't seem to handle file URLs. I think we could fix that in the package cache implementation (or should we make a client wrapper?)

RepoData, deserialize into a Sha256Hash / Md5Hash instead of String

We currently keep the hashes as String, but it would be nice for type safety to deserialize them into proper types. We have the rattler_digest::Sha256Hash and Md5Hash now that serve this purpose in PathsJson.

register environments globally?

Since environments link to files from a central package cache, we need a mechanism to clean that cache from time to time. For that, it would be great to register existing environments in a json file or similar somewhere. Maybe we should have this mechanism in rattler (if we expect the rattler default cache directory to be used often).

PathsJson, make sure that we serialize in alphabetical order

To improve the reproducibility of packages, we should serialize the paths in alphabetical order. This means we need to sort the Vec<PathsEntry> before serializing alphabetically on the _path attribute.

Similarly the has_prefix, files text files should always be sorted deterministically (alphabetically).