Giter Site home page Giter Site logo

mamba-org / rattler Goto Github PK

View Code? Open in Web Editor NEW
217.0 9.0 39.0 34.93 MB

Rust crates to work with the Conda ecosystem.

License: BSD 3-Clause "New" or "Revised" License

Rust 88.38% Python 11.23% HTML 0.26% CSS 0.11% Batchfile 0.01% Shell 0.01%
conda rust

rattler's People

Contributors

0xbe7a avatar aochagavia avatar baszalmstra avatar benjaminlowry avatar clement-chaneching avatar dependabot[bot] avatar dholth avatar github-actions[bot] avatar hadim avatar hofer-julian avatar iamthebot avatar jaimergp avatar johnhany97 avatar johnwillliam avatar kassoulait avatar manulpatel avatar mariusvniekerk avatar nichmor avatar orhun avatar pavelzw avatar ruben-arts avatar tdejager avatar travishathaway avatar tusharsadhwani avatar vlad-ivanov-name avatar wackyator avatar wolfv avatar yeungonion avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rattler's Issues

jlap improvements

When executing JLAP one can notice the sync execution of the patching step. We should wrap the deserialization, patching and serialization steps in tokio::task::spawn_blocking.

We prototyped this yesterday and one of the challenges is the ownership & lifetime of some of the variables.

One workaround would be to use the https://docs.rs/async-scoped/latest/async_scoped/ crate

libsolv-rs is choosing slightly older libgfortran5 (on macos-arm64)

When installing numpy on macOS ARM64, the new libsolv implementation picks libgfortran5-12.2.0-h0eea778_32.

Conda and mamba choose libgfortran5-12.3.0-ha3a6a3e_0.

I wonder if that has to do with libgfortran-5.0.0-12_3_0_hd922786_0 vs. libgfortran-5.0.0-12_2_0_hd922786_32. If we compare / sort by build number vs. build string we might pick the older one (because it has a much higer build number).

Use default_cache_dir in create

Currently a rattler user has to choose the cache dir but I think it would be convenient if we export the default value uses (dirs::cache_dir().join("rattler/cache")).

Remove dependency on `extendhash`

Some parts of rattler use the extendhash crate. However, the same functionality is also provided by the md-5 and sha2 crates which are also used. Since md-5 and sha2 seem more complete we should get rid of extendhash.

Use the `#[sorted]` macro on more conda json types

For reproducibility reasons we want to ensure that fields are serialized in a deterministic (alphabetical) order.

To achieve this, we have a macro in rattler_macros that ensures that all struct fields are sorted alphabetically.

We need to apply this to more structs, such as

  • IndexJson
  • AboutJson
  • possibly more

Implement the `jlap` format

It would be nice to implement the jlap format for partial repodata updates (using a specific sequenece of JSON patch applications).

That makes interactive usage faster because only a subset of the entire data needs to be fetched over the network.

version: access element range / bump elements range

In rattler-build we have the pin_subpackage and pin_compatible functions. They work by specifying a max_pin and min_pin with x.x.x syntax (where each x stands for one version element).

If we have a package A with version 1.2.3.4 and we use

pin_subpackage(A, max_pin="x.x", min_pin="x.x")

we are expected to convert this to A >=1.2,<1.3

For this it would be ideal if we could extract sub-versions based on an index (e.g. version[..2] to get the first two elements, and then also version[..2].bump() to create the upper bound).

Remove `usize` from types in conda_package_types

I think usize is platform dependent and will be an u32 on 32 bit systems and a u64 on 64 bit systems.

I think that doesn't make much sense for structs that are parsed or written to from JSON since it'd be better to strictly encode the maximum size for these integers regardless of the host CPU architecture.

WDYT @baszalmstra ?

Invalidate `>=3.8<3.9`

What I meant to write was >=3.8,<3.9 but missing the , gives you really weird behavior as it only uses the >=3.8.

Conda invalidates this request, as should rattler.

Bindings for Python

We want to also expose the awesomeness of this library to Python so it because easy to use from within the Python ecosystem as well.

  • We can start by binding small library parts to Python like version, matchspec, etc.
  • It would be nice to use Python async for the async parts (like downloading).
  • We can use pyo3 to generate the bindings.

Typed checksums

Currently we're storing SHA256 and MD5 hashes as strings. I am wondering if it would be better to store them as typed byte arrays and de-/serialize them at the serde level only?

RunExportsJson - parse into MatchSpec?

I wonder if we should parse the contents of RunExportsJson as Vec<MatchSpec>?

The only issue I have is that we can't guarantee roundtrip parsing (e.g. I think the output of matchspec to str is sometimes different, but equivalent, from what it parses).

Test extraction using an expected hash

Currently, the tests for package extraction just check if the process doesn't error out. This doesn't really test if a package is properly extracted.

The tests for validating packages do use the same code to extract a package and validate its content but it would be nice if we could compute a sha256 hash of an extracted package and validate that that is indeed what we would expect.

We can use the rstest crate to create test cases with the package to extract and an expected hash as an input. e.g.:

#[rstest]
#[case("some_package.conda", "somecomplexshahashnumber")]
fn validate_extracted_packages(#[case] path: &str, #[case] hash: &str) {
 // ...
}

When developing this feature some care is required to ensure that symlinks are properly hashed. I think it would be better to hash the link itself instead of the content it points to. This is because ../a and ./../a refer to the same file but are different symlinks.

`jinja2 >2.10*` results in a crash

Unfortunately, there is at least one instance of a MatchSpec that currently crashes rattler:

jinja2 >2.10* in jupyter-server.

We could either parse this as 2.10* or we patch the repodata.

Implement parsing key value pairs of matchspec without quotes

Currently, when using key value pairs in matchspecs we require there to be quotes around the values. This shouldn't actually be necessary, it would be nice if we could refactor the matchspec parsing a little to make sure this is not required.

So our parser currently only accepts:

foo[build="py2*"]

but it should also support

foo[build=py2*]

This can be a little tricky in the case of version specs because in this case its ambiguous if the comma indicates that a next key should follow or if its part of the version spec. We should check how conda handles this.

foo[version=1.3,2.0]

force docs on public functions and structs.

All public methods in rattler crates should have proper docs. We can force this with #![warn(missing_docs)] or even #![deny(missing_docs)]. Some crates already have this but we need to play catchup for some of the others.

Use `rstest` for extraction

Instead of iterating over all archives in the package extraction tests we can use the rstest crate to create individual test cases for all archives. This will give us better insight into when a test failed into which exact package caused the issue. It will likely also give us a little better overview of which test cases take a long time.

With using cases from rstest we do lose the ability to use a glob pattern to find all test cases but I think it would also be good to not test all package archives contained in this repository per se, but use more targeted tests with some problematic cases.

So instead of something like this (current implementation):

#[test]
fn test_extract_conda() {
    let temp_dir = Path::new(env!("CARGO_TARGET_TMPDIR"));
    println!("Target dir: {}", temp_dir.display());

    for file_path in
        find_all_archives().filter(|path| ArchiveType::try_from(path) == Some(ArchiveType::Conda))
    {
        // ..
    }
}

We do something like this:

#[rstest]
#[case("conda-22.11.1-py38haa244fe_1.conda")]
#[case("mamba-1.1.0-py39hb3d9227_2.conda")]
fn test_extract_conda(#[case] input: &str) {
    // ..
}

We can also use rstest_reuse to reuse some of the test cases.

Implement the `add_pip_to_python` weirdness

conda / mamba have a flag to add pip as a python dependency. This isn't done in the repodata to prevent a circular dependency situation (I think) but the package manager does add it "on the fly".

This is useful because most people / developers expect pip to be installed alongside Python and it's also good if it's pulled in if another package that depends on Python is installed.

E.g. installing some noarch package like rich should also automatically install pip.

Implement layered package caches

Conda and Mamba both support having multiple cache directories. Rattler currently only supports a single directory. The "layered" cache can have multiple readable directories and only one writable directory. Conda/Mamba tries to write a magical file to all these directories at startup to determine which one is writable.

Rattler should also facilitate "layered" package caches. We could introduce a trait for PackageCache that is implemented for both the current implementation as well as a layered version.

rattler_solve, NUL characters and robustness

We are currently using libsolv as solver backend in rattler_solve through FFI calls. When passing the information about the available packages to libsolv, parts of the data (such as a package's build string or license) need to be converted from a Rust string to a NUL-terminated string (using the CString Rust type). This assumes that the original string does not contain NUL characters, which is not always guaranteed.

In its current state, rattler is unable to solve an environment if the repodata.json for a particular channel includes strings with \u0000. Adding the following package to test-data/channels/dummy/linux-64/repodata.json, for instance, causes all tests related to that file to fail:

"baz-1.0-unix_py36h1af98f8_2\u0000.tar.bz2": {
  "build": "unix_py36h1af98f8_2\u0000",
  "build_number": 1,
  "depends": [
    "__unix"
  ],
  "license": "MIT",
  "license_family": "MIT",
  "md5": "bc13aa58e2092bcb0b97c561373d3905",
  "name": "bar",
  "sha256": "97ec377d2ad83dfef1194b7aa31b0c9076194e10d995a6e696c9d07dd782b14a",
  "size": 414494,
  "subdir": "linux-64",
  "timestamp": 1605110689658,
  "version": "1.2.3"
}

I see two alternatives to go on about this:

  1. Ignore the issue. Other C-based tools in the conda ecosystem would probably break as well if there are NUL escapes in repodata.json, and maybe we can just trust the maintainers of a channel to take that into consideration.
  2. Use a custom method to construct CStrings that are always valid, by replacing NUL characters with something else (similar to String::from_utf8_lossy). I will push a PR for this shortly, so you can see how it would look like.

Double check that libabseil matchspec matches a package

I thought that we had to change the parsing of matchspecs but after double checking it seems to be correct. However, we had some resolve issues with libabseil and constraints of the following sort:

libabseil 20230125.0 cxx17* should match libabseil-20230125.0-cxx17_hb7217d7_0.

I'll try to add a test for this

`~=2.4,<4` is evaluated wrongly

The MatchSpec currently also matches 3.1 or other versions. In fact, the <4 should be completely ignored, and the matchspec should match only 2.4.*.

Parse `platform` into `Option<Platform>` instead of `Option<String>`

In the IndexJson struct we could parse platform (or rather the subdir field) into the Platform enum (with an escape hatch for Platform::Other(String) I would argue.

Similarly I would say that we can remove arch and subdir fields, and only reconstruct them for serialization?

Basically, the relationship is as follows: subdir is <platform> - <arch> with the special that for 64 arch is x86_64 and for 32 arch is x86.

Channel in `RepoDataRecord` is a simple String

Should we either create a Channel type that we can use in RepoDataRecord and store it or a smart pointer to a shared / cached channel? Or should we type it as an URL?

It would be nice to be able to call a .url() and .name() method on the channel to get the short or expanded versions.

Test: Check invalidation of matchspecs that are formatted incorrect

We're looking to bolster the robustness of our Rattler's matchspec implementation, particularly in the area of input parsing. While we currently have some testing mechanisms in place, we'd like to ensure we are fully covered, especially with respect to certain error scenarios.

Two primary errors that need to be checked for are:
StringMatcherParseError
ParseMatchSpecError

We recommend taking a look at Conda's tests for some inspiration:
https://github.com/conda/conda/blob/9e8425844a28ffad0c4a3adcf28a2e769f965947/tests/models/test_match_spec.py

This is a fantastic opportunity for anyone looking to make their first contribution, as this issue is primarily about adding more tests. We welcome incremental progress, so don't feel pressured to craft a massive Pull Request - a single test at a time is perfectly fine!

If you have any questions or need any assistance, feel free to ask here or join our conversation on discord. We're excited to see your contributions!

Make PythonFormatter pub?

We also need to dump a JSON struct in the rattler-build part, and ideally it looks the same as the Python JSON dumps. Could be useful to reuse the PythonFormatter there, too.

Support conda-lock v2 file format

The file format for conda-lock seems to have changed slightly. Instead of having all locked packages under packages its now first grouped by platform.

Package cache locking

Multiple processes are able to write to the package cache at the same time. At this point, they are completely unaware of each other. This could cause problems when multiple processes try to write to the same cache.

To work around this issue we want to introduce file locking. Conda and Mamba both already have a system in place to facilitate this. We should mimic their behavior for compatibility to ensure that when both Rattler and Conda/Mamba write to the cache there are no issues.

@wolfv about the mamba implementation:

The different operating systems (UNIX and Windows) support methods to lock files (e.g. on unix it's fcntl whatever). We started out with the idea of writing the PID of the locking process into the file but that was pretty brittle. We can just rely on the OS to make sure a file is locked or unlocked.

There are a few crates for this but I don't know how good they are.

Network drives don't support file locking so we would need a different solution there.

Handle "file://" URLs in package cache

The default reqwest client doesn't seem to handle file URLs. I think we could fix that in the package cache implementation (or should we make a client wrapper?)

register environments globally?

Since environments link to files from a central package cache, we need a mechanism to clean that cache from time to time. For that, it would be great to register existing environments in a json file or similar somewhere. Maybe we should have this mechanism in rattler (if we expect the rattler default cache directory to be used often).

PathsJson, make sure that we serialize in alphabetical order

To improve the reproducibility of packages, we should serialize the paths in alphabetical order. This means we need to sort the Vec<PathsEntry> before serializing alphabetically on the _path attribute.

Similarly the has_prefix, files text files should always be sorted deterministically (alphabetically).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.