mamba-org / rattler Goto Github PK
View Code? Open in Web Editor NEWRust crates to work with the Conda ecosystem.
License: BSD 3-Clause "New" or "Revised" License
Rust crates to work with the Conda ecosystem.
License: BSD 3-Clause "New" or "Revised" License
When executing JLAP one can notice the sync execution of the patching step. We should wrap the deserialization, patching and serialization steps in tokio::task::spawn_blocking
.
We prototyped this yesterday and one of the challenges is the ownership & lifetime of some of the variables.
One workaround would be to use the https://docs.rs/async-scoped/latest/async_scoped/ crate
When installing numpy
on macOS ARM64, the new libsolv implementation picks libgfortran5-12.2.0-h0eea778_32
.
Conda and mamba choose libgfortran5-12.3.0-ha3a6a3e_0
.
I wonder if that has to do with libgfortran-5.0.0-12_3_0_hd922786_0
vs. libgfortran-5.0.0-12_2_0_hd922786_32
. If we compare / sort by build number vs. build string we might pick the older one (because it has a much higer build number).
Currently a rattler user has to choose the cache dir but I think it would be convenient if we export the default value uses (dirs::cache_dir().join("rattler/cache")
).
Our matchspec implementation currently doesn't implement build string version comparison like >6
in python 3.9.3[build_number=">6"]
.
This would be a nice addition!
Conda implementation: https://github.com/conda/conda/blob/main/conda/models/version.py#L613
Some parts of rattler
use the extendhash crate. However, the same functionality is also provided by the md-5
and sha2
crates which are also used. Since md-5
and sha2
seem more complete we should get rid of extendhash
.
For reproducibility reasons we want to ensure that fields are serialized in a deterministic (alphabetical) order.
To achieve this, we have a macro in rattler_macros
that ensures that all struct fields are sorted alphabetically.
We need to apply this to more structs, such as
IndexJson
AboutJson
It would be nice to implement the jlap
format for partial repodata updates (using a specific sequenece of JSON patch applications).
That makes interactive usage faster because only a subset of the entire data needs to be fetched over the network.
Conda added support for freebsd (conda/conda#12647). We should also add it.
In rattler-build
we have the pin_subpackage
and pin_compatible
functions. They work by specifying a max_pin
and min_pin
with x.x.x
syntax (where each x
stands for one version element).
If we have a package A with version 1.2.3.4 and we use
pin_subpackage(A, max_pin="x.x", min_pin="x.x")
we are expected to convert this to A >=1.2,<1.3
For this it would be ideal if we could extract sub-versions based on an index (e.g. version[..2]
to get the first two elements, and then also version[..2].bump()
to create the upper bound).
It would be cool ๐ if we integrate dependabot to get automatic dependency updates.
I think usize
is platform dependent and will be an u32
on 32 bit systems and a u64
on 64 bit systems.
I think that doesn't make much sense for structs that are parsed or written to from JSON since it'd be better to strictly encode the maximum size for these integers regardless of the host CPU architecture.
WDYT @baszalmstra ?
What I meant to write was >=3.8,<3.9
but missing the ,
gives you really weird behavior as it only uses the >=3.8
.
Conda invalidates this request, as should rattler.
Should we parse depends and constrains into matchspecs in IndexJson
or repodata?
For typesafety and because it is nice, we should deserialize the timestamp to the proper timestamp type.
@baszalmstra prefers chrono::DateTimeUtc
We want to also expose the awesomeness of this library to Python so it because easy to use from within the Python ecosystem as well.
Currently we're storing SHA256 and MD5 hashes as strings. I am wondering if it would be better to store them as typed byte arrays and de-/serialize them at the serde level only?
Only noarch
should always be required -- a missing win-64
, or linux-64
etc. folder should be ignored.
To improve chances of reproducible packages, we should always make sure that the JSON files are map keys are sorted alphabetically when being serialized.
E.g. when writing out an index.json
, about.json
or paths.json
file.
I wonder if we should parse the contents of RunExportsJson as Vec<MatchSpec>
?
The only issue I have is that we can't guarantee roundtrip parsing (e.g. I think the output of matchspec to str is sometimes different, but equivalent, from what it parses).
I think currently the files in the cache aren't updated if a package with the same name (but different hash) is requested. Happens mostly when building new packages locally :)
Currently, the tests for package extraction just check if the process doesn't error out. This doesn't really test if a package is properly extracted.
The tests for validating packages do use the same code to extract a package and validate its content but it would be nice if we could compute a sha256 hash of an extracted package and validate that that is indeed what we would expect.
We can use the rstest
crate to create test cases with the package to extract and an expected hash as an input. e.g.:
#[rstest]
#[case("some_package.conda", "somecomplexshahashnumber")]
fn validate_extracted_packages(#[case] path: &str, #[case] hash: &str) {
// ...
}
When developing this feature some care is required to ensure that symlinks are properly hashed. I think it would be better to hash the link itself instead of the content it points to. This is because ../a
and ./../a
refer to the same file but are different symlinks.
Unfortunately, there is at least one instance of a MatchSpec that currently crashes rattler:
jinja2 >2.10*
in jupyter-server
.
We could either parse this as 2.10*
or we patch the repodata.
Support for matching package records by hash is missing. MatchSpec
does not include these fields and parsing it should also be added.
This should work:
python[sha256=deadbeef, md5=something]
Currently, when using key value pairs in matchspecs we require there to be quotes around the values. This shouldn't actually be necessary, it would be nice if we could refactor the matchspec parsing a little to make sure this is not required.
So our parser currently only accepts:
foo[build="py2*"]
but it should also support
foo[build=py2*]
This can be a little tricky in the case of version specs because in this case its ambiguous if the comma indicates that a next key should follow or if its part of the version spec. We should check how conda handles this.
foo[version=1.3,2.0]
All public methods in rattler crates should have proper docs. We can force this with #![warn(missing_docs)]
or even #![deny(missing_docs)]
. Some crates already have this but we need to play catchup for some of the others.
conda/mamba know about strict and less strict channel priority and I think we should (at least) implement strict
channel priority.
Instead of iterating over all archives in the package extraction tests we can use the rstest
crate to create individual test cases for all archives. This will give us better insight into when a test failed into which exact package caused the issue. It will likely also give us a little better overview of which test cases take a long time.
With using case
s from rstest
we do lose the ability to use a glob pattern to find all test cases but I think it would also be good to not test all package archives contained in this repository per se, but use more targeted tests with some problematic cases.
So instead of something like this (current implementation):
#[test]
fn test_extract_conda() {
let temp_dir = Path::new(env!("CARGO_TARGET_TMPDIR"));
println!("Target dir: {}", temp_dir.display());
for file_path in
find_all_archives().filter(|path| ArchiveType::try_from(path) == Some(ArchiveType::Conda))
{
// ..
}
}
We do something like this:
#[rstest]
#[case("conda-22.11.1-py38haa244fe_1.conda")]
#[case("mamba-1.1.0-py39hb3d9227_2.conda")]
fn test_extract_conda(#[case] input: &str) {
// ..
}
We can also use rstest_reuse
to reuse some of the test cases.
We use very dumb logic to detect the current shell. I would like to have a proper implementation. We can base it on: https://docs.rs/clap_complete/latest/clap_complete/shells/enum.Shell.html#method.from_env
Ideally, we also have a function to return the "default" shell for the platform. This can then be used when spawning a shell when the current one cannot be used.
conda / mamba have a flag to add pip
as a python
dependency. This isn't done in the repodata to prevent a circular dependency situation (I think) but the package manager does add it "on the fly".
This is useful because most people / developers expect pip
to be installed alongside Python and it's also good if it's pulled in if another package that depends on Python is installed.
E.g. installing some noarch package like rich
should also automatically install pip
.
Conda and Mamba both support having multiple cache directories. Rattler currently only supports a single directory. The "layered" cache can have multiple readable directories and only one writable directory. Conda/Mamba tries to write a magical file to all these directories at startup to determine which one is writable.
Rattler should also facilitate "layered" package caches. We could introduce a trait for PackageCache
that is implemented for both the current implementation as well as a layered version.
Apparently it's possible to give zstd a hint for the "decompressed size" of the tarball and that is supposed to improve memory usage when decoding.
The conda team added this as described here: https://conda.discourse.group/t/conda-package-streaming-0-8-0-and-conda-package-handling-2-1-0-released/267
We are currently using libsolv
as solver backend in rattler_solve
through FFI calls. When passing the information about the available packages to libsolv
, parts of the data (such as a package's build string or license) need to be converted from a Rust string to a NUL-terminated string (using the CString
Rust type). This assumes that the original string does not contain NUL characters, which is not always guaranteed.
In its current state, rattler is unable to solve an environment if the repodata.json
for a particular channel includes strings with \u0000
. Adding the following package to test-data/channels/dummy/linux-64/repodata.json
, for instance, causes all tests related to that file to fail:
"baz-1.0-unix_py36h1af98f8_2\u0000.tar.bz2": {
"build": "unix_py36h1af98f8_2\u0000",
"build_number": 1,
"depends": [
"__unix"
],
"license": "MIT",
"license_family": "MIT",
"md5": "bc13aa58e2092bcb0b97c561373d3905",
"name": "bar",
"sha256": "97ec377d2ad83dfef1194b7aa31b0c9076194e10d995a6e696c9d07dd782b14a",
"size": 414494,
"subdir": "linux-64",
"timestamp": 1605110689658,
"version": "1.2.3"
}
I see two alternatives to go on about this:
CString
s that are always valid, by replacing NUL characters with something else (similar to String::from_utf8_lossy
). I will push a PR for this shortly, so you can see how it would look like.I thought that we had to change the parsing of matchspecs but after double checking it seems to be correct. However, we had some resolve issues with libabseil
and constraints of the following sort:
libabseil 20230125.0 cxx17*
should match libabseil-20230125.0-cxx17_hb7217d7_0
.
I'll try to add a test for this
The MatchSpec currently also matches 3.1
or other versions. In fact, the <4
should be completely ignored, and the matchspec should match only 2.4.*
.
In the IndexJson
struct we could parse platform
(or rather the subdir
field) into the Platform enum (with an escape hatch for Platform::Other(String)
I would argue.
Similarly I would say that we can remove arch
and subdir
fields, and only reconstruct them for serialization?
Basically, the relationship is as follows: subdir
is <platform> - <arch>
with the special that for 64
arch is x86_64
and for 32
arch is x86
.
Should we either create a Channel
type that we can use in RepoDataRecord
and store it or a smart pointer to a shared / cached channel? Or should we type it as an URL?
It would be nice to be able to call a .url()
and .name()
method on the channel to get the short or expanded versions.
We're looking to bolster the robustness of our Rattler's matchspec implementation, particularly in the area of input parsing. While we currently have some testing mechanisms in place, we'd like to ensure we are fully covered, especially with respect to certain error scenarios.
Two primary errors that need to be checked for are:
StringMatcherParseError
ParseMatchSpecError
We recommend taking a look at Conda's tests for some inspiration:
https://github.com/conda/conda/blob/9e8425844a28ffad0c4a3adcf28a2e769f965947/tests/models/test_match_spec.py
This is a fantastic opportunity for anyone looking to make their first contribution, as this issue is primarily about adding more tests. We welcome incremental progress, so don't feel pressured to craft a massive Pull Request - a single test at a time is perfectly fine!
If you have any questions or need any assistance, feel free to ask here or join our conversation on discord. We're excited to see your contributions!
We also need to dump a JSON struct in the rattler-build
part, and ideally it looks the same as the Python JSON dumps. Could be useful to reuse the PythonFormatter there, too.
The file format for conda-lock seems to have changed slightly. Instead of having all locked packages under packages
its now first grouped by platform.
We should implement the solver error message algorithm that's found in mamba (for the libsolv solver):
We can enable the zstdmt
feature on the zstd crate and use multithreading as described here:
https://docs.rs/zstd/latest/zstd/stream/write/struct.Encoder.html#method.multithread
For the package writing functions here: https://github.com/mamba-org/rattler/blob/main/crates/rattler_package_streaming/src/write.rs
Multiple processes are able to write to the package cache at the same time. At this point, they are completely unaware of each other. This could cause problems when multiple processes try to write to the same cache.
To work around this issue we want to introduce file locking. Conda and Mamba both already have a system in place to facilitate this. We should mimic their behavior for compatibility to ensure that when both Rattler and Conda/Mamba write to the cache there are no issues.
@wolfv about the mamba implementation:
The different operating systems (UNIX and Windows) support methods to lock files (e.g. on unix it's fcntl whatever). We started out with the idea of writing the PID of the locking process into the file but that was pretty brittle. We can just rely on the OS to make sure a file is locked or unlocked.
There are a few crates for this but I don't know how good they are.
Network drives don't support file locking so we would need a different solution there.
I think we need to expose it as a feature to be able to use reqwest with rustls-tls (and not openssl), correct?
The tuple field of the NoArchType is private and that makes it hard to construct it IIUC.
The default reqwest client doesn't seem to handle file URLs. I think we could fix that in the package cache implementation (or should we make a client wrapper?)
We currently keep the hashes as String
, but it would be nice for type safety to deserialize them into proper types. We have the rattler_digest::Sha256Hash
and Md5Hash
now that serve this purpose in PathsJson
.
Since environments link to files from a central package cache, we need a mechanism to clean that cache from time to time. For that, it would be great to register existing environments in a json file or similar somewhere. Maybe we should have this mechanism in rattler (if we expect the rattler default cache directory to be used often).
To improve the reproducibility of packages, we should serialize the paths
in alphabetical order. This means we need to sort the Vec<PathsEntry>
before serializing alphabetically on the _path
attribute.
Similarly the has_prefix
, files
text files should always be sorted deterministically (alphabetically).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.