Comments (5)
verify was never tested for anything other than standard POSIX file systems. For non POSIX, I'm open to many ideas.
I'm happy to add checkpointing support to verify as a temporary workaround (to get around when it's interrupted).
Now, in regards to your idea:
- I don't like the guessing notion of your idea
- Python package metadata and naming is not super clean ... it's an old language
Python packages and their naming + metadata isn't as strictly enforced as one would like. So making guesses is going to land you in a bad place.
Now that we have the simple API in JSON, maybe pull that index file you've generated and then work on each project from that list and use the success/failed checkpoint feature I alluded to up above. Due to the site of the mirrors, it's hard to do this efficiently without getting more metadata + applying it to PyPI every time I've thought about it. But happy to be proven wrong.
from bandersnatch.
I have not found any clear MUST
for project name and package name, I think you're probably right about the "guessing", we should not do that.
So about the checkpoint part, my thoughts are like follows:
Separate verify into stages
- scan all JSON file to get project and designated package list
- delete project that not exists anymore
- scan all packages file to get actual package list
- compare designated package list and actual package list, delete or download package on need
We can have persistent storage in both stage 1 and stage 3, storing package list and project list into text files, one line a project/package, if there are any properties (active, inactive, update time, etc...), separate them with comma, maybe just use CSV to make it simpler.
During stage 1,3, there's also a checkpoint to indicate which is the next package, my suggestion would be every 10000 project/package, store it in a new file, so an interruption would only lose about progress of 10000 project. The file structure would be like
verify_data/
βββ desinated_package
β βββ package_list.csv.1
β βββ package_list.csv.2
βββ package
β βββ package_list.csv.1
β βββ package_list.csv.2
βββ project
βββ project_list.csv.1
βββ project_list.csv.2
For stage 2 and stage 4, they are similar, read the files generated in the previous stage, load into memory (a dict to have O(1)
complexity), and compare it.
restore from interruption
If verify is interrupted, first determine which stage were we? We can have a separate verify_status.json file to show which stage were we.
Read the last line of last package_list.csv or project_list.csv.2, continue from there, iterate file from beginning, continue process until we met the checkpoint.
If we were interrupted during stage 2 or stage 4, it's ok, these two stage are both very time-efficient, just read the designated list and actual list, compare it and schedule a deletion.
from bandersnatch.
My other thought is using tables in a sqllite database maybe for the state. Comitting using something like aiosqlite as you go moving from a simple todo and done table. OR just had a runid you use until finished. But I'm open to ideas.
The process seems sane. How do you want to structure the code? Refactor verify.py? I've always wanted to merge more of t into other classes and use the storage plugins directly for POSIX filesystem vs. other storage. Let's get a similiar agreement there before you get stuck into writing it. I'm happy to chat on discord too if you want more real time chatting it out.
from bandersnatch.
I think all the things we do is to make the global info sharable in all stages of verify processes.
If the project accepts an external database or embed database in memory, I'd recommend those options:
- sqlite3, this is rather simple
- Rocksdb, similar to sqlite3, but just KV, no SQL knowledge required
- external databases like MySQL, PgSQL (not recommended, make this project heavier)
The database file must be placed in POSIX filesystem as sqlite3/rocksdb on s3 would be inefficient.
About the code structure, I don't have many thoughts right now, I think maybe I can go create a demo first.
from bandersnatch.
π I tried to use verify in my host and failed, it took 7 days and 20G+ memory and finally OOM killed by kernel.
There are a few problems that I want to solve:
packages
may have been changed during such long time, we should consider scanpackages
first to make sure no file would be deleted wrongly- memory usage is too high, maybe I can get a packages list at first, then scan JSON files, for each file that exists in JSON, deleted it from packages list, when the JSON iteration is over, what left in packages list would be the packages that need cleaning. That would make the memory usage less and less during the iteration process.
from bandersnatch.
Related Issues (20)
- package json digest dict mapped to simple json hashes dict causes pip >23 to fail HOT 3
- Generate SimpleDigests object from what metadata offers
- Bandersnatch mirror do not mirror the dependencies for the packages mentioned on allow list ? HOT 2
- Ensure bandersnatch implements pep700 fields
- bandersnatch verify --delete occurs error HOT 2
- Sync only python 2 packages HOT 5
- Since the Bump s3path from 0.4.2 to 0.5.0 HOT 2
- package exist on pypi but not sync HOT 8
- Test bandersnatch in 3.12 + cut over docker HOT 1
- Move CI back to latest python once aiohttp supports 3.12 HOT 1
- Latest x releases: version sorting broken HOT 3
- Latest x packages breaks pinned version functionality HOT 1
- Issues serving via S3 static website HOT 5
- HTTP(S) Proxy Support HOT 10
- bandersnatch mirror completeness HOT 12
- bandersnatch repeatedly executes synchronously but keeps getting stuck on the same packageοΌthis time is mpfοΌ HOT 6
- Incremental Synchronization Issue with Bandersnatch HOT 2
- Can I edit the file "todo"? I encountered a lot of "no longer exists on PyPI" HOT 2
- Add SOCKS support to proxy configuration parameter
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bandersnatch.