Giter Site home page Giter Site logo

pypa / bandersnatch Goto Github PK

View Code? Open in Web Editor NEW
421.0 14.0 134.0 2.35 MB

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/

License: Academic Free License v3.0

Python 99.61% Dockerfile 0.39%
mirror mirroring pypi pypi-mirror-client pep381 bandersnatch

bandersnatch's Introduction

Code style: black Actions Status codecov.io Documentation Status Downloads


This is a PyPI mirror client according to PEP 381 + PEP 503 + PEP 691 http://www.python.org/dev/peps/pep-0381/.

  • bandersnatch >=6.0 implements PEP691
  • bandersnatch >=4.0 supports Linux, MacOSX + Windows
  • Documentation

bandersnatch maintainers are looking for more help! Please refer to our MAINTAINER documentation to see the roles and responsibilities. We would also ask you read our Mission Statement to ensure it aligns with your thoughts for this project.

  • If interested contact @cooperlees

Installation

The following instructions will place the bandersnatch executable in a virtualenv under bandersnatch/bin/bandersnatch.

  • bandersnatch requires >= Python 3.8.0

Docker

This will pull latest build. Please use a specific tag if desired.

  • Docker image includes /bandersnatch/src/runner.py to periodically run a bandersnatch mirror
    • Please /bandersnatch/src/runner.py --help for usage
  • With docker, we recommend bind mounting in a read only bandersnatch.conf
    • Defaults to /conf/bandersnatch.conf
docker pull pypa/bandersnatch
docker run pypa/bandersnatch bandersnatch --help

Docker Compose

Bandersnatch setup using docker-compose is available here

pip

This installs the latest stable, released version.

python3 -m venv bandersnatch
bandersnatch/bin/pip install bandersnatch
bandersnatch/bin/bandersnatch --help

Quickstart

  • Run bandersnatch mirror - it will create an empty configuration file for you in /etc/bandersnatch.conf.
  • Review /etc/bandersnatch.conf and adapt to your needs.
  • Run bandersnatch mirror again. It will populate your mirror with the current status of all PyPI packages. Current mirror package size can be seen here: https://pypi.org/stats/
  • A blocklist or allowlist can be created to cut down your mirror size. You might want to Analyze PyPI downloads to determine which packages to add to your list.
  • Run bandersnatch mirror regularly to update your mirror with any intermediate changes.

Webserver

Configure your webserver to serve the web/ sub-directory of the mirror. For PEP691 support we need to respect the format the client requests.

For an nginx example, please look at our banderx docker container and nginx.conf example configuration.

  • Note that it is a good idea to have your webserver publish the HTML index files correctly with UTF-8 as the charset. The index pages will work without it but if humans look at the pages the characters will end up looking funny.

  • Make sure that the webserver uses UTF-8 to look up unicode path names. nginx gets this right by default - not sure about others.

For more information visit out official documentation for instructions on how to use a NGINX example Docker Image.

If you are looking to an docker-compose example head over here

Cron jobs

You need to set up one cron job to run the mirror itself.

Here's a sample that you could place in /etc/cron.d/bandersnatch:

    LC_ALL=en_US.utf8
    */2 * * * * root bandersnatch mirror |& logger -t bandersnatch[mirror]

This assumes that you have a logger utility installed that will convert the output of the commands to syslog entries.

SystemD Timers are also another alternative in today's modern world.

Maintenance

bandersnatch does not keep much local state in addition to the mirrored data. In general you can just keep rerunning bandersnatch mirror to make it fix errors.

If you want to force bandersnatch to check everything against the master PyPI:

  • run bandersnatch mirror --force-check to move status files if they exist in your mirror directory in order get a full sync.

Be aware that full syncs likely take hours depending on PyPI's performance and your network latency and bandwidth.

Other Commands

  • bandersnatch delete --help - Allows you to specify package(s) to be removed from your mirror (dangerous)
  • bandersnatch verify --help - Crawls your repo and fixes any missed files + deletes any unowned files found (dangerous)

Operational notes

Case-sensitive filesystem needed

You need to run bandersnatch on a case-sensitive filesystem.

OS X natively does this OK even though the filesystem is not strictly case-sensitive and bandersnatch will work fine when running on OS X. However, tarring a bandersnatch data directory and moving it to, e.g. Linux with a case-sensitive filesystem will lead to inconsistencies. You can fix those by deleting the status files and have bandersnatch run a full check on your data.

Windows requires elevated prompt

Bandersnatch makes use of symbolic links. On Windows, this permission is turned off by default for non-admin users. In order to run bandersnatch on Windows either call it from an elevated command prompt (i.e. right-click, run-as Administrator) or give yourself symlink permissions in the group policy editor.

Many sub-directories needed

The PyPI has a quite extensive list of packages that we need to maintain in a flat directory. Filesystems with small limits on the number of sub-directories per directory can run into a problem like this:

    2013-07-09 16:11:33,331 ERROR: Error syncing package: zweb@802449
    OSError: [Errno 31] Too many links: '../pypi/web/simple/zweb'

Specifically we recommend to avoid using ext3. Ext4 and newer does not have the limitation of 32k sub-directories.

Client Compatibility

A bandersnatch static mirror is compatible only to the "static", cacheable parts of PyPI that are needed to support package installation. It does not support more dynamic APIs of PyPI that maybe be used by various clients for other purposes.

An example of an unsupported API is PyPI's XML-RPC interface, which is used when running pip search.

Bandersnatch Mission

The bandersnatch project strives to:

  • Mirror all static objects of the Python Package Index (https://pypi.org/)
  • bandersnatch's main goal is to support the main global index to local syncing only
  • This will allow organizations to have lower latency access to PyPI and save bandwidth on their WAN connections and more importantly the PyPI CDN
  • Custom features and requests may be accepted if they can be of a plugin form
    • e.g. refer to the blocklist and allowlist plugins

Contact

If you have questions or comments, please submit a bug report to https://github.com/pypa/bandersnatch/issues/new

Code of Conduct

Everyone interacting in the bandersnatch project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PSF Code of Conduct.

Kudos

This client is based on the original pep381client by Martin v. Loewis.

Richard Jones was very patient answering questions at PyCon 2013 and made the protocol more reliable by implementing some PyPI enhancements.

Christian Theune for creating and maintaining bandersnatch for many years!

bandersnatch's People

Contributors

asottile avatar cooperlees avatar ctheune avatar dependabot[bot] avatar dralley avatar dstufft avatar dwighthubbard avatar electricworry avatar ewdurbin avatar gerrod3 avatar greatbahram avatar happyaron avatar ichard26 avatar indrat avatar jacobian avatar jezdez avatar leoquote avatar loewis avatar mbacicc avatar nlaurance-pyie avatar pre-commit-ci[bot] avatar pronix avatar pyup-bot avatar rene-d avatar rkm avatar sanketdg avatar tau3 avatar techalchemy avatar techciel avatar yeraydiazdiaz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bandersnatch's Issues

Add a verify/cleanup [--delete] option

Not sure on which name - Cleanup / verify

Lets walk the file system to:
a) ensure we have all the packages (e.g. like a full sync)
b) If option supplied (--delete), remove all files found that are not found in the saved JSON metadata

Also clean up non working deleting code from each run and rely on this.

  • Update documentation explaining how this verify / cleanup is the only way to clean up unneeded packages now

Initalizing Plugins Twice

Looking at output, I feel we shouldn't see these log messages more than once:

2019-01-14T00:16:48.8984740Z 2019-01-14 00:16:48,898 DEBUG: Initialized release plugin 'blacklist_release', filtering []
2019-01-14T00:16:48.9056080Z 2019-01-14 00:16:48,905 DEBUG: Initialized release plugin 'blacklist_release', filtering []
2019-01-14T00:16:48.9108980Z 2019-01-14 00:16:48,910 INFO: Initialized prerelease plugin with [re.compile('.+rc\\d$'), re.compile('.+a(lpha)?\\d$'), re.compile('.+b(eta)?\\d$')]
2019-01-14T00:16:48.9114570Z 2019-01-14 00:16:48,910 INFO: Initialized prerelease plugin with [re.compile('.+rc\\d$'), re.compile('.+a(lpha)?\\d$'), re.compile('.+b(eta)?\\d$')]

Check and see if we need logic or log cleanup here. I will try and look, but if you get a second @dwighthubbard it would be appreciated.

Case Sensitivity Issue

I have created a full mirror of PyPi using Bandersnatch to use on my air-gapped systems, this is typically very effective but I ran into an interesting issue today.

Pip is case-insensitive, so if I tell it I want to install the package InSilicoSeq, instead of looking for web/simple/InSilicoSeq it automatically looks for web/simple/insilicoseq.

The mirroring of Bandersnatch retains the case in the package folders in "simple" and that is breaking things. In the example above, the URLs from pypi.org are automatically lowercase(in the above example, https://pypi.org/simple/insilicoseq/ vs https://pypi.org/project/InSilicoSeq/).

Can you please do something to resolve this to reflect the case-insensitive nature of the packages on PyPi?

integration with safety-db

Hi all,

I am providing PyPI mirror to local users using bandersnatch. It works great!
However if some package is insecure I am bringing home some potentially risky stuff.

I am wondering if implementing some bandersnatch filtering based on Python insecure database would be possible.

Do this make sense?
Does something like that exists already?
Are you aware of possible integration with other security db source (e.g. CVE)?

Thanks
GP

local double copy of PyPI via HTTP

Hi all,

in my organization we keep a local copy of PyPI using bandersnatch. Let's call this copy DMZ repo.
We would like to create (and keep up to date) a second copy of PyPI (let's call it internal repo) in another zone of our network mirroring from the DMZ repo.

The DMZ repo expose the file structure download via bandersnatch on simple HTTP server (no HTTPS).
When i try to connect bandersnatch to create the internal copy to the DMZ repo i get this error:

2018-07-26 16:54:47,058 ERROR bandersnatch.master - Master URL http://dmz is not https scheme
Traceback (most recent call last):
  File "/opt/repos/PyPI/venv/bin/bandersnatch", line 11, in <module>
    sys.exit(main())
  File "/opt/repos/PyPI/venv/lib64/python3.6/site-packages/bandersnatch/main.py", line 112, in main
    args.func(config)
  File "/opt/repos/PyPI/venv/lib64/python3.6/site-packages/bandersnatch/main.py", line 21, in mirror
    config.getfloat('mirror', 'timeout'),
  File "/opt/repos/PyPI/venv/lib64/python3.6/site-packages/bandersnatch/master.py", line 35, in __init__
    raise ValueError("Master URL {0} is not https scheme".format(url))
ValueError: Master URL http://dmz is not https scheme

Apparently bandersnatch wants HTTPS. If I remove this conditional statement in master.py looks like bandersnatch works OK.

Why is HTTPS enforced?

Thanks
GP

example1 & example2 included in default blacklist

I've found that if you don't specify a blacklist in bandersnatch.conf, the bandersnatch mirror command acts as if you have included the example blacklist from the documentation.

here is my configuration file at /etc/bandersnatch.conf:

[mirror]

; The directory where the mirror data will be stored.
directory = /srv/pypi

; Save JSON metadata into the web tree:
json = false

; The PyPI server which will be mirrored.
master = https://pypi.org

; The network socket timeout to use for all connections.
timeout = 10

; Number of worker threads to use for parallel downloads.
workers = 3

; Whether to hash package indexes
; Recommended setting: the default of false for full pip/pypi compatibility.
hash-index = false

; Whether to stop a sync quickly after an error is found or whether to continue
; syncing but not marking the sync as successful. Value should be "true" or
; "false".
stop-on-error = false

; Number of consumers which verify metadata
verifiers = 3

And here is the beginning of the output:

$ bandersnatch mirror
2018-11-26 22:41:07,145 INFO: bandersnatch/3.1.1 (cpython 3.6.7-final0, Linux x86_64)
2018-11-26 22:41:07,146 INFO: Setting up mirror directory: /srv/pypi/
2018-11-26 22:41:07,146 INFO: Setting up mirror directory: /srv/pypi/web/simple
2018-11-26 22:41:07,147 INFO: Setting up mirror directory: /srv/pypi/web/packages
2018-11-26 22:41:07,147 INFO: Setting up mirror directory: /srv/pypi/web/local-stats/days
2018-11-26 22:41:07,147 INFO: Generation file missing. Reinitialising status files.
2018-11-26 22:41:07,148 INFO: Status file missing. Starting over.
2018-11-26 22:41:07,148 INFO: Syncing with https://pypi.org.
2018-11-26 22:41:07,148 INFO: Current mirror serial: 0
2018-11-26 22:41:07,148 INFO: Syncing all packages.
2018-11-26 22:41:11,704 INFO: Package 'example2' is blacklisted
2018-11-26 22:41:11,704 ERROR: example2 not found in packages to sync - blacklist_project has no effect here ...
2018-11-26 22:41:11,949 INFO: Package 'example1' is blacklisted
2018-11-26 22:41:11,949 ERROR: example1 not found in packages to sync - blacklist_project has no effect here ...
...

Note the mention of example1 and example2, which was unexpected.

Verify failures

I decided to check out the latest development of bandersnatch to see how the verification is coming along. (I had pointed out previously that with the new package paths it was not straightforward to identify purged files for removal.)

I can see that you're well on the way to a solution using the json metadata (good idea!). However, I'm finding the running of "bandersnatch verify" a bit fragile. It seems that a single exception causes the asyncio event loop to die:

2018-06-21 08:49:59,249 INFO: Finished validating jvc-proxy
2018-06-21 08:49:59,258 INFO: Finished validating djangocms-background-media
Task was destroyed but it is pending!
task: <Task pending coro=<TCPConnector._resolve_host() running at /opt/mirror/access/bandersnatch-venv/lib/python3.6/site-packages/aiohttp/connector.py:733> wait_for=<Future finished result=[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('151.101.61.63', 443)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2a04:4e42:f::319', 443, 0, 0))]> cb=[shield.<locals>._done_callback() at /usr/lib/python3.6/asyncio/tasks.py:679]>
Traceback (most recent call last):
  File "/opt/mirror/access/bandersnatch-venv/bin/bandersnatch", line 11, in <module>
    load_entry_point('bandersnatch', 'console_scripts', 'bandersnatch')()
  File "/opt/mirror/access/bandersnatch/src/bandersnatch/main.py", line 152, in main
    loop.run_until_complete(bandersnatch.verify.metadata_verify(config, args))
  File "/usr/lib/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "/opt/mirror/access/bandersnatch/src/bandersnatch/verify.py", line 174, in metadata_verify
    config, all_package_files, mirror_base, json_files, args, executor
  File "/opt/mirror/access/bandersnatch/src/bandersnatch/verify.py", line 148, in async_verify
    await asyncio.gather(*coros)
  File "/opt/mirror/access/bandersnatch/src/bandersnatch/verify.py", line 95, in verify
    await url_fetch(jpkg["url"], pkg_file, executor)
  File "/opt/mirror/access/bandersnatch/src/bandersnatch/verify.py", line 128, in url_fetch
    async with session.get(url, timeout=timeout) as response:
  File "/opt/mirror/access/bandersnatch-venv/lib/python3.6/site-packages/aiohttp/client.py", line 784, in __aenter__
    self._resp = await self._coro
  File "/opt/mirror/access/bandersnatch-venv/lib/python3.6/site-packages/aiohttp/client.py", line 409, in _request
    break
  File "/opt/mirror/access/bandersnatch-venv/lib/python3.6/site-packages/aiohttp/helpers.py", line 671, in __exit__
    raise asyncio.TimeoutError from None
concurrent.futures._base.TimeoutError
exception calling callback for <Future at 0x7fbb5e3af0b8 state=finished returned str>
Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/usr/lib/python3.6/asyncio/futures.py", line 414, in _call_set_state
    dest_loop.call_soon_threadsafe(_set_state, destination, source)
  File "/usr/lib/python3.6/asyncio/base_events.py", line 621, in call_soon_threadsafe
    self._check_closed()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 358, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
exception calling callback for <Future at 0x7fbb5e265ba8 state=finished returned str>
Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/usr/lib/python3.6/asyncio/futures.py", line 414, in _call_set_state
    dest_loop.call_soon_threadsafe(_set_state, destination, source)
  File "/usr/lib/python3.6/asyncio/base_events.py", line 621, in call_soon_threadsafe
    self._check_closed()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 358, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

I'm completely new to asyncio, so I'm not sure what is the best approach. Can asyncio be configured to ignore errors and continue to the next task? Or is diligent handling of failures required as normal?

Thanks. I'd be happy to help out on this one if I can.

Refactor verify.py to use less memory

Refactor verify.py to iterate through the file system JSON files better and use less memory.

  • Potentially use a asyncio.Queue and have X (workers) consumers
    -- Store only filepath in the queue and not coros

At the moment loading all the coros into a list and uses asyncio.gather is using ~1.3gb of RAM. This is not cool nor scalable.

3.2.0 Still see prerelease_name log messages

With no plugins enabled in config, I still see alpha log messages on my prod usage. Will try to confirm with CI I actually fixed this.

[2019-01-30 12:45:09,976] INFO: Downloading: https://files.pythonhosted.org/packages/9f/96/22506f10b29def2991c4ac773c05e154ee223dc3e22785ef15ab857f71eb/graphistry-1.0a3.tar.gz (package.py:344)
[2019-01-30 12:45:09,994] INFO: Initialized prerelease plugin with [re.compile('.+rc\\d$'), re.compile('.+a(lpha)?\\d$'), re.compile('.+b(eta)?\\d$')] (prerelease_name.py:25)

Timeouts are not handling the timeout exception cleanly

In using the current git release timeouts are causing unhandled exceptions instead of logging
timeout errors:

2018-05-23 16:47:21,785 ERROR: Error syncing package: tf-nightly-gpu@3887722
Traceback (most recent call last):
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/opt/python/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/opt/python/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/opt/python/lib/python3.6/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/opt/python/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/opt/python/lib/python3.6/ssl.py", line 1009, in recv_into
    return self.read(nbytes, buffer)
  File "/opt/python/lib/python3.6/ssl.py", line 871, in read
    return self._sslobj.read(len, buffer)
  File "/opt/python/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/urllib3/util/retry.py", line 357, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/urllib3/connectionpool.py", line 389, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=10.0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/bandersnatch/package.py", line 117, in sync
    self.sync_release_files()
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/bandersnatch/package.py", line 183, in sync_release_files
    release_file['digests']['sha256']
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/bandersnatch/package.py", line 307, in download_file
    r = self.mirror.master.get(url, required_serial=None, stream=True)
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/bandersnatch/master.py", line 44, in get
    r = self.session.get(path, timeout=self.timeout, **kw)
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/requests/adapters.py", line 521, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=10.0)

Create bandersnatch Travis Integration Tests

Lets either hit test PyPI or an even smaller aiohttp PyPI to do a full sync (mirror) and a verify over to test code diffs. This might catch some things that Unit Tests will not. For Example:

  • Hitting test PyPI will allow API changes to be found, but take longer + resources
  • Hitting our own small PyPI will be faster but the API will need to be kept up to date + create more work.

I'm leaning towards hitting test PyPI - Thoughts?

Setup automagic Third Party Dep upgrading like Warehouse

  • Double check how we test third parties
    -- Ensure we're testing core APIs we hit (e.g. not over mocking)
    -- Maybe make some Integration tests
  • Turn on auto updating so we always stay up to date on trunk

Then also look at releasing to PyPI more once this is setup.

bandersnatch generated a no link `black/index.html` "simple api" file this morning

All 3 of my internal production mirrors this morning around 7:10-7:40am PST USA time (15:10-15:40 UTC) received a new serial for black (4731664) and attempted to sync black. No errors were reported in the bandersnatch logs.

[2019-01-23 07:37:00,407] INFO: Syncing package: black (serial 4731664) (package.py:101)
[2019-01-23 07:37:00,474] INFO: Storing index page: black (package.py:231)

All the package files remained (as expected and bandersnatch does not delete package files (sdists, wheels etc.) but bandersnatch generated a black index.html:

<!DOCTYPE html>
<html>
  <head>
    <title>Links for black</title>
  </head>
  <body>
    <h1>Links for black</h1>

  </body>
</html>

From a spot check it ONLY seems to have effected black. Will log a job with Warehouse to see if they have any thoughts on why the serial as changed as there is no new version to https://pypi.org/project/black/

Can any other bandersnatch users out there let me know if they too got an empty black index.html?

Workaround

wget https://pypi.org/simple/black
mv black index.html
sed -i 's/files\.pythonhosted\.org/your.mirror.awesome/g'
scp index.html your.pypi.mirror:/srv/pypi/web/simple/{b}/black/

Whitelist or list of packages to sync

To create local mirrors with only known, trusted packages, it would be nice to have option to whitelist packages. For example like this:

[whitelist]
package =
pip
setuptools
....

Best Regards,
Dawid

Blacklist not working in 3.2.0

I set up a PyPI mirror using bandersnatch v3.2.0 and a blacklist generated by pypistats. I let bandersnatch mirror download all weekend while the office traffic on the network was minimal. My local mirror is now over a terabyte, but the downloads were only into packages starting with the letter k. When I search my mirror folder, I am seeing blacklisted packages that have been downloaded. I restarted bandersnatch so I could see any plugin initialization messages and I see the whitelist plugin being initialized (even though I do not have one in my .conf file) but not the blacklist.

Here are the first 20 lines that bandersnatch spits out.
2019-02-12 07:48:37,171 INFO: bandersnatch/3.2.0 (cpython 3.6.7-final0, Linux x86_64)
2019-02-12 07:48:37,238 INFO: Status file /mnt/mirror/pypi/status missing. Starting over.
2019-02-12 07:48:37,238 INFO: Syncing with https://pypi.org.
2019-02-12 07:48:37,238 INFO: Current mirror serial: 0
2019-02-12 07:48:37,238 INFO: Resuming interrupted sync from local todo list.
2019-02-12 07:48:37,586 INFO: Initialized project plugin 'whitelist_project', filtering []
2019-02-12 07:48:37,589 INFO: No project filters are enabled. Skipping filtering
2019-02-12 07:48:37,589 INFO: Trying to reach serial: 4796058
2019-02-12 07:48:37,589 INFO: 85976 packages to sync.
2019-02-12 07:48:38,151 INFO: Syncing package: 2 (serial 1386393)
2019-02-12 07:48:38,151 INFO: Syncing package: 2dfly-manbanzhen (serial 4165025)
2019-02-12 07:48:38,155 INFO: Syncing package: 3d_bin_container_packing (serial 4047427)
2019-02-12 07:48:38,499 INFO: 3d_bin_container_packing no longer exists on PyPI
2019-02-12 07:48:38,505 INFO: Syncing package: ACPI (serial 4097070)
2019-02-12 07:48:38,527 INFO: 2dfly-manbanzhen no longer exists on PyPI
2019-02-12 07:48:38,528 INFO: 2 no longer exists on PyPI
2019-02-12 07:48:38,528 INFO: Syncing package: AMON-py (serial 4188130)
2019-02-12 07:48:38,529 INFO: Syncing package: Abaqus-RunINPFiles (serial 4490341)
2019-02-12 07:48:38,545 INFO: ACPI no longer exists on PyPI
2019-02-12 07:48:38,551 INFO: Syncing package: Accengage (serial 4006324)

I also see a large number of the following message throughout my log.
2019-02-12 07:48:51,073 INFO: Initialized prerelease plugin with [re.compile('.+rc\d$'), re.compile('.+a(lpha)?\d$'), re.compile('.+b(eta)?\d$')]

I have attached my .conf file for reference
bandersnatch.conf.txt

Make Whitelist Optionally Resolve Dependencies

It seems the whitelist option does not mirror package dependencies! That's strange, isn't it?
In my opinion, that would be useless unless the package has no dependency( and this case happens rarely).
If the whitelist can get the package and its dependencies people can use the whitelist option as a cache server.

Why it has been written like this, if you don't mind me asking?

Any Suggestion, to fix this issue?

PyPI stale caches

At work, we're using a slight modified older version of Bandersnatch to mirror PyPI. About 10 days ago the mirroring job started failing - a lot - because the PyPI cache we were hitting was stale and refusing to update. After 10 days of failure, (where we got behind by about 4,000 updates), I finally forced Bandersnatch to hit a different cache by changing some request headers.

All was well for about 24 hours, then that cache started having stale issues. I've since updated our Bandersnatch code to use a query parameter of the current timestamp in order to unconditionally bust the caches.

That's obviously a sub-optimal solution, but it got us back up and running. It also got me thinking. instead of just bailing after a certain number of stale page errors, what if Bandersnatch fell back to using the cache-bust hack? It could retry 3 times to see if the caches get updated, and if they don't, it would force the cache to be busted in order to retrieve the proper version.

I'm willing to make a PR for this (I'm trying to upgrade our Bandersnatch at work), but I didn't want to do so without a discussion first. Maybe there's a better way to do this?

Bandersnatch does not exit cleanly when KeyboardInterrupt is triggered

Bandersnatch generates an exception message and hangs when keyboard interrupt (ctrl-c) is pressed to force it to exit.

Example output, notice the shell prompt does not appear after hitting ctrl-c because bandersnatch did not exit:

$ bandersnatch mirror
^CTraceback (most recent call last):
  File "/var/virtualenv/public_mirror/bin/bandersnatch", line 11, in <module>
    sys.exit(main())
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/bandersnatch/main.py", line 105, in main
    args.func(config)
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/bandersnatch/main.py", line 60, in mirror
    changed_packages = mirror.synchronize()
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/bandersnatch/mirror.py", line 105, in synchronize
    self.sync_packages()
  File "/var/virtualenv/public_mirror/lib/python3.6/site-packages/bandersnatch/mirror.py", line 197, in sync_packages
    package_syncer(packages, self.workers, self.stop_on_error)
  File "/opt/python/lib/python3.6/asyncio/base_events.py", line 454, in run_until_complete
    self.run_forever()
  File "/opt/python/lib/python3.6/asyncio/base_events.py", line 421, in run_forever
    self._run_once()
  File "/opt/python/lib/python3.6/asyncio/base_events.py", line 1395, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/python/lib/python3.6/selectors.py", line 445, in select
    fd_event_list = self._epoll.poll(timeout, max_ev)
KeyboardInterrupt

Hitting ctrl-c multiple times will eventually cause bandersnatch to exit.

Misleading/wrong tests?

I might be missing something here, but I'm noticing a lot of tests use this "pattern":

def test_mirror_sync_package_error_no_early_exit(mirror, requests):
    mirror.master.all_packages = mock.Mock()
    mirror.master.all_packages.return_value = {"foo": 1}

   # test code

It's mocking a nonexistent method. I removed the lines and the tests still pass. It's weird and also doesn't give me a lot of confidence in the test suite being useful, especially when considering that the code coverage reports that all of the bandersnatch classes are completely uncovered. I know they're not, but it's confusing to say the least:

image

Add package serial to generated Simple HTML

PyPi.org / Warehouse adds the package Serial into the generated HTML as a comment. We should do the same.

The serial can be found in the JSON we use to find package files etc.

{"info": {"last_serial": XXXX}}

HTML

</html>
<!--SERIAL 696969-->

We should pass that into the html generation function and add it into the HTML.

Pinning Deps reduces package Flexibility

If you don't use a virtualenv and have large monorepos pinning deps hard can cause a lot of uneeded pain. For example at the moment, bandersnatch has everything set via ==, in a shared repos if someone moves requests or any bandersnatch dependency forward bandersnatch will complain and fail to execute.

Suggestions and would love some feedback:

  • Change requirements to >= to at least allow moving forward and ONLY pin if a newer version causes issues - Could introduce more travis runs when new versions come out etc.
  • Find the lowest version we work with and >= it in requirements.txt
  • Go back to un-versioned setup.py install_requires and leave it up to the deployment to respect requirements.txt or not

How large is a full pypi mirror?

Just wondering. I'm at nearly 600GB right now - wondering how far there is to go.
Please date your responses as people in the future may come here for answers.

adding a new package to whitelist section does not work

First of all, thanks for white list plugin @dwighthubbard and other friends. I had a problem with whitelist plugin, consider this config file for exmaple:

[mirror]
directory = /srv/pypi
json = false
master = https://pypi.org
timeout = 10
workers = 8
hash-index = false
stop-on-error = false
verifiers = 3

[whitelist]
packages =
    bs4
    requests
    click

After bandersnatch downloaded all related files, then I want to add another modules, but this time bandersnatch cannot get the new packages I think there is problem with target_serial attribute.
The output I got:

2018-11-28 14:42:13,353 INFO: bandersnatch/3.1.1 (cpython 3.6.7-final0, Linux x86_64)
2018-11-28 14:42:13,354 INFO: Syncing with https://pypi.org.
2018-11-28 14:42:13,354 INFO: Current mirror serial: 4538127
2018-11-28 14:42:13,354 INFO: Syncing based on changelog.
2018-11-28 14:42:14,395 DEBUG: Project blacklist is []
2018-11-28 14:42:14,396 DEBUG: Initialized project plugin 'blacklist_project', filtering []
2018-11-28 14:42:14,399 DEBUG: Project whitelist is ['django', 'bs4', 'click', 'pytest', 'aiohttp', 'requests']
2018-11-28 14:42:14,399 INFO: Initialized project plugin 'whitelist_project', filtering ['django', 'bs4', 'click', 'pytest', 'aiohttp', 'requests']                                                                              
2018-11-28 14:42:14,401 INFO: Trying to reach serial: 4538140
2018-11-28 14:42:14,401 INFO: 0 packages to sync.
2018-11-28 14:42:14,402 DEBUG: Starting to sync packages 8 at once
2018-11-28 14:42:14,402 ERROR: Problem with package syncs: []
2018-11-28 14:42:14,402 INFO: Generating global index page.
2018-11-28 14:42:14,403 INFO: New mirror serial: 4538140
2018-11-28 14:42:14,404 INFO: 0 packages had changes

Content of pypi directory:

.
├── generation
├── output
├── status
└── web
    ├── last-modified
    ├── local-stats
    │   └── days
    ├── packages
    │   ├── 00
     .................
    │   └── ff
    └── simple
        ├── bs4
        ├── click
        ├── index.html
        └── requests

182 directories, 5 files

What's the problem ? and is there any way to monitor current status of bandersnatch, like how many packages has been download up to now and so on?

Bandersnatch works for Python 3.6.1 or later rather than Python 3..6 or later --> change in the docs?

Hi
Development doc https://bandersnatch.readthedocs.io/en/latest/CONTRIBUTING.html#pre-install mentions that Python 3.6 or later is required however Python 3.6.0 fails to run 'src/bandersnatch/init.py' file which code is below:

#!/usr/bin/env python3
from typing import NamedTuple

class _VersionInfo(NamedTuple):
    major: int
    minor: int
    micro: int
    releaselevel: str
    serial: int

    @property
    def version_str(self) -> str:
        release_level = f".{self.releaselevel}" if self.releaselevel else ""
        return f"{self.major}.{self.minor}.{self.micro}{release_level}"


__version_info__ = _VersionInfo(
    major=3,
    minor=0,
    micro=0,
    releaselevel="dev0",
    serial=0,  # Not currently in use with Bandersnatch versioning
)
__version__ = __version_info__.version_str

The error I got in Python 3.6.0 was:

Traceback (most recent call last):
  File "<input>", line 24, in <module>
AttributeError: '_VersionInfo' object has no attribute 'version_str'

Also see last note about Python 3.6.1 on https://docs.python.org/3/library/typing.html#typing.NamedTuple as shown below:

Changed in version 3.6: Added support for PEP 526 variable annotation syntax.
Changed in version 3.6.1: Added support for default values, methods, and docstrings.

os.uname not supported on Windows

I just tried to setup bandersnatch on a windows host and got an error because os.uname() ist not supported. Either switch to platform.uname() or document in the README that Windows isn't supported as a host.

(bandersnatch) C:\Users\john\software\bandersnatch>bandersnatch mirror
Traceback (most recent call last):
File "C:\Users\john\software\Python36-32\Lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\john\software\Python36-32\Lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\john\software\bandersnatch\Scripts\bandersnatch.exe_main
.py", line 5, in
File "c:\users\john\software\bandersnatch\lib\site-packages\bandersnatch\main.py", line 3, in
import bandersnatch.master
File "c:\users\john\software\bandersnatch\lib\site-packages\bandersnatch\master.py", line 1, in
from .utils import USER_AGENT
File "c:\users\john\software\bandersnatch\lib\site-packages\bandersnatch\utils.py", line 23, in
USER_AGENT = user_agent()
File "c:\users\john\software\bandersnatch\lib\site-packages\bandersnatch\utils.py", line 12, in user_agent
system = os.uname()
AttributeError: module 'os' has no attribute 'uname'

Only download latest versions

Is there any way to configure bandersnatch to only download the latest version of packages? I don't need nor want 600GB+ of every released version of every package. I'm not running an official mirror, I just want a local copy of latest packages.

Add support for `data-requires-python` attribute

According to PEP 503

A repository MAY include a data-requires-python attribute on a file link. This exposes the Requires-Python metadata field, specified in PEP 345, for the corresponding release. Where this is present, installer tools SHOULD ignore the download when installing to a Python version that doesn't satisfy the requirement. For example:

<a href="..." data-requires-python="&gt;=3">...</a>
In the attribute value, < and > have to be HTML encoded as &lt; and &gt;, respectively.

Without this feature, users will encounter errors because the latest version of a package may conflict with the running Python version.

e.g. using a mirror synced by bandersnatch

# python2 -m pip install ipython -U
Looking in indexes: https://mirrors.ustc.edu.cn/pypi/web/simple
Collecting ipython
  Downloading https://mirrors.ustc.edu.cn/pypi/web/packages/5b/e3/4b3082bd7f6908af828561b0129b5064bff4a13e6acadb321fc2d939a605/ipython-7.0.1.tar.gz (5.1MB)
    100% |################################| 5.1MB 7.7MB/s 
    Complete output from command python setup.py egg_info:
    
    IPython 7.0+ supports Python 3.5 and above.
    When using Python 2.7, please install IPython 5.x LTS Long Term Support version.
    Python 3.3 and 3.4 were supported up to IPython 6.x.
    
    See IPython `README.rst` file for more information:
    
        https://github.com/ipython/ipython/blob/master/README.rst
    
    Python sys.version_info(major=2, minor=7, micro=15, releaselevel='candidate', serial=1) detected.
    
    
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-Tm_NkK/ipython/

There will be no error if you switch to the official PyPI repository since data-requires-python attribute is provided.

Add a Windows CI Run

Lets ensure we stay working on Windows.

  1. Check all unit tests pass on Windows
  2. Add us to appveyor or any other service that gives us Windows containers etc.

Update the blacklist filtering to allow specifying project and version

Currently blacklisting blocks an entire package/project.

Add an enhancement to allow blocking by specifying pep440 package with version specs to allow blocking specific releases of the package.

In addition the blacklisting should be implemented in a manner that allows it to be extended in the future.

verify's JSON updates keeps failures as valid JSON and never deletes packages

The --json-update parameter suffers from a couple of issues.

  • If a package has been deleted, the response from pypi.org will be a 404 error page. This is saved to the JSON directory as if it is a success. Status should be checked to only save 200 responses.
  • Furthermore, if a non-200 response is returned, if the --delete parameter is also set, then the original JSON file should be deleted so that it's files are recognised as orphans that need to be deleted.

Move from threading to asyncio

Move from queue and threading to asyncio.

1 - Use executors until we deprecate xmlrpc2 (Done)
2 - Move to native async libraries once we get a new API for PyPI

  • 2a: Move to aiohttp-xmlrpc (Done)
  • 2b: Move from requests + threadding to aiohttp (same workers limit from config)

How do I set multiple mirrors in bandersnatch?

Hello fellas
Is this possible to have multiple masters in bandersnatch configuration?
And would it be possible to use bandersnatch in order to clone devpi package index as well?

Thanks

Enforce Dir + File permissions everywhere + Retry

Due to GlusterFS fun I sometimes hit PermissionError with different bandersnatch IO operations. Lets ensure we're doing all we can to get correct permissions and possibly add retry logic on PermissionError.

[2018-06-16 18:45:07,467] ERROR: Error syncing package: cip-log@3937070 (package.py:146)
Traceback (most recent call last):
  File "bandersnatch/package.py", line 117, in sync
  File "bandersnatch/master.py", line 45, in get
  File "requests/models.py", line 935, in raise_for_status
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://pypi.org/pypi/cip-log/json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bandersnatch/package.py", line 121, in sync
  File "bandersnatch/package.py", line 327, in delete
  File "/usr/local/fbcode/gcc-5-glibc-2.23/lib/python3.6/shutil.py", line 476, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/usr/local/fbcode/gcc-5-glibc-2.23/lib/python3.6/shutil.py", line 474, in rmtree
    fd = os.open(path, os.O_RDONLY)
PermissionError: [Errno 13] Permission denied: '/data/pypi/web/pypi/cip-log'

When I check the directory:

ls -lha /data/pypi/web/pypi/cip-log
d---------      2 nobody nobody  24 Jun 12 00:14 .

Lets see if we can make bandersnatch try and set correct permissions and retry 'x' time(s). I'd suggest once.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.