pypi / inspector Goto Github PK

View Code? Open in Web Editor NEW

78.0 78.0 13.0 341 KB

🕵️ File browser for distributions on PyPI

Home Page: https://inspector.pypi.io

License: Apache License 2.0

Dockerfile 4.36% Makefile 2.82% Procfile 0.14% Python 54.37% HTML 16.96% Shell 2.07% CSS 19.28%

inspector's People

Contributors

Stargazers

Watchers

Forkers

wouterkoorn hugovk informaticacba baby636 wingit-security miketheman angelod2022 import-pandas-as-numpy randyblo7 seanpm2001 vipyrsec saip007

inspector's Issues

Don't 404 when package has been removed

Currently we depend on PyPI's JSON API to render a given project/release page. When that project/release has been removed, inspector becomes 404 as well.

Ideally, we'd still be able to display all data for removed releases, but this isn't currently possible.

May be blocked on #5.

Feature: support inspection of test.pypi.org packages

As a way to examine packages that have been uploaded to test.pypi.org as well.

Some folks may only upload their packages to the test server, and ask users to install from test.pypi.org via chats with specific pip commands.

Currently inspector will show a 404 for any package on test.pypi.org as it only supports retrieval from the production index.

inspector/inspector/main.py

Line 57 in cfd7b09

resp = requests.get(f"https://pypi.org/pypi/{project_name}/json")

I’m not sure if this should manifest as a separate test-inspector instance and differ via config, or if there’s another way we should support test retrieval like a specific header/query string.

Update Application to Single Page Application (SPA)

As the project evolves, it's crucial to enhance the user experience by transforming the application into a Single Page Application (SPA). Currently, the application relies on traditional page navigation, which can result in longer loading times and a less fluid user interface. By converting it into a SPA, we can improve the overall performance and provide a more seamless browsing experience.

The primary goals of this task are as follows:

Implement a client-side routing mechanism: Integrate a Python-based library or framework (e.g., Flask, Django, or FastAPI) that enables client-side routing. This will allow us to handle navigation within the application without full page reloads.

Refactor the existing codebase: Modify the application's architecture to support the SPA model. This involves breaking down the user interface into modular components that can be dynamically loaded and rendered as needed.

Implement asynchronous data retrieval: Utilize AJAX or similar techniques to retrieve data from the server asynchronously, without requiring full page reloads. This will enable smoother transitions and improve overall performance.

Enhance user experience: Implement visual indicators or loading spinners to provide feedback during data fetching or navigation transitions. This will help users understand that the application is still actively processing their requests.

By transforming our application into a SPA, we can significantly enhance the user experience, reduce loading times, and create a more modern and responsive web application.

Feel free to add any additional ideas, suggestions, or insights to further improve this transition. Let's collaborate and work towards making our application a more efficient and user-friendly SPA!

Please note: This task may require refactoring and modifications to the existing codebase. Let's discuss the implementation strategy and any potential challenges together.

Let me know if you need any further adjustments or information in the issue description!

Shorter URLs

Currently URLs look something like this:

https://inspector.pypi.io/project/pip/22.1/packages/f3/77/23152f90de45957b59591c34dcb39b78194eb67d088d4f8799e9aa9726c4/pip-22.1-py3-none-any.whl/pip/_internal/models/format_control.py

That's pretty long! Obviously this is done because that gives enough information to fetch the URL from files.pythonhosted.org, but it might be nice to use shorter URLs, and query PyPI to get the long URL for the file distribution?

We could go as simple as:

https://inspector.pypi.io/file/pip/pip-22.1-py3-none-any.whl/pip/_internal/models/format_control.py

That's enough information to know the project name (since sdists don't have a well formed name) and the filename (which we can then look up on PyPI's /simple/<project>/ page), and get the long URL.

We could even go a bit simpler, and do:

https://inspector.pypi.io/file/pip-22.1-py3-none-any.whl/pip/_internal/models/format_control.py

Note all this embeds is the filename, we would need a way to look up the URL given nothing but the filename, but filenames are unique on PyPI, so we could just have a route on PyPI that does a redirect of filename to pythonhosted.org and does that look up for us.

The main thing we'd lose is that these links would then "die" if the file is removed from PyPI but still exists in files.pythonhosted.org. Maybe with #5 we could store the filename => file url mapping as we load them, which would mean they would continue to work in the future.

Alternatively, maybe still support the long URLs, and have a button to turn the short url into a permalink (think how github does).

Alternatively, maybe this is a silly idea and we should just stick with the long URLs :)

Serve 404 from Inspector instead of pypi.org

When using Inspector in an iframe, if the package lookup isn't found, a 404 from pypi.org is served.

This makes setting frame-src directives in a content security policy longer, since now it has to allow two domains, instead of serving the 404 directly.

In or around here:

inspector/inspector/main.py

Lines 67 to 68 in 5756f29

    
           if resp.status_code != 200: 
        
               return redirect(pypi_project_url, 307)

Add Search Button to Each Page for Easy Package Navigation

This proposal suggests a set of enhancements to the application's user interface (UI), including the addition of a search button to each page and transitioning to a Single Page Application (SPA) architecture. Additionally, it is proposed to enable seamless navigation between files and versions within the application.

Proposed Enhancements

UI Improvements

Refine the UI to enhance user-friendliness, efficiency, and intuitiveness. This includes improving the layout, styling, and responsiveness of the application across different devices.

Search Button on Each Page

Add a search button prominently to each page to simplify content navigation. This feature will allow users to quickly search for specific information within the application.

Seamless Navigation between Files/Versions

Implement a navigation mechanism that enables users to switch between different files and versions without the need to go back to the previous tag. This feature will streamline the browsing experience and provide quick access to the desired content.

Transition to Single Page Application (SPA)

Restructure the application's architecture to adopt a Single Page Application (SPA) approach. This transition will eliminate page refreshes, resulting in a faster and more seamless browsing experience.

Expected Benefits

Improved user experience: The UI enhancements will make the application more visually appealing and user-friendly.
Enhanced navigation: The addition of a search button on each page and seamless navigation between files/versions will improve efficiency in finding and accessing desired content.
Seamless browsing experience: Transitioning to a SPA architecture will eliminate page reloads and provide a smoother user experience.

Please provide any additional information or specific requirements you may have regarding the proposed enhancements.

Provide a way to diff between two package versions/artifacts

This would be incredibly useful in understanding what a new package contains.

Code that requires horizontal scrolling can easily be missed.

I encountered this package that appeared like this in my browser:

Being MacOS there was no horizontal scrollbar indicating there was text further to the right.

I added white-space: pre-wrap to the code block and this is what I found:

This solution messes with the line numbers, but it made it obvious where the malicious code was.

Ability to search files by sha256

This will be a generic method of reporting without meta information about the project, paths, etc.

This will be handy for some researchers and for automation purposes.

Cert error while trying to access https://inspector.pypi.io/

Problem: Current certificate on inspector.pypi.io is invalid. This site uses HSTS this way you cannot bypass an exception in chrome / edge / firefox without disabling HSTS, but it is very insecure.

Firefox:

Websites prove their identity via certificates. Firefox does not trust this site because it uses a
certificate that is not valid for inspector.pypi.io. The certificate is only valid for the following
names: *.ingress.cmh1.psfhosted.org, test.pypi.org, upload.pypi.org, *.cmh1.psfhosted.com,
*.pyfound.org, *.ingress.cmh1.psfhosted.com, *.cmh1.psfhosted.org

Text Element Overflow

https://inspector.pypi.io/project/opentty/1.1/packages/ef/a1/5e2ca733dabac920962eef040f3be41efa92c6881f15d1c110f965359b3b/opentty-1.1.tar.gz/opentty-1.1/opentty.py#line.760

Line 760 is causing a container overflow and subsequently causing the page to assume an incorrect width.

Handle binary files, etc.

We should do something other than 500-error for things like https://inspector.pypi.io/project/tensorflow/2.9.1/packages/51/86/f5db15a6403a8ecf377807e93cdcd5cddb2f57e73604143cc02917d24db4/tensorflow-2.9.1-cp310-cp310-macosx_10_14_x86_64.whl/tensorflow/libtensorflow_framework.2.9.1.dylib

feature: automatically identify code removed previously for being malicious.

A couple ideas for approaching this (just spitballing, possible better solutions exist as well):

taking a cryptographic hash of a file (language agnostic but inflexible to minor code changes)
computing a locality-sensitive hash of the malicious file using opcode disassembly or AST features (python-specific)
- the similarity of another file to a known malicious hash could be taken using the Levinshtein distance of the hash of a file with a known malicious file's hash.

This would obviously require a database of some sort (and committing thereto malicious file hashes in response to reports).

Inspector "Project Removed" Indicator Can Be Inaccurate

REF: #110

Problem: Inspector can serve a 'Project Removed' response when a package has not yet been removed.

Background: When a package is uploaded, in our experience, it can often take a moment for PyPI to serve the appropriate content on the package's page, while Inspector is able to serve the contents of the files relatively immediately.

Steps to Reproduce:

Identify a recently uploaded package.
Visit the inspector link of said package prior to the content being served on PyPI.

Example:
We were alerted to pipcryptov2 at 2:46PM.
I visited the Inspector URL to confirm malicious content. I was met with a package removed notification.

The PyPI page initially 404'd, but refreshing it moments later provided the appropriate webpage, and the package had not yet been removed.

Discussion: I understand this is probably a transient issue and likely not impactful as a whole to the service, as very few people are visiting inspector within the time frame that a package is uploaded and the time the PyPI content is served. Given that we tend to respond within ~60 seconds of receiving notification of a package upload, this is likely an issue that will only affect our service and services similar, so from our end, we can inform our team accurately that this should be ignored unless responding to a package significantly after the fact.

Support .tar.gz files

Currently only supports .zip files

Set up Sentry

For error reporting

Don't load entire distribution into memory

Currently this fetches the distribution from PyPI into a BytesIO object, after doing a requests.get() call (not streaming).

That means that while we're inside of _get_dist, we'll currently be using 2x the file size of the distribution worth of extra RAM, and outside of it we'll be using 1x the file size of extra RAM.

This should probably buffer to a temporary file and use streaming requests so that a large distribution doesn't kill us on memory.

This might just be #5 but I wanted to call it out explicitly since this applies even if we're storing the files somewhere.

Prevent Cross Site Scripting (XSS)

All file contents are placed in the HTML without anything preventing XSS.
Simple example: https://inspector.pypi.io/project/inspector-test-package/0.0.0/packages/71/9a/24c8c3286a09bd3f82e17723562493128c6dc89e8fe177b3697bd31bb524/inspector-test-package-0.0.0.tar.gz/inspector-test-package-0.0.0/inspector-test-package/__init__.py

Set up CDN

The following routes will ~never change once a distribution is published:

/project/<project_name>/<version>/packages/<first>/<second>/<rest>/<distname>/
/project/<project_name>/<version>/packages/<first>/<second>/<rest>/<distname>/<path:filepath>

We should put these behind a CDN with a very long-lived expiry.

Link from release and package back to project page

It would be useful if pages like:

Also linked back to https://inspector.pypi.io/project/hatchling/

So for example, add a "hatchling" link between "Inspector" and "hatchling==0.6" here:

Method for selecting multiple lines

Right now this requires manually editing the anchor in the url from something like line.1 to line.1-20.

Ideally this would be similar to GitHub (click & drag to select multiple lines) but I don't think the JS framework we're using supports that currently.

Bug: Line numbers above 9999 are wrapping improperly

Issue: Line numbers past 9999 wrap in their column element.

https://inspector.pypi.io/project/rp/0.1.914/packages/07/67/ceeb07d5b8165c270e729f6fb950061b7afea5283ecb546b6e1bed915ea8/rp-0.1.914.tar.gz/rp-0.1.914/rp/r.py#line.10250

This was missed behavior in the line wrapping change (#146). I'm aware of this, bringing it to your attention while I stumble my way through a fix here locally. Line linking still behaves correctly, just looks a bit strange graphically.

Provide a file tree

Currently this has just a flat listing of files in directories:

a/b/c/foo.txt
a/b/c/bar.txt

We should make this a tree instead:

a
- b
  - c
    - foo.txt
    - bar.txt

Figure out a datastore

Currently this loads all files into memory, and does some rudimentary in-process caching of files.

Ideally this would be replaced with something that would perform slightly better, without having to hold all of PyPI in memory forever. Something like redis with a 24 hour timeout.

Any potential solution should probably not store files on disk.

Show file sizes in index

When browsing an index of a package, it's helpful to see file sizes as well.

I don't know if that data is available in the contexts yet, but wanted to file this while it was in my head.

Error loading whl packages' contents

When processing the content of whl packages(zipfiles) of a project, all files show content as "None"

Sort versions correctly

Currently versions are sorted alphanumerically, this should use the sorting that https://github.com/pypa/packaging provides instead.

inspector.pypi.io shows 502 Bad Gateway

https://inspector.pypi.io/ shows:

502 Bad Gateway
nginx/1.13.9

IPv6 Inspector

Can we add AAAA records please :). Happy to help if I can.

We only have legacy IP offered today:

[cooper@work ~]$ host inspector.pypi.io
inspector.pypi.io is an alias for inspector.cmh1.psfhosted.org.
inspector.cmh1.psfhosted.org has address 13.58.193.163
inspector.cmh1.psfhosted.org has address 18.217.27.127
inspector.cmh1.psfhosted.org has address 3.13.211.4

feature: try to detect language from filename for syntax highlighting

Source:

inspector/inspector/templates/code.html

Lines 8 to 9 in 1ef5cda

    
           <pre id="line" class="line-numbers linkable-line-numbers language-python"> 
        
           <code class="language-python">{{- code }}</code>

When viewing a non-python file via Inspector, like a README.md, the browser highlights the contents incorrectly.

Prism supports a lot of languages and advises on using their autoloader for languages.

Support disassembling `.pyc` files

We've seen some instances of malware being hidden in .pyc files (example here) Currently this tool refuses to display .pyc files because they are binary. Instead, we should attempt to disassemble the bytecode to some degree and display as much as possible in the UI.

Sort version numbers as version numbers.

Currently the version numbers are sorted as strings (e.g. 0.1, 0.10, 0.11, 0.2) rather than as version numbers (e.g. 0.1, 0.2, 0.10)

See for instance https://inspector.pypi.io/project/whey/

	if resp.status_code != 200:
	return redirect(pypi_project_url, 307)

	<pre id="line" class="line-numbers linkable-line-numbers language-python">
	<code class="language-python">{{- code }}</code>