Giter Site home page Giter Site logo

hash-http-content's Introduction

hash-http-content

GitHub Build Status Coverage Status Total alerts Language grade: Python Known Vulnerabilities

This is a Python library to retrieve the contents of a given URL via HTTP (or HTTPS) and hash the processed contents.

Content processing

If an encoding is detected, this package will convert content into the UTF-8 encoding before proceeding.

Additional content processing is currently implemented for the following types of content:

  • HTML
  • JSON

HTML

HTML content is processed by leveraging the pyppeteer package to execute any JavaScript on a retrieved page. The result is then parsed by Beautiful Soup to reduce the content to the human visible portions of a page.

JSON

JSON content is processed by using the json library that is part of the Python standard library. It is read in and then output in a deterministic manner to adjust for any styling differences between content.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

License

This project is in the worldwide public domain.

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

hash-http-content's People

Contributors

dav3r avatar dependabot[bot] avatar felddy avatar hillaryj avatar jsf9k avatar mcdonnnj avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

hash-http-content's Issues

Explore switching from `pyppeteer` to `playwright-python`

๐Ÿ’ก Summary

We should consider switching the browser interaction package we use from pyppeteer to playwright-python.

Motivation and context

According to the pyppeteer README:

Attention: This repo is unmaintained and has been outside of minor changes for a long time. Please consider playwright-python as an alternative.

Reliance on unmaintained packages can result in undesirable situations such as breaking behavior that is unlikely to be fixed and feature stagnation.

Implementation notes

As this is an entirely different tool functionality will have to be verified to ensure a similar experience.

Acceptance criteria

How do we know when this work is done?

  • Switching packages is found to be viable.
  • We switch to using [python-playwright].
  • We update cisagov/vdp-scanner-docker to use the new release.

Update mypy Workaround if It Is No Longer Needed

๐Ÿ’ก Summary

We currently work around a perceived issue with type hinting with mypy in:

# mypy relies on typeshed (https://github.com/python/typeshed) for
# stdlib type hinting, but it does not have the correct type hints for
# hashlib.new(). The PR I submitted to fix them
# (https://github.com/python/typeshed/pull/4973) was approved, but I
# am not sure if mypy will still have issues with the usage of this
# keyword in non Python 3.9 (when the usedforsecurity kwarg was added)
# environments. I believe the earliest I can test this will be in mypy
# v0.900, and I have made
# https://github.com/cisagov/hash-http-content/issues/3 to document
# the status of this workaround.
# hasher = hashlib.new(hash_algorithm, usedforsecurity=False)
hasher = getattr(hashlib, "new")(hash_algorithm, usedforsecurity=False)

After a new version of mypy with type hint updates (possibly v0.900) is released, we should see if this workaround is still necessary to pass linting.

Motivation and context

Workarounds should only be used as long as there is something to work around.

Implementation notes

The preferred usage is commented out, so testing is simple as switching between the current and preferred usage.

Acceptance criteria

  • mypy hook passes with the preferred usage in place
  • Code is updated

or

  • mypy hook still fails with the preferred usage in place
  • Comment is updated to reflect the necessity of the workaround

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.