Giter Site home page Giter Site logo

broken_link_checker's People

Contributors

elhmn avatar ngdream avatar pythonbrad avatar rmpr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

broken_link_checker's Issues

fix: provided argument for host

i think it's better to be align with other arguments from broken_link_checker/main.py
at line 10, instead of

parser.add_argument('HOST', help='Eg: http://example.com')

it should be :

parser.add_argument('-h', '--host', help='Eg: http://example.com')

What do you think @pythonbrad ? Feel free to share your opinion on this !

feat: project as module

It's more like a suggestion but i think it will be better to have an __init__.py file inside broken_link_checker directory so that, from anywhere else in the project, we could be able to call like :

from broken_link_checker.notifier import X, Y

What do you think @pythonbrad ? Feel free to share your opinion on this !

feat: been able to provide more link at once

i think we need to be able to provide more link at once instead of hit the command line more than once.
Am thinking it could be great to provide more than only one link at a time, something like :

python broken_link_checker/main.py --file targets.txt ....

where targets.txt will contain the list of links we want to check:

www.example.com
sangoku.fr
https://osscameroon.com

What do you think @pythonbrad ? Feel free to share your opinion on this !

host argument missing in example

The following example in README.md and README-PYPI.md is missing the host argument and isn't running as is.

python -m broken_link_checker https://example.com --delay 1

extend Python versions support

The selected minimum Python version is 3.9 while versions 3.6 to 3.8 still account to more than ~80% of Python installations (for websites).

Dropping the encoding parameter for logging.basicConfig (or having a special handling for python>=3.9) and using string annotations for unions could help extend the supported versions.

feat: configuration file

Right now it's ok to pass all parameters to the command line but it could be great to also be able to pass a configuration file

python broken_link_checker/main.py --file targets.txt --config conf.txt

in this case we have a file for targets urls we want to check and a configuration file with:

[config-broken-link]
DELAY=1
[email protected]::aSecretPassWord
[email protected],[email protected]
  • the delay
  • the credentials (email::password)
  • the list of recipients (it could be great to have more than only one recipient)

What do you think @pythonbrad ? Feel free to share your opinion on this !

feat: full fetch even for js generated websites

We should have these informations :

  • all the links tested
  • all the links succeeded
  • all the links failed

Asking this because some websites like osscameroon.com is built using React, therefore, the requests.get is not going to "load" the page and its links... just some small html and a div like

(as far as i can remember, didn't touch react since a long time) where the JS binding file is supposed to render the SPA.

PS: "NOT TESTED on my end", I may be wrong, but i think you should check the 'content' result you fetch from PWA.

How to deal with this ?
using a JScrawler... or if you want to don't change your actual code loggic, use requests_html is your way to go... (it will download a chromium binary that will be called to render pages in background).

Not sure i may available this night for a session of live coding, but feel free to share your questions here in the chat @pythonbrad !

Good job so far

Verify the foreign links

The broken should also verify the foreign link present in the website to check.
This link can be download link, reference link, ....

@Sanix-Darker Why do you think about it?

fix: problem of circular checking

Problem

The checker check the same page many time (infinite loop checking).

Log

source -> child
/ -> /files/borismbarga.pdf
/imgs/holoshoot_1 -> /imgs/files/borismbarga.pdf (load the home page)
/imgs/files/borismbarga.pdf -> /imgs/files/files/borismbarga.pdf (load the home page)
/imgs/files/files/borismbarga.pdf -> /imgs/files/files/files/borismbarga.pdf (load the home page)
/imgs/imgs/boris_profile.jpg -> /imgs/imgs/files/borismbarga.pdf (load the home page)
/imgs/files/files/files/borismbarga.pdf -> /imgs/files/files/files/files/borismbarga.pdf (load the home page)
/imgs/imgs/files/borismbarga.pdf -> /imgs/imgs/files/files/borismbarga.pdf (load the home page)
/imgs/files/imgs/boris_profile.jpg -> /imgs/files/imgs/files/borismbarga.pdf (load the home page)
/imgs/files/files/files/files/borismbarga.pdf -> /imgs/files/files/files/files/files/ (load the home page)borismbarga.pdf
/imgs/imgs/files/files/borismbarga.pdf -> /imgs/imgs/files/files/files/borismbarga.pdf (load the home page)
/imgs/files/imgs/files/borismbarga.pdf -> /imgs/files/imgs/files/files/borismbarga.pdf (load the home page)
/imgs/files/files/imgs/boris_profile.jpg -> /imgs/files/files/imgs/files/borismbarga.pdf (load the home page)
/imgs/files/files/files/files/files/borismbarga.pdf -> /imgs/files/files/files/files/ (load the home page)files/files/borismbarga.pdf
/imgs/imgs/imgs/boris_profile.jpg -> /imgs/imgs/imgs/files/borismbarga.pdf (load the home page)

The error begin from the second URL.

Solution

  • Try to verify the redirect URL in case of redirection. โœ…
  • Try to implement an algorithm to evict redundancy. (maybe in comparing the source and child content. โœ…

Migration from urllib3 to request

Reason

To simplify the development, we think migrate to requests.

Error occured

The broken link check with a https website return this kind of error.
2022-04-24 21:12:35,948 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)'))': / 2022-04-24 21:12:36,283 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)'))': / 2022-04-24 21:12:36,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)'))': /
cURL and requests succeed to open this website.

Potentiel solution (on MacOS)

Navigate to your Applications/Python 3.X/ folder and double click the Install Certificates.command to fix this.

Question

What do you think about it?

fix: docstrings placement

I think it's more relevant and logic that instead of having :

"""
   some docstring for method something
"""
def something():
   ....

it's better to have this :

def something():
   """
       some docstring for method something
   """
   ....

The way docstrings are used right now is a little bit confusing for me ๐Ÿค”

What do you think @pythonbrad ? Feel free to share your opinion on this !

refactor: Separate check and load_url

The function check should just check. And the function load_url should just load urls.
The separation of these functions will help in the writing of tests.

Provide a short report

It will be cool, if the blc can provide an option to get a shortest version of the report.
Something like
blc https://example.com -brief

Drop the notifier

The application can return an output on the console.
If at a moment, we want use notifications, we can get this output and use it with an external application.

Eg. blc https://example.com | sendmail [email protected]

feat: documentation of the broken links

It's will be cool for the admin, to know the reason of why a link is considered as broken.
Eg:

Hello, your website <http://example.com> contains 2 broken links:
http://example.com/a/b/c: Not found
http://example.com/d: Name or service not known

tests: add some tests

Because the project is just starting i think we should add some tests in it, a directory at the root called tests, you don't necessarily need to create all tests now, but we should keep in mind, it's what we're expecting at a milestone !

What do you think @pythonbrad ? Feel free to share your opinion on this !

feat: add a Makefile

A make file is a good way to simplify command that need to be run, feel free to add a Mkefile in the project for stuffs like:

  • set up an environment or installing dependencies
    so instead of pip install -r re.... it could be make install
  • start the command line checker

What do you think @pythonbrad ? Feel free to share your opinion on this !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.