simplecto / screenshots Goto Github PK
View Code? Open in Web Editor NEWSimple Website Screenshots as a Service (Django, Selenium, Docker, Docker-compose)
License: MIT License
Simple Website Screenshots as a Service (Django, Selenium, Docker, Docker-compose)
License: MIT License
there is no input validation on the URL for screenshots. You can put in anything.
Malformed URLs will crash the screenshotting process because it assumes proper URLS.
force prefix of http or https with the ://
If you want cropped versions then you can do that using the resizer
Sometime this happens when the browser is simply waiting for a response (currently set at 60 seconds)
This should be handled whereby the worker updates the status of the screenshot to failed with an explanation.
Currently we get a warning about needing to save as PNG with selenium driver
/Users/sheraz/src/screenshot/venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py:1031: UserWarning: name used for saved screenshot does not match file type. It should end with a `.png` extension
"type. It should end with a `.png` extension", UserWarning)
This repo seems to kill all three.
there should be a screenshot_file table that points to the screenshot table.
That way every time a new screenshot is made it does not overwrite the previous ones.
This will be helpful for when we want to track the screenshots over time.
screenshot_file table:
id (uuid)
screenshot_id
created_At
Yea, I'm late to the party here, but I recall there was a problem with zombie processes when the browser hung.
It is mid-2022, and I JUST learned about tini / docker run --init that will reap these zombie processes and restart them
https://docs.docker.com/compose/compose-file/#init
Perhaps run this on a VM for a while and monitor?
We can see from the screenshots that some of them need a little more time to complete once the screen is resized. Let's ass a delay with default to 5 seconds, and the option to refresh it with a longer interval.
create a worker to only do delete house-cleaning
when a screenshot is marked as deleted, the worker should first delete the original screenshot folder or individual screenshot files
when the actual images are removed then it should delete the entry from the database
There will be duplicates recorded when using www or not.
We should have a way for the create_screenshot method to first lookup the domain and then redirect to the appropriate screenshot if it already exists.
If it does not exist then we continue as planned.
There are some nice projects out here for selenium grid. Maybe look at that for the next roudn of improvements
if status is pending, then keep spinner.
if status is SUCCESS then load image
figure out how to make intercooler js refresh until a different page comes back
Now that this has been open-sourced we should start removing any mentions of simplecto from the codebase. Move those into dynamic settings like django-solo or as part of the environment variables.
show url and image with link to the full page
selenium.common.exceptions.WebDriverException: Message: Failed to decode response from marionette
It seems that sentry has a hard time pulling out all the context variables such as the actual input that caused the error. We should catch / wrap for this exception, send it to sentry, and then continue on.
if this is done then you should update the actual screenshot status to error or something else.
Right now it is possible that a screenshot will be left in PENDING state if it is killed while making the screenshot. This is because the worker does not handle a kill or shutdown properly.
It should listen for the KILL signal and return the screenshot status back to NEW
after the user kicks off a screenshot request and lands on the view page, he should see a notification asking them to wait while the screenshot is finished.
The page should refresh every 5 seconds until the page is done loading.
If the page status is not pending or success then it should show an error to the user.
Hello,
I see screenshot_worker.py that you use while loop true for handle
def handle(self, *args, **options):
while True:
when I run manage.py screenshot_worker in terminal, I think it will neverstop. how to run manage.py runserver 0.0.0.0:8000
Chrome can auto-translate pages, so we should try to screenshot those.
There is no domain name lookup validation on the url
you can borrow this from favicon repo. we had that there
Some websites are not configured properly with their SSL and redirection. It is entirely possible that you can hit a page like https://azcentral.com from outsize the US, and their GDPR/protection routers will drop traffic, provide invalid SSL settings, or simply hang.
However, when you hit http://azcentral.com, it will work as expected and redirect you to https://eu.azcentral.com
We should have a number of attempts before giving up.
if http fails, then stop.
if https fails, then try http
We already catch webdriver exceptions and handle them. Sending them into sentry does nothing but confuse things.
This is a misuse of sentry. We should simply log these handled cases somewhere. Either log it as STDERR / STDOUT or just stick it in a database table and we can look at it later.
override the save() method on the model to force lowercase
then perhaps for the sake of learning make a custom migration which does the same
there should be a weekly cronjob that will delete all original files in the images folder that do not have a correcponding entry in the database. Those files should be considered deleted.
build a REST endpoint to get a screenshot.
it receives a POST to the endpoint
JSON format
the request payload as POST
in the response payload
webhook payload as POST
Source page: zerohedge.com
JS to click consent button:
document.getElementsByClassName('qc-cmp-secondary-button')[0].click()
Show a little window with the following:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.