Some documents implement blurred images (https://d3tvd1u91rr79.cloudfront.net/a2aa9a4720ce8e5692f963a70e3cfcc9/html/pages/blurred/page3.webp...
). This can be bypassed simply by removing the /blurred/
from the Cloudfront URL.
This means that the StuHack extension does not work for these documents, since StuHack currently relies on client-side blurring with the CSS style filter: blur(2px)
and simply undoing that blurring. I wrote a Python script that when given a URL, it will save the document as a PDF by identifying the Cloudfront CDN URL and constructing a non-blurred URL path.
from typing import List, Tuple
import requests
from PIL import Image
from io import BytesIO
import re
def get_total_pages(document_link: str) -> int:
r = requests.get(document_link, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9"})
regex = r'data-page-no=\\\"([0-9a-f]+)\\\"'
matches = re.findall(regex, r.text, re.IGNORECASE)
total_pages = max(int(match, 16) for match in matches)
return total_pages
def get_image_info(document_link: str) -> Tuple[str, str, str]:
r = requests.get(document_link, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9"})
regex = r'src="https://(?P<subdomain>[\w.-]+)\.cloudfront\.net/(?P<identifier>[\w.-]+)/html/bg1\.png\?Policy=(?P<policy>[^"]+)"'
matches = re.search(regex, r.text, re.IGNORECASE)
if matches:
subdomain = matches.group("subdomain")
identifier = matches.group("identifier")
policy = matches.group("policy").replace("&", "&")
return subdomain, identifier, policy
else:
raise ValueError("Image information not found in the document link.")
def download_images(subdomain: str, identifier: str, policy: str, total_pages: int) -> List[Image.Image]:
images = []
for i in range(1, total_pages + 1):
url = f"https://{subdomain}.cloudfront.net/{identifier}/html/pages/page{i}.webp?Policy={policy}"
r = requests.get(url, allow_redirects=True)
try:
images.append(Image.open(BytesIO(r.content)))
except Exception as e:
raise ValueError(f"Failed to download page {i}. Looks like this document is using client-side blurring-- try using the StuHack Chrome extension instead.")
print(f"[~] Downloaded page {i} of {total_pages}")
return images
def save_document(output_file: str, images: List[Image.Image]) -> None:
if images:
images[0].save(output_file, save_all=True, append_images=images[1:])
print(f"[+] Document saved to {output_file}")
else:
raise ValueError("No images found.")
def main() -> None:
document_link = input("Enter the link to the document: ")
if "studocu" not in document_link:
print("Invalid link")
return
output_file = input("Enter the output file name (e.g., output.pdf): ")
try:
total_pages = get_total_pages(document_link)
print(f"[+] Identified {total_pages} pages in the document.")
subdomain, identifier, policy = get_image_info(document_link)
images = download_images(subdomain, identifier, policy, total_pages)
save_document(output_file, images)
except Exception as e:
print(f"[-] An error occurred: {e}")
if __name__ == "__main__":
main()
You'll need to install requests, Pillow, and typing.
pip install requests pillow typing
Thought I would leave this here for whoever needs it, since I know there are a few issues open regarding the extension not working.
TL;DR
- Some documents utilize image blurring on the backend rather than the client-side
filter: blur(2px)
- Try the StuHack extension first. If it doesn't work, the document is probably using blurred images
- If the extension doesn't work, try this Python script
I was thinking of making a PR to help integrate this into StuHack, but my Javascript knowledge is pretty lacking. I also don't know my way around a lot of the Web API stuff and noticed there's some things StuDocu does like if the filter: blur(2px)
gets removed, it gets added back. Figured some folks smarter than me could implement my logic into the extension to make it support both types of documents.