How to download all AWS Whitepapers
AWS Whitepapers Downloader
I have made a new repository on GitHub : https://github.com/KasteM34/aws-whitepapers-downloader
This repository automates the task of fetching the latest AWS whitepapers for local use, especially useful if you’re building a dataset to train a large language model (LLM). It addresses a specific challenge: the AWS whitepapers page is JavaScript-heavy, so traditional HTML parsing tools like BeautifulSoup can’t easily detect all the required PDF links. By leveraging Selenium, the script navigates the site as a real browser would, discovers dynamic links, and downloads them.
Why Build This?
- Training an LLM – Collecting a broad range of AWS whitepapers is valuable for teaching a model about cloud architecture, security, or migration best practices.
- Structure AWS Whitepapers – The script categorizes downloaded PDFs into folders (e.g. “security,” “compute,” “databases”), making it easier to sort and reference them later.
How It Works
- Scrape with Selenium
- A headless Chrome browser visits the main AWS whitepapers page.
- It waits a few seconds for the page to load JavaScript-driven elements.
- It clicks through multiple pages to gather all PDF links.
- Link Validation
- Only valid PDF URLs are added to a list, ignoring short or malformed filenames.
- Categorizing Files
- Each PDF URL is analyzed against predefined keyword sets (found in categories.py).
- The script places PDFs into matching subfolders like “security,” “analytics,” or “misc.”
- Downloading PDFs
- The script creates a new directory (named “aws_whitepapers” by default).
- It uses Python’s “requests” library to download the PDFs to each category folder.
- Any errors (like 403 or 404) are caught and will be reported.
- Final Summary
- After completion, it displays the number of files in each category.
Example Usage
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
python3 downloader.py
You might need the latest version of the chrome driver depending on the OS you are.
sudo rm /usr/local/bin/chromedriver
sudo apt-get update
sudo apt-get install chromium-chromedriver
Watch as it opens a headless browser, cycles through pages, downloads relevant PDFs, and organizes them according to their category. If a PDF is restricted or fails to download, the script logs an error but continues with the next file.
Final Thoughts
If you need a comprehensive set of AWS whitepapers—whether for study, internal training, archiving or LLM development—this script streamlines the entire process. The reliance on Selenium guarantees you capture all those JavaScript-populated links, and the categorization helps you keep everything neatly organized. Fork it, extend it, or just run it as-is whenever you need an up-to-date AWS reference library. Happy downloading!