How to download all AWS Whitepapers
AWS Whitepapers Downloader
I recently created a new GitHub repository that simplifies the process of fetching the latest AWS whitepapers for local use.
You can find it here: aws-whitepapers-downloader.
This tool is particularly handy if you’re gathering a large collection of AWS documentation, whether for study, internal use, or even training a large language model (LLM).
The core challenge it solves is dealing with the JavaScript-heavy AWS whitepapers page, which makes conventional HTML parsing tools (like BeautifulSoup) less effective. By using Selenium, the script functions like a real browser: it navigates through the site, identifies links rendered by JavaScript, and then downloads the corresponding PDFs.
Why Build This?
- LLM Training
Having an extensive set of AWS whitepapers is valuable for providing a model with real-world examples of cloud infrastructure, security, migration best practices, and more. - Organizing AWS Whitepapers
The script automatically sorts the downloaded PDFs into categories (like “security,” “compute,” and “databases”), allowing you to maintain a tidy library of reference materials.
How It Works
- Scraping with Selenium
- The script launches a headless Chrome browser and navigates to the main AWS whitepapers page.
- It waits a few seconds to ensure all JavaScript-driven elements have fully loaded.
- It then iterates through multiple pages to collect all the PDF links.
- Link Validation
- Only valid PDF URLs are added to the download list, ignoring any malformed or suspicious links.
- Categorizing Files
- Each PDF URL is checked against a list of predefined keywords (specified in
categories.py
). - Based on these keywords, PDFs are placed into corresponding subfolders like “security,” “analytics,” or “misc.”
- Each PDF URL is checked against a list of predefined keywords (specified in
- Downloading PDFs
- A folder named
aws_whitepapers
(by default) is created to store all files. - The script uses Python’s
requests
library to handle the downloads. - Any errors encountered (e.g., 403 or 404) are recorded, but the script continues with the next file.
- A folder named
- Final Summary
- Once everything is complete, it presents a brief report showing how many files landed in each category.
Example Usage
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
python3 downloader.py
As soon as you run the script, it will open a headless browser, step through each page of whitepapers, and download the relevant PDFs. Each file is deposited into its designated category folder. If a file is blocked or missing, the script logs the issue and moves on.
Updating Chrome Driver
Depending on your operating system, you may need the latest version of chromedriver
. On Ubuntu/Debian-based systems, you can use:
sudo rm /usr/local/bin/chromedriver
sudo apt-get update
sudo apt-get install chromium-chromedriver
Make sure that the versions of Chrome and chromedriver
match to avoid any compatibility issues.
Final Thoughts
If you’re looking to build a thorough AWS reference library—whether for exam prep, team training, or creating a dataset for an LLM—this downloader offers an efficient, organized approach. Leveraging Selenium ensures you won’t miss any PDFs rendered via JavaScript, and the built-in categorization system helps keep everything orderly.
Feel free to fork the repository, add your own features, or just run it as-is whenever you need an up-to-date collection of AWS whitepapers. Give it a try and streamline your reference-gathering process!