2 min read

How to download all AWS Whitepapers

How to download all AWS Whitepapers

AWS Whitepapers Downloader

I have made a new repository on GitHub : https://github.com/KasteM34/aws-whitepapers-downloader

This repository automates the task of fetching the latest AWS whitepapers for local use, especially useful if you’re building a dataset to train a large language model (LLM). It addresses a specific challenge: the AWS whitepapers page is JavaScript-heavy, so traditional HTML parsing tools like BeautifulSoup can’t easily detect all the required PDF links. By leveraging Selenium, the script navigates the site as a real browser would, discovers dynamic links, and downloads them.

Why Build This?

  1. Training an LLM – Collecting a broad range of AWS whitepapers is valuable for teaching a model about cloud architecture, security, or migration best practices.
  2. Structure AWS Whitepapers – The script categorizes downloaded PDFs into folders (e.g. “security,” “compute,” “databases”), making it easier to sort and reference them later.

How It Works

  1. Scrape with Selenium
    • A headless Chrome browser visits the main AWS whitepapers page.
    • It waits a few seconds for the page to load JavaScript-driven elements.
    • It clicks through multiple pages to gather all PDF links.
  2. Link Validation
    • Only valid PDF URLs are added to a list, ignoring short or malformed filenames.
  3. Categorizing Files
    • Each PDF URL is analyzed against predefined keyword sets (found in categories.py).
    • The script places PDFs into matching subfolders like “security,” “analytics,” or “misc.”
  4. Downloading PDFs
    • The script creates a new directory (named “aws_whitepapers” by default).
    • It uses Python’s “requests” library to download the PDFs to each category folder.
    • Any errors (like 403 or 404) are caught and will be reported.
  5. Final Summary
    • After completion, it displays the number of files in each category.

Example Usage

python3 -m venv venv 
source venv/bin/activate
pip3 install -r requirements.txt
python3 downloader.py

You might need the latest version of the chrome driver depending on the OS you are.

sudo rm /usr/local/bin/chromedriver
sudo apt-get update
sudo apt-get install chromium-chromedriver

Watch as it opens a headless browser, cycles through pages, downloads relevant PDFs, and organizes them according to their category. If a PDF is restricted or fails to download, the script logs an error but continues with the next file.

Final Thoughts

If you need a comprehensive set of AWS whitepapers—whether for study, internal training, archiving or LLM development—this script streamlines the entire process. The reliance on Selenium guarantees you capture all those JavaScript-populated links, and the categorization helps you keep everything neatly organized. Fork it, extend it, or just run it as-is whenever you need an up-to-date AWS reference library. Happy downloading!