2 min read

How to download all AWS Whitepapers

How to download all AWS Whitepapers

AWS Whitepapers Downloader

I recently created a new GitHub repository that simplifies the process of fetching the latest AWS whitepapers for local use.

You can find it here: aws-whitepapers-downloader.

This tool is particularly handy if you’re gathering a large collection of AWS documentation, whether for study, internal use, or even training a large language model (LLM).

The core challenge it solves is dealing with the JavaScript-heavy AWS whitepapers page, which makes conventional HTML parsing tools (like BeautifulSoup) less effective. By using Selenium, the script functions like a real browser: it navigates through the site, identifies links rendered by JavaScript, and then downloads the corresponding PDFs.

Why Build This?

  1. LLM Training
    Having an extensive set of AWS whitepapers is valuable for providing a model with real-world examples of cloud infrastructure, security, migration best practices, and more.
  2. Organizing AWS Whitepapers
    The script automatically sorts the downloaded PDFs into categories (like “security,” “compute,” and “databases”), allowing you to maintain a tidy library of reference materials.

How It Works

  1. Scraping with Selenium
    • The script launches a headless Chrome browser and navigates to the main AWS whitepapers page.
    • It waits a few seconds to ensure all JavaScript-driven elements have fully loaded.
    • It then iterates through multiple pages to collect all the PDF links.
  2. Link Validation
    • Only valid PDF URLs are added to the download list, ignoring any malformed or suspicious links.
  3. Categorizing Files
    • Each PDF URL is checked against a list of predefined keywords (specified in categories.py).
    • Based on these keywords, PDFs are placed into corresponding subfolders like “security,” “analytics,” or “misc.”
  4. Downloading PDFs
    • A folder named aws_whitepapers (by default) is created to store all files.
    • The script uses Python’s requests library to handle the downloads.
    • Any errors encountered (e.g., 403 or 404) are recorded, but the script continues with the next file.
  5. Final Summary
    • Once everything is complete, it presents a brief report showing how many files landed in each category.

Example Usage

python3 -m venv venv 
source venv/bin/activate
pip3 install -r requirements.txt
python3 downloader.py

As soon as you run the script, it will open a headless browser, step through each page of whitepapers, and download the relevant PDFs. Each file is deposited into its designated category folder. If a file is blocked or missing, the script logs the issue and moves on.

Updating Chrome Driver

Depending on your operating system, you may need the latest version of chromedriver. On Ubuntu/Debian-based systems, you can use:

sudo rm /usr/local/bin/chromedriver
sudo apt-get update
sudo apt-get install chromium-chromedriver

Make sure that the versions of Chrome and chromedriver match to avoid any compatibility issues.

Final Thoughts

If you’re looking to build a thorough AWS reference library—whether for exam prep, team training, or creating a dataset for an LLM—this downloader offers an efficient, organized approach. Leveraging Selenium ensures you won’t miss any PDFs rendered via JavaScript, and the built-in categorization system helps keep everything orderly.

Feel free to fork the repository, add your own features, or just run it as-is whenever you need an up-to-date collection of AWS whitepapers. Give it a try and streamline your reference-gathering process!