Documentation Scraper

A Python-based web scraper that recursively crawls documentation websites and compiles them into organized markdown documents. Currently optimized for MDN Web Docs, with support for proper handling of inline code, special characters, and hierarchical document structure.

Features

Recursive documentation crawling
Clean markdown output with proper formatting
Support for inline code blocks and special characters
Hierarchical document structure preservation
Navigation-friendly compiled documentation
Test mode for limited scraping during development
Duplicate code block detection and removal
Source URL preservation

Installation

Prerequisites

Python 3.8 or higher
pip (Python package installer)

Setting up a Virtual Environment

It's recommended to use a virtual environment to avoid conflicts with other Python projects:

Create a virtual environment:

python -m venv venv

Activate the virtual environment:

# On macOS/Linux:
source venv/bin/activate

# On Windows:
venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

To deactivate the virtual environment when you're done:

deactivate

Usage

Full Scraping

python scraper.py <url>

Example:

python scraper.py https://developer.mozilla.org/en-US/docs/Web/CSS

Test Mode

For development or testing, use the --test flag to limit the number of documents scraped:

python scraper.py --test <url>

Output

The scraper organizes its output in the following directory structure:

output/
├── docs/              # Individual markdown files for each documentation page
└── compiled/
    ├── structure.json # Documentation hierarchy
    └── compiled-documentation.md  # Single navigable document

Individual Files (in docs/)

Clean markdown formatting
Preserved source URLs
Well-organized sections
Proper handling of inline code and links

Compiled Documentation

Table of contents with proper hierarchy
HTML anchors for navigation
Consistent formatting throughout
Comprehensive coverage of all scraped content

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scraper.py		scraper.py
test_fetch.py		test_fetch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Documentation Scraper

Features

Installation

Prerequisites

Setting up a Virtual Environment

Usage

Full Scraping

Test Mode

Output

Individual Files (in docs/)

Compiled Documentation

Contributing

License

About

Uh oh!

Uh oh!

Languages

License

djankies/docs-scraper

Folders and files

Latest commit

History

Repository files navigation

Documentation Scraper

Features

Installation

Prerequisites

Setting up a Virtual Environment

Usage

Full Scraping

Test Mode

Output

Individual Files (in docs/)

Compiled Documentation

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages