A Python-based web scraper that recursively crawls documentation websites and compiles them into organized markdown documents. Currently optimized for MDN Web Docs, with support for proper handling of inline code, special characters, and hierarchical document structure.
- Recursive documentation crawling
- Clean markdown output with proper formatting
- Support for inline code blocks and special characters
- Hierarchical document structure preservation
- Navigation-friendly compiled documentation
- Test mode for limited scraping during development
- Duplicate code block detection and removal
- Source URL preservation
- Python 3.8 or higher
- pip (Python package installer)
It's recommended to use a virtual environment to avoid conflicts with other Python projects:
- Create a virtual environment:
python -m venv venv
- Activate the virtual environment:
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
To deactivate the virtual environment when you're done:
deactivate
python scraper.py <url>
Example:
python scraper.py https://developer.mozilla.org/en-US/docs/Web/CSS
For development or testing, use the --test
flag to limit the number of documents scraped:
python scraper.py --test <url>
The scraper organizes its output in the following directory structure:
output/
├── docs/ # Individual markdown files for each documentation page
└── compiled/
├── structure.json # Documentation hierarchy
└── compiled-documentation.md # Single navigable document
- Clean markdown formatting
- Preserved source URLs
- Well-organized sections
- Proper handling of inline code and links
- Table of contents with proper hierarchy
- HTML anchors for navigation
- Consistent formatting throughout
- Comprehensive coverage of all scraped content
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.