<project_overview> A Python tool that converts HTML documentation (particularly from MadCap Flare) to Markdown format while preserving folder structure and centralizing images with intelligent deduplication. </project_overview>
<conversion_rules>
- Input: HTML files (
.html
,.htm
,.xhtml
) - Output: Markdown files (
.md
) - Directory Structure: Preserved except for images
- Image Handling: Centralized in
static/img/{productname}
directory - Filename Convention: All lowercase with underscores replacing spaces
- Path References: Absolute paths from parent output directory </conversion_rules>
<setup_instructions>
# 1. Clone repository
git clone [repository_url]
# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies
pip install beautifulsoup4 markdownify
</setup_instructions>
<usage_examples>
python app.py /path/to/html/docs /path/to/output
<output_structure>
output/ # Specified output directory
├── Product1/ # Markdown files (structure preserved)
│ ├── guide/
│ │ └── intro.md
│ └── api/
│ └── reference.md
└── Product2/
└── docs/
└── overview.md
static/ # Parallel to output directory
└── img/ # Centralized images (not 'images')
├── image-manifest.json # Deduplication tracking
├── Product1/
│ ├── guide/
│ │ └── screenshot.png
│ └── api/
│ └── diagram.png
└── Product2/
└── docs/
└── logo.png
</output_structure>
<processing_phases>
- Scan for images and build reference map
- Create anchor mappings for cross-references
- Build deduplication hash table
<error_scenarios>
- Log warning but continue processing
- Record in image-manifest.json
- Preserve image reference in Markdown
<cli_options>
Option | Type | Description | Default |
---|---|---|---|
input_dir |
Required | Source HTML directory | - |
output_dir |
Required | Destination for Markdown | - |
--verbose, -v |
Flag | Show detailed progress | False |
--overwrite |
Flag | Overwrite existing files | False |
--skip-images |
Flag | Convert without copying images | False |
</cli_options> |