HTML to Markdown Converter - Claude Instructions

<project_overview> A Python tool that converts HTML documentation (particularly from MadCap Flare) to Markdown format while preserving folder structure and centralizing images with intelligent deduplication. </project_overview>

Core Functionality

<conversion_rules>

Input: HTML files (.html, .htm, .xhtml)
Output: Markdown files (.md)
Directory Structure: Preserved except for images
Image Handling: Centralized in static/img/{productname} directory
Filename Convention: All lowercase with underscores replacing spaces
Path References: Absolute paths from parent output directory </conversion_rules>

Key Features

- Detects identical images using content hashing - Stores only one copy of duplicate images - Tracks usage in `image-manifest.json` - Updates all internal `.html` links to `.md` - Maintains anchor links between documents - Resolves cross-file references automatically - All images stored in `/static/img/{mirror_doc_directory}` - One image folder per product - Only referenced images are copied

Installation & Setup

<setup_instructions>

# 1. Clone repository
git clone [repository_url]

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install beautifulsoup4 markdownify

</setup_instructions>

Usage

<usage_examples>

python app.py /path/to/html/docs /path/to/output

```bash python app.py /path/to/html/docs /path/to/output --verbose ```

Output Structure

<output_structure>

output/                    # Specified output directory
├── Product1/             # Markdown files (structure preserved)
│   ├── guide/
│   │   └── intro.md
│   └── api/
│       └── reference.md
└── Product2/
    └── docs/
        └── overview.md

static/                   # Parallel to output directory
└── img/                 # Centralized images (not 'images')
    ├── image-manifest.json  # Deduplication tracking
    ├── Product1/
    │   ├── guide/
    │   │   └── screenshot.png
    │   └── api/
    │       └── diagram.png
    └── Product2/
        └── docs/
            └── logo.png

</output_structure>

Implementation Details

<processing_phases>

Scan for images and build reference map
Create anchor mappings for cross-references
Build deduplication hash table

- Convert HTML to Markdown - Update all link references - Copy unique images to static directory - Generate image-manifest.json

Critical Requirements

- Never modify source files - Preserve all internal links - Handle MadCap Flare-specific HTML structures - Maintain readable Markdown output - Optimize image storage through deduplication - Generate comprehensive image manifest

Error Handling

<error_scenarios>

Log warning but continue processing
Record in image-manifest.json
Preserve image reference in Markdown

- Attempt best-effort conversion - Log parsing errors with file path - Continue with next file - Check for existing files - Option to overwrite or skip - Log conflicts

Performance Considerations

- **Expected Speed**: ~1-2 seconds per file - **Memory Usage**: Scales with image deduplication table - **Disk Usage**: Reduced through image deduplication - **Large Documentation Sets**: Two-pass processing for efficiency

Troubleshooting Guide

Image not referenced in HTML or missing from source 1. Verify image exists in source 2. Check if referenced in HTML 3. Review image-manifest.json 4. Confirm static/img structure Cross-reference anchors not found 1. Check anchor mappings in verbose output 2. Verify target document exists 3. Confirm anchor ID consistency

Command Reference

<cli_options>

Option	Type	Description	Default
`input_dir`	Required	Source HTML directory	-
`output_dir`	Required	Destination for Markdown	-
`--verbose, -v`	Flag	Show detailed progress	False
`--overwrite`	Flag	Overwrite existing files	False
`--skip-images`	Flag	Convert without copying images	False
</cli_options>

Testing Checklist

- [ ] Basic HTML to Markdown conversion - [ ] Image deduplication across multiple files - [ ] Cross-file link resolution - [ ] MadCap Flare specific elements - [ ] Large documentation set performance - [ ] Edge cases (empty files, broken HTML)

Future Enhancements

- Support for custom CSS preservation - Batch processing with progress bar - Configuration file support - Plugin system for custom transformations

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
app.py		app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HTML to Markdown Converter - Claude Instructions

Core Functionality

Key Features

Installation & Setup

Usage

Output Structure

Implementation Details

Critical Requirements

Error Handling

Performance Considerations

Troubleshooting Guide

Command Reference

Testing Checklist

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

netwrix/flarewell

Folders and files

Latest commit

History

Repository files navigation

HTML to Markdown Converter - Claude Instructions

Core Functionality

Key Features

Installation & Setup

Usage

Output Structure

Implementation Details

Critical Requirements

Error Handling

Performance Considerations

Troubleshooting Guide

Command Reference

Testing Checklist

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages