Skip to content

netwrix/flarewell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTML to Markdown Converter - Claude Instructions

<project_overview> A Python tool that converts HTML documentation (particularly from MadCap Flare) to Markdown format while preserving folder structure and centralizing images with intelligent deduplication. </project_overview>

Core Functionality

<conversion_rules>

  • Input: HTML files (.html, .htm, .xhtml)
  • Output: Markdown files (.md)
  • Directory Structure: Preserved except for images
  • Image Handling: Centralized in static/img/{productname} directory
  • Filename Convention: All lowercase with underscores replacing spaces
  • Path References: Absolute paths from parent output directory </conversion_rules>

Key Features

- Detects identical images using content hashing - Stores only one copy of duplicate images - Tracks usage in `image-manifest.json` - Updates all internal `.html` links to `.md` - Maintains anchor links between documents - Resolves cross-file references automatically - All images stored in `/static/img/{mirror_doc_directory}` - One image folder per product - Only referenced images are copied

Installation & Setup

<setup_instructions>

# 1. Clone repository
git clone [repository_url]

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install beautifulsoup4 markdownify

</setup_instructions>

Usage

<usage_examples>

python app.py /path/to/html/docs /path/to/output
```bash python app.py /path/to/html/docs /path/to/output --verbose ```

Output Structure

<output_structure>

output/                    # Specified output directory
├── Product1/             # Markdown files (structure preserved)
│   ├── guide/
│   │   └── intro.md
│   └── api/
│       └── reference.md
└── Product2/
    └── docs/
        └── overview.md

static/                   # Parallel to output directory
└── img/                 # Centralized images (not 'images')
    ├── image-manifest.json  # Deduplication tracking
    ├── Product1/
    │   ├── guide/
    │   │   └── screenshot.png
    │   └── api/
    │       └── diagram.png
    └── Product2/
        └── docs/
            └── logo.png

</output_structure>

Implementation Details

<processing_phases>

  • Scan for images and build reference map
  • Create anchor mappings for cross-references
  • Build deduplication hash table
- Convert HTML to Markdown - Update all link references - Copy unique images to static directory - Generate image-manifest.json

Critical Requirements

- Never modify source files - Preserve all internal links - Handle MadCap Flare-specific HTML structures - Maintain readable Markdown output - Optimize image storage through deduplication - Generate comprehensive image manifest

Error Handling

<error_scenarios>

  • Log warning but continue processing
  • Record in image-manifest.json
  • Preserve image reference in Markdown
- Attempt best-effort conversion - Log parsing errors with file path - Continue with next file - Check for existing files - Option to overwrite or skip - Log conflicts

Performance Considerations

- **Expected Speed**: ~1-2 seconds per file - **Memory Usage**: Scales with image deduplication table - **Disk Usage**: Reduced through image deduplication - **Large Documentation Sets**: Two-pass processing for efficiency

Troubleshooting Guide

Image not referenced in HTML or missing from source 1. Verify image exists in source 2. Check if referenced in HTML 3. Review image-manifest.json 4. Confirm static/img structure Cross-reference anchors not found 1. Check anchor mappings in verbose output 2. Verify target document exists 3. Confirm anchor ID consistency

Command Reference

<cli_options>

Option Type Description Default
input_dir Required Source HTML directory -
output_dir Required Destination for Markdown -
--verbose, -v Flag Show detailed progress False
--overwrite Flag Overwrite existing files False
--skip-images Flag Convert without copying images False
</cli_options>

Testing Checklist

- [ ] Basic HTML to Markdown conversion - [ ] Image deduplication across multiple files - [ ] Cross-file link resolution - [ ] MadCap Flare specific elements - [ ] Large documentation set performance - [ ] Edge cases (empty files, broken HTML)

Future Enhancements

- Support for custom CSS preservation - Batch processing with progress bar - Configuration file support - Plugin system for custom transformations

About

Say goodbye to MadCap Flare and convert your project to markdown!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages