Skip to content

Improve Ingestion for Web Documents #116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 4, 2025

Conversation

sumukshashidhar
Copy link
Member

No description provided.

@sumukshashidhar sumukshashidhar requested a review from alozowski May 27, 2025 23:09
return None
except ImportError:
logger.error(
"Trafilatura library is not installed. Please install it (e.g., `pip install trafilatura`) "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since trafilatura is now a hard dependency in pyproject.toml, we could probably remove the ImportError block. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! removed the import error block!

file_ext = os.path.splitext(file_path)[1].lower()
content: str | None = None

if file_ext == ".md":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section:

if file_ext == ".md":
    ...
elif file_ext in [".html", ".htm"]:
    ...
else:
    ...

This block is getting a bit long with the different file types. Maybe we could move it into a separate helper function? Something like get_markdown_content(file_path, markdown_processor) could make the main function easier to read and maintain – especially if we support more formats later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it!

@sumukshashidhar sumukshashidhar requested a review from alozowski June 4, 2025 10:53
@sumukshashidhar sumukshashidhar merged commit f698308 into main Jun 4, 2025
6 checks passed
Josephrp pushed a commit to Josephrp/yourbench that referenced this pull request Jun 5, 2025
…-markitdown

Improve Ingestion for Web Documents
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants