-
Notifications
You must be signed in to change notification settings - Fork 30
Improve Ingestion for Web Documents #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
yourbench/pipeline/ingestion.py
Outdated
return None | ||
except ImportError: | ||
logger.error( | ||
"Trafilatura library is not installed. Please install it (e.g., `pip install trafilatura`) " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since trafilatura
is now a hard dependency in pyproject.toml
, we could probably remove the ImportError
block. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! removed the import error block!
yourbench/pipeline/ingestion.py
Outdated
file_ext = os.path.splitext(file_path)[1].lower() | ||
content: str | None = None | ||
|
||
if file_ext == ".md": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section:
if file_ext == ".md":
...
elif file_ext in [".html", ".htm"]:
...
else:
...
This block is getting a bit long with the different file types. Maybe we could move it into a separate helper function? Something like get_markdown_content(file_path, markdown_processor)
could make the main function easier to read and maintain – especially if we support more formats later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it!
…-markitdown Improve Ingestion for Web Documents
No description provided.