This repository contains the code, data, and results for the paper titled "GraphRAG on technical documents - impact of knowledge graph schema" by Henri Scaffidi, Prof. Melinda Hodkiewicz, Dr. Caitlin Woods, and Nicole Roocke (2025).
The project assesses how 1) domain-specific knowledge graph schema, and 2) the selection of local or global GraphRAG search options, impact the quality of GraphRAG responses to questions on technical documents. We use Microsoft's GraphRAG framework for all experiments which is available under an MIT license.
The src
directory of the repository contains the following:
- Python code used to run GraphRAG pipelines (adapted from Microsoft's GraphRAG Notebooks)
- Four sub-directories, containing settings and data for each of our four GraphRAG pipelines (differing in the specified knowledge graph schema - see
entity_types
insettings.yaml
)
The data
directory of the repository contains the following:
mriwa_report_subset_txt
: The .txt versions of the seven MRIWA technical reports analysed in this project. All MRIWA reports are publicly accessible at MRIWA's Project Portfolio as PDF versions. We used PyPDF2 to extract the PDF text to .txt files.mriwa_cqa
: The set of MRIWA-defined competency questions and answers used to evaluate the GraphRAG pipelines in this project.
The results
directory of the repository contains the GraphRAG pipelines' responses, using both local and global search, to all MRIWA-defined competency questions.
The supplementary_materials
directory of the repository contains the following:
- GraphRAG performance analysis marking scheme and results
- Cost analysis
- Entity tagging example using a domain-specific knowledge graph schema on MRIWA report text
- Cross-validation of our performance analysis results using RAGAS
- Distribution of page count across MRIWA technical reports
- MRIWA report sample selection process