This repository contains tools for scraping, processing, and exploring the Martin Luther King Jr. Assassination Declassified Records from the National Archives using Retrieval-Augmented Generation (RAG) technology.
Dr. Martin Luther King Jr.'s legacy is one of courage, justice, and transformation. The declassified records surrounding his assassination (hosted by the National Archives) are a vital part of the historical record. This project aims to make these documents more accessible and searchable using modern AI and data processing technologies.
The project consists of these main components:
- Web Scraper: Scripts to scrape MLK assassination records from the National Archives website
- S3 Uploader: Tools to upload the scraped documents to Amazon S3 for storage
- Transforming Archive Documents: This step was performed using the Unstructured UI to process the documents from the National Archives and store them in an ElasticSearch database
- RAG Application: A Jupyter notebook "MLK_Archive_RAG_Application.ipynb" that implements a question-answering system using the processed documents
- Release of Processed Results: The processed data from the MLK archive documents is publicly available via AWS S3 bucket: http://example-transformations-mlk-archive.s3-website-us-east-1.amazonaws.com/
.
├── MLK_Archive_RAG_Application.ipynb # Jupyter notebook with RAG implementation
├── mlk_archive_to_s3/ # Scripts for scraping and S3 upload
│ ├── download_to_s3.py # Script to download data to S3
│ ├── scrape_mlk_records.py # Script to scrape MLK records
│ ├── mlk_records_*.csv # CSV file with MLK records
│ ├── mlk_records_*.json # JSON file with MLK records
│ └── mlk_urls_*.txt # Text file with MLK URLs
└── s3_hosting/ # Static hosting files
├── generate_index.py # Script to generate index page
└── index.html # Static index page
The processed MLK archive documents are available for download:
Download mlk-archive-public.jsonl
Each line in the JSONL file represents a document element with the following structure:
{
"element_id": "ab049307ff7695d08f1e798d5372d51b",
"embeddings": [0.04380892589688301, -0.007506858557462692, -0.013627462089061737, ...],
"text": "Prefix: This chunk appears near the beginning of an FBI investigation document (File #BH 44-1740) from April 1968 that details inquiries into Eric S. Galt's activities in Birmingham, Alabama...; Original: 1 BH 44-1740 DTD:scb\n\nLAUNDRIES AND CLEANERS, BIRMINGHAM, ALABAMA:...",
"type": "CompositeElement",
"record_id": "edb7ea45-00ba-5ab1-a58c-30da4ff50de5",
"metadata": {
"filename": "44-at-2386_hs1-852715321_158-01-part_3_of_4.pdf",
"filetype": "application/pdf",
"languages": ["eng", "por"],
"page_number": 1,
"text_as_html": "<div class=\"Page\" data-page-number=\"1\" />...",
"orig_elements": "eJztnQtvWzmSqP8KkVngzgB+8P3I9m1AsZXE07Ed2E7vDrYXQZEsxkLLkiHJSecu...",
"data_source-url": "https://example-transformations-mlk-archive.s3.us-east-1.amazonaws.com/mlk-archive/44-at-2386_hs1-852715321_158-01-part_3_of_4.pdf",
"data_source-version": "9940ff50258de939ef2bb1331c7e2fe3-2",
"data_source-record_locator-protocol": "s3",
"data_source-record_locator-remote_file_path": "https://example-transformations-mlk-archive.s3.us-east-1.amazonaws.com/mlk-archive/",
"data_source-record_locator-metadata-source-url": "https://www.archives.gov/files/research/mlk/releases/2025/0721/44-at-2386_hs1-852715321_158-01-part_3_of_4.pdf",
"data_source-record_locator-metadata-content-length": "9695272",
"data_source-record_locator-metadata-download-date": "2025-07-22 15:03:58",
"data_source-date_created": "1753211039.0",
"data_source-date_modified": "1753211039.0",
"data_source-date_processed": "1753223725.6787593",
"entities-items": [
{"entity": "FBI", "type": "ORGANIZATION"},
{"entity": "Eric S. Galt", "type": "PERSON"},
{"entity": "Birmingham", "type": "LOCATION"},
{"entity": "Alabama", "type": "LOCATION"}
],
"entities-relationships": [
{"from": "Eric S. Galt", "relationship": "rented", "to": "Safe Deposit Box No. 5517"},
{"from": "Eric S. Galt", "relationship": "affiliated_with", "to": "Birmingham Trust National Bank"}
]
}
}
- element_id: Unique identifier for each document element
- embeddings: Vector embeddings for semantic search (OpenAI text-embedding-3-large [dim 3072])
- text: The processed text content with contextual prefix and original text
- type: Element type (e.g., CompositeElement, NarrativeText, Title)
- record_id: Unique identifier linking related elements from the same document
- metadata: Rich metadata including:
- Source document information (filename, filetype, page numbers)
- Processing timestamps and versions
- Named entities and their relationships
- Original source URLs from the National Archives
- HTML representation of the content
Note: The steps below were completed prior to this notebook. You do not need to rerun them—they're included here to explain how the records were made searchable.
The declassified MLK assassination records were processed using the Unstructured platform in a multi-step ETL pipeline to make them AI-ready and searchable:
- Original documents—including PDFs, images, and other file types—were streamed from the National Archives to Amazon S3, providing secure and scalable cloud storage.
- National Archives: https://www.archives.gov/research/mlk
- AWS Files: http://example-transformations-mlk-archive.s3-website-us-east-1.amazonaws.com/
The Unstructured platform processed each document through a series of enrichment steps:
-
VLM Partitioning
Vision language models (VLMs) segmented each document into meaningful sections, preserving layout and context. Because most documents were scanned images of typed pages—making OCR challenging—VLMs were chosen for partitioning. Claude 3.7 Sonnet was used as the VLM provider. -
Title-Based Chunking
Documents were split into semantically coherent chunks using structural cues (like section headers) to improve context retention. A "Chunk by Title" chunking strategy with contextual chunking was used. The chunking parameters were:{ "contextual_chunking": true, "combine_text_under_n_characters": 3000, "include_original_elements": true, "max_characters": 5500, "multipage_sections": true, "new_after_n_characters": 3500, "overlap": 350, "overlap_all": true }
-
Named Entity Recognition (NER)
Entities such as people, organizations, locations, and dates were extracted to enhance downstream filtering and relevance. OpenAI GPT-4o was used with the default NER prompt. For more information about NER, please see our documentation: https://docs.unstructured.io/ui/enriching/ner -
Vector Embedding
Each chunk was embedded using OpenAI'stext-embedding-3-large
model (3072 dims), enabling semantic similarity search.
This end-to-end pipeline transformed the raw historical documents into a searchable, structured knowledge base—optimized for natural language queries and intelligent retrieval. Unstructured made it possible to transform 243,496 pages of grainy text in a single day.
- The enriched document chunks—with metadata and vector embeddings—were indexed into Elasticsearch, enabling:
- Fast full-text and semantic (vector) search
- Metadata-based filtering and sorting
- Scalable querying across large document sets
Access to this database is available using the following credentials:
ELASTICSEARCH_HOSTS: "https://mlk-archive-public.es.eastus.azure.elastic-cloud.com"
ELASTICSEARCH_API_KEY: "S0I5ak5aZ0JwcE44OWFmcEpBb3M6dTlpYnVQbk9Ub2dKNk15LUpkT0JwUQ=="
The processed output of the ETL is available via an ElasticSearch database as explained in the Jupyter Notebook "MLK_Archive_RAG_Application.ipynb", or a JSONL copy of the processed data is available for you to download and use for your own research:
- OpenAI API key. On your Jupyter notebook server, you must set the environment variable
OPENAI_API_KEY
to this API key. To learn how, see your Jupyter notebook server provider's documentation.
-
Create a virtual environment:
python -m venv mlk_scraper_env source mlk_scraper_env/bin/activate
-
Install required packages:
pip install -r requirements.txt
Open and run the Jupyter notebook:
jupyter notebook MLK_Archive_RAG_Application.ipynb
The notebook contains a question-answering system that allows you to ask questions about the MLK assassination records.
- National Archives for providing access to the declassified MLK assassination records