WAF Harvesting optimization #5261

jbrown-xentity · 2025-05-23T14:08:50Z

User Story

In order to be able to harvest large WAF sources, data.gov managers wants the process to be optimized.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN the IOOS harvest source is setup in datagov-harvester
WHEN a harvest is run
THEN the harvest completes within 24 hours
AND data is available on catalog-next

Background

We tried running for up to 19 hours and bumping the memory of the task to 5 G. Still was unable to complete.
We have 7 harvest sources with over 10K records.

Other context on slack thread.

Security Considerations (required)

None

Sketch

We currently do a complete pull of the source, get the hash, do the comparison, and then setup the create, update, and delete lists. Since this doesn't seem tenable, the following logic should hopefully speed things up.

The traverse_waf function gives a list of all the URL's. We use the URL's as identifiers. Before we start downloading files, we should use this as a quick comparison.
1. We can compare a few things: does anything in our system need to be removed because it is removed from the source? Now we have our delete list...
2. Are there timestamps in the HTML of the WAF? If so, utilize this to determine if we even need to pull the file (ie simple compare) or if it can be ignored
Does this URL exist in our system today?
1. If not download and add it to the add list; save to DB and then delete local memory record.
2. If yes then download and hash and compare against previous hash, if different mark for update and if not discard.
Once extract and compare above is done, pull records from DB (one at a time?) and process (transform, validate, load).

All of this may become a moot point if the memory still flows over, ie in the initial load where every object from the source must be downloaded, hashed, processed, and loaded. I believe we may need to create a process for "picking up where we left off". These harvest records will have an action (create, update, etc) and a date_created and a date_finished, and a status. I would like to utilize these things to create another task that could "pick up" jobs that crashed due to various reasons and try to complete them.

FuhuXia · 2025-05-28T14:12:45Z

To traverse WAF and read path/filenames and file timestamps, we can borrow code from ckanext-spatial.

github-project-automation bot added this to data.gov team board May 23, 2025

This was referenced May 23, 2025

H2.0 runner memory optimization spike #5256

Closed

Add arch for cleanup and retry GSA/datagov-harvester#241

Draft

jbrown-xentity added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label May 27, 2025

tdlowden moved this to 📥 Queue in data.gov team board May 29, 2025

neilmb mentioned this issue Jun 2, 2025

Analyze memory usage of harvest tasks #5250

Closed

rshewitt moved this from 📥 Queue to 🏗 In Progress [8] in data.gov team board Jun 3, 2025

rshewitt self-assigned this Jun 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WAF Harvesting optimization #5261

WAF Harvesting optimization #5261

jbrown-xentity commented May 23, 2025 •

edited

Loading

FuhuXia commented May 28, 2025

Uh oh!

WAF Harvesting optimization #5261

WAF Harvesting optimization #5261

Comments

jbrown-xentity commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

FuhuXia commented May 28, 2025

Uh oh!

jbrown-xentity commented May 23, 2025 •

edited

Loading