Skip to content

WAF Harvesting optimization #5261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
jbrown-xentity opened this issue May 23, 2025 · 1 comment
Open
1 task

WAF Harvesting optimization #5261

jbrown-xentity opened this issue May 23, 2025 · 1 comment
Assignees
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0

Comments

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented May 23, 2025

User Story

In order to be able to harvest large WAF sources, data.gov managers wants the process to be optimized.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN the IOOS harvest source is setup in datagov-harvester
    WHEN a harvest is run
    THEN the harvest completes within 24 hours
    AND data is available on catalog-next

Background

We tried running for up to 19 hours and bumping the memory of the task to 5 G. Still was unable to complete.
We have 7 harvest sources with over 10K records.

Other context on slack thread.

Security Considerations (required)

None

Sketch

We currently do a complete pull of the source, get the hash, do the comparison, and then setup the create, update, and delete lists. Since this doesn't seem tenable, the following logic should hopefully speed things up.

  1. The traverse_waf function gives a list of all the URL's. We use the URL's as identifiers. Before we start downloading files, we should use this as a quick comparison.
    1. We can compare a few things: does anything in our system need to be removed because it is removed from the source? Now we have our delete list...
    2. Are there timestamps in the HTML of the WAF? If so, utilize this to determine if we even need to pull the file (ie simple compare) or if it can be ignored
  2. Does this URL exist in our system today?
    1. If not download and add it to the add list; save to DB and then delete local memory record.
    2. If yes then download and hash and compare against previous hash, if different mark for update and if not discard.
  3. Once extract and compare above is done, pull records from DB (one at a time?) and process (transform, validate, load).

All of this may become a moot point if the memory still flows over, ie in the initial load where every object from the source must be downloaded, hashed, processed, and loaded. I believe we may need to create a process for "picking up where we left off". These harvest records will have an action (create, update, etc) and a date_created and a date_finished, and a status. I would like to utilize these things to create another task that could "pick up" jobs that crashed due to various reasons and try to complete them.

@FuhuXia
Copy link
Member

FuhuXia commented May 28, 2025

To traverse WAF and read path/filenames and file timestamps, we can borrow code from ckanext-spatial.

@tdlowden tdlowden moved this to 📥 Queue in data.gov team board May 29, 2025
@rshewitt rshewitt moved this from 📥 Queue to 🏗 In Progress [8] in data.gov team board Jun 3, 2025
@rshewitt rshewitt self-assigned this Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0
Projects
Status: 🏗 In Progress [8]
Development

No branches or pull requests

3 participants