You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to be able to harvest large WAF sources, data.gov managers wants the process to be optimized.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
GIVEN the IOOS harvest source is setup in datagov-harvester
WHEN a harvest is run
THEN the harvest completes within 24 hours
AND data is available on catalog-next
Background
We tried running for up to 19 hours and bumping the memory of the task to 5 G. Still was unable to complete.
We have 7 harvest sources with over 10K records.
We currently do a complete pull of the source, get the hash, do the comparison, and then setup the create, update, and delete lists. Since this doesn't seem tenable, the following logic should hopefully speed things up.
The traverse_waf function gives a list of all the URL's. We use the URL's as identifiers. Before we start downloading files, we should use this as a quick comparison.
We can compare a few things: does anything in our system need to be removed because it is removed from the source? Now we have our delete list...
Are there timestamps in the HTML of the WAF? If so, utilize this to determine if we even need to pull the file (ie simple compare) or if it can be ignored
Does this URL exist in our system today?
If not download and add it to the add list; save to DB and then delete local memory record.
If yes then download and hash and compare against previous hash, if different mark for update and if not discard.
Once extract and compare above is done, pull records from DB (one at a time?) and process (transform, validate, load).
All of this may become a moot point if the memory still flows over, ie in the initial load where every object from the source must be downloaded, hashed, processed, and loaded. I believe we may need to create a process for "picking up where we left off". These harvest records will have an action (create, update, etc) and a date_created and a date_finished, and a status. I would like to utilize these things to create another task that could "pick up" jobs that crashed due to various reasons and try to complete them.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
User Story
In order to be able to harvest large WAF sources, data.gov managers wants the process to be optimized.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
WHEN a harvest is run
THEN the harvest completes within 24 hours
AND data is available on catalog-next
Background
We tried running for up to 19 hours and bumping the memory of the task to 5 G. Still was unable to complete.
We have 7 harvest sources with over 10K records.
Other context on slack thread.
Security Considerations (required)
None
Sketch
We currently do a complete pull of the source, get the hash, do the comparison, and then setup the create, update, and delete lists. Since this doesn't seem tenable, the following logic should hopefully speed things up.
All of this may become a moot point if the memory still flows over, ie in the initial load where every object from the source must be downloaded, hashed, processed, and loaded. I believe we may need to create a process for "picking up where we left off". These harvest records will have an action (create, update, etc) and a date_created and a date_finished, and a status. I would like to utilize these things to create another task that could "pick up" jobs that crashed due to various reasons and try to complete them.
The text was updated successfully, but these errors were encountered: