H2.0 runner memory optimization spike #5256

jbrown-xentity · 2025-05-19T18:01:27Z

Purpose

We want to optimize memory usage on cloud.gov, but we're not sure what the current process will require from the system.

Given above question, conducting testing is needed to provide factual knowledge on future steps.

1 day of effort has been allocated and once compete, findings will be demonstrated and specific future actions will be decided.

Acceptance Criteria

[ACs should be clearly demo-able/verifiable whenever possible. Try specifying them using BDD.]

GIVEN our largest source (IOOS) is harvestable
WHEN 1 day expires
THEN the amount of required memory to successfully harvest the source is known
AND a memory increase recommendation is made
AND any optimization/fixes are proposed (if necessary)

Background

See https://datagov-harvest-admin-dev.app.cloud.gov/harvest_source/554d15db-6080-4441-b4b2-d045451d6967, currently crashing regularly.
May want to investigate (if memory usage is high) what a "typical" source memory requirements are, and consider S/M/L approach.
May relate or be blocked by #5254

Sketch

Start at 2G, and increase as failures occur. Report success when import starts. Review code for possible optimizations.

jbrown-xentity · 2025-05-21T22:53:56Z

So the harvester never made it past the compare stage. It ran for 5.1 hours, and the memory usage climbed slowly and consistently. You can see that usage here.
Some unfortunate things:

There was no logging during the extraction process, so I had no way of knowing how far we had come, and how much further we had to go.
We did get logging that the extraction completed, but it ran out of memory at the hashing step. It was already right at the limit of 3 G, so unclear how much more headroom it'll need.
I downloaded a metadata file from the WAF. It was 129 KB. That's 0.129 MB. 35K of those is 3500 MB, or 3.5 G. So just downloading is roughly 3.5 G.
We have 7 sources with > 10K datasets. Of those 2 are DCAT-US, and 5 are WAF.

In order to get the largest sources working, we probably will need at least 4 G. However we probably don't need that for most jobs. I'd like to have a working session next week to consider implementation of the t-shirt size of harvest sources to size the jobs accordingly. The logic will be a bit more complex, but not that complex.
5G test is ongoing, expect it to take at least 5 hours to extract from the source and then possibly longer to sync with CKAN. Will review statistics in the morning.

FuhuXia · 2025-05-22T13:19:31Z

shocking

jbrown-xentity · 2025-05-23T16:54:08Z

Unfortunately this job has still not finished. More than that, it hasn't moved on. It hasn't changed it's memory significantly since 4pm yesterday. New Relic Logs show that the task never made it past the external records prep. However, as of this writing the task is still running and taking up memory.

I've made a ticket to follow up on this spike, as I don't believe the current solution is tenable for a large source like IOOS. #5261. We will discuss further in an office hours if the proposed sketch is appropriate, and possible other optimizations.

jbrown-xentity self-assigned this May 19, 2025

jbrown-xentity added this to data.gov team board May 19, 2025

jbrown-xentity added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label May 19, 2025

jbrown-xentity moved this to 🏗 In Progress [8] in data.gov team board May 19, 2025

jbrown-xentity mentioned this issue May 20, 2025

Runner size test GSA/datagov-harvester#234

Merged

3 tasks

tdlowden mentioned this issue May 22, 2025

Analyze memory usage of harvest tasks #5250

Closed

jbrown-xentity moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board May 23, 2025

neilmb moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Jun 2, 2025

neilmb closed this as completed by moving to ✔ Done in data.gov team board Jun 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

H2.0 runner memory optimization spike #5256

H2.0 runner memory optimization spike #5256

jbrown-xentity commented May 19, 2025

jbrown-xentity commented May 21, 2025

Uh oh!

FuhuXia commented May 22, 2025

Uh oh!

jbrown-xentity commented May 23, 2025

Uh oh!

H2.0 runner memory optimization spike #5256

H2.0 runner memory optimization spike #5256

Comments

jbrown-xentity commented May 19, 2025

Purpose

Acceptance Criteria

Background

Sketch

jbrown-xentity commented May 21, 2025

Uh oh!

FuhuXia commented May 22, 2025

Uh oh!

jbrown-xentity commented May 23, 2025

Uh oh!