Skip to content

Index lifecycle management execution metrics #457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mokrinsky
Copy link

@mokrinsky mokrinsky commented Jul 18, 2021

I already suggested it in #306, but got closed by me before the repo got transferred to prometheus-community. Maybe it could be worth making another one pull request cause I believe I might be not the only one who'll find these metrics useful.

Basically in my daily routine I'd like to monitor ILM execution stats - how many indexes are covered by ILM policies, how many errors I've got, etc. This is a simple representation of my goal I use since 7.3.2, now I'm on 7.11.smth and it still works. Haven't seen any changes to ILM recently so I assume it's compatible with any 7.* and maybe even earlier.

Example of metrics available:

elasticsearch_ilm_index_status{action="rollover",index="foo_2",phase="hot",step="check-rollover-ready"} 1
elasticsearch_ilm_index_status{action="shrink",index="foo_3",phase="warm",step="shrunk-shards-allocated"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_4",phase="warm",step="complete"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_5",phase="hot",step="complete"} 1
elasticsearch_ilm_index_status{action="complete",index="foo_6",phase="new",step="complete"} 1
elasticsearch_ilm_index_status{action="",index="foo_7",phase="",step=""} 0

Numeric values represent if exact index is covered by ILM policy at all (in the example above index foo_7 has no policy attached, other have one). Everything else in tags is just _all/_ilm/explain API result.

Signed-off-by: Nikolay Mokrinsky (ML) <[email protected]>
@wojtas911
Copy link

wojtas911 commented Aug 26, 2021

Hi,
Can we ask when it will be merged, please?
Not a lot of code, it could be checked quite quickly, and that metrics will be very useful.
Thank You
Kudos @mokrinsky 🥇

@tgrondier
Copy link

Is this not missing labels such as cluster ?

@mokrinsky
Copy link
Author

@tgrondier yes, it actually misses them. I have cluster tag in my prometheus environment added by default, so I missed its absence in exporter. Gonna fix soon.

@arapaho arapaho mentioned this pull request Jan 3, 2022
@paulojmdias
Copy link

Hi 👋

Any news on this PR?

@mokrinsky you will terminate the work for preparing to merge it?

@Evesy
Copy link
Contributor

Evesy commented May 12, 2022

Also interested if this is going to be picked up & finished off

We've pulled this change into our fork and it does do the job. I think there's room for improvement when it comes to using these metrics for alerting, specifically around actions that can be retried:

  • The metric does not tell you whether an action can be retried, and if so how many retries have been attempted
  • When an action is retrying, the Error metric disappears while the action is retried

Both of these factors make it a bit more difficult to alert on. For our case we'd want to alert on:

  • A failed action that is not retriable
  • A failed action that is retriable but has failed n number of retries

Not sure exactly what the metrics would look like for this. It's difficult as the ILM explain API itself hides the error state when the action is retrying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants