Skip to content

Sorting across the whole data set doesn't work when using a point-in-time search with slicing #101096

Open
@valasatava

Description

@valasatava

Elasticsearch Version

8.9.1

Installed Plugins

No response

Java Version

20.0.2

OS Version

5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

I'm trying to pull results from Elasticsearch with a sort. There can be millions of documents and It's taking a very long time to fetch all of the results. I'm looking for ways to improve the speed.

I implemented sliced scrolls with PIT, and it improves the time, but the results are no longer really sorted. They are only sorted within their own slice, but I need the results to return in sort order.

For example, this search for slice 1

GET _search
{
  "slice": {
    "id": 1,
    "max": 5
  },
  "pit": {
    "id": "tOaGBAEXY29tYmluZWRfbW9sX2RlZmluaXRpb24WaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAWTUpXdDdzb1BUYXlUd1NSS0l4THFMUQAAAAAAAAOpjBZwLUZROWxiWVJycWtUQnRFWk1iek9nAAEWaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAA"
  },
  "_source": ["_none_"],
  "docvalue_fields": ["rcsb_id"], 
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "rcsb_id": {
        "order": "asc"
      }
    }
  ]
}

returns first document with ID "006"

{
  "pit_id": "tOaGBAEXY29tYmluZWRfbW9sX2RlZmluaXRpb24WaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAWTUpXdDdzb1BUYXlUd1NSS0l4THFMUQAAAAAAAAOpjBZwLUZROWxiWVJycWtUQnRFWk1iek9nAAEWaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAA",
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8227,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "combined_mol_definition",
        "_id": "006-CHEM_COMP",
        "_score": null,
        "_source": {},
        "fields": {
          "rcsb_id": [
            "006"
          ]
        },
        "sort": [
          "006",
          38024
        ]
      },
      {
        "_index": "combined_mol_definition",
        "_id": "00B-CHEM_COMP",
        "_score": null,
        "_source": {},
        "fields": {
          "rcsb_id": [
            "00B"
          ]
        },
        "sort": [
          "00B",
          47562
        ]
      }

and for slice 2 - "001"

{
  "pit_id": "tOaGBAEXY29tYmluZWRfbW9sX2RlZmluaXRpb24WaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAWTUpXdDdzb1BUYXlUd1NSS0l4THFMUQAAAAAAAAOpjBZwLUZROWxiWVJycWtUQnRFWk1iek9nAAEWaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAA",
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8257,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "combined_mol_definition",
        "_id": "001-CHEM_COMP",
        "_score": null,
        "_source": {},
        "fields": {
          "rcsb_id": [
            "001"
          ]
        },
        "sort": [
          "001",
          56945
        ]
      },
      {
        "_index": "combined_mol_definition",
        "_id": "003-CHEM_COMP",
        "_score": null,
        "_source": {},
        "fields": {
          "rcsb_id": [
            "003"
          ]
        },
        "sort": [
          "003",
          63266
        ]
      }

Steps to Reproduce

Step 1: mappings

{
  "mappings": {
       "rcsb_id": {
            "type": "keyword",
            "eager_global_ordinals": true,
            "fields": {
              "normalized": {
                "type": "keyword",
                "normalizer": "lowercase_normalizer"
              }
           }
        }
     }
}

Step 2: index creation

Step 3: opening point-in-time

Step 4: requesting slices

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions