Skip to content

ddmitov/reteti

Repository files navigation

Reteti

Reteti is a lexical search experiment using partitioned index of hashed words in object storage.

Design Objectives

  • 1. Fast lexical search with index data based entirely on object storage
  • 2. Usability in serverless or scale-to-zero applications for scalability and cost control
  • 3. Adaptability to different cloud environments or on-premise systems

Features

  • All index data is stored only in object storage.

  • Reteti is language-agnostic and does not use language-specific stemmers.

  • Storage and compute are decoupled and Reteti can be used in serverless functions.

  • The index and text locations are independent from one another.

Workflow

  • Texts are split to words using a normalizer and a pre-tokenizer from the Tokenizers Python module.

  • Words are hashed and their positions are saved in Arrow files under hash prefixes in object storage.

  • Only the Arrow files of the hashed words in the search request are contacted during search.

  • Words are represented by their hashes or alias integers during search.

  • Search is performed using DuckDB SQL.

Word Definition

A word is any sequence of Unicode lowercase alphanumeric characters between two whitespaces.

Demo

Gradio demo is available on Fly.io.
It is a scale-to-zero application and its object storage is managed by Tigris Data.

Search Criteria

Reteti selects the IDs of texts that match the following criteria:

  • 1. They have the full set of unique word hashes presented in the search request.
  • 2. They have one or more sequences of word hashes identical to the sequence of word hashes in the search request.

Ranking Criterion

Matching words frequency is the ranking criterion. It is defined as the number of search request words found in a document divided by the number of all words in the document. Short documents having high number of matching words are at the top of the search results.

Name

Reteti was a giraffe calf orphaned during a severe drought around 2018 in Northern Kenya and saved thanks to the kindness and efforts of a local community.

Today we use complex data processing technologies thanks to the knowledge, persistence and efforts of many people of a large global community. Just like the small Reteti, we owe much to our community and should always be thankful to its members for their goodwill and contributions!

This program is licensed under the terms of the Apache License 2.0.

Author

Dimitar D. Mitov, 2024 - 2025

About

Lexical search based on partitioned index of hashed words in object storage

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published