Skip to content

RegionTermIndexer: Port from C++ #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sreekanth-cb
Copy link

Attempt to port a functional version of S2RegionTermIndexer

Copy link

google-cla bot commented May 14, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

// -----------------
//
// Given a query region, we want to find all of the document regions that
// intersect it. The first step is to represent all the regions as S2Cell
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout, use the Go type names instead of the C++ type names. (e.g Cell vs S2Cell)

var defaultMaxCells = int(8)

type Options struct {
///////////////// Options Inherited From S2RegionCoverer ////////////////
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go doesn't inherit. Can say see RegionCoverer for more details on these options.

return &RegionTermIndexer{options: option}
}

func (rti *RegionTermIndexer) GetTerm(termTyp TermType, id CellID,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use a single letter for the type 'r' vs 'rti' in the method reciever.

var marker = string('$')

const (
ANCESTOR TermType = iota + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preface this enum values with the type so we know what it's for when we see it. e.g., TermTypeAncestor.

iota is sufficient, there's no need to start it at +1.

// covering term, but as an optimization we always index these cells as
// ancestor terms only. This is possible because query regions will never
// contain a descendant of such cells. Note that this is true even when
// max_level() != true_max_level() (see S2RegionCoverer::Options).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update these methods to match the naming they have above.

@@ -0,0 +1,441 @@
// Copyright 2025 Google Inc. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this from "Google Inc." to "The S2 Geometry Project Authors."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we've done that in any files yet, though it may be a good idea.

The copyright story is perhaps complicated because a lot of this file is copied verbatim from a C++ file which has a Google Inc copyright line.

}
}

func TestIndexRegionsQueryRegionsOptimizeTime(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the style of other test names in this package, preface the test case name with the file/type here to make it clearer where a failing test is coming from. e.g.
TestRegionTermIndexer......

func TestIndexPointsQueryRegionsOptimizeTime(t *testing.T) {
options := Options{minLevel: 0,
levelMod: 2,
maxLevel: 30,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use the actual MaxLevel constant here and throughout.

maxCells: 8,
}
testRandomCaps(t, options, CAP)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the MaxLevelSetLoosely and MarkerCharacter tests?

// DEFAULT: false
optimizeSpace bool
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact we need to explicitly set levelMod on every unit test feels like a point of trouble that's going to bite people down the road. A default options instance should work safely without the user needing to know to set certain fields.

e.g. Anyone who forgets to set it will get stuck with infinite loops because several methods do this:

for (level - rti.options.levelMod) >= rti.options.minLevel {

I think there should be a DefaultRegionTermIndexerOptions() that gives the caller a safe variable to start from when they need an options.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will still be a foot gun. @alan-strohm @flwyd Thoughts on idiomatic ways to check for valid options?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional options would be a good pattern here. (See the go-cmp project for a good example of Options in use.)

I also question whether the Options type needs to be public: could there be a NewRegionTermIndexer constructor that takes a list of RegionTermIndexerOption functions which set state in the RegionTermIndexer and the RegionCoverer? It looks like most of the fields in Options just get copied over to the Coverer.

Do users of the indexer need to be able to inspect the option state (public fields), or would a set-only interface, as in the functional options pattern, suffice?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do users of the indexer need to be able to inspect the option state (public fields), or would a set-only interface, as in the functional options pattern, suffice?

The accessors are only so the builder and related functions can get the options. Clients only need to set them.

@jmr jmr changed the title Introducing RegionTermIndexer RegionTermIndexer: Port from C++ May 14, 2025
@jmr
Copy link
Collaborator

jmr commented May 14, 2025

@jmr
Copy link
Collaborator

jmr commented May 14, 2025

It looks like this is a copy of https://github.com/blevesearch/geo/commits/master/s2/region_term_indexer.go [helping to resolve #134] but you wrote that anyway, then changed email addresses. It would save time to provide a bit more background in the PR.

// indexes but want fast serving, it might be reasonable to set
// max_cells() == 100 during indexing and max_cells() == 8 for queries.
//
// DEFAULT: 8 (coarse approximations)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"DEFAULT" is a bit misleading here. Document the value in DefaultRegionTermIndexerOptions and reference that instead?

// limitations under the License.

//
// Indexing Strategy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the RegionTermIndexer type (and its documentation) at the top so that someone browsing this file can quickly understand what it's about. This comment is an implementation detail coming from the .cc file; having the code that's ported from .h first is a more natural introduction to the file. That move will also help the reader understand terms like "document" in this comment.

Also nit: This long block comment might read a little nicer with /* … */ rather than line-comments. (Starting each line with * isn't needed.)

@@ -0,0 +1,441 @@
// Copyright 2025 Google Inc. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we've done that in any files yet, though it may be a good idea.

The copyright story is perhaps complicated because a lot of this file is copied verbatim from a C++ file which has a Google Inc copyright line.

return trueMax
}

// RegionTermIndexer is a helper struct for adding spatial data to an
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"helper class" is a term that's popular in the C++ world that I don't find particularly informative. Perhaps "RegionTermIndexer is a tool for adding spatial data…" or "RegionTermIndexer adds spatial data…"

// and then building an "inverted index" that maps each term to a list of
// documents (and document positions) where that term occurs.
//
// This class deals with the problem of converting spatial data into index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go doesn't have classes.

// terms, which can then be indexed along with the other document information.
//
// Spatial data is represented using the S2Region type. Useful S2Region
// subtypes include:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for this type will be on the same page as the docs for s2.Region and all these implementing types. Is it sufficient to say this indexer works with any type that implements the Region interface?

indexTerms += len(terms)
}

for i := 0; i < NUM_ITERATIONS; i++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These loops look very similar. Add a comment that explains the difference in some way.

sort.Ints(expected)
sort.Ints(actual)

if !reflect.DeepEqual(expected, actual) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a case for cmp.Diff

t.Errorf("expected: %+v Vs actual: %+v", expected, actual)
}

queryTerms += len(terms)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this variable ever read?


const (
POINT QueryType = iota + 1
CAP
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to test Region types other than Cap?

deduplicate(&actual)

sort.Ints(expected)
sort.Ints(actual)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could ensure uniqueness by using map[int]struct{} (or bool) rather than a slice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants