-
Notifications
You must be signed in to change notification settings - Fork 186
RegionTermIndexer: Port from C++ #178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
// ----------------- | ||
// | ||
// Given a query region, we want to find all of the document regions that | ||
// intersect it. The first step is to represent all the regions as S2Cell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughout, use the Go type names instead of the C++ type names. (e.g Cell vs S2Cell)
var defaultMaxCells = int(8) | ||
|
||
type Options struct { | ||
///////////////// Options Inherited From S2RegionCoverer //////////////// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go doesn't inherit. Can say see RegionCoverer for more details on these options.
return &RegionTermIndexer{options: option} | ||
} | ||
|
||
func (rti *RegionTermIndexer) GetTerm(termTyp TermType, id CellID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use a single letter for the type 'r' vs 'rti' in the method reciever.
var marker = string('$') | ||
|
||
const ( | ||
ANCESTOR TermType = iota + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preface this enum values with the type so we know what it's for when we see it. e.g., TermTypeAncestor.
iota is sufficient, there's no need to start it at +1.
// covering term, but as an optimization we always index these cells as | ||
// ancestor terms only. This is possible because query regions will never | ||
// contain a descendant of such cells. Note that this is true even when | ||
// max_level() != true_max_level() (see S2RegionCoverer::Options). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update these methods to match the naming they have above.
@@ -0,0 +1,441 @@ | |||
// Copyright 2025 Google Inc. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this from "Google Inc." to "The S2 Geometry Project Authors."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like we've done that in any files yet, though it may be a good idea.
The copyright story is perhaps complicated because a lot of this file is copied verbatim from a C++ file which has a Google Inc copyright line.
} | ||
} | ||
|
||
func TestIndexRegionsQueryRegionsOptimizeTime(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the style of other test names in this package, preface the test case name with the file/type here to make it clearer where a failing test is coming from. e.g.
TestRegionTermIndexer......
func TestIndexPointsQueryRegionsOptimizeTime(t *testing.T) { | ||
options := Options{minLevel: 0, | ||
levelMod: 2, | ||
maxLevel: 30, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use the actual MaxLevel constant here and throughout.
maxCells: 8, | ||
} | ||
testRandomCaps(t, options, CAP) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the MaxLevelSetLoosely and MarkerCharacter tests?
// DEFAULT: false | ||
optimizeSpace bool | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact we need to explicitly set levelMod on every unit test feels like a point of trouble that's going to bite people down the road. A default options instance should work safely without the user needing to know to set certain fields.
e.g. Anyone who forgets to set it will get stuck with infinite loops because several methods do this:
for (level - rti.options.levelMod) >= rti.options.minLevel {
I think there should be a DefaultRegionTermIndexerOptions() that gives the caller a safe variable to start from when they need an options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will still be a foot gun. @alan-strohm @flwyd Thoughts on idiomatic ways to check for valid options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functional options would be a good pattern here. (See the go-cmp project for a good example of Options in use.)
I also question whether the Options type needs to be public: could there be a NewRegionTermIndexer constructor that takes a list of RegionTermIndexerOption functions which set state in the RegionTermIndexer and the RegionCoverer? It looks like most of the fields in Options just get copied over to the Coverer.
Do users of the indexer need to be able to inspect the option state (public fields), or would a set-only interface, as in the functional options pattern, suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do users of the indexer need to be able to inspect the option state (public fields), or would a set-only interface, as in the functional options pattern, suffice?
The accessors are only so the builder and related functions can get the options. Clients only need to set them.
It looks like this is a copy of https://github.com/blevesearch/geo/commits/master/s2/region_term_indexer.go [helping to resolve #134] but you wrote that anyway, then changed email addresses. It would save time to provide a bit more background in the PR. |
// indexes but want fast serving, it might be reasonable to set | ||
// max_cells() == 100 during indexing and max_cells() == 8 for queries. | ||
// | ||
// DEFAULT: 8 (coarse approximations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"DEFAULT" is a bit misleading here. Document the value in DefaultRegionTermIndexerOptions
and reference that instead?
// limitations under the License. | ||
|
||
// | ||
// Indexing Strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the RegionTermIndexer type (and its documentation) at the top so that someone browsing this file can quickly understand what it's about. This comment is an implementation detail coming from the .cc
file; having the code that's ported from .h
first is a more natural introduction to the file. That move will also help the reader understand terms like "document" in this comment.
Also nit: This long block comment might read a little nicer with /* … */
rather than line-comments. (Starting each line with *
isn't needed.)
@@ -0,0 +1,441 @@ | |||
// Copyright 2025 Google Inc. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like we've done that in any files yet, though it may be a good idea.
The copyright story is perhaps complicated because a lot of this file is copied verbatim from a C++ file which has a Google Inc copyright line.
return trueMax | ||
} | ||
|
||
// RegionTermIndexer is a helper struct for adding spatial data to an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"helper class" is a term that's popular in the C++ world that I don't find particularly informative. Perhaps "RegionTermIndexer is a tool for adding spatial data…" or "RegionTermIndexer adds spatial data…"
// and then building an "inverted index" that maps each term to a list of | ||
// documents (and document positions) where that term occurs. | ||
// | ||
// This class deals with the problem of converting spatial data into index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go doesn't have classes.
// terms, which can then be indexed along with the other document information. | ||
// | ||
// Spatial data is represented using the S2Region type. Useful S2Region | ||
// subtypes include: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for this type will be on the same page as the docs for s2.Region and all these implementing types. Is it sufficient to say this indexer works with any type that implements the Region interface?
indexTerms += len(terms) | ||
} | ||
|
||
for i := 0; i < NUM_ITERATIONS; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These loops look very similar. Add a comment that explains the difference in some way.
sort.Ints(expected) | ||
sort.Ints(actual) | ||
|
||
if !reflect.DeepEqual(expected, actual) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a case for cmp.Diff
t.Errorf("expected: %+v Vs actual: %+v", expected, actual) | ||
} | ||
|
||
queryTerms += len(terms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this variable ever read?
|
||
const ( | ||
POINT QueryType = iota + 1 | ||
CAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to test Region types other than Cap?
deduplicate(&actual) | ||
|
||
sort.Ints(expected) | ||
sort.Ints(actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you could ensure uniqueness by using map[int]struct{}
(or bool) rather than a slice.
Attempt to port a functional version of S2RegionTermIndexer