leave_k_out_split produces incorrect values in test set

I noticed odd behavior in [leave_k_out_split](https://github.com/benfred/implicit/blob/871e0c7229b012108131b6211cd617e23a3b24bf/implicit/evaluation.pyx#L143) - specifically, under certain circumstances (many rows with one value?), the number returned for the withheld test set value is different from the actual value. This causes `train + test` to fail to reconstruct the original matrix.

Script to reproduce:

```python
import implicit

implicit.__version__
#> '0.6.2'

from implicit.evaluation import leave_k_out_split
from scipy import sparse

ratings = sparse.csr_matrix(
    [[3, 2, 1, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 1, 1]]
)

train, test = leave_k_out_split(ratings, K=1, random_state=42)

diff = (train + test) - ratings

diff.nnz
#> 1

train.todense()
#> matrix([[3, 2, 0, 0],
#>         [1, 0, 0, 0],
#>         [0, 1, 0, 0],
#>         [0, 0, 1, 0],
#>         [0, 1, 0, 1]])

test.todense()
#> matrix([[0, 0, 4, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 1, 0]])

diff.todense()
#> matrix([[0, 0, 3, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 0, 0]])
```

<sup>Created at 2022-12-23 13:59:28 CST by [reprexlite](https://github.com/jayqi/reprexlite) v0.5.0</sup>

I believe the issue is in `_take_tails` - the returned `test_idx` array has multiple copies of the first user index returned, so we end up with a test set value that is copied multiple times.

(As an aside, the [call](https://github.com/benfred/implicit/blob/871e0c7229b012108131b6211cd617e23a3b24bf/implicit/evaluation.pyx#L208) to `_take_tails` when `shuffled=True` does not pass on the `rng`, so the random state cannot be maintained.)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

leave_k_out_split produces incorrect values in test set #639

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

leave_k_out_split produces incorrect values in test set #639

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions