Skip to content

refactor: Generic CompliantSelector #2064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 69 commits into from
Mar 2, 2025
Merged

refactor: Generic CompliantSelector #2064

merged 69 commits into from
Mar 2, 2025

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Feb 21, 2025

Tracking

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Expecting quite a nice reduction in copy/pasted code

  • _arrow
  • _dask
  • _duckdb
  • _pandas_like
  • _spark_like
  • Investigate fix 3.8 protocol bug

Comment on lines 156 to 159
def __init__(self: Self, context: _FullContext, /) -> None:
self._implementation = context._implementation
self._backend_version = context._backend_version
self._version = context._version
Copy link
Member Author

@dangotbanned dangotbanned Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME

Seems like Protocol(s) with a constructor are not triggered for <3.11?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@dangotbanned dangotbanned Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

I'm ignoring 3.8 issues for the time being.
Absolute worst-case - I'll just use regular classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking for a solution w/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be the way to go:

Will be attempting something like (python/typing#572 (comment))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it certainly is a solution (#2064 (comment))

😞

@dangotbanned
Copy link
Member Author

@MarcoGorelli I'm looking at the lazy backends atm

Are selectors implemented for them in a lazy way?
I would've expected they'd be handled quite a bit differently to avoid materialization

def by_dtype(self: Self, dtypes: Iterable[DType | type[DType]]) -> DaskSelector:
def func(df: DaskLazyFrame) -> list[dx.Series]:
return [
df._native_frame[col] for col in df.columns if df.schema[col] in dtypes
]
def evaluate_output_names(df: DaskLazyFrame) -> Sequence[str]:
return [col for col in df.columns if df.schema[col] in dtypes]
return selector(self, func, evaluate_output_names)

I'm still getting to know these backends, so maybe this is cheap to do?

@dangotbanned dangotbanned marked this pull request as ready for review March 1, 2025 20:23
Comment on lines +59 to +60
EvalSeries: TypeAlias = Callable[[FrameT], Sequence[SeriesT]]
EvalNames: TypeAlias = Callable[[FrameT], Sequence[str]]
Copy link
Member Author

@dangotbanned dangotbanned Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably move these to nw.typing and add docs to them.

There's lots of other places they could be used to simplify these two types:

call: Callable[[PandasLikeDataFrame], Sequence[PandasLikeSeries]],

evaluate_output_names: Callable[[PandasLikeDataFrame], Sequence[str]],

IMO this isn't as noisy:

call: EvalSeries[PandasLikeDataFrame, PandasLikeSeries],
evaluate_output_names: EvalNames[PandasLikeDataFrame],

@MarcoGorelli
Copy link
Member

love this, will do a final check later

Comment on lines +278 to +280
def __repr__(self: Self) -> str: # pragma: no cover
s = f"depth={self._depth}, " if is_tracks_depth(self._implementation) else ""
return f"{type(self).__name__}({s}function_name={self._function_name})"
Copy link
Member Author

@dangotbanned dangotbanned Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally planning to use utils.is_tracks_depth more than just in __repr__

utils.is_tracks_depth

narwhals/narwhals/utils.py

Lines 1392 to 1394 in d03cc22

def is_tracks_depth(obj: Implementation, /) -> TypeIs[_TracksDepth]: # pragma: no cover
# Return `True` for implementations that utilize `CompliantExpr._depth`.
return obj.is_pandas_like() or obj in {Implementation.PYARROW, Implementation.DASK}

The idea I had was to move most of the implementation of these methods up into the protocols.

You'd only need to declare the types (e.g. ArrowSelector & ArrowExpr):

*SelectorNamespace._selector

def _selector(
self,
call: EvalSeries[ArrowDataFrame, ArrowSeries],
evaluate_output_names: EvalNames[ArrowDataFrame],
/,
) -> ArrowSelector:
return ArrowSelector(
call,
depth=0,
function_name="selector",
evaluate_output_names=evaluate_output_names,
alias_output_names=None,
backend_version=self._backend_version,
version=self._version,
)

*Selector._to_expr

def _to_expr(self: Self) -> ArrowExpr:
return ArrowExpr(
self._call,
depth=self._depth,
function_name=self._function_name,
evaluate_output_names=self._evaluate_output_names,
alias_output_names=self._alias_output_names,
backend_version=self._backend_version,
version=self._version,
)

But I stopped short of that - because it should probably live in *Expr.

You'd have an even greater reduction in repetition, as it would account for all the arg mapping in:

class ArrowExpr(CompliantExpr["ArrowDataFrame", ArrowSeries]):
_implementation: Implementation = Implementation.PYARROW
def __init__(
self: Self,
call: Callable[[ArrowDataFrame], Sequence[ArrowSeries]],
*,
depth: int,
function_name: str,
evaluate_output_names: Callable[[ArrowDataFrame], Sequence[str]],
alias_output_names: Callable[[Sequence[str]], Sequence[str]] | None,
backend_version: tuple[int, ...],
version: Version,
call_kwargs: dict[str, Any] | None = None,
) -> None:
self._call = call
self._depth = depth
self._function_name = function_name
self._depth = depth
self._evaluate_output_names = evaluate_output_names
self._alias_output_names = alias_output_names
self._backend_version = backend_version
self._version = version
self._call_kwargs = call_kwargs or {}
def __repr__(self: Self) -> str: # pragma: no cover
return f"ArrowExpr(depth={self._depth}, function_name={self._function_name}, "
def __call__(self: Self, df: ArrowDataFrame) -> Sequence[ArrowSeries]:
return self._call(df)
def broadcast(self, kind: Literal[ExprKind.AGGREGATION, ExprKind.LITERAL]) -> Self:
# Mark the resulting ArrowSeries with `_broadcast = True`.
# Then, when extracting native objects, `extract_native` will
# know what to do.
def func(df: ArrowDataFrame) -> list[ArrowSeries]:
results = []
for result in self(df):
result._broadcast = True
results.append(result)
return results
return self.__class__(
func,
depth=self._depth,
function_name=self._function_name,
evaluate_output_names=self._evaluate_output_names,
alias_output_names=self._alias_output_names,
backend_version=self._backend_version,
version=self._version,
call_kwargs=self._call_kwargs,
)
@classmethod
def from_column_names(
cls: type[Self],
*column_names: str,
backend_version: tuple[int, ...],
version: Version,
) -> Self:
from narwhals._arrow.series import ArrowSeries
def func(df: ArrowDataFrame) -> list[ArrowSeries]:
try:
return [
ArrowSeries(
df._native_frame[column_name],
name=column_name,
backend_version=df._backend_version,
version=df._version,
)
for column_name in column_names
]
except KeyError as e:
missing_columns = [x for x in column_names if x not in df.columns]
raise ColumnNotFoundError.from_missing_and_available_column_names(
missing_columns=missing_columns, available_columns=df.columns
) from e
return cls(
func,
depth=0,
function_name="col",
evaluate_output_names=lambda _df: column_names,
alias_output_names=None,
backend_version=backend_version,
version=version,
)
@classmethod
def from_column_indices(
cls: type[Self],
*column_indices: int,
backend_version: tuple[int, ...],
version: Version,
) -> Self:
from narwhals._arrow.series import ArrowSeries
def func(df: ArrowDataFrame) -> list[ArrowSeries]:
return [
ArrowSeries(
df._native_frame[column_index],
name=df._native_frame.column_names[column_index],
backend_version=df._backend_version,
version=df._version,
)
for column_index in column_indices
]
return cls(
func,
depth=0,
function_name="nth",
evaluate_output_names=lambda df: [df.columns[i] for i in column_indices],
alias_output_names=None,
backend_version=backend_version,
version=version,
)

Copy link
Member Author

@dangotbanned dangotbanned Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR revealed some patterns that I think could be better encoded into the Protocol definitions:

Single or multi-Implementation

Tracks depth

  • As mentioned, we need that for the dask special-case
  • All other lazy backends don't track depth

(Lazy|Eager)*

Reuse*

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli feel free to move this into another issue - just trying to connect the dots between a lot of the stuff I've been working on 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - i'm not really sure about having _depth long-term, i think we'll need a better way to do this in the future, so i'd suggest not spending too long on that specifically

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dangotbanned - just a comment about moving a comment from just being the PR to putting in the code (a practice I'd encourage in general), then feel free to merge

@dangotbanned
Copy link
Member Author

thanks @dangotbanned - just a comment about moving a comment from just being the PR to putting in the code (a practice I'd encourage in general), then feel free to merge

Thanks for the review @MarcoGorelli

@dangotbanned dangotbanned merged commit 22ce463 into main Mar 2, 2025
28 checks passed
@dangotbanned dangotbanned deleted the more-dedup-1 branch March 2, 2025 18:42
dangotbanned added a commit that referenced this pull request Mar 5, 2025
- Long-term, should probably be defined in `nw.typing`
- Or just generally used in parts of each backend impl

#2064 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants