-
Notifications
You must be signed in to change notification settings - Fork 146
refactor: Generic CompliantSelector
#2064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Less ambiguous, thinking `iter_columns` will be a better name to reserve for https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.iter_columns.html
Column names and selecting series is the core part of `selectors`
Will investigate later
Experimenting with `pandas` only to start
All tests are passing locally, hoping to do `_arrow` next and see if this holds up
Well that went smoothly
Possible via #2060
narwhals/_selectors.py
Outdated
def __init__(self: Self, context: _FullContext, /) -> None: | ||
self._implementation = context._implementation | ||
self._backend_version = context._backend_version | ||
self._version = context._version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part was resolved in (5a36c81)
Seems 3.8
has a different issue
https://github.com/narwhals-dev/narwhals/actions/runs/13462903656/job/37622190616?pr=2064
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
I'm ignoring 3.8
issues for the time being.
Absolute worst-case - I'll just use regular classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be the way to go:
Will be attempting something like (python/typing#572 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it certainly is a solution (#2064 (comment))
😞
Now is always empty #2059
@MarcoGorelli I'm looking at the lazy backends atm Are narwhals/narwhals/_dask/selectors.py Lines 41 to 51 in f0a5fbb
I'm still getting to know these backends, so maybe this is cheap to do? |
Lazy-support needs to be able to override them
- Working locally - `pyright` is very unhappy with `dx.Series` being used
- Haven't got a local install - Expecting to work the same as `_dask` in CI
EvalSeries: TypeAlias = Callable[[FrameT], Sequence[SeriesT]] | ||
EvalNames: TypeAlias = Callable[[FrameT], Sequence[str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably move these to nw.typing
and add docs to them.
There's lots of other places they could be used to simplify these two types:
narwhals/narwhals/_pandas_like/expr.py
Line 53 in 9387ecd
call: Callable[[PandasLikeDataFrame], Sequence[PandasLikeSeries]], |
narwhals/narwhals/_pandas_like/expr.py
Line 57 in 9387ecd
evaluate_output_names: Callable[[PandasLikeDataFrame], Sequence[str]], |
IMO this isn't as noisy:
narwhals/narwhals/_pandas_like/selectors.py
Lines 26 to 27 in ec5dd0c
call: EvalSeries[PandasLikeDataFrame, PandasLikeSeries], | |
evaluate_output_names: EvalNames[PandasLikeDataFrame], |
love this, will do a final check later |
def __repr__(self: Self) -> str: # pragma: no cover | ||
s = f"depth={self._depth}, " if is_tracks_depth(self._implementation) else "" | ||
return f"{type(self).__name__}({s}function_name={self._function_name})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was originally planning to use utils.is_tracks_depth
more than just in __repr__
utils.is_tracks_depth
Lines 1392 to 1394 in d03cc22
def is_tracks_depth(obj: Implementation, /) -> TypeIs[_TracksDepth]: # pragma: no cover | |
# Return `True` for implementations that utilize `CompliantExpr._depth`. | |
return obj.is_pandas_like() or obj in {Implementation.PYARROW, Implementation.DASK} |
The idea I had was to move most of the implementation of these methods up into the protocols.
You'd only need to declare the types (e.g. ArrowSelector
& ArrowExpr
):
*SelectorNamespace._selector
narwhals/narwhals/_arrow/selectors.py
Lines 20 to 34 in d03cc22
def _selector( | |
self, | |
call: EvalSeries[ArrowDataFrame, ArrowSeries], | |
evaluate_output_names: EvalNames[ArrowDataFrame], | |
/, | |
) -> ArrowSelector: | |
return ArrowSelector( | |
call, | |
depth=0, | |
function_name="selector", | |
evaluate_output_names=evaluate_output_names, | |
alias_output_names=None, | |
backend_version=self._backend_version, | |
version=self._version, | |
) |
*Selector._to_expr
narwhals/narwhals/_arrow/selectors.py
Lines 43 to 52 in d03cc22
def _to_expr(self: Self) -> ArrowExpr: | |
return ArrowExpr( | |
self._call, | |
depth=self._depth, | |
function_name=self._function_name, | |
evaluate_output_names=self._evaluate_output_names, | |
alias_output_names=self._alias_output_names, | |
backend_version=self._backend_version, | |
version=self._version, | |
) |
But I stopped short of that - because it should probably live in *Expr
.
You'd have an even greater reduction in repetition, as it would account for all the arg mapping in:
narwhals/narwhals/_arrow/expr.py
Lines 34 to 151 in 140833c
class ArrowExpr(CompliantExpr["ArrowDataFrame", ArrowSeries]): | |
_implementation: Implementation = Implementation.PYARROW | |
def __init__( | |
self: Self, | |
call: Callable[[ArrowDataFrame], Sequence[ArrowSeries]], | |
*, | |
depth: int, | |
function_name: str, | |
evaluate_output_names: Callable[[ArrowDataFrame], Sequence[str]], | |
alias_output_names: Callable[[Sequence[str]], Sequence[str]] | None, | |
backend_version: tuple[int, ...], | |
version: Version, | |
call_kwargs: dict[str, Any] | None = None, | |
) -> None: | |
self._call = call | |
self._depth = depth | |
self._function_name = function_name | |
self._depth = depth | |
self._evaluate_output_names = evaluate_output_names | |
self._alias_output_names = alias_output_names | |
self._backend_version = backend_version | |
self._version = version | |
self._call_kwargs = call_kwargs or {} | |
def __repr__(self: Self) -> str: # pragma: no cover | |
return f"ArrowExpr(depth={self._depth}, function_name={self._function_name}, " | |
def __call__(self: Self, df: ArrowDataFrame) -> Sequence[ArrowSeries]: | |
return self._call(df) | |
def broadcast(self, kind: Literal[ExprKind.AGGREGATION, ExprKind.LITERAL]) -> Self: | |
# Mark the resulting ArrowSeries with `_broadcast = True`. | |
# Then, when extracting native objects, `extract_native` will | |
# know what to do. | |
def func(df: ArrowDataFrame) -> list[ArrowSeries]: | |
results = [] | |
for result in self(df): | |
result._broadcast = True | |
results.append(result) | |
return results | |
return self.__class__( | |
func, | |
depth=self._depth, | |
function_name=self._function_name, | |
evaluate_output_names=self._evaluate_output_names, | |
alias_output_names=self._alias_output_names, | |
backend_version=self._backend_version, | |
version=self._version, | |
call_kwargs=self._call_kwargs, | |
) | |
@classmethod | |
def from_column_names( | |
cls: type[Self], | |
*column_names: str, | |
backend_version: tuple[int, ...], | |
version: Version, | |
) -> Self: | |
from narwhals._arrow.series import ArrowSeries | |
def func(df: ArrowDataFrame) -> list[ArrowSeries]: | |
try: | |
return [ | |
ArrowSeries( | |
df._native_frame[column_name], | |
name=column_name, | |
backend_version=df._backend_version, | |
version=df._version, | |
) | |
for column_name in column_names | |
] | |
except KeyError as e: | |
missing_columns = [x for x in column_names if x not in df.columns] | |
raise ColumnNotFoundError.from_missing_and_available_column_names( | |
missing_columns=missing_columns, available_columns=df.columns | |
) from e | |
return cls( | |
func, | |
depth=0, | |
function_name="col", | |
evaluate_output_names=lambda _df: column_names, | |
alias_output_names=None, | |
backend_version=backend_version, | |
version=version, | |
) | |
@classmethod | |
def from_column_indices( | |
cls: type[Self], | |
*column_indices: int, | |
backend_version: tuple[int, ...], | |
version: Version, | |
) -> Self: | |
from narwhals._arrow.series import ArrowSeries | |
def func(df: ArrowDataFrame) -> list[ArrowSeries]: | |
return [ | |
ArrowSeries( | |
df._native_frame[column_index], | |
name=df._native_frame.column_names[column_index], | |
backend_version=df._backend_version, | |
version=df._version, | |
) | |
for column_index in column_indices | |
] | |
return cls( | |
func, | |
depth=0, | |
function_name="nth", | |
evaluate_output_names=lambda df: [df.columns[i] for i in column_indices], | |
alias_output_names=None, | |
backend_version=backend_version, | |
version=version, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR revealed some patterns that I think could be better encoded into the Protocol
definitions:
Single or multi-Implementation
- Single defines on the class
- Multi defines on the instance, and passes it around to keep track
- Currently the
*Like*
classes - Although I think that should be revisted
- Currently the
Tracks depth
- As mentioned, we need that for the
dask
special-case - All other lazy backends don't track depth
(Lazy|Eager)*
- I've made a start with that for
*Selector
- Planning the same for
*Expr
in
Reuse*
- I mentioned in this PR that I was thinking about renaming to
EagerOnly*
- I think it would be worth having the distinction between those cases and
Polars*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli feel free to move this into another issue - just trying to connect the dots between a lot of the stuff I've been working on 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks - i'm not really sure about having _depth
long-term, i think we'll need a better way to do this in the future, so i'd suggest not spending too long on that specifically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @dangotbanned - just a comment about moving a comment from just being the PR to putting in the code (a practice I'd encourage in general), then feel free to merge
- Required some compatibility aliasing for `pandas`, `pyarrow` - They're faux-lazy #2064 (comment)
Thanks for the review @MarcoGorelli |
- Long-term, should probably be defined in `nw.typing` - Or just generally used in parts of each backend impl #2064 (comment)
Tracking
3.8
#2084pyspark
#2051CompliantDataFrame
#2115)CompliantSelector
#2064 (comment)CompliantSelector
#2064 (comment)CompliantSelector
#2064 (comment)CompliantSelector
#2064 (comment)CompliantSelector
#2064 (comment)DataFrame.iter_columns
#2101What type of PR is this? (check all applicable)
Related issues
CompliantExpr._version
#2060kwargs
fromselector
function #2058*.selectors
#2057TypeVar
used in(SparkLike|DuckDB)Expr
#2044Checklist
If you have comments or can explain your changes, please do so below
Expecting quite a nice reduction in copy/pasted code
_arrow
_dask
_duckdb
_pandas_like
_spark_like
Investigatefix3.8
protocol bug