Skip to content

chore(typing): enable typing checks for pyspark #2051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Feb 27, 2025

Conversation

EdAbati
Copy link
Collaborator

@EdAbati EdAbati commented Feb 20, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@dangotbanned
Copy link
Member

@EdAbati do we want this merged before or after #2038?

@EdAbati EdAbati marked this pull request as ready for review February 24, 2025 09:31
@dangotbanned
Copy link
Member

I'd really like this PR to be resolving warnings vs just adding ignore comments.
I do understand we don't have stubs to work with, so @EdAbati let me know if this is an unreasonable ask

@EdAbati
Copy link
Collaborator Author

EdAbati commented Feb 24, 2025

Thanks for the comments, I'll move the unpivot fix out so we can concentrate on discussing typing here. I'll reply later

@EdAbati EdAbati changed the title fix: pyspark typing and related unpivot bug fix: enable typing check for pyspark Feb 25, 2025
@dangotbanned dangotbanned changed the title fix: enable typing checks for pyspark chore(typing): enable typing checks for pyspark Feb 25, 2025
@dangotbanned
Copy link
Member

dangotbanned commented Feb 25, 2025

https://github.com/narwhals-dev/narwhals/actions/runs/13527240491/job/37800780977?pr=2051

This one is gonna take a little extra work to fix

image

Update

(11bde2e) is a pretty fast-and-loose solution to the problem.

https://github.com/EdAbati/narwhals/blob/11bde2e95c93d10f4cc04a389a259da2a05d1f6b/narwhals/_spark_like/dataframe.py#L43

We should think about moving the Implementation-dispatch into two subclasses:

  • Keep all the shared behavior in SparkLikeLazyFrame
  • Make that generic over TypeVar("T", DataFrame, SQLFrameDataFrame)
  • Move _implementation into a ClassVar on each subclass
  • Then those subclasses just have the unique parts defined
    • native_dataframe, _F, _native_dtypes, _Window, etc

Important

Food for thought - not proposing these changes become part of this PR

@dangotbanned dangotbanned marked this pull request as draft February 25, 2025 17:38
@dangotbanned dangotbanned marked this pull request as ready for review February 25, 2025 18:10
Copy link
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this PR @EdAbati!

Had a lot of fun spiraling through the issues that came up here.
I'll leave this open to let someone else merge - since I've made a lot of changes that I'd also want reviewed πŸ˜„

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit I still get easily lost in the typing shenanigans - can't spot anything that does not sound reasonable, but it's not a review that you should trust either πŸ™ˆ


from narwhals._spark_like.expr import SparkLikeExpr
from narwhals._spark_like.group_by import SparkLikeLazyGroupBy
from narwhals._spark_like.namespace import SparkLikeNamespace
from narwhals.dtypes import DType
from narwhals.utils import Version

SQLFrameDataFrame: TypeAlias = _SQLFrameDataFrame[Any, Any, Any, Any, Any]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn that's a hell of a generic πŸ˜‚

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn that's a hell of a generic πŸ˜‚

Ikr!
All the Unknown(s) in this gave me quite the scare
https://github.com/narwhals-dev/narwhals/actions/runs/13527240491/job/37800780977?pr=2051

Comment on lines +86 to +89
if TYPE_CHECKING:
from pyspark.sql import types

return types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any drawback to this? I have barely seen these conditional in non-import blocks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@dangotbanned dangotbanned Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🀯🀯

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any drawback to this? I have barely seen these conditional in non-import blocks

I'm not sure I'd say drawback - but I would recommend only trying this as a last resort.
Here - I'm using it to describe something the type system doesn't support yet.

Another use was #2064 (comment) - for a backwards compat bug fix.

I'd much rather do things in a simple way whenever possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a similar thing here to convert these functions into typeguards

https://github.com/zen-xu/pyarrow-stubs/blob/c482d8f146a8cf2cff55cc4060148fe4883f9309/pyarrow-stubs/types.pyi

if TYPE_CHECKING:
from typing import TypeVar
from typing_extensions import TypeAlias
from typing_extensions import TypeIs
from narwhals._arrow.series import ArrowSeries
from narwhals._arrow.typing import ArrowArray
from narwhals._arrow.typing import ArrowChunkedArray
from narwhals._arrow.typing import Incomplete
from narwhals._arrow.typing import StringArray
from narwhals.dtypes import DType
from narwhals.typing import _AnyDArray
from narwhals.utils import Version
# NOTE: stubs don't allow for `ChunkedArray[StructArray]`
# Intended to represent the `.chunks` property storing `list[pa.StructArray]`
ChunkedArrayStructArray: TypeAlias = ArrowChunkedArray
_T = TypeVar("_T")
def is_timestamp(t: Any) -> TypeIs[pa.TimestampType[Any, Any]]: ...
def is_duration(t: Any) -> TypeIs[pa.DurationType[Any]]: ...
def is_list(t: Any) -> TypeIs[pa.ListType[Any]]: ...
def is_large_list(t: Any) -> TypeIs[pa.LargeListType[Any]]: ...
def is_fixed_size_list(t: Any) -> TypeIs[pa.FixedSizeListType[Any, Any]]: ...
def is_dictionary(
t: Any,
) -> TypeIs[pa.DictionaryType[Any, Any, Any]]: ...
def extract_regex(
strings: ArrowChunkedArray,
/,
pattern: str,
*,
options: Any = None,
memory_pool: Any = None,
) -> ChunkedArrayStructArray: ...
else:
from pyarrow.compute import extract_regex
from pyarrow.types import is_dictionary # noqa: F401
from pyarrow.types import is_duration
from pyarrow.types import is_fixed_size_list
from pyarrow.types import is_large_list
from pyarrow.types import is_list
from pyarrow.types import is_timestamp

Copy link
Member

@dangotbanned dangotbanned Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention, but (#2051 (comment)) would be the direction to go in - to avoid needing TYPE_CHECKING blocks.

Keeping everything in one class means you need to either do this - or introduce Union(s) and isinstance checks to resolve them on every usage - to type this correctly.

Currently, we branch on Implementation but we can't represent the relationship between Implementation and these unions:

  • pyspark.sql.types or sqlframe.base.types
  • pyspark.sql.functions or sqlframe.base.functions
  • pyspark.sql.Window or sqlframe.base.window.Window
  • pyspark.sql.dataframe.DataFrame.sparkSession or sqlframe.base.dataframe.BaseDataFrame.session

They're a mix of modules, classes and object attributes

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a quick-ish scan this looks good, but i didn't check the details, though if you're all happy with it feel free to ship it 🚒

@dangotbanned
Copy link
Member

dangotbanned commented Feb 27, 2025

from a quick-ish scan this looks good, but i didn't check the details, though if you're all happy with it feel free to ship it 🚒

68747470733a2f2f692e666c756666792e63632f34357a51333477304b775758736768304b4d346735715a34766a526a307877322e676966

Ah perfect thanks @MarcoGorelli yeah I'm happy with it if you are β™₯️

Edit: ffs every time with the double gif 🀦

@dangotbanned dangotbanned merged commit 5383756 into narwhals-dev:main Feb 27, 2025
27 of 28 checks passed
dangotbanned added a commit that referenced this pull request Mar 5, 2025
Porting over (#2051), didn't realize this was delclared twice until (#2132)
MarcoGorelli pushed a commit that referenced this pull request Mar 7, 2025
* chore(typing): Add typing for `SparkLikeExpr` properties

Porting over (#2051), didn't realize this was delclared twice until (#2132)

* chore: fix typing and simplify `SparkLikeExprStringNamespace.to_datetime`

Resolves (https://github.com/narwhals-dev/narwhals/actions/runs/13682912007/job/38259412675?pr=2152)

* rename function

* use single chars in set

* fix: Remove timezone offset replacement

* test: Adds `test_to_datetime_tz_aware`

Resolves #2152 (comment)

* test: possibly fix `pyarrow` in ci?

Maybe this was just a TZDATA issue locally?
https://github.com/narwhals-dev/narwhals/actions/runs/13699734154/job/38310256617?pr=2152

* test: xfail polars `3.8`, fix false positive pyarrow

https://github.com/narwhals-dev/narwhals/actions/runs/13699804987/job/38310487932?pr=2152
https://github.com/narwhals-dev/narwhals/actions/runs/13699804987/job/38310488783?pr=2152

* test: narrower xfail, tz-less expected?

Not even sure what `pyarrow` is doing here https://github.com/narwhals-dev/narwhals/actions/runs/13700021595/job/38311197947?pr=2152

* test: account for `pyarrow` version changes

https://github.com/narwhals-dev/narwhals/actions/runs/13700267075/job/38312036397?pr=215

* test: maybe fix `pyspark`

https://github.com/narwhals-dev/narwhals/actions/runs/13700361438/job/38312364899?pr=2152

* revert: go back to typing fixes only

Addresses #2152 (review)

* chore: ignore `format` shadowing

#2152 (review)

* keep logic the same I hope

#2152 (comment)

---------

Co-authored-by: Edoardo Abati <[email protected]>
dangotbanned added a commit to MarcoGorelli/narwhals that referenced this pull request Mar 12, 2025
- Added in narwhals-dev#2187, narwhals-dev#2051
- No longer needed since `sqlframe` only typing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants