Skip to content

chore(typing): SparkLikeExpr properties #2152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Mar 7, 2025
Merged

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Mar 5, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

  • Related issue #<issue number>
  • Closes #<issue number>

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Porting over (#2051), didn't realize this was declared twice until (#2132)

Porting over (#2051), didn't realize this was delclared twice until (#2132)
@dangotbanned dangotbanned marked this pull request as ready for review March 5, 2025 19:11
@dangotbanned
Copy link
Member Author

dangotbanned commented Mar 5, 2025

@MarcoGorelli if you merge this ahead of #2132 - you'll be able to get more useful IDE feedback from SparkLikeExpr.(_F|_Window) 😊

I checked out the PR and was confused why I didn't get hints from #2051

Comment on lines 126 to 127
def is_naive(obj: Any) -> bool:
return obj is not None and {"%s", "%z", "Z"}.isdisjoint(obj)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def is_naive(obj: Any) -> bool:
return obj is not None and {"%s", "%z", "Z"}.isdisjoint(obj)
def is_naive(format: str) -> bool:
return {"%s", "%z", "Z"}.isdisjoint(format)

just to signal that this checks the date time format (should we maybe call it is_naive_format?).
Also based on how it is used, format will always be a non-null str right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EdAbati, yeah is_naive_format sounds good to me

I renamed format to obj to avoid a ruff warning for shadowing builtins.format

Also narrowing from Any to str sounds good as well.
I originally wrote it using a TypeIs but removed that since it gave a false-positive on unreachability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy for you to make those changes and merge πŸ™Œ

Copy link
Member Author

@dangotbanned dangotbanned Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think the current set needs to be changed to:

{"s", "z", "Z"}

I'm relying on format splitting into a sequence of characters - but two of them are 2-char sequences.

I wonder why that didn't fail a test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why that didn't fail a test?

I actually haven't seen a test covering this πŸ‘€ double checking with the datetime wizard @MarcoGorelli , is it something that we are not testing on purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm reading this wrong, but I dont think we have coverage for any backend with a tz-aware format for .str.to_datetime(...)

https://github.com/narwhals-dev/narwhals/blob/3103b20afc1a523137f74003478c77a9debc2781/tests/expr_and_series/str/to_datetime_test.py

Copy link
Member Author

@dangotbanned dangotbanned Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli really need some help on writing a test for this πŸ™

Right now I've got something kinda working for a lot of backends.
But it fails to parse for sqlframe, pyarrow.

Also polars.Expr.str.to_datetime seems to have more options - that might be able to preserve the timezone offset?

test_to_datetime_tz_aware

@pytest.mark.parametrize(
    ("fmt", "data", "expected_pandas", "expected_polars_duckdb"),
    [
        (
            "%Y-%m-%d %H:%M:%S%z",
            {"a": ["2020-01-01 12:34:56+0200"]},
            "2020-01-01 12:34:56+02:00",
            "2020-01-01 10:34:56+00:00",
        )
    ],
)
def test_to_datetime_tz_aware(
    constructor: Constructor,
    fmt: str,
    data: dict[str, Sequence[str]],
    expected_pandas: str,
    expected_polars_duckdb: str,
) -> None:
    constructor_str = str(constructor)
    if "pandas" in constructor_str or "dask" in constructor_str:
        expected = expected_pandas
    elif "polars" in constructor_str or "duckdb" in constructor_str:
        expected = expected_polars_duckdb
    else:
        expected = "fails"
    result = (
        nw.from_native(constructor(data))
        .lazy()
        .select(b=nw.col("a").str.to_datetime(fmt))
        .collect()
        .item(row=0, column="b")
    )

    assert str(result) == expected

I can fully understand nope-ing out of handling timezones for all these different packages πŸ˜’

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright so (e19dcdf) -> (194ef83) now has a single test

sqlframe works, pyspark apprently doesn't https://github.com/narwhals-dev/narwhals/actions/runs/13700361438/job/38312364899?pr=2152

Absolutely don't want to do timezones again 😩

Copy link
Member Author

@dangotbanned dangotbanned Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FBruzzesi I see you added strptime_to_pyspark_format in #1826

AFAICT there were never any tests for tz-aware timestamps and right now I can't get one to work 😞

Tried my best, but as of (ce3145d) I'm giving up

@dangotbanned dangotbanned added the help wanted Extra attention is needed label Mar 6, 2025
@dangotbanned dangotbanned changed the title chore(typing): Add typing for SparkLikeExpr properties chore: SparkLikeExpr property typing, fix str.to_datetime Mar 6, 2025
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this!

I think this PR is doing too much, please either address typing or fix str.to_datetime

the former should be fine, some comments for the latter:

  • can we keep format even if it overrides the builtin?
  • we should probably make sure that we consistently parse into UTC, so there should only be a single expected for test_to_datetime_tz_aware

I can take care of the timezone handling separately, but it does need splitting out

@dangotbanned dangotbanned changed the title chore: SparkLikeExpr property typing, fix str.to_datetime chore(typing): SparkLikeExpr properties Mar 7, 2025
@dangotbanned dangotbanned removed help wanted Extra attention is needed fix tests labels Mar 7, 2025
@dangotbanned
Copy link
Member Author

dangotbanned commented Mar 7, 2025

#2152 (review)

Thanks for the review @MarcoGorelli
I think you're right, I bit off a lil more than I could chew here.

Caught me at a good time with the feedback - hopefully I've addressed it all in:

I can take care of the timezone handling separately, but it does need splitting out

Phew - I'm in no hurry to deal w/ timezones again πŸ˜…

Edit

Just for some clarity - I touched that method to resolve these https://github.com/narwhals-dev/narwhals/actions/runs/13682912007/job/38259412675?pr=2152

Looking back now - I'm not sure that was super clear

Comment on lines 126 to 127
def is_naive_format(format: str) -> bool: # noqa: A002
return {"s", "z", "Z"}.isdisjoint(format)
Copy link
Member

@MarcoGorelli MarcoGorelli Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't really like how we're changing the logic here, 'z' in a format string means the 'z' literal, it's '%z' we need to check for

the rest looks good though, well done!

PR to address time zone aware parsing: #2166

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MarcoGorelli

I'm happy for that bit to be reverted and just keep the splitting out into the new function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! if you're happy with #2166 to add a test and adjust some logic, happy to then take this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully got it done (311c7ea)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dangotbanned much appreciated!

@MarcoGorelli MarcoGorelli merged commit 734c641 into main Mar 7, 2025
29 checks passed
@MarcoGorelli MarcoGorelli deleted the spark-expr-typing branch March 7, 2025 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants