-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Use new parseHiveDate for OpenX reader to remove any characters after yyyy-mm-dd #25792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we have to options here:
- Preserve compatibility with the (original) OpenX JSON SerDe, which means preserving the semantics of doing
new SimpleDateFormat("yyyy-MM-dd").parse(value)
(see: JavaStringDateObjectInspector.parse) - Match Hive's parsing semantics at one variation of the following:
2a. Later versions of Hive 3 which support parsing timestamps separating the date and time with either a space or ISO 8601 style timestamps separating the two with 'T' (see: Hive 3 Date.valueOf)
2b. Hive 4 plus style parsing which ignores anything after the date portion (see: Hive 4 Date.valueOf)
If we're preserving the original OpenX serde semantics, that means ignoring any contents of the string after the date portion (similar to option 2b, matching Hive 4+)
I would support matching the permissive OpenX SimpleDateFormat equivalent behavior, but only for the OpenX SerDe (as is the approach here, avoiding using the shared HiveFormatUtils.parseHiveDate
method).
cc: @electrum, @dain - any thoughts on what the compatibility goal should be here and whether special casing this logic in OpenX is the right approach to take?
...o-hive-formats/src/main/java/io/trino/hive/formats/line/openxjson/OpenXJsonDeserializer.java
Outdated
Show resolved
Hide resolved
5cd3410
to
e810185
Compare
e810185
to
46ed36e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes mostly LGTM, with some minor changes recommended.
...o-hive-formats/src/main/java/io/trino/hive/formats/line/openxjson/OpenXJsonDeserializer.java
Outdated
Show resolved
Hide resolved
...o-hive-formats/src/main/java/io/trino/hive/formats/line/openxjson/OpenXJsonDeserializer.java
Outdated
Show resolved
Hide resolved
...o-hive-formats/src/main/java/io/trino/hive/formats/line/openxjson/OpenXJsonDeserializer.java
Outdated
Show resolved
Hide resolved
46ed36e
to
df34882
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change seems reasonable in spirit, but I'm concerned about most of the tests being Trino only. Trino only tests should only be used when there is an expected difference between Hive and Trino, and as I understand it, the point of this PR is to make the code more compatible with Hive.
assertValueTrinoOnly(DATE, "\"1986-01-01T00:00:00.000Z\"", sqlDate); | ||
assertValueTrinoOnly(DATE, "\"1986-01-01AA00:00:00.000Z\"", sqlDate); | ||
|
||
// Test bad yyyy-mm-dd formats | ||
// Cascading when number of digits are allowed | ||
assertValueTrinoOnly(DATE, "\"1986-01-33\"", null); | ||
assertValueTrinoOnly(DATE, "\"2023-80-80\"", null); | ||
assertValueTrinoOnly(DATE, "\"9999-99-99\"", null); | ||
|
||
// Digits are more than allowed in year, month, or day | ||
assertValueTrinoOnly(DATE, "\"02025-01-01\"", null); | ||
assertValueTrinoOnly(DATE, "\"2025-011-01\"", null); | ||
assertValueTrinoOnly(DATE, "\"2025-01-011\"", null); | ||
|
||
// Digits are more than allowed in year, month, or day and cascading | ||
assertValueTrinoOnly(DATE, "\"202510-01-01\"", null); | ||
assertValueTrinoOnly(DATE, "\"2025-100-01\"", null); | ||
assertValueTrinoOnly(DATE, "\"2025-01-100\"", null); | ||
assertValueTrinoOnly(DATE, "\"5881580-07-11\"", null); | ||
assertValueTrinoOnly(DATE, "\"-5877641-06-23\"", null); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are all of these "Trino Only" tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is that the "Hive" implementation uses the starburst OpenX JsonSerDe which uses Hive's date parsing, whereas the original OpenX SerDe actually uses SimpleDateFormat
which implies different semantics in the edge cases being addressed here. Correct me if I'm wrong on that @ljw9111
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that makes sense. I remember now, the Hive version of OpenX had a weird license and I'm not sure it even ran correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertValueTrinoOnly
also tests on Hive in here that it either produces value other than specified expected value (in this case null
) or fails. I used assertValueTrinoOnly
as we only care that it produces null
on Trino.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, thanks @ljw9111!
d983a16
to
ffe7f62
Compare
ffe7f62
to
afb54c2
Compare
Description
The parseHiveDate method in HiveFormatUtils.java that the native OpenX reader was using only supported
a space delimiter to remove any characters after 'yyyy-mm-dd'. As a result, while '2025-01-04 00:00:00.000Z'
was correctly parsed as '2025-01-04', strings like '2025-01-04T00:00:00.000Z' or '2025-01-04AA00:00:00.000Z'
were throwing exceptions and being parsed as null.
This new parseHiveDate method removes any characters after 'yyyy-mm-dd', regardless of the delimiter using regex.
This change adopts Hive 4's fix in https://github.com/apache/hive/blob/84b42876441ec8984f1a39367ae77627cc96a275/common/src/java/org/apache/hadoop/hive/common/type/Date.java#L179-L200
https://github.com/apache/hive/blob/84b42876441ec8984f1a39367ae77627cc96a275/common/src/java/org/apache/hadoop/hive/common/type/Date.java#L86-L89
The only difference is that it uses ResolverStyle.LENIENT while Hive 4's fix uses ResolverStyle.STRICT
Additionally, we introduce strict parsing for the date format yyyy-mm-dd, where we don't allow more than 4 digits for year, 2 digits for month, and 2 digits for day.
Thus, following date values will be parsed differently
02025-01-01 => null (after), 2025-01-01 (before)
2025-011-01 => null (after), 2025-11-01 (before)
2025-01-011 => null (after), 2025-01-11 (before)
202510-01-01 => null (after), +202510-01-01 (before)
2025-100-01 => null (after), 2033-04-01 (before)
2025-01-100 => null (after), 2025-04-10 (before)
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(O) Release notes are required, with the following suggested text: