Skip to content

Use new parseHiveDate for OpenX reader to remove any characters after yyyy-mm-dd #25792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 5, 2025

Conversation

ljw9111
Copy link
Contributor

@ljw9111 ljw9111 commented May 14, 2025

Description

The parseHiveDate method in HiveFormatUtils.java that the native OpenX reader was using only supported
a space delimiter to remove any characters after 'yyyy-mm-dd'. As a result, while '2025-01-04 00:00:00.000Z'
was correctly parsed as '2025-01-04', strings like '2025-01-04T00:00:00.000Z' or '2025-01-04AA00:00:00.000Z'
were throwing exceptions and being parsed as null.
This new parseHiveDate method removes any characters after 'yyyy-mm-dd', regardless of the delimiter using regex.

This change adopts Hive 4's fix in https://github.com/apache/hive/blob/84b42876441ec8984f1a39367ae77627cc96a275/common/src/java/org/apache/hadoop/hive/common/type/Date.java#L179-L200
https://github.com/apache/hive/blob/84b42876441ec8984f1a39367ae77627cc96a275/common/src/java/org/apache/hadoop/hive/common/type/Date.java#L86-L89

The only difference is that it uses ResolverStyle.LENIENT while Hive 4's fix uses ResolverStyle.STRICT

Additionally, we introduce strict parsing for the date format yyyy-mm-dd, where we don't allow more than 4 digits for year, 2 digits for month, and 2 digits for day.

Thus, following date values will be parsed differently
02025-01-01 => null (after), 2025-01-01 (before)
2025-011-01 => null (after), 2025-11-01 (before)
2025-01-011 => null (after), 2025-01-11 (before)
202510-01-01 => null (after), +202510-01-01 (before)
2025-100-01 => null (after), 2033-04-01 (before)
2025-01-100 => null (after), 2025-04-10 (before)

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(O) Release notes are required, with the following suggested text:

## Hive connector
* The OpenX JSON reader now ignores any extra characters after parsing a timestamp value as a date.
  Previously, it only supported parsing timestamps with a space separator between the date and
  timestamp, but now it supports parsing with other separators such as `T` used by ISO 8601
  timestamps. Dates are now validated to enforce no more than 4 digit years, 2 digit months, and 2 digit
  day of month values. ({issue}`25792`)

@cla-bot cla-bot bot added the cla-signed label May 14, 2025
@ljw9111 ljw9111 requested a review from dain May 14, 2025 15:44
Copy link
Member

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we have to options here:

  1. Preserve compatibility with the (original) OpenX JSON SerDe, which means preserving the semantics of doing new SimpleDateFormat("yyyy-MM-dd").parse(value) (see: JavaStringDateObjectInspector.parse)
  2. Match Hive's parsing semantics at one variation of the following:
    2a. Later versions of Hive 3 which support parsing timestamps separating the date and time with either a space or ISO 8601 style timestamps separating the two with 'T' (see: Hive 3 Date.valueOf)
    2b. Hive 4 plus style parsing which ignores anything after the date portion (see: Hive 4 Date.valueOf)

If we're preserving the original OpenX serde semantics, that means ignoring any contents of the string after the date portion (similar to option 2b, matching Hive 4+)

I would support matching the permissive OpenX SimpleDateFormat equivalent behavior, but only for the OpenX SerDe (as is the approach here, avoiding using the shared HiveFormatUtils.parseHiveDate method).

cc: @electrum, @dain - any thoughts on what the compatibility goal should be here and whether special casing this logic in OpenX is the right approach to take?

@ljw9111 ljw9111 self-assigned this May 14, 2025
@ljw9111 ljw9111 force-pushed the openx-date-parse-issue branch from 5cd3410 to e810185 Compare May 14, 2025 18:46
@ljw9111 ljw9111 force-pushed the openx-date-parse-issue branch from e810185 to 46ed36e Compare May 23, 2025 20:23
Copy link
Member

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes mostly LGTM, with some minor changes recommended.

@ljw9111 ljw9111 force-pushed the openx-date-parse-issue branch from 46ed36e to df34882 Compare May 27, 2025 20:22
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems reasonable in spirit, but I'm concerned about most of the tests being Trino only. Trino only tests should only be used when there is an expected difference between Hive and Trino, and as I understand it, the point of this PR is to make the code more compatible with Hive.

Comment on lines 896 to 917
assertValueTrinoOnly(DATE, "\"1986-01-01T00:00:00.000Z\"", sqlDate);
assertValueTrinoOnly(DATE, "\"1986-01-01AA00:00:00.000Z\"", sqlDate);

// Test bad yyyy-mm-dd formats
// Cascading when number of digits are allowed
assertValueTrinoOnly(DATE, "\"1986-01-33\"", null);
assertValueTrinoOnly(DATE, "\"2023-80-80\"", null);
assertValueTrinoOnly(DATE, "\"9999-99-99\"", null);

// Digits are more than allowed in year, month, or day
assertValueTrinoOnly(DATE, "\"02025-01-01\"", null);
assertValueTrinoOnly(DATE, "\"2025-011-01\"", null);
assertValueTrinoOnly(DATE, "\"2025-01-011\"", null);

// Digits are more than allowed in year, month, or day and cascading
assertValueTrinoOnly(DATE, "\"202510-01-01\"", null);
assertValueTrinoOnly(DATE, "\"2025-100-01\"", null);
assertValueTrinoOnly(DATE, "\"2025-01-100\"", null);
assertValueTrinoOnly(DATE, "\"5881580-07-11\"", null);
assertValueTrinoOnly(DATE, "\"-5877641-06-23\"", null);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all of these "Trino Only" tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that the "Hive" implementation uses the starburst OpenX JsonSerDe which uses Hive's date parsing, whereas the original OpenX SerDe actually uses SimpleDateFormat which implies different semantics in the edge cases being addressed here. Correct me if I'm wrong on that @ljw9111

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that makes sense. I remember now, the Hive version of OpenX had a weird license and I'm not sure it even ran correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertValueTrinoOnly also tests on Hive in here that it either produces value other than specified expected value (in this case null) or fails. I used assertValueTrinoOnly as we only care that it produces null on Trino.

Copy link
Member

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, thanks @ljw9111!

@ljw9111 ljw9111 force-pushed the openx-date-parse-issue branch 2 times, most recently from d983a16 to ffe7f62 Compare June 4, 2025 19:05
@ljw9111 ljw9111 force-pushed the openx-date-parse-issue branch from ffe7f62 to afb54c2 Compare June 4, 2025 21:03
@pettyjamesm pettyjamesm merged commit 463f607 into trinodb:master Jun 5, 2025
59 checks passed
@github-actions github-actions bot added this to the 476 milestone Jun 5, 2025
@ljw9111 ljw9111 deleted the openx-date-parse-issue branch June 13, 2025 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants