[SQL] Support for TIMESTAMP WITH TIME ZONE#6116
Conversation
mythical-fred
left a comment
There was a problem hiding this comment.
Several blockers; please mark this as draft as the description suggests, or address before merge.
| JsonFlavor::DebeziumMySql => Self { | ||
| time_format: TimeFormat::Micros, | ||
| date_format: DateFormat::DaysSinceEpoch, | ||
| // Why is this missing fractions of second? |
There was a problem hiding this comment.
Stray // Why is this missing fractions of second? comment with a question to nobody. Either answer it (Debezium MySQL really does emit second-precision timestamps in this layout) or open an issue and reference it. Don't ship a TODO disguised as a question.
48d602b to
4e74e03
Compare
add429a to
7edb1df
Compare
mythical-fred
left a comment
There was a problem hiding this comment.
Re-review on tip 7edb1df (force-pushed since the prior review on 8b53059). Several of my prior blockers were silently fixed without a reply (good); a few are not yet addressed; and the new force-push introduced regressions in cast_to_TimestampTz_s and a Python test file that won't even parse. PR is now draft, which is the right call.
Prior blockers — status
- timestamp.rs SerializeWithContext using
timestamp_formatinstead oftimestamp_tz_format— fixed. serialize_table_record!(TestStruct[3] {...})field count — fixed (now[4]).- DBSPTypeTimestampTz
getMaxValue8 nines — fixed (6 nines). - CastTests
Americas/New_York— fixed. - TypeCompiler TZ branch warning text + precision constant — fixed (note:
timestampPrecisionsWarnedis still shared between the two branches; minor). - TimestampTz
// Is this right for negative timestamps?TODO + Rustdoc copy-paste — gone (the methods were removed; if/when they come back, please don't reintroduce the bug). - TZ section docs typos / heading — fixed.
- Non-actionable cast error — only partially: type name corrected to TIMESTAMP WITH TIME ZONE, but still no format hint as PR_REVIEW_RULES.md asks (see inline).
- Magic-number
Objects.hash(..., 14)— number changed to16, still a bare magic constant (see inline).
Still open / not addressed
- Test coverage. Same two integration tests as before (
issue4146,issue4146a). Your own description still says "I need to implement many more tests - at least one for each function". This is the headline blocker for the PR. - Plain-TIMESTAMP doc typos
data time/asssumed(datetime.md L196-197) — flagged previously to fix in the same PR; still present. - Stray rhetorical
// Why is this missing fractions of second?comment inserde_config.rs:204.
New issues introduced by this force-push
cast_to_TimestampTz_snow uses onlyDateTime::parse_from_rfc3339. That's a regression: the previous version at least acceptedYYYY-MM-DD HH:MM:SS[.f]andYYYY-MM-DD. RFC3339 also doesn't accept IANA zone names, even though both the docs andDBSPTimestampTzLiteraladvertise IANA support. Net effect: literals accept'2020-01-01 10:10:10 America/New_York'butCAST('2020-01-01 10:10:10 America/New_York' AS TIMESTAMP WITH TIME ZONE)will now fail at runtime.python/tests/runtime_aggtest/aggregate_tests2/test_timestamp_tbl.py— the new TZ test data is broken in three ways (syntax error + two malformed strings). See inline; this file won't even import.DBSPTimestampTzLiteral.getMicroshas a negative-modulo footgun (see inline) — pre-1970 TZ literals will throw fromLocalDateTime.ofEpochSecondbecausenanoOfSecondmust be non-negative.
Keeping verdict at REQUEST_CHANGES until the test coverage gap and the regressed cast/python paths are addressed.
|
|
||
| #[doc(hidden)] | ||
| pub fn cast_to_TimestampTz_s(value: SqlString) -> SqlResult<TimestampTz> { | ||
| if let Ok(v) = DateTime::parse_from_rfc3339(value.str()) { |
There was a problem hiding this comment.
Two regressions vs. the previous implementation, both visible to users:
- No IANA zone names.
DateTime::parse_from_rfc3339only accepts numeric offsets. ButDBSPTimestampTzLiteralanddocs/sql/datetime.mdboth promise IANA names (America/New_York, etc.). SoTIMESTAMP WITH TIME ZONE '2020-01-01 10:10:10 America/New_York'works as a literal butCAST('2020-01-01 10:10:10 America/New_York' AS TIMESTAMP WITH TIME ZONE)fails. Please usechrono_tz(or equivalent) to handle IANA names, then fall back toparse_from_rfc3339for numeric offsets. - Date-only and space-separated forms dropped. The previous version of this function accepted
YYYY-MM-DD HH:MM:SS[.f](no offset, treated as UTC) andYYYY-MM-DD. RFC3339 requires theTseparator and an offset, so both of those casts now fail. At minimum, please keep the%Y-%m-%d %H:%M:%S%.fand%Y-%m-%dfallbacks (UTC-assumed, matching the literal grammar's "assumed to be in UTC").
A test that exercises each accepted format end-to-end (numeric offset, IANA name, date-only, no-offset) belongs with this fix.
| JsonFlavor::DebeziumMySql => Self { | ||
| time_format: TimeFormat::Micros, | ||
| date_format: DateFormat::DaysSinceEpoch, | ||
| // Why is this missing fractions of second? |
There was a problem hiding this comment.
Still here from prior review. Either answer the question (does Debezium MySQL really emit second-precision timestamps in this layout? — if so, say so) or open an issue and reference it. Don't ship a TODO disguised as a question.
|
Most of the Python tests are validated because they have the same expected shape and results as tests for TIMESTAMP. |
|
@snkas this is a big PR, but I think most scrutiny is needed in the Rust code dealing with connectors and pipeline manager, which is probably < 100 lines of code. Most of the code is in tests; see the above comment why the Python tests are probably correct. |
dbdd9a6 to
8091d80
Compare
snkas
left a comment
There was a problem hiding this comment.
From pipeline-manager perspective, it's a new type added to the OpenAPI schema, that part looks good.
For the connectors: the input and output format of this new data type is still MicrosSinceEpoch, so an integer. When would one use TIMESTAMP instead of TIMESTAMP WITH TIME ZONE, and vice versa? It seems they are effectively the same in terms of the queries you can do on them (e.g., it's no possible to query "give me all the tuples with the timestamp in time zone +01:00" or so)?
|
Our implementation with TIMESTAMP WITH TIME ZONE aligns with most major databases. For one, some datasets already include timestamps with time zones, so this will enable us to load them. Then there are functions to convert timezones. CONVERT_TIMEZONE is an example we already have, but there is a function (not yet implemented) AT TIME ZONE, which takes a timestamp at a given time zone and converts it to UDT. Indeed, we do not carry the actual time zone; time zones are relatively complex objects, which do not make sense without a TIMESTAMP, and whose definition evolves every year, they are not simply numeric offsets. |
|
@snkas any actionable feedback? Can we merge this and refine? |
mythical-fred
left a comment
There was a problem hiding this comment.
Big blockers from my prior two reviews are addressed: IANA zones now parse, the error message is actionable with the expected format, getMaxValue() is back to 6 fractional digits, DBSPTimestampTzLiteral.getMicros uses Math.floorDiv/Math.floorMod (no more negative-modulo footgun), the broken test_timestamp_tbl.py row is now valid Python with a clean +05:00 offset, and test coverage has expanded substantially (issue4146a with ten in-line cases, testCaseTz, testCompareTz, plus the new test_interval_*, test_cmp_operators.py, test_date_time_fn.py and TZ rows in test_timestamp_tbl.py). Approving. Two tiny leftovers inline.
|
|
||
| @Override | ||
| public int hashCode() { | ||
| return Objects.hash(this.mayBeNull, 16); |
There was a problem hiding this comment.
Still a bare magic number — 14 -> 16 doesn't really address the original concern; the literal carries no meaning at the call site. A one-line // Discriminator for hash-disambiguation across DBSPType subclasses (or whatever the convention here is) would make this self-explanatory. Not a blocker.
| JsonFlavor::DebeziumMySql => Self { | ||
| time_format: TimeFormat::Micros, | ||
| date_format: DateFormat::DaysSinceEpoch, | ||
| // Why is this missing fractions of second? |
There was a problem hiding this comment.
This rhetorical TODO comment is still here from my first review. Either answer it (does Debezium MySQL truly emit second-precision timestamps in this flavor? — confirm and replace with a single-sentence explanation citing the upstream behaviour) or delete it. Floating questions to nobody are noise. Not a blocker.
There was a problem hiding this comment.
I see, so this PR has to benefits:
(1) It allows us to ingest data from sources that provide a column that has timestamp+timezone as its type. Would this not already be possible by adding an extra mapping to Timestamp? Given we don't store the timezone in any case.
(2) It opens up the possibility for SQL timezone-related functions. But if we don't store the timezone in this new type, wouldn't CHAR just still be the output type for both AT TIMESTAMP and CONVERT_TIMESTAMP? Documentation: https://docs.feldera.com/sql/datetime/#convert_timezone
As the PR description is not "Add type TIMESTAMP WITH TIMEZONE as an alias for TIMESTAMP", which would be my expectation so far based on our back-and-forth. How does the type differ?
| // hardwire that for now. We can make it configurable in the future. | ||
| // FIXME: Also, the timezone should be `None`, but that gets compiled into `timezone_ntz` | ||
| // in the JSON schema, which Databricks doesn't fully support yet. | ||
| SqlType::Timestamp => DataType::Timestamp( |
There was a problem hiding this comment.
There is this special case for Timestamp, wouldn't this also apply to TimestampTz? @ryzhyk added the comment here.
There was a problem hiding this comment.
I hope that this will work - the special case was None not being supported
|
In other databases an important difference between TIMESTAMP and TIMESTAMP WITH TIME ZONE is that the latter is usually adjusted for I/O based on the local timezone of the database, whereas the former is absolute. Now, Feldera databases always run in the UTC time zone, which has the nice benefit of making things deterministic. So these two do look very similar. Think of the former as a |
18a7640 to
ae3223e
Compare
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
|
Merging this is blocked on #6133 - without this it requires a platform upgrade |
Fixes #4146
This PR finally adds support for the last important missing SQL type
TIMESTAMP WITH TIME ZONEThere are multiple ways to implement this; I have chosen a version which provides similar capabilities with most databases: timestamps with time zones are stored at runtime as simply UTC timestamps. Thus they are comparable to each other. No time zone information is preserved at runtime, though.
I will mark this PR as a draft for two reasons:
The Rust serialization code will need to be audited for compatibility with other standard formats (e.g., Debezium).