-
Notifications
You must be signed in to change notification settings - Fork 288
Conversation
21653a7
to
65bda1a
Compare
65bda1a
to
e7406a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, run black -l 120 data_diff
, as mentioned here - https://github.com/datafold/data-diff/blob/master/CONTRIBUTING.md#code-style
@@ -18,7 +18,7 @@ jobs: | |||
matrix: | |||
os: [ubuntu-latest] | |||
python-version: | |||
- "3.7.1" | |||
- "3.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the .1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier, I did not manage to update the lock file for 3.7 when I added Databricks so I increased it up to 3.7.1, and it worked.
Now, I updated the lock with an older version of Python so decided to return back to 3.7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I thought we had it at 3.7.1 for a different reason, but if that's the case then it's fine.
Btw, what did you change to make updating the lockfile work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing special, it just updated without a resolving error although it failed with it before. For me it'a a kind of magic how poetry works...
data_diff/databases/clickhouse.py
Outdated
@@ -0,0 +1,153 @@ | |||
from typing import Optional, Type | |||
|
|||
import clickhouse_driver.dbapi.connection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't have this import outside of import_clichouse()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
data_diff/databases/clickhouse.py
Outdated
type_repr = type_repr[len(nullable_prefix):].lstrip('(').rstrip(')') | ||
|
||
if type_repr.startswith('Decimal'): | ||
type_repr = 'Decimal' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we throwing valuable information here, like precision/scale? For example, see how it's done in Presto: https://github.com/datafold/data-diff/blob/master/data_diff/databases/presto.py#L113
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we throwing valuable information here, like precision/scale?
No, we are not. This information comes from information_schema
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But type_repr
is the information that comes from information_schema
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's. But it's also stored in numeric_precision
and numeric_scale
columns of information_schema
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand now. Well, ideally I would like to read it anyway, and just add an assert to make sure it's the same as the information_schema
. But I won't insist on it.
data_diff/databases/clickhouse.py
Outdated
def _parse_type_repr(self, type_repr: str) -> Optional[Type[ColType]]: | ||
nullable_prefix = 'Nullable' | ||
if type_repr.lower().startswith(nullable_prefix.lower()): | ||
type_repr = type_repr[len(nullable_prefix):].lstrip('(').rstrip(')') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't readable. Please use a regexp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote it in a more readable form but without regexp.
data_diff/databases/clickhouse.py
Outdated
prec = coltype.precision | ||
timestamp = f'toDateTime64(round(toUnixTimestamp64Micro(toDateTime64({value}, 6)) / 1000000, {prec}), 6)' | ||
return self.to_string(timestamp) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"else" clause not needed. I suggest using pylint on this module to catch things like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
def _convert_db_precision_to_digits(self, p: int) -> int: | ||
# Done the same as for PostgreSQL but need to rewrite in another way | ||
# because it does not help for float with a big integer part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. Can you explain it? What floats, for example, it doesn't help with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's dirty hack to save a compatibility with Postgres.
Clickhouse float32
use float
of C language under the hood. I guess C float is used for Postgres' 4-byte real
(or float4
that is the same) too because it has the same behavior. It still remains a mystery to me why 2 is subtracted here for PostgreSQL. I think it's not wierd precision issues as mentioned in a comment for postgres but a normal behavior of 4-byte floating number which has 6 precision. For float it means it can store no more than 6-7 digits in average. Let me clarify on example for Postgres (the same for Clickhouse):
with tmp as (
select 0.123456789::float4 x
union
select 10.123456789::float4 x
union
select 100.123456789::float4 x
union
select 1000.123456789::float4 x
)
select
x::decimal(38, 6)
from tmp
order by x;
Result:
x
-------------
0.123457
10.123500
100.123000
1000.120000
(4 rows)
If we increase an integer part 10 times, we leave 1 digit of a floating part. So, 6-7 precision for float means that it can store 6-7 digits for whole number, namely, integer + float parts.
I see in unit tests that we have a number 100.98, and I think tests are passed because of substracting of 2 from precision for Postgres. But if you add a number with a higher integer part, for example, 1000.98, tests failed.
My opinion, we cannot cast float
to decimal(38, some_precision)
because precision here depends on the integer part of a float value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does look like the precision of floats is a little more complicated than we currently account for.
But following your example, I can think of another problem:
with tmp as (
select 123456789::float4 x )
select
x::decimal(38, 6)
from tmp;
x
------------------
123457000.000000
So even the integer parts aren't safe!
I do think we can make two assumptions that helps us a lot -
-
The value already sits in each database, and probably was given a datatype that makes sense on both ends
-
We only need to be as precise at the tool that copied it (probably in a migration).
But yeah, I agree this is a weak point, and it deserves some more attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pik94 It's not "C language float
s," it's the IEEE 754 floating-point standard that defines what float
should look like and how it should behave for all digital computers in order to ensure interoperability. C (and virtually every other programming language) does it that way because the CPU does it that way, and the CPU does it that way because the standard says that's the way it must be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masonwheeler, yes, it's more accurate to say that it's behavior of the floating point number described in the IEEE 754 standard. I just used float
because it's mentioned in clickhouse docs.
# For examples above it looks like: | ||
# select toString(toDecimal128(1.10, 2 + 1) + toDecimal128(0.001, 3)); -- the result is 1.101 | ||
# After that, cut an extra symbol from the string, i.e. 1.101 -> 1.10 | ||
# So, the algorithm is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this better than calling RPAD, once with "." and once with "0" ?
It sounds like your method might lead to all kinds of strange precision errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this better than calling RPAD, once with "." and once with "0" ?
It's a problem to obtain a number of digits (it may vary in contrast to timestamp format) to apply rpad
here.
It sounds like your method might lead to all kinds of strange precision errors.
Does it? I have just added a small number with a higher precision here, and do not use it to round something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say for sure. It just feels unstable, but maybe I'm worried for nothing.
I'll give it a deeper look later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the current moment I see only one corner case, namely, if a customer's number type has precision higher than 38 digits (for example, they can use Decimal256
, or Decimal(100, 40)
etc). Casting to Decimal128(39)
is not possible. However, I'm note sure that people has such high precision.
Also I tried to rewrite it using RPAD but found with next problem. When we do not add a little value, casting of float to decimal has an unpleasant side effect. For example, let's we want to cast to precision equaled 5:
with tmp as (
select toFloat64(0.001201923076923077) as col
)
select
col,
round(col, 5),
toDecimal128(round(col, 5), 5) as precision_5,
toDecimal128(round(col, 5), 6) as precision_6
from tmp
order by col;
Result:
0.001201923076923077, 0.0012, 0.00119, 0.001200
As you can see, the number was corrupted for 5-digits precision .
data_diff/databases/clickhouse.py
Outdated
'DateTime64': Timestamp, | ||
|
||
} | ||
ROUNDS_ON_PREC_LOSS = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the tests you wrote "Clickhouse does not round microseconds when inserting in contrast to PostgreSQL and MySQL.", so sounds like ROUNDS_ON_PREC_LOSS
should be False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
fc65ea5
to
af3b726
Compare
You took the snowflake connector too far back, it doesn't work for Python 3.10 - https://github.com/datafold/data-diff/runs/8187675999?check_suite_focus=true (the unittests in this PR didn't run snowflake) |
This PR has been merged. It doesn't show because it was force-pushed after the fact. |
@erezsh hm... Is there a way to run it here? |
Let's worry about that next time :) |
No description provided.