Support of Clickhouse #217

pik94 · 2022-08-26T16:41:59Z

No description provided.

pyproject.toml

erezsh

Also, run black -l 120 data_diff, as mentioned here - https://github.com/datafold/data-diff/blob/master/CONTRIBUTING.md#code-style

erezsh · 2022-08-30T08:39:32Z

.github/workflows/ci.yml

@@ -18,7 +18,7 @@ jobs:
      matrix:
        os: [ubuntu-latest]
        python-version:
-          - "3.7.1"
+          - "3.7"


Why remove the .1 ?

Earlier, I did not manage to update the lock file for 3.7 when I added Databricks so I increased it up to 3.7.1, and it worked.

Now, I updated the lock with an older version of Python so decided to return back to 3.7.

Oh, I see. I thought we had it at 3.7.1 for a different reason, but if that's the case then it's fine.

Btw, what did you change to make updating the lockfile work?

Nothing special, it just updated without a resolving error although it failed with it before. For me it'a a kind of magic how poetry works...

erezsh · 2022-08-30T08:40:16Z

data_diff/databases/clickhouse.py

@@ -0,0 +1,153 @@
+from typing import Optional, Type
+
+import clickhouse_driver.dbapi.connection


Shouldn't have this import outside of import_clichouse()

erezsh · 2022-08-30T08:42:10Z

data_diff/databases/clickhouse.py

+            type_repr = type_repr[len(nullable_prefix):].lstrip('(').rstrip(')')
+
+        if type_repr.startswith('Decimal'):
+            type_repr = 'Decimal'


Aren't we throwing valuable information here, like precision/scale? For example, see how it's done in Presto: https://github.com/datafold/data-diff/blob/master/data_diff/databases/presto.py#L113

Aren't we throwing valuable information here, like precision/scale?

No, we are not. This information comes from information_schema anyway.

But type_repr is the information that comes from information_schema, no?

Yes, it's. But it's also stored in numeric_precision and numeric_scale columns of information_schema.

I understand now. Well, ideally I would like to read it anyway, and just add an assert to make sure it's the same as the information_schema. But I won't insist on it.

erezsh · 2022-08-30T08:42:35Z

data_diff/databases/clickhouse.py

+    def _parse_type_repr(self, type_repr: str) -> Optional[Type[ColType]]:
+        nullable_prefix = 'Nullable'
+        if type_repr.lower().startswith(nullable_prefix.lower()):
+            type_repr = type_repr[len(nullable_prefix):].lstrip('(').rstrip(')')


This isn't readable. Please use a regexp.

Rewrote it in a more readable form but without regexp.

erezsh · 2022-08-30T08:44:02Z

data_diff/databases/clickhouse.py

+            prec = coltype.precision
+            timestamp = f'toDateTime64(round(toUnixTimestamp64Micro(toDateTime64({value}, 6)) / 1000000, {prec}), 6)'
+            return self.to_string(timestamp)
+        else:


"else" clause not needed. I suggest using pylint on this module to catch things like that.

erezsh · 2022-08-30T08:45:40Z

data_diff/databases/clickhouse.py

+
+    def _convert_db_precision_to_digits(self, p: int) -> int:
+        # Done the same as for PostgreSQL but need to rewrite in another way
+        # because it does not help for float with a big integer part.


I don't understand this comment. Can you explain it? What floats, for example, it doesn't help with?

It's dirty hack to save a compatibility with Postgres.
Clickhouse float32 use float of C language under the hood. I guess C float is used for Postgres' 4-byte real (or float4 that is the same) too because it has the same behavior. It still remains a mystery to me why 2 is subtracted here for PostgreSQL. I think it's not wierd precision issues as mentioned in a comment for postgres but a normal behavior of 4-byte floating number which has 6 precision. For float it means it can store no more than 6-7 digits in average. Let me clarify on example for Postgres (the same for Clickhouse):

with tmp as ( select 0.123456789::float4 x union select 10.123456789::float4 x union select 100.123456789::float4 x union select 1000.123456789::float4 x ) select x::decimal(38, 6) from tmp order by x;

Result:

x ------------- 0.123457 10.123500 100.123000 1000.120000 (4 rows)

If we increase an integer part 10 times, we leave 1 digit of a floating part. So, 6-7 precision for float means that it can store 6-7 digits for whole number, namely, integer + float parts.

I see in unit tests that we have a number 100.98, and I think tests are passed because of substracting of 2 from precision for Postgres. But if you add a number with a higher integer part, for example, 1000.98, tests failed.

My opinion, we cannot cast float to decimal(38, some_precision) because precision here depends on the integer part of a float value.

It does look like the precision of floats is a little more complicated than we currently account for.

But following your example, I can think of another problem:

with tmp as ( select 123456789::float4 x ) select x::decimal(38, 6) from tmp; x ------------------ 123457000.000000

So even the integer parts aren't safe!

I do think we can make two assumptions that helps us a lot -

The value already sits in each database, and probably was given a datatype that makes sense on both ends

We only need to be as precise at the tool that copied it (probably in a migration).

But yeah, I agree this is a weak point, and it deserves some more attention.

@pik94 It's not "C language floats," it's the IEEE 754 floating-point standard that defines what float should look like and how it should behave for all digital computers in order to ensure interoperability. C (and virtually every other programming language) does it that way because the CPU does it that way, and the CPU does it that way because the standard says that's the way it must be done.

@masonwheeler, yes, it's more accurate to say that it's behavior of the floating point number described in the IEEE 754 standard. I just used float because it's mentioned in clickhouse docs.

erezsh · 2022-08-30T08:48:08Z

data_diff/databases/clickhouse.py

+        # For examples above it looks like:
+        #   select toString(toDecimal128(1.10, 2 + 1) + toDecimal128(0.001, 3)); -- the result is 1.101
+        # After that, cut an extra symbol from the string, i.e. 1.101 -> 1.10
+        # So, the algorithm is:


Why is this better than calling RPAD, once with "." and once with "0" ?

It sounds like your method might lead to all kinds of strange precision errors.

Why is this better than calling RPAD, once with "." and once with "0" ?

It's a problem to obtain a number of digits (it may vary in contrast to timestamp format) to apply rpad here.

It sounds like your method might lead to all kinds of strange precision errors.

Does it? I have just added a small number with a higher precision here, and do not use it to round something.

I can't say for sure. It just feels unstable, but maybe I'm worried for nothing.

I'll give it a deeper look later.

To the current moment I see only one corner case, namely, if a customer's number type has precision higher than 38 digits (for example, they can use Decimal256, or Decimal(100, 40) etc). Casting to Decimal128(39)is not possible. However, I'm note sure that people has such high precision.

Also I tried to rewrite it using RPAD but found with next problem. When we do not add a little value, casting of float to decimal has an unpleasant side effect. For example, let's we want to cast to precision equaled 5:

with tmp as ( select toFloat64(0.001201923076923077) as col ) select col, round(col, 5), toDecimal128(round(col, 5), 5) as precision_5, toDecimal128(round(col, 5), 6) as precision_6 from tmp order by col;

Result:

0.001201923076923077, 0.0012, 0.00119, 0.001200

As you can see, the number was corrupted for 5-digits precision .

erezsh · 2022-08-30T08:54:36Z

data_diff/databases/clickhouse.py

+        'DateTime64': Timestamp,
+
+    }
+    ROUNDS_ON_PREC_LOSS = True


In the tests you wrote "Clickhouse does not round microseconds when inserting in contrast to PostgreSQL and MySQL.", so sounds like ROUNDS_ON_PREC_LOSS should be False.

erezsh · 2022-09-05T12:13:07Z

You took the snowflake connector too far back, it doesn't work for Python 3.10 - https://github.com/datafold/data-diff/runs/8187675999?check_suite_focus=true

(the unittests in this PR didn't run snowflake)

PR #217 adjustments

erezsh · 2022-09-05T13:52:55Z

This PR has been merged. It doesn't show because it was force-pushed after the fact.

pik94 · 2022-09-05T13:53:13Z

You took the snowflake connector too far back, it doesn't work for Python 3.10 - https://github.com/datafold/data-diff/runs/8187675999?check_suite_focus=true

(the unittests in this PR didn't run snowflake)

@erezsh hm... Is there a way to run it here?

erezsh · 2022-09-05T13:54:49Z

Let's worry about that next time :)

pik94 force-pushed the clickhouse-integration branch 6 times, most recently from 21653a7 to 65bda1a Compare August 28, 2022 20:58

nolar reviewed Aug 29, 2022

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

pik94 force-pushed the clickhouse-integration branch from 65bda1a to e7406a8 Compare August 29, 2022 15:33

pik94 marked this pull request as ready for review August 29, 2022 16:30

pik94 self-assigned this Aug 29, 2022

erezsh suggested changes Aug 30, 2022

View reviewed changes

glebmezh linked an issue Sep 1, 2022 that may be closed by this pull request

Add support to ClickHouse #204

Closed

glebmezh mentioned this pull request Sep 1, 2022

Add support to ClickHouse #204

Closed

pik94 added 11 commits September 4, 2022 15:51

Support of Clickhouse

3fc95a6

Update README.md

38c40fc

Test with a new lock file

2de20aa

Add clickhouse to ci

cf97935

Fix hashing

041b8f2

Update poetry toml and lock

f87b223

Update ci.yml

6dc1b42

Set ROUNDS_ON_PREC_LOSS flag to False for Clickhouse

8c1e6df

Update README.md

3b08ccf

Move SingleConnection class definition to create_connection function

f7a85b4

Use NULL instead of zero in when calculating counts and checksums

af3b726

pik94 force-pushed the clickhouse-integration branch from fc65ea5 to af3b726 Compare September 4, 2022 12:56

Update lockfile

75f8701

erezsh added a commit that referenced this pull request Sep 5, 2022

Merge pull request #226 from datafold/pr217_adjustments

60b050f

PR #217 adjustments

erezsh closed this Sep 5, 2022

		@@ -0,0 +1,153 @@
		from typing import Optional, Type

		import clickhouse_driver.dbapi.connection

Support of Clickhouse #217

Support of Clickhouse #217

Uh oh!

Conversation

pik94 commented Aug 26, 2022

Uh oh!

Uh oh!

erezsh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pik94 Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erezsh Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pik94 Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erezsh Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masonwheeler Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pik94 Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pik94 Sep 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erezsh commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erezsh commented Sep 5, 2022

pik94 Aug 30, 2022 •

edited

Loading

erezsh Aug 30, 2022 •

edited

Loading

pik94 Aug 30, 2022 •

edited

Loading

erezsh Aug 30, 2022 •

edited

Loading

masonwheeler Aug 31, 2022 •

edited

Loading

pik94 Aug 30, 2022 •

edited

Loading

pik94 Sep 3, 2022 •

edited

Loading

erezsh commented Sep 5, 2022 •

edited

Loading

pik94 commented Sep 5, 2022 •

edited

Loading