Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Support of Clickhouse #217

Closed
wants to merge 12 commits into from
Closed

Conversation

pik94
Copy link
Contributor

@pik94 pik94 commented Aug 26, 2022

No description provided.

@pik94 pik94 force-pushed the clickhouse-integration branch 6 times, most recently from 21653a7 to 65bda1a Compare August 28, 2022 20:58
@pik94 pik94 force-pushed the clickhouse-integration branch from 65bda1a to e7406a8 Compare August 29, 2022 15:33
@pik94 pik94 marked this pull request as ready for review August 29, 2022 16:30
@pik94 pik94 self-assigned this Aug 29, 2022
Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, run black -l 120 data_diff, as mentioned here - https://github.com/datafold/data-diff/blob/master/CONTRIBUTING.md#code-style

@@ -18,7 +18,7 @@ jobs:
matrix:
os: [ubuntu-latest]
python-version:
- "3.7.1"
- "3.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the .1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, I did not manage to update the lock file for 3.7 when I added Databricks so I increased it up to 3.7.1, and it worked.

Now, I updated the lock with an older version of Python so decided to return back to 3.7.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I thought we had it at 3.7.1 for a different reason, but if that's the case then it's fine.

Btw, what did you change to make updating the lockfile work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing special, it just updated without a resolving error although it failed with it before. For me it'a a kind of magic how poetry works...

@@ -0,0 +1,153 @@
from typing import Optional, Type

import clickhouse_driver.dbapi.connection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't have this import outside of import_clichouse()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

type_repr = type_repr[len(nullable_prefix):].lstrip('(').rstrip(')')

if type_repr.startswith('Decimal'):
type_repr = 'Decimal'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we throwing valuable information here, like precision/scale? For example, see how it's done in Presto: https://github.com/datafold/data-diff/blob/master/data_diff/databases/presto.py#L113

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we throwing valuable information here, like precision/scale?

No, we are not. This information comes from information_schema anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But type_repr is the information that comes from information_schema, no?

Copy link
Contributor Author

@pik94 pik94 Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's. But it's also stored in numeric_precision and numeric_scale columns of information_schema.

Copy link
Contributor

@erezsh erezsh Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now. Well, ideally I would like to read it anyway, and just add an assert to make sure it's the same as the information_schema. But I won't insist on it.

def _parse_type_repr(self, type_repr: str) -> Optional[Type[ColType]]:
nullable_prefix = 'Nullable'
if type_repr.lower().startswith(nullable_prefix.lower()):
type_repr = type_repr[len(nullable_prefix):].lstrip('(').rstrip(')')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't readable. Please use a regexp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote it in a more readable form but without regexp.

prec = coltype.precision
timestamp = f'toDateTime64(round(toUnixTimestamp64Micro(toDateTime64({value}, 6)) / 1000000, {prec}), 6)'
return self.to_string(timestamp)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"else" clause not needed. I suggest using pylint on this module to catch things like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


def _convert_db_precision_to_digits(self, p: int) -> int:
# Done the same as for PostgreSQL but need to rewrite in another way
# because it does not help for float with a big integer part.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. Can you explain it? What floats, for example, it doesn't help with?

Copy link
Contributor Author

@pik94 pik94 Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dirty hack to save a compatibility with Postgres.
Clickhouse float32 use float of C language under the hood. I guess C float is used for Postgres' 4-byte real (or float4 that is the same) too because it has the same behavior. It still remains a mystery to me why 2 is subtracted here for PostgreSQL. I think it's not wierd precision issues as mentioned in a comment for postgres but a normal behavior of 4-byte floating number which has 6 precision. For float it means it can store no more than 6-7 digits in average. Let me clarify on example for Postgres (the same for Clickhouse):

with tmp as (
    select 0.123456789::float4 x
    union
    select 10.123456789::float4 x
    union
    select 100.123456789::float4 x
    union
    select 1000.123456789::float4 x
)
select
    x::decimal(38, 6)
from tmp
order by x;

Result:

      x      
-------------
    0.123457
   10.123500
  100.123000
 1000.120000
(4 rows)

If we increase an integer part 10 times, we leave 1 digit of a floating part. So, 6-7 precision for float means that it can store 6-7 digits for whole number, namely, integer + float parts.

I see in unit tests that we have a number 100.98, and I think tests are passed because of substracting of 2 from precision for Postgres. But if you add a number with a higher integer part, for example, 1000.98, tests failed.

My opinion, we cannot cast float to decimal(38, some_precision) because precision here depends on the integer part of a float value.

Copy link
Contributor

@erezsh erezsh Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look like the precision of floats is a little more complicated than we currently account for.

But following your example, I can think of another problem:

with tmp as (
    select 123456789::float4 x                                                                                                                                                     )
select
    x::decimal(38, 6)
from tmp;

        x
------------------
 123457000.000000

So even the integer parts aren't safe!

I do think we can make two assumptions that helps us a lot -

  1. The value already sits in each database, and probably was given a datatype that makes sense on both ends

  2. We only need to be as precise at the tool that copied it (probably in a migration).

But yeah, I agree this is a weak point, and it deserves some more attention.

Copy link

@masonwheeler masonwheeler Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pik94 It's not "C language floats," it's the IEEE 754 floating-point standard that defines what float should look like and how it should behave for all digital computers in order to ensure interoperability. C (and virtually every other programming language) does it that way because the CPU does it that way, and the CPU does it that way because the standard says that's the way it must be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masonwheeler, yes, it's more accurate to say that it's behavior of the floating point number described in the IEEE 754 standard. I just used float because it's mentioned in clickhouse docs.

# For examples above it looks like:
# select toString(toDecimal128(1.10, 2 + 1) + toDecimal128(0.001, 3)); -- the result is 1.101
# After that, cut an extra symbol from the string, i.e. 1.101 -> 1.10
# So, the algorithm is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this better than calling RPAD, once with "." and once with "0" ?

It sounds like your method might lead to all kinds of strange precision errors.

Copy link
Contributor Author

@pik94 pik94 Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this better than calling RPAD, once with "." and once with "0" ?

It's a problem to obtain a number of digits (it may vary in contrast to timestamp format) to apply rpad here.

It sounds like your method might lead to all kinds of strange precision errors.

Does it? I have just added a small number with a higher precision here, and do not use it to round something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say for sure. It just feels unstable, but maybe I'm worried for nothing.

I'll give it a deeper look later.

Copy link
Contributor Author

@pik94 pik94 Sep 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the current moment I see only one corner case, namely, if a customer's number type has precision higher than 38 digits (for example, they can use Decimal256, or Decimal(100, 40) etc). Casting to Decimal128(39)is not possible. However, I'm note sure that people has such high precision.

Also I tried to rewrite it using RPAD but found with next problem. When we do not add a little value, casting of float to decimal has an unpleasant side effect. For example, let's we want to cast to precision equaled 5:

with tmp as (
    select toFloat64(0.001201923076923077) as col
)
select
    col,
    round(col, 5),
    toDecimal128(round(col, 5), 5) as precision_5,
    toDecimal128(round(col, 5), 6) as precision_6
from tmp
order by col;

Result:

0.001201923076923077, 0.0012, 0.00119, 0.001200

As you can see, the number was corrupted for 5-digits precision .

'DateTime64': Timestamp,

}
ROUNDS_ON_PREC_LOSS = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tests you wrote "Clickhouse does not round microseconds when inserting in contrast to PostgreSQL and MySQL.", so sounds like ROUNDS_ON_PREC_LOSS should be False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@glebmezh glebmezh linked an issue Sep 1, 2022 that may be closed by this pull request
@pik94 pik94 force-pushed the clickhouse-integration branch from fc65ea5 to af3b726 Compare September 4, 2022 12:56
@erezsh
Copy link
Contributor

erezsh commented Sep 5, 2022

You took the snowflake connector too far back, it doesn't work for Python 3.10 - https://github.com/datafold/data-diff/runs/8187675999?check_suite_focus=true

(the unittests in this PR didn't run snowflake)

erezsh added a commit that referenced this pull request Sep 5, 2022
@erezsh
Copy link
Contributor

erezsh commented Sep 5, 2022

This PR has been merged. It doesn't show because it was force-pushed after the fact.

@erezsh erezsh closed this Sep 5, 2022
@pik94
Copy link
Contributor Author

pik94 commented Sep 5, 2022

You took the snowflake connector too far back, it doesn't work for Python 3.10 - https://github.com/datafold/data-diff/runs/8187675999?check_suite_focus=true

(the unittests in this PR didn't run snowflake)

@erezsh hm... Is there a way to run it here?

@erezsh
Copy link
Contributor

erezsh commented Sep 5, 2022

Let's worry about that next time :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support to ClickHouse
4 participants