Support databricks #55

pik94 · 2022-06-13T12:29:42Z

Add test support of Databricks connector.

erezsh

Overall looks alright!

My only problem is with the changes to pyproject.toml.

erezsh · 2022-06-14T12:35:46Z

pyproject.toml

@@ -23,7 +23,7 @@ classifiers = [
 packages = [{ include = "data_diff" }]

 [tool.poetry.dependencies]
-python = "^3.7"
+python = "^3.7.1"


What is this change for?

Without it poetry install old versions of databricks-sql-connector (<2.0.2) and pandas (which is used by databricks), and it fails when running the diff. Increase of a patch version helped, and higher version of databricks and pandas are installed.

erezsh · 2022-06-14T12:35:56Z

pyproject.toml

@@ -38,9 +38,11 @@ snowflake-connector-python = { version = "*", optional = true }
 [tool.poetry.dev-dependencies]
 mysql-connector-python = "*"
 preql = "^0.2.14"
+pandas = "^1.3"


Why is this here?

With a lower version, the databricks connector fails because an old version of pandas (<1.3.*) when using. But maybe it can be removed because if databricks has a good version (2.0.2), pandas should have an appropriate good version as well. I'll check if it can be removed.

erezsh · 2022-06-14T12:40:18Z

Also, needs to be rebased over master; there are conflicts to resolve. (sorry, we're at a fast moving dev stage)

sirupsen · 2022-06-14T13:54:39Z

@erezsh I think we'll also need the rounding support in here, no?

erezsh · 2022-06-14T14:00:09Z

If we decide to make it mandatory for implementing a Database, yes.

pik94 · 2022-06-14T19:18:28Z

Also, needs to be rebased over master; there are conflicts to resolve. (sorry, we're at a fast moving dev stage)

@erezsh , I resolved but I see that a new version failed because it could not compare object of ColumnName class. I've added a fix in this PR for it.

erezsh · 2022-06-16T11:52:19Z

Hi, I'm sorry but a lot has changed in the last week.

Database classes now also need to "normalize" the values, based on the column type. For example, dates need to have the form of YYYY-MM-DD HH:MM:SS.FFFFFF. They also need to provide table schema, so we know the column types.

We plan to make the 0.1 release sometime next week. I believe the pace of internal changes will slow down after that.

erezsh · 2022-06-27T08:14:30Z

We've released v0.1 and things should be a bit more stable from now on.

erezsh

Overall looks good!

Just a few comments.

Also, can you provide instructions on how to set-up a databricks server in order to test this?

erezsh · 2022-07-06T13:34:46Z

data_diff/databases/connect.py

+        # dsn user - access token
+        # sdn password - http path (starting with /)
+
+        return cls(http_path=dsn.password, access_token=dsn.user, server_hostname=dsn.hostname, **kw)


I don't know how I feel about putting the http_path as the password.

Can you tell me why you chose this, instead of adding it to the path, or as a query argument?

adding it to the path

http_path starts with / symbol so adding it to the path leads to an error when parsing. We can just notify users no to use the slash in their path but I bet what the first run will be always failed because from Databricks UI http_path is copied with the slash so people might be confused during their first attempt of diffing.

a query argument

I assumed that query params in terms of usage look like additional i.e. not necessarily so people might skip it for the first run, and it'll lead to an error as well.

So, my concern is just not to disappoint people in their first attempt of diffing.

We can provide a good error, telling users that this parameter is necessary.

I think it's less confusing than using the password. Also, we're going to remove the password whenever we print the configuration to the logs. That means the http_path will appear as ****.

P.S. we can also ensure that it's formatted with only one leading slash. shouldn't be too hard to do.

Ok, I'll update the uri with a format databricks://:access_token@account/http_path/catalog/schema in that case.

Maybe account should be where user is? Seems to make sense.

(I don't know databricks, do they have some convention about this?)

I have not heard about convention regarding this but I think we can use account as a user here, it looks pretty natural.

Checked it more carefully. I think the easiest way is to have the next uri: databricks://http_path:access_token@server_name/catalog/schema.

Http path looks like /sql/2.0/abcdeabcde so when parsing ofdatabricks://server_name:access_token@http_path/catalog/schema leads to http_path=sql, catalog=2.0, schema=abcdeabcde. I think it's not critical to have http_path as a user, WDYT?

erezsh · 2022-07-06T13:36:54Z

data_diff/databases/databricks.py

+
+                elif issubclass(type_cls, Decimal):
+                    # TYPE_NAME has a format DECIMAL(x,y)
+                    items = row.TYPE_NAME[8:].strip(')').split(',')


Sorry for the nit-pick, but better to use rstrip()

Np, changed it.)

pik94 · 2022-07-06T15:32:17Z

Also, can you provide instructions on how to set-up a databricks server in order to test this?

Oh, I'm not sure that I know how to do that. I use our saas cluster. @glebmezh can we give access to Erez for our Databricks so that he can test this PR as well?

glebmezh · 2022-07-07T01:32:49Z

@erezsh granted access

…ostgreSQL one.

erezsh · 2022-07-11T08:56:45Z

@pik94 The tests are failing for numbers. They are expected to be 0.00000000..., but instead databricks formats them as 0E-10

For example:

======================================================================
test_types_mysql_num65_10_databricks_double_50 (tests.test_database_types.TestDiffCrossDatabaseTables)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\parameterized\parameterized.py", line 533, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "C:\code\datafold\xdiff\tests\test_database_types.py", line 630, in test_types
    self.assertEqual(expected, diff)
AssertionError: Lists differ: [] != [('-', ('1', '0.0000000000')), ('+', ('1',[116 chars]0'))]

Second list contains 6 additional elements.
First extra element 0:
('-', ('1', '0.0000000000'))

- []
+ [('-', ('1', '0.0000000000')),
+  ('+', ('1', '0E-10')),
+  ('-', ('14', '0.0000000000')),
+  ('+', ('14', '0E-10')),
+  ('-', ('15', '0.0000000000')),
+  ('+', ('15', '0E-10'))]

erezsh · 2022-07-11T08:59:50Z

@pik94 I also had a look at the URI, and I don't think it's a good idea to put the "http_path" as the user, because then you'll have to escape the slashes.

I think this should be the URI instead:

databricks://:access_token@server_name/http_path?schema=..&catalog=..

This also reflects that "schema" and "catalog" are optional parameters.

I already have a patch for this, so if you agree, I'll push it on to this PR.

JCZuurmond · 2022-07-12T09:37:46Z

👋 Hi, y'all. Just here to say that I would like to try data-diff on Databricks. Happy to see this PR being merged.

erezsh reviewed Jun 14, 2022

View reviewed changes

pik94 force-pushed the support-databricks branch from b184294 to 4ea25bb Compare June 14, 2022 19:18

pik94 force-pushed the support-databricks branch from 4ea25bb to 64b1be4 Compare July 2, 2022 13:18

pik94 marked this pull request as draft July 2, 2022 13:20

pik94 force-pushed the support-databricks branch from 64b1be4 to a8df33d Compare July 2, 2022 16:29

pik94df assigned pik94 Jul 5, 2022

pik94 force-pushed the support-databricks branch from a8df33d to c3f416e Compare July 6, 2022 13:18

pik94 marked this pull request as ready for review July 6, 2022 13:22

pik94 force-pushed the support-databricks branch from c3f416e to b48a703 Compare July 6, 2022 13:28

erezsh reviewed Jul 6, 2022

View reviewed changes

pik94 force-pushed the support-databricks branch from b48a703 to f074d1a Compare July 6, 2022 15:27

pik94 added 6 commits July 7, 2022 16:32

Add databricks support with normalized data

1bfe5d5

Update databricks URI and tests

f577dc0

Fix incorrect parsing of Decimal type and add UUID support

0680fac

Fix casting of floating types to decimal. The problem is similar to P…

1f314b2

…ostgreSQL one.

Use parameters instead of hard value in md5_to_int for Databricks

d8be22b

Add tests of datatypes for databricks

f45598c

pik94 force-pushed the support-databricks branch from f074d1a to f45598c Compare July 7, 2022 18:55

Cast to decimal when md5_to_int for databricks to avoid bigint overflow.

66b7e52

erezsh mentioned this pull request Jul 12, 2022

Support Databricks 2 #160

Merged

erezsh merged commit 66b7e52 into datafold:master Jul 12, 2022

Support databricks #55

Support databricks #55

Uh oh!

Conversation

pik94 commented Jun 13, 2022

Uh oh!

erezsh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erezsh commented Jun 14, 2022

Uh oh!

sirupsen commented Jun 14, 2022

Uh oh!

erezsh commented Jun 14, 2022

Uh oh!

pik94 commented Jun 14, 2022

Uh oh!

erezsh commented Jun 16, 2022

Uh oh!

erezsh commented Jun 27, 2022

Uh oh!

erezsh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pik94 Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pik94 Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pik94 commented Jul 6, 2022

Uh oh!

glebmezh commented Jul 7, 2022

Uh oh!

erezsh commented Jul 11, 2022

Uh oh!

erezsh commented Jul 11, 2022

Uh oh!

JCZuurmond commented Jul 12, 2022

Uh oh!

Uh oh!

pik94 Jul 6, 2022 •

edited

Loading

pik94 Jul 6, 2022 •

edited

Loading