Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Support databricks #55

Merged
merged 7 commits into from
Jul 12, 2022
Merged

Support databricks #55

merged 7 commits into from
Jul 12, 2022

Conversation

pik94
Copy link
Contributor

@pik94 pik94 commented Jun 13, 2022

Add test support of Databricks connector.

Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks alright!

My only problem is with the changes to pyproject.toml.

pyproject.toml Outdated
@@ -23,7 +23,7 @@ classifiers = [
packages = [{ include = "data_diff" }]

[tool.poetry.dependencies]
python = "^3.7"
python = "^3.7.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it poetry install old versions of databricks-sql-connector (<2.0.2) and pandas (which is used by databricks), and it fails when running the diff. Increase of a patch version helped, and higher version of databricks and pandas are installed.

pyproject.toml Outdated
@@ -38,9 +38,11 @@ snowflake-connector-python = { version = "*", optional = true }
[tool.poetry.dev-dependencies]
mysql-connector-python = "*"
preql = "^0.2.14"
pandas = "^1.3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a lower version, the databricks connector fails because an old version of pandas (<1.3.*) when using. But maybe it can be removed because if databricks has a good version (2.0.2), pandas should have an appropriate good version as well. I'll check if it can be removed.

@erezsh
Copy link
Contributor

erezsh commented Jun 14, 2022

Also, needs to be rebased over master; there are conflicts to resolve. (sorry, we're at a fast moving dev stage)

@sirupsen
Copy link
Contributor

@erezsh I think we'll also need the rounding support in here, no?

@erezsh
Copy link
Contributor

erezsh commented Jun 14, 2022

If we decide to make it mandatory for implementing a Database, yes.

@pik94 pik94 force-pushed the support-databricks branch from b184294 to 4ea25bb Compare June 14, 2022 19:18
@pik94
Copy link
Contributor Author

pik94 commented Jun 14, 2022

Also, needs to be rebased over master; there are conflicts to resolve. (sorry, we're at a fast moving dev stage)

@erezsh , I resolved but I see that a new version failed because it could not compare object of ColumnName class. I've added a fix in this PR for it.

@erezsh
Copy link
Contributor

erezsh commented Jun 16, 2022

Hi, I'm sorry but a lot has changed in the last week.

Database classes now also need to "normalize" the values, based on the column type. For example, dates need to have the form of YYYY-MM-DD HH:MM:SS.FFFFFF. They also need to provide table schema, so we know the column types.

We plan to make the 0.1 release sometime next week. I believe the pace of internal changes will slow down after that.

@erezsh
Copy link
Contributor

erezsh commented Jun 27, 2022

We've released v0.1 and things should be a bit more stable from now on.

@pik94 pik94 force-pushed the support-databricks branch from 4ea25bb to 64b1be4 Compare July 2, 2022 13:18
@pik94 pik94 marked this pull request as draft July 2, 2022 13:20
@pik94 pik94 force-pushed the support-databricks branch from 64b1be4 to a8df33d Compare July 2, 2022 16:29
@pik94 pik94 force-pushed the support-databricks branch from a8df33d to c3f416e Compare July 6, 2022 13:18
@pik94 pik94 marked this pull request as ready for review July 6, 2022 13:22
@pik94 pik94 force-pushed the support-databricks branch from c3f416e to b48a703 Compare July 6, 2022 13:28
Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

Just a few comments.

Also, can you provide instructions on how to set-up a databricks server in order to test this?

# dsn user - access token
# sdn password - http path (starting with /)

return cls(http_path=dsn.password, access_token=dsn.user, server_hostname=dsn.hostname, **kw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how I feel about putting the http_path as the password.

Can you tell me why you chose this, instead of adding it to the path, or as a query argument?

Copy link
Contributor Author

@pik94 pik94 Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding it to the path

http_path starts with / symbol so adding it to the path leads to an error when parsing. We can just notify users no to use the slash in their path but I bet what the first run will be always failed because from Databricks UI http_path is copied with the slash so people might be confused during their first attempt of diffing.

a query argument

I assumed that query params in terms of usage look like additional i.e. not necessarily so people might skip it for the first run, and it'll lead to an error as well.

So, my concern is just not to disappoint people in their first attempt of diffing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can provide a good error, telling users that this parameter is necessary.

I think it's less confusing than using the password. Also, we're going to remove the password whenever we print the configuration to the logs. That means the http_path will appear as ****.

P.S. we can also ensure that it's formatted with only one leading slash. shouldn't be too hard to do.

Copy link
Contributor Author

@pik94 pik94 Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll update the uri with a format databricks://:access_token@account/http_path/catalog/schema in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe account should be where user is? Seems to make sense.

(I don't know databricks, do they have some convention about this?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not heard about convention regarding this but I think we can use account as a user here, it looks pretty natural.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked it more carefully. I think the easiest way is to have the next uri: databricks://http_path:access_token@server_name/catalog/schema.

Http path looks like /sql/2.0/abcdeabcde so when parsing ofdatabricks://server_name:access_token@http_path/catalog/schema leads to http_path=sql, catalog=2.0, schema=abcdeabcde. I think it's not critical to have http_path as a user, WDYT?


elif issubclass(type_cls, Decimal):
# TYPE_NAME has a format DECIMAL(x,y)
items = row.TYPE_NAME[8:].strip(')').split(',')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the nit-pick, but better to use rstrip()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Np, changed it.)

@pik94 pik94 force-pushed the support-databricks branch from b48a703 to f074d1a Compare July 6, 2022 15:27
@pik94
Copy link
Contributor Author

pik94 commented Jul 6, 2022

Also, can you provide instructions on how to set-up a databricks server in order to test this?

Oh, I'm not sure that I know how to do that. I use our saas cluster. @glebmezh can we give access to Erez for our Databricks so that he can test this PR as well?

@glebmezh
Copy link
Contributor

glebmezh commented Jul 7, 2022

@erezsh granted access

@pik94 pik94 force-pushed the support-databricks branch from f074d1a to f45598c Compare July 7, 2022 18:55
@erezsh
Copy link
Contributor

erezsh commented Jul 11, 2022

@pik94 The tests are failing for numbers. They are expected to be 0.00000000..., but instead databricks formats them as 0E-10

For example:

======================================================================
test_types_mysql_num65_10_databricks_double_50 (tests.test_database_types.TestDiffCrossDatabaseTables)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\parameterized\parameterized.py", line 533, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "C:\code\datafold\xdiff\tests\test_database_types.py", line 630, in test_types
    self.assertEqual(expected, diff)
AssertionError: Lists differ: [] != [('-', ('1', '0.0000000000')), ('+', ('1',[116 chars]0'))]

Second list contains 6 additional elements.
First extra element 0:
('-', ('1', '0.0000000000'))

- []
+ [('-', ('1', '0.0000000000')),
+  ('+', ('1', '0E-10')),
+  ('-', ('14', '0.0000000000')),
+  ('+', ('14', '0E-10')),
+  ('-', ('15', '0.0000000000')),
+  ('+', ('15', '0E-10'))]

@erezsh
Copy link
Contributor

erezsh commented Jul 11, 2022

@pik94 I also had a look at the URI, and I don't think it's a good idea to put the "http_path" as the user, because then you'll have to escape the slashes.

I think this should be the URI instead:

databricks://:access_token@server_name/http_path?schema=..&catalog=..

This also reflects that "schema" and "catalog" are optional parameters.

I already have a patch for this, so if you agree, I'll push it on to this PR.

@JCZuurmond
Copy link
Contributor

👋 Hi, y'all. Just here to say that I would like to try data-diff on Databricks. Happy to see this PR being merged.

@erezsh erezsh mentioned this pull request Jul 12, 2022
@erezsh erezsh merged commit 66b7e52 into datafold:master Jul 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants