-
Notifications
You must be signed in to change notification settings - Fork 288
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks alright!
My only problem is with the changes to pyproject.toml.
pyproject.toml
Outdated
@@ -23,7 +23,7 @@ classifiers = [ | |||
packages = [{ include = "data_diff" }] | |||
|
|||
[tool.poetry.dependencies] | |||
python = "^3.7" | |||
python = "^3.7.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without it poetry install old versions of databricks-sql-connector (<2.0.2) and pandas (which is used by databricks), and it fails when running the diff. Increase of a patch version helped, and higher version of databricks and pandas are installed.
pyproject.toml
Outdated
@@ -38,9 +38,11 @@ snowflake-connector-python = { version = "*", optional = true } | |||
[tool.poetry.dev-dependencies] | |||
mysql-connector-python = "*" | |||
preql = "^0.2.14" | |||
pandas = "^1.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a lower version, the databricks connector fails because an old version of pandas (<1.3.*) when using. But maybe it can be removed because if databricks has a good version (2.0.2), pandas should have an appropriate good version as well. I'll check if it can be removed.
Also, needs to be rebased over |
@erezsh I think we'll also need the rounding support in here, no? |
If we decide to make it mandatory for implementing a Database, yes. |
b184294
to
4ea25bb
Compare
@erezsh , I resolved but I see that a new version failed because it could not compare object of ColumnName class. I've added a fix in this PR for it. |
Hi, I'm sorry but a lot has changed in the last week.
We plan to make the 0.1 release sometime next week. I believe the pace of internal changes will slow down after that. |
We've released v0.1 and things should be a bit more stable from now on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
Just a few comments.
Also, can you provide instructions on how to set-up a databricks server in order to test this?
data_diff/databases/connect.py
Outdated
# dsn user - access token | ||
# sdn password - http path (starting with /) | ||
|
||
return cls(http_path=dsn.password, access_token=dsn.user, server_hostname=dsn.hostname, **kw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how I feel about putting the http_path
as the password.
Can you tell me why you chose this, instead of adding it to the path, or as a query argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding it to the path
http_path starts with /
symbol so adding it to the path leads to an error when parsing. We can just notify users no to use the slash in their path but I bet what the first run will be always failed because from Databricks UI http_path is copied with the slash so people might be confused during their first attempt of diffing.
a query argument
I assumed that query params in terms of usage look like additional i.e. not necessarily so people might skip it for the first run, and it'll lead to an error as well.
So, my concern is just not to disappoint people in their first attempt of diffing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can provide a good error, telling users that this parameter is necessary.
I think it's less confusing than using the password. Also, we're going to remove the password whenever we print the configuration to the logs. That means the http_path will appear as ****
.
P.S. we can also ensure that it's formatted with only one leading slash. shouldn't be too hard to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll update the uri with a format databricks://:access_token@account/http_path/catalog/schema
in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe account should be where user is? Seems to make sense.
(I don't know databricks, do they have some convention about this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not heard about convention regarding this but I think we can use account as a user here, it looks pretty natural.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked it more carefully. I think the easiest way is to have the next uri
: databricks://http_path:access_token@server_name/catalog/schema
.
Http path looks like /sql/2.0/abcdeabcde
so when parsing ofdatabricks://server_name:access_token@http_path/catalog/schema
leads to http_path=sql
, catalog=2.0
, schema=abcdeabcde
. I think it's not critical to have http_path
as a user, WDYT?
data_diff/databases/databricks.py
Outdated
|
||
elif issubclass(type_cls, Decimal): | ||
# TYPE_NAME has a format DECIMAL(x,y) | ||
items = row.TYPE_NAME[8:].strip(')').split(',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the nit-pick, but better to use rstrip()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Np, changed it.)
Oh, I'm not sure that I know how to do that. I use our saas cluster. @glebmezh can we give access to Erez for our Databricks so that he can test this PR as well? |
@erezsh granted access |
@pik94 The tests are failing for numbers. They are expected to be For example:
|
@pik94 I also had a look at the URI, and I don't think it's a good idea to put the "http_path" as the user, because then you'll have to escape the slashes. I think this should be the URI instead:
This also reflects that "schema" and "catalog" are optional parameters. I already have a patch for this, so if you agree, I'll push it on to this PR. |
👋 Hi, y'all. Just here to say that I would like to try |
Add test support of Databricks connector.