Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Presto snowflake enhancement #187

Merged

Conversation

matthiasekundayo-eb
Copy link

@matthiasekundayo-eb matthiasekundayo-eb commented Jul 29, 2022

Changes entail:
Snowflake:
Allow auth by PKCS8 key for Snowflake
User private key specification

Presto:
Allow verify http session with certificate for Presto
Allow using basic authentication for Presto
Presto Port specification
Presto HTTP scheme specification
Presto certificate specification

For Testing:
Snowflake:
Create a Snowflake free trial account here,

Run the following in your local machine via CLI to generate a new SSH key pair:

cd ~/.ssh
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out snowflake_rsa_key.p8 -nocrypt
openssl rsa -in snowflake_rsa_key.p8  -pubout -out snowflake_rsa_key.pub
cat snowflake_rsa_key.pub | sed '1d;$d' | tr -d "\n" | pbcopy
cat snowflake_rsa_key.p8 | tr '\n' '|' | sed 's/\|/\\n/g' | pbcopy

Then take what is on your Clipboard, and in Snowflake, associate it to a user you’re trying to log in as:
alter user <SNOWFLAKE_USER> set rsa_public_key='FILL_IN_VALUE_FROM_CLIPBOARD'

Presto:

  1. Run the following on your local via CLI
openssl genrsa -des3 -out example.com.key 2048
openssl req -new -key example.com.key -out example.com.csr  #Provide 'localhost' as Common Name
cp example.com.key example.com.key.org
openssl rsa -in example.com.key.org -out example.com.key

Edit v3.ext file
vi v3.ext

Put the following in file v3.ext:

subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid:always,issuer:always
basicConstraints = CA:TRUE
keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment, keyAgreement, keyCertSign
subjectAltName = DNS:localhost
issuerAltName = issuer:copy

Run:
openssl x509 -req -in example.com.csr -signkey example.com.key -out example.com.crt -days 3650 -sha256 -extfile v3.ext

  1. Combine cert and private keys into a single file:
    cat example.com.crt example.com.key > combined.pem

  2. Put that combined.pem into dev/presto-conf/standalone/

  3. Put the example.com.crt into the data-diff project root

  4. Creat a new user password file with the following , providing test as the password as well

htpasswd -c password.db test
htpasswd -B -C 10 password.db test
  1. Put that password.db under dev/presto-conf/standalone/

  2. Create a new dev/presto-conf/standalone/password-authenticator.properties file with the following contents:

password-authenticator.name=file
file.password-file=/opt/presto/etc/password.db
  1. Add the following to dev/presto-conf/standalone/config.properties:
http-server.https.enabled=true
http-server.https.port=8443
http-server.https.keystore.path=/opt/presto/etc/combined.pem
http-server.authentication.type=PASSWORD
  1. Add this line to docker-compose.yml under ‘ports’ for ‘presto’
    - '8443:8443'

After the above steps are completed, mount the data-diff directory in a docker container, load sample data and test using a config.toml file like below:

[database.presto]
driver = "presto"
host = "<presto hostname>"
port = "<presto port>"
user = "<presto user>"
password = "<presto password>"
catalog = "<presto catalog>"
schema = "<presto schema>"
cert = "<path/to/presto/certificate.cer>" (this is the certificate from step 4 above)
auth = "basic"
http_scheme = "https"
source = "odbc"
​
[database.snowflake]
driver = "snowflake"
account = "snowflake account name"
database = "snowflake database name"
user = "<snowflake user name>"
warehouse = "<snowflake warehouse>"
role = "<snowflake role>"
schema = "<snowflake schema>"
key = "<path/to/snowflake/ssh/key.p8>" e.g. "/.ssh/snowflake_rsa_key.p8"
​
​
# Specify the default run params
[run.default]
verbose = true
​
# Specify params for a run 'test_diff'.
[run.test_diff]
verbose = false
# Source 1 ("left")
1.database = "presto"
1.table = "<table name>"
​
# Source 2 ("right")
2.database = "snowflake"
2.table = "<schema name>.<table name>" (case sensitive)

Then run data diff in the CLI using:
poetry run python3 -m data_diff --conf <path/to/config.toml/file> --run test_diff -v
E.g.
poetry run python3 -m data_diff --conf data_diff/config.toml --run test_diff -v

Copy link
Contributor

@nolar nolar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and for the detailed instruction on testing. I have a few moments to clarify, but generally, it looks good.

Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly style comments.

Please try to reduce the amount of ifs and repetitive lines.

Bad arguments should fail explicitly rather than silently.

kw.pop("password")

if "cert" in kw: # if a certificate was specified in URI, verify session with cert
cert = kw.get("cert")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

cert = kw.pop("cert")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have corrected this

self.default_schema = kw.get("schema")

if (
"password" in kw and "auth" in kw and kw.get("auth") == "basic"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If auth == 'basic', the lack of user/password should throw an exception, rather than fail silently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have implemented this

format=serialization.PrivateFormat.PKCS8,
encryption_algorithm=serialization.NoEncryption(),
)
self._conn = snowflake.connector.connect(private_key=pkb, schema=f'"{schema}"', **kw) # replaces password
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you change this to

kw['private_key'] = pkb

There's no need to duplicate the connect() function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have implemented this

Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More remarks regarding style

kw.get("auth") == "basic" and "user" in kw and "password" in kw
# if auth=basic, add basic authenticator for Presto
kw["auth"] = prestodb.auth.BasicAuthentication(kw.pop("user"), kw.pop("password"))
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please catch specific exceptions. I believe this one should be KeyError ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it from ValueError to KeyError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You misunderstood. My problem was with your use of except:.

Raising ValueError was actually the better raising KeyError.


try:
# checks if user and password are missing when auth=basic
kw.get("auth") == "basic" and "user" in kw and "password" in kw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't do anything, won't even throw an exception.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throws an exception if user or password is missing. I tested it

This is after replacing ValueError with KeyError:

try:
            # checks if user and password are missing when auth=basic
            kw.get("auth") == "basic" and "user" in kw and "password" in kw
            # if auth=basic, add basic authenticator for Presto
            kw["auth"] = prestodb.auth.BasicAuthentication(kw.pop("user"), kw.pop("password"))
        except:
            raise KeyError("User or password cannot be missing if auth==basic")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throws an exception if user or password is missing. I tested it

The second line does. The first line doesn't do anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean having the code like this instead:

if kw.get("auth") == "basic":
            kw["auth"] = prestodb.auth.BasicAuthentication(kw.pop("user"), kw.pop("password"))

So that it throws error if user or password is missing?

Copy link
Contributor

@erezsh erezsh Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's more in line with what I'm thinking.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have corrected that now

@erezsh
Copy link
Contributor

erezsh commented Aug 9, 2022

Hi @matthiasekundayo-eb , I tried to follow your instructions for setup on Snowflake, but got stuck. Any chance you can walk me through it live? On slack/discord/zoom/hangouts, or anywhere else.

@matthiasekundayo-eb
Copy link
Author

Hi @erezsh Will you be available anytime later today? I am on EST timezone.

@matthiasekundayo-eb
Copy link
Author

@erezsh Are you in Isreal?
If yes, then let's do 11:30am EST on Thursday Aug 11th or 11:30am EST on Friday Aug 12th.
Will any of those times work for you?

@erezsh
Copy link
Contributor

erezsh commented Aug 9, 2022

@matthiasekundayo-eb I'm in Amsterdam, but close enough. These dates/hours are fine. Where can I reach you?

@matthiasekundayo-eb
Copy link
Author

@erezsh I just sent over a Google Meet invite to your inbox.

Here is also a copy of the same invite:
Data-diff: Snowflake Setup Walk-Through
Thursday, August 11 · 11:30am – 12:00pm EST
Google Meet joining info
Video call link: https://meet.google.com/xhq-kyzk-swk
Or dial: ‪(CA) +1 778-729-8580‬ PIN: ‪323 066 228‬#
More phone numbers: https://tel.meet/xhq-kyzk-swk?pin=5417485608915

@erezsh
Copy link
Contributor

erezsh commented Aug 11, 2022

@matthiasekundayo-eb Okay, I tested Snowflake and it works with the key argument. However, when I also provide a password, I get -

ImportError: Failed to import test module: tests.test_database_types
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\cryptography\utils.py", line 38, in _check_byteslike
    memoryview(value)
TypeError: memoryview: a bytes-like object is required, not 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python310\lib\unittest\loader.py", line 436, in _find_test_path
    module = self._get_module_from_name(name)
  File "C:\Python310\lib\unittest\loader.py", line 377, in _get_module_from_name
    __import__(name)
  File "C:\code\datafold\xdiff\tests\test_database_types.py", line 30, in <module>
    CONNS = {k: db.connect.connect(v, N_THREADS) for k, v in CONN_STRINGS.items()}
  File "C:\code\datafold\xdiff\tests\test_database_types.py", line 30, in <dictcomp>
    CONNS = {k: db.connect.connect(v, N_THREADS) for k, v in CONN_STRINGS.items()}
  File "C:\code\datafold\xdiff\data_diff\databases\connect.py", line 204, in connect
    return connect_with_dict(db_conf, thread_count)
  File "C:\code\datafold\xdiff\data_diff\databases\connect.py", line 173, in connect_with_dict
    return cls(**d)
  File "C:\code\datafold\xdiff\data_diff\databases\snowflake.py", line 44, in __init__
    p_key = serialization.load_pem_private_key(
  File "C:\Python310\lib\site-packages\cryptography\hazmat\primitives\serialization\base.py", line 22, in load_pem_private_key
    return ossl.load_pem_private_key(data, password)
  File "C:\Python310\lib\site-packages\cryptography\hazmat\backends\openssl\backend.py", line 823, in load_pem_private_key
    return self._load_key(
  File "C:\Python310\lib\site-packages\cryptography\hazmat\backends\openssl\backend.py", line 1042, in _load_key
    utils._check_byteslike("password", password)
  File "C:\Python310\lib\site-packages\cryptography\utils.py", line 40, in _check_byteslike
    raise TypeError("{} must be bytes-like".format(name))
TypeError: password must be bytes-like

I think it should be able to accept string passwords, no?

@erezsh
Copy link
Contributor

erezsh commented Aug 11, 2022

@matthiasekundayo-eb For Presto, I'm getting this error:

  File "C:\Python310\lib\site-packages\parameterized\parameterized.py", line 533, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "C:\code\datafold\xdiff\tests\test_database_types.py", line 569, in test_types
    _drop_table_if_exists(dst_conn, dst_table)
  File "C:\code\datafold\xdiff\tests\common.py", line 98, in _drop_table_if_exists
    conn.query(f"DROP TABLE IF EXISTS {table}", None)
  File "C:\code\datafold\xdiff\data_diff\databases\base.py", line 101, in query
    res = self._query(sql_code)
  File "C:\code\datafold\xdiff\data_diff\databases\presto.py", line 65, in _query
    c = self._conn.cursor()
  File "C:\Users\erez\AppData\Roaming\Python\Python310\site-packages\prestodb\dbapi.py", line 166, in cursor
    request = self._create_request()
  File "C:\Users\erez\AppData\Roaming\Python\Python310\site-packages\prestodb\dbapi.py", line 141, in _create_request
    return prestodb.client.PrestoRequest(
  File "C:\Users\erez\AppData\Roaming\Python\Python310\site-packages\prestodb\client.py", line 256, in __init__
    raise ValueError("cannot use authentication with HTTP")
ValueError: cannot use authentication with HTTP

@matthiasekundayo-eb
Copy link
Author

Hi @erezsh,
For Snowflake, you can't use password and key at the same time to connect. You can either use basic auth (username & password) or use username & ssh.

For Presto, did you use http_scheme = "https"?

@erezsh
Copy link
Contributor

erezsh commented Aug 23, 2022

Hi @matthiasekundayo-eb , when I add http_scheme=https I get the following error in Python:

requests.exceptions.SSLError: HTTPSConnectionPool(host='127.0.0.1', port=8080): Max retries exceeded with url: /v1/statement (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)')))

On the server side, I'm getting this error repeatedly:

presto_1    | 2022-08-23T09:51:55.588Z  WARN    http-worker-137 org.eclipse.jetty.util.SharedBlockingCallback   Blocker not complete Blocker@396571e4{null}
presto_1    | 2022-08-23T09:51:55.588Z  WARN    http-worker-137 org.eclipse.jetty.util.thread.strategy.EatWhatYouKill
presto_1    | java.lang.NullPointerException
presto_1    |   at io.airlift.http.server.HttpServerChannelListener.onRequestEnd(HttpServerChannelListener.java:62)
presto_1    |   at io.airlift.http.server.HttpServerChannelListener.onResponseBegin(HttpServerChannelListener.java:70)
presto_1    |   at org.eclipse.jetty.server.HttpChannelListeners.onResponseBegin(HttpChannelListeners.java:187)
presto_1    |   at org.eclipse.jetty.server.HttpChannel.sendResponse(HttpChannel.java:814)
presto_1    |   at org.eclipse.jetty.server.HttpChannel.sendResponse(HttpChannel.java:842)
presto_1    |   at org.eclipse.jetty.server.HttpChannel.onBadMessage(HttpChannel.java:775)
presto_1    |   at org.eclipse.jetty.server.HttpChannelOverHttp.badMessage(HttpChannelOverHttp.java:283)
presto_1    |   at org.eclipse.jetty.http.HttpParser.badMessage(HttpParser.java:1628)
presto_1    |   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1610)
presto_1    |   at org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(HttpConnection.java:364)
presto_1    |   at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:261)
presto_1    |   at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
presto_1    |   at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
presto_1    |   at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
presto_1    |   at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
presto_1    |   at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
presto_1    |   at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
presto_1    |   at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
presto_1    |   at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
presto_1    |   at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
presto_1    |   at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
presto_1    |   at java.base/java.lang.Thread.run(Thread.java:829)

Any idea what's causing it? It only happens for https, it seems.

As for snowflake, no problem, I can fix the error myself.

@matthiasekundayo-eb
Copy link
Author

matthiasekundayo-eb commented Aug 23, 2022

Hi @erezsh Try using port = "8443" in the config.toml file instead of 8080.
Could you please also share your config file with the password redacted?

@erezsh
Copy link
Contributor

erezsh commented Aug 23, 2022

Hi @matthiasekundayo-eb , the port helped, thanks! Now I'm getting prestodb.exceptions.HttpError: error 401: b'Access Denied: Invalid credentials'

I don't really know why, but tomorrow I'll try to redo the process as you described it, from scratch, just in case I made a mistake on the way.

@erezsh
Copy link
Contributor

erezsh commented Aug 24, 2022

@matthiasekundayo-eb I got it working. I think maybe I missed putting 'localhost' as Common Name. Either way, I think it's ready to merge. I will try to do so later today.

erezsh added a commit that referenced this pull request Aug 24, 2022
@erezsh erezsh merged commit fc5b6cf into datafold:master Aug 24, 2022
@erezsh
Copy link
Contributor

erezsh commented Aug 24, 2022

Merged. Thank you for your contribution, and sorry it took a while!

@matthiasekundayo-eb
Copy link
Author

matthiasekundayo-eb commented Aug 24, 2022

Thanks @erezsh

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants