Data-diff fails for Motherduck connections when using multiple databases as environments #858
Description
Describe the bug
When connecting to Motherduck, query_table_schema()
throws an assertion error if the table path exists in more than 1 database.
I have a motherduck project with multiple databases, 1 per environment. I use these environments to materialize dbt models. In this example, I have a schema called dbt_marts
with a table called orders
. This exists in 3 databases, namely production
, development
, and local_test
. The schema in all 3 tables is exactly the same.
I came across this bug when experimenting with data-diff.
data-diff-config.toml
:
[database.motherduck]
driver = "duckdb"
filepath = "md:production?motherduck_token=${MOTHERDUCK_TOKEN}"
[run.test]
# Source 1 ("left")
1.database = "motherduck"
1.table = "dbt_marts.orders"
# Source 1 ("right")
2.database = "motherduck"
2.table = "dbt_marts.orders"
then I ran the command: data-diff --conf data-diff-config.toml --run test2
which throws:
Traceback (most recent call last):
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/bin/data-diff", line 8, in <module>
sys.exit(main())
^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 344, in main
return _data_diff(
^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 482, in _data_diff
schemas = list(differ._thread_map(_get_schema, safezip(dbs, table_paths)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 77, in _get_schema
return db.query_table_schema(table_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/databases/base.py", line 1042, in query_table_schema
assert len(d) == len(rows)
AssertionError
I traced it back to this line. It takes the output of the information_schema query over here and converts it to a dictionary. This line verifies that the number of rows in the dictionary (representing columns in the schema) is the same as the number of rows in the original result.
This fails in my case, because the information_schema returns the columns from all that meet WHERE table_name = '{name}' AND table_schema = '{schema}'
, regardless of the database in which they're in. During the dictionary conversion however, the set of rows is reduced to unique ones, so the assertion fails.
Not sure how this should be fixed. Not sure if it's expected that the information_schema returns columns from tables in other databases than the one specified in the connection parameters. One way to avoid this error is modifying the where clause as follows:
WHERE table_name = '{name}' AND table_schema = '{schema}' AND table_catalog = '[database-name]'
Describe the environment
Running MacOS with data-diff version v0.10.1 and duckdb 0.9.2