Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.
This repository was archived by the owner on May 17, 2024. It is now read-only.

Data-diff fails for Motherduck connections when using multiple databases as environments #858

Closed
@rob-teeuwen

Description

@rob-teeuwen

Describe the bug
When connecting to Motherduck, query_table_schema() throws an assertion error if the table path exists in more than 1 database.

I have a motherduck project with multiple databases, 1 per environment. I use these environments to materialize dbt models. In this example, I have a schema called dbt_marts with a table called orders. This exists in 3 databases, namely production, development, and local_test. The schema in all 3 tables is exactly the same.

I came across this bug when experimenting with data-diff.

data-diff-config.toml:

[database.motherduck]
  driver = "duckdb"
  filepath = "md:production?motherduck_token=${MOTHERDUCK_TOKEN}" 

[run.test]
  # Source 1 ("left")
  1.database = "motherduck"
  1.table = "dbt_marts.orders"

  # Source 1 ("right")
  2.database = "motherduck"
  2.table = "dbt_marts.orders"

then I ran the command: data-diff --conf data-diff-config.toml --run test2

which throws:

Traceback (most recent call last):
  File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/bin/data-diff", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 344, in main
    return _data_diff(
           ^^^^^^^^^^^
  File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 482, in _data_diff
    schemas = list(differ._thread_map(_get_schema, safezip(dbs, table_paths)))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 77, in _get_schema
    return db.query_table_schema(table_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/databases/base.py", line 1042, in query_table_schema
    assert len(d) == len(rows)
AssertionError

I traced it back to this line. It takes the output of the information_schema query over here and converts it to a dictionary. This line verifies that the number of rows in the dictionary (representing columns in the schema) is the same as the number of rows in the original result.

This fails in my case, because the information_schema returns the columns from all that meet WHERE table_name = '{name}' AND table_schema = '{schema}', regardless of the database in which they're in. During the dictionary conversion however, the set of rows is reduced to unique ones, so the assertion fails.

Not sure how this should be fixed. Not sure if it's expected that the information_schema returns columns from tables in other databases than the one specified in the connection parameters. One way to avoid this error is modifying the where clause as follows:

WHERE table_name = '{name}' AND table_schema = '{schema}' AND table_catalog = '[database-name]'

Describe the environment

Running MacOS with data-diff version v0.10.1 and duckdb 0.9.2

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions