Skip to content

Refactor catalog for handling disjoint intervals of data #2594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

faysou
Copy link
Collaborator

@faysou faysou commented May 3, 2025

Pull Request

  • Refactor catalog for handling disjoint intervals of data

    • Use closed time intervals to represent data contained in a catalog for a given data type
    • Information about time range contained in a parquet file name
    • Saving new data only by creating a new file
    • Possibility to consolidate all files into a single file, or only a given range of time into a single file for maintance purpose
    • Possiblity to reset file names of an existing catalog to work with the new format by reading the min and max timestamps of a parquet file and changing the file name
    • Taking into account empty intervals into file names when possible in order to create contiguous files (timestamps are integers so the notion of contiguity exists there), in order to avoid repeated queries where there's no data for example during weekends
    • Use timestamp information in files to use the minimum amount of parquet files required for a given catalog query
  • Improve data engine to handle several market data queries and one catalog query

    • Also ensure that received data from a client or catalog matches the bounds intended in a request
    • Data concatenated from several requests is then sorted
    • The code now assumes that a given symbol is only in one catalog at a time, this simplifies implementation
    • When data is present in a catalog, only the minimum amount of data is queried to a live market data client, which is economical for paid historical data
  • Improve handling of custom data in data requests and data subscriptions by better taking into account optional instrument_id

  • Migrate catalog implementation to rust as well

  • Libraries used for intervals:

Type of change

  • New feature (non-breaking change which adds functionality)

How has this change been tested?

Existing tests are passing
Added tests for the new catalog features

@faysou faysou force-pushed the parquet_intervals branch 24 times, most recently from 89e2b6a to 3f64ab6 Compare May 10, 2025 16:27
@faysou faysou force-pushed the parquet_intervals branch 6 times, most recently from 0c5ed37 to 247d404 Compare May 13, 2025 08:00
@faysou faysou force-pushed the parquet_intervals branch from dd4bf1f to b31337d Compare May 16, 2025 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant