Skip to content

Improve DataFrame.select_dtypes scaling to wide data frames #28317

Closed
@TomAugspurger

Description

@TomAugspurger

Running select_dtypes for a variety of lengths.

import numpy as np
import pandas as pd
from timeit import default_timer as tic

ns = [0, 10, 100, 1_000, 10_000]
times = []

for n in ns:
    df = pd.DataFrame(np.random.randn(10, n))
    t0 = tic()
    df.select_dtypes(include='int')
    t1 = tic()

    times.append([t1 - t0])

df = pd.DataFrame(times, columns=['include'], index=ns)
df.plot()

This looks O(n) in the number of columns. I think that can be improved (to whatever set intersection is)

gh

Edit: maybe it's O(log(n)), I never took CS :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Dtype ConversionsUnexpected or buggy dtype conversionsPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions