Skip to content

[BUG] RandomUnderSampler throws errors if pandas DataFrame has timestamps #970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tomateit opened this issue Feb 19, 2023 · 1 comment · Fixed by #1004
Closed

[BUG] RandomUnderSampler throws errors if pandas DataFrame has timestamps #970

tomateit opened this issue Feb 19, 2023 · 1 comment · Fixed by #1004

Comments

@tomateit
Copy link

tomateit commented Feb 19, 2023

Describe the bug

RandomUnderSampler performs checks on X argument, which are unnecessary, as they do not affect the choice of resampled indices.
This is an issue if I pass pandas DataFrame.
The exception is not risen if I pass a numpy object with timestamps.

Steps/Code to Reproduce

from datetime import datetime
import pandas as pd

df = pd.DataFrame({"label": [0,0,0,1], "td": [datetime.now()]*4})
rus = imblearn.under_sampling.RandomUnderSampler(random_state=2342374)
rus.fit_resample(df, df.label)

Expected Results

No error is thrown.

Actual Results

TypeError: The DType <class 'numpy.dtype[int64]'> could not be promoted by <class 'numpy.dtype[datetime64]'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtype[int64]'>, <class 'numpy.dtype[datetime64]'>)

Versions

Linux-5.15.0-60-generic-x86_64-with-glibc2.35
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
NumPy 1.24.1
SciPy 1.9.3
Scikit-Learn 1.2.1
Imbalanced-Learn 0.10.0

My current workaround

from datetime import datetime
import pandas as pd

df = pd.DataFrame({"label": [0,0,0,1], "td": [datetime.now()]*4})

rus = imblearn.under_sampling.RandomUnderSampler(random_state=2342374)

downsabpled_df, _ = rus.fit_resample(df.to_numpy(), df.label)
downsabpled_df = pd.DataFrame(downsabpled_df, columns=df.columns)

P.S. Huge thanks for this useful library.

@glemaitre
Copy link
Member

The proposal is reasonable, we just have to check if it is possible to be compatible with check_estimator from scikit-learn. A PR would be welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants