-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
DOC: add pandas 3.0 migration guide for the string dtype #61705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
975dea1
db42937
8c0b883
1bc84ca
e4a764d
9760fee
c1f8a43
7ee9098
eb457c1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -87,5 +87,6 @@ Guides | |
enhancingperf | ||
scale | ||
sparse | ||
migration-3-strings | ||
gotchas | ||
cookbook |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,272 @@ | ||||||
{{ header }} | ||||||
|
||||||
.. _string_migration_guide: | ||||||
|
||||||
========================================================= | ||||||
Migration guide for the new string data type (pandas 3.0) | ||||||
========================================================= | ||||||
|
||||||
The upcoming pandas 3.0 release introduces a new, default string data type. This | ||||||
will most likely cause some work when upgrading to pandas 3.0, and this page | ||||||
provides an overview of the issues you might run into and gives guidance on how | ||||||
to address them. | ||||||
|
||||||
This new dtype is already available in the pandas 2.3 release, and you can | ||||||
enable it with: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
pd.options.future.infer_string = True | ||||||
|
||||||
This allows to test your code before the final 3.0 release. | ||||||
|
||||||
Background | ||||||
---------- | ||||||
|
||||||
Historically, pandas has always used the NumPy ``object`` dtype as the default | ||||||
to store text data. This has two primary drawbacks. First, ``object`` dtype is | ||||||
not specific to strings: any Python object can be stored in an ``object```-dtype | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
array, not just strings, and seeing ``object`` as the dtype for a column with | ||||||
strings is confusing for users. Second, this is not always very efficient (both | ||||||
performance wise as for memory usage). | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Since pandas 1.0, an opt-in string data type has been available, but this has | ||||||
not yet been made the default, and uses the ``pd.NA`` scalar to represent | ||||||
missing values. | ||||||
|
||||||
Pandas 3.0 changes the default dtype for strings to a new string data type, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
a variant of the existing optional string data type but using ``NaN`` as the | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
missing value indicator, to be consistent with the other default data types. | ||||||
|
||||||
To improve performance, the new string data type will use the ``pyarrow`` | ||||||
package by default, if installed (and otherwise it uses object dtype under the | ||||||
hood as a fallback). | ||||||
|
||||||
See `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__ | ||||||
for more background and details. | ||||||
|
||||||
.. - brief primer on the new dtype | ||||||
|
||||||
.. - Main characteristics: | ||||||
.. - inferred by default (Default inference of a string dtype) | ||||||
.. - only strings (setitem with non string fails) | ||||||
.. - missing values sentinel is always NaN and uses NaN semantics | ||||||
|
||||||
.. - Breaking changes: | ||||||
.. - dtype is no longer object dtype | ||||||
.. - None gets coerced to NaN | ||||||
.. - setitem raises an error for non-string data | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the above is not rendered? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, this are comments, it was my outline when writing it (can remove this in the end) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no problem. |
||||||
|
||||||
Brief intro to the new default string dtype | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
------------------------------------------- | ||||||
|
||||||
By default, pandas will infer this new string dtype instead of object dtype for | ||||||
string data (when creating pandas objects, such as in constructors or IO | ||||||
functions). | ||||||
|
||||||
Being a default dtype means that the string dtype will be used in IO methods or | ||||||
constructors when the dtype is being inferred and the input is inferred to be | ||||||
string data: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> pd.Series(["a", "b", None]) | ||||||
0 a | ||||||
1 b | ||||||
2 NaN | ||||||
dtype: str | ||||||
|
||||||
It can also be specified explicitly using the ``"str"`` alias: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> pd.Series(["a", "b", None], dtype="str") | ||||||
0 a | ||||||
1 b | ||||||
2 NaN | ||||||
dtype: str | ||||||
|
||||||
In contrast the the current object dtype, the new string dtype will only store | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
strings. This also means that it will raise an error if you try to store a | ||||||
non-string value in it (see below for more details). | ||||||
|
||||||
Missing values with the new string dtype are always represented as ``NaN``, and | ||||||
the missing value behaviour is similar as for other default dtypes. | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
For the rest, this new string dtype should work the same as how you have been | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
using pandas with string data today. For example, all string-specific methods | ||||||
through the ``str`` accessor will work the same: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> ser = pd.Series(["a", "b", None], dtype="str") | ||||||
>>> ser.str.upper() | ||||||
0 A | ||||||
1 B | ||||||
2 NaN | ||||||
dtype: str | ||||||
|
||||||
.. note:: | ||||||
|
||||||
The new default string dtype is an instance of the :class:`pandas.StringDtype` | ||||||
class. The dtype can be constructed as ``pd.StringDtype(na_value=np.nan)``, | ||||||
but for general usage we recommend to use the shorter ``"str"`` alias. | ||||||
|
||||||
Overview of behaviour differences and how to address them | ||||||
--------------------------------------------------------- | ||||||
|
||||||
The dtype is no longer object dtype | ||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||
|
||||||
When inferring string data, the data type of the resulting DataFrame column or | ||||||
Series will silently start being the new ``"str"`` dtype instead of ``"object"`` | ||||||
dtype, and this can have some impact on your code. | ||||||
|
||||||
Checking the dtype | ||||||
^^^^^^^^^^^^^^^^^^ | ||||||
|
||||||
When checking the dtype, code might currently do something like: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> ser = pd.Series(["a", "b", "c"]) | ||||||
>>> ser.dtype == "object" | ||||||
|
||||||
to check for columns with string data (by checking for the dtype being | ||||||
``"object"``). This will no longer work in pandas 3+, since ``ser.dtype`` will | ||||||
now be ``"str"`` with the new default string dtype, and the above check will | ||||||
return ``False``. | ||||||
|
||||||
To check for columns with string data, you should instead use: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> ser.dtype == "str" | ||||||
|
||||||
**How to write compatible code?** | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
For code that should work on both pandas 2.x and 3.x, you can use the | ||||||
:func:`pandas.api.types.is_string_dtype` function: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> pd.api.types.is_string_dtype(ser.dtype) | ||||||
True | ||||||
|
||||||
This will return ``True`` for both the object dtype as for the string dtypes. | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Hardcoded use of object dtype | ||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||||||
|
||||||
If you have code where the dtype is hardcoded in constructors, like | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> pd.Series(["a", "b", "c"], dtype="object") | ||||||
|
||||||
this will keep using the object dtype. You will want to update this code to | ||||||
ensure you get the benefits of the new string dtype. | ||||||
|
||||||
**How to write compatible code?** | ||||||
|
||||||
First, in many cases it can be sufficient to remove the specific data type, and | ||||||
let pandas do the inference. But if you want to be specific, you can specify the | ||||||
``"str"`` dtype: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> pd.Series(["a", "b", "c"], dtype="str") | ||||||
|
||||||
This is actually compatible with pandas 2.x as well, since in pandas < 3, | ||||||
``dtype="str"`` was essentially treated as an alias for object dtype. | ||||||
|
||||||
The missing value sentinel is now always NaN | ||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||
|
||||||
When using object dtype, multiple possible missing value sentinels are | ||||||
supported, including ``None`` and ``np.nan``. With the new default string dtype, | ||||||
the missing value sentinel is always NaN (``np.nan``): | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
# with object dtype, None is preserved as None and seen as missing | ||||||
>>> ser = pd.Series(["a", "b", None], dtype="object") | ||||||
>>> ser | ||||||
0 a | ||||||
1 b | ||||||
2 None | ||||||
dtype: object | ||||||
>>> print(ser[2]) | ||||||
None | ||||||
|
||||||
# with the new string dtype, any missing value like None is coerced to NaN | ||||||
>>> ser = pd.Series(["a", "b", None], dtype="str") | ||||||
>>> ser | ||||||
0 a | ||||||
1 b | ||||||
2 NaN | ||||||
dtype: str | ||||||
>>> print(ser[2]) | ||||||
nan | ||||||
|
||||||
Generally this should be no problem when relying on missing value behaviour in | ||||||
pandas methods (for example, ``ser.isna()`` will give the same result as before). | ||||||
But when you relied on the exact value of ``None`` being present, that can | ||||||
impact your code. | ||||||
|
||||||
**How to write compatible code?** | ||||||
|
||||||
When checking for a missing value, instead of checking for the exact value of | ||||||
``None`` or ``np.nan``, you should use the :func:`pandas.isna` function. This is | ||||||
the most robust way to check for missing values, as it will work regardless of | ||||||
the dtype and the exact missing value sentinel: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> pd.isna(ser[2]) | ||||||
True | ||||||
|
||||||
One caveat: this function works both on scalars and on array-likes, and in the | ||||||
latter case it will return an array of boolean dtype. When using it in a boolean | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
not to confuse with pandas nullable type should capitalize as named after George Boole? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. numpy uses "boolean" as well, so would rather leave it like this, or can make it an "array of bools" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure |
||||||
context (for example, ``if pd.isna(..): ..``) be sure to only pass a scalar to | ||||||
it. | ||||||
|
||||||
"setitem" operations will now raise an error for non-string data | ||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||
|
||||||
With the new string dtype, any attempt to set a non-string value in a Series or | ||||||
DataFrame will raise an error: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> ser = pd.Series(["a", "b", None], dtype="str") | ||||||
>>> ser[1] = 2.5 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i notice you can do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I am not a super big fan of already allowing to assign There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
sure. It doesn't create any issues really like with |
||||||
--------------------------------------------------------------------------- | ||||||
TypeError Traceback (most recent call last) | ||||||
... | ||||||
TypeError: Invalid value '2.5' for dtype 'str'. Value should be a string or missing value, got 'float' instead. | ||||||
|
||||||
If you relied on the flexible nature of object dtype being able to hold any | ||||||
Python object, but your initial data was inferred as strings, your code might be | ||||||
impacted by this change. | ||||||
|
||||||
**How to write compatible code?** | ||||||
|
||||||
You can update your code to ensure you only set string values in such columns, | ||||||
or otherwise you have explicitly ensure the column has object dtype first. This | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
can be done by specifying the dtype explicitly in the constructor, or by using | ||||||
the :meth:`~pandas.Series.astype` method: | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> ser = pd.Series(["a", "b", None], dtype="str") | ||||||
>>> ser = ser.astype("object") | ||||||
>>> ser[1] = 2.5 | ||||||
|
||||||
This ``astype("object")`` call will be redundant when using pandas 2.x, but | ||||||
this way such code can work for all versions. | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
For existing users of the nullable ``StringDtype`` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you really want to keep writing i have no objection, but by construction these are advanced users who i dont think need as much hand-holding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mostly want to briefly mention in the docs (as I don't think we really do that anywhere, except for in the PDEP) that we made this backcompat as if you were using (and maybe mention that if you were using it for getting the faster pyarrow one, but don't care about the missing value sentinel, you could also just use the default dtype now. But that might be a bit subjective/controversial to say, and indeed at that point they probably understand that themselves as well) |
||||||
-------------------------------------------------- | ||||||
|
||||||
TODO |
Uh oh!
There was an error while loading. Please reload this page.