diff --git a/doc/source/whatsnew/v3.0.0.rst b/doc/source/whatsnew/v3.0.0.rst index 8d3ac0e396430..8cb2809583845 100644 --- a/doc/source/whatsnew/v3.0.0.rst +++ b/doc/source/whatsnew/v3.0.0.rst @@ -14,10 +14,108 @@ including other versions of pandas. Enhancements ~~~~~~~~~~~~ -.. _whatsnew_300.enhancements.enhancement1: +.. _whatsnew_300.enhancements.string_dtype: -Enhancement1 -^^^^^^^^^^^^ +Dedicated string data type by default +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Historically, pandas represented string columns with NumPy ``object`` data type. +This representation has numerous problems: it is not specific to strings (any +Python object can be stored in an ``object``-dtype array, not just strings) and +it is often not very efficient (both performance wise and for memory usage). + +Starting with pandas 3.0, a dedicated string data type is enabled by default +(backed by PyArrow under the hood, if installed, otherwise falling back to +NumPy). This means that pandas will start inferring columns containing string +data as the new ``str`` data type when creating pandas objects, such as in +constructors or IO functions. + +Old behavior: + +.. code-block:: python + + >>> ser = pd.Series(["a", "b"]) + 0 a + 1 b + dtype: object + +New behavior: + +.. code-block:: python + + >>> ser = pd.Series(["a", "b"]) + 0 a + 1 b + dtype: str + +The string data type that is used in these scenarios will mostly behave as NumPy +object would, including missing value semantics and general operations on these +columns. + +The main characteristic of the new string data type: + +- Inferred by default for string data (instead of object dtype) +- The ``str`` dtype can only hold strings (or missing values), in contrast to + ``object`` dtype. (setitem with non string fails) +- The missing value sentinel is always ``NaN`` (``np.nan``) and follows the same + missing value semantics as the other default dtypes. + +Those intentional changes can have breaking consequences, for example when checking +for the ``.dtype`` being object dtype or checking the exact missing value sentinel. + +TODO add link to migration guide for more details + +.. seealso:: + + `PDEP-14: Dedicated string data type for pandas 3.0 `__ + + +.. _whatsnew_300.enhancements.copy_on_write: + +Copy-on-Write +^^^^^^^^^^^^^ + +The new "copy-on-write" behaviour in pandas 3.0 brings changes in behavior in +how pandas operates with respect to copies and views. A summary of the changes: + +1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way, + i.e. including accessing a DataFrame column as a Series) or any method returning a + new DataFrame or Series, always *behaves as if* it were a copy in terms of user + API. +2. As a consequence, if you want to modify an object (DataFrame or Series), the only way + to do this is to directly modify that object itself. + +The main goal of this change is to make the user API more consistent and +predictable. There is now a clear rule: *any* subset or returned +series/dataframe **always** behaves as a copy of the original, and thus never +modifies the original (before pandas 3.0, whether a derived object would be a +copy or a view depended on the exact operation performed, which was often +confusing). + +Because every single indexing step now behaves as a copy, this also means that +"chained assignment" (updating a DataFrame with multiple setitem steps) will +stop working. Because this now consistently never works, the +``SettingWithCopyWarning`` is removed. + +The new behavioral semantics are explained in more detail in the +:ref:`user guide about Copy-on-Write `. + +A secondary goal is to improve performance by avoiding unnecessary copies. As +mentioned above, every new DataFrame or Series returned from an indexing +operation or method *behaves* as a copy, but under the hood pandas will use +views as much as possible, and only copy when needed to guarantee the "behaves +as a copy" behaviour (this is the actual "copy-on-write" mechanism used as an +implementation detail). + +Some of the behaviour changes described above are breaking changes in pandas +3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas +2.3 to get deprecation warnings for a subset of those changes. The +:ref:`migration guide ` explains the upgrade +process in more detail. + +.. seealso:: + + `PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write `__ .. _whatsnew_300.enhancements.enhancement2: