Skip to content

Commit f553a5f

Browse files
committed
ENH: add to/from_parquet with pyarrow & fastparquet
1 parent ab49d1f commit f553a5f

20 files changed

+703
-12
lines changed

ci/install_travis.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ fi
153153
echo
154154
echo "[removing installed pandas]"
155155
conda remove pandas -y --force
156+
pip uninstall -y pandas
156157

157158
if [ "$BUILD_TEST" ]; then
158159

ci/requirements-2.7.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 27"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
7+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet

ci/requirements-3.5.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ source activate pandas
44

55
echo "install 35"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1
8-
97
# pip install python-dateutil to get latest
108
conda remove -n pandas python-dateutil --force
119
pip install python-dateutil
10+
11+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1

ci/requirements-3.5_OSX.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 35_OSX"
66

7-
conda install -n pandas -c conda-forge feather-format==0.3.1
7+
conda install -n pandas -c conda-forge feather-format==0.3.1 fastparquet

ci/requirements-3.6.pip

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
brotlipy

ci/requirements-3.6.run

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ pymysql
1717
feather-format
1818
pyarrow
1919
psycopg2
20+
python-snappy
21+
fastparquet
2022
beautifulsoup4
2123
s3fs
2224
xarray

ci/requirements-3.6_DOC.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@ echo "[install DOC_BUILD deps]"
66

77
pip install pandas-gbq
88

9-
conda install -n pandas -c conda-forge feather-format pyarrow nbsphinx pandoc
9+
conda install -n pandas -c conda-forge feather-format pyarrow nbsphinx pandoc fastparquet
1010

1111
conda install -n pandas -c r r rpy2 --yes

ci/requirements-3.6_WIN.run

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,5 @@ numexpr
1313
pytables
1414
matplotlib
1515
blosc
16+
fastparquet
17+
pyarrow

doc/source/install.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,7 @@ Optional Dependencies
237237
* `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
238238
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
239239
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
240+
* ``Apache Parquet Format``, either `pyarrow <http://arrow.apache.org/docs/python/>`__ (>= 0.4.1) or `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`__ (>= 0.0.6) for parquet-based storage. The `snappy <https://pypi.python.org/pypi/python-snappy>`__ and `brotli <https://pypi.python.org/pypi/brotlipy>`__ are available for compression support.
240241
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:
241242

242243
* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL

doc/source/io.rst

Lines changed: 78 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ object. The corresponding ``writer`` functions are object methods that are acces
4343
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
4444
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
4545
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
46+
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
4647
binary;`Msgpack <http://msgpack.org/index.html>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
4748
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
4849
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
@@ -209,7 +210,7 @@ buffer_lines : int, default None
209210
.. deprecated:: 0.19.0
210211

211212
Argument removed because its value is not respected by the parser
212-
213+
213214
compact_ints : boolean, default False
214215
.. deprecated:: 0.19.0
215216

@@ -4087,7 +4088,7 @@ control compression: ``complevel`` and ``complib``.
40874088
``complevel`` specifies if and how hard data is to be compressed.
40884089
``complevel=0`` and ``complevel=None`` disables
40894090
compression and ``0<complevel<10`` enables compression.
4090-
4091+
40914092
``complib`` specifies which compression library to use. If nothing is
40924093
specified the default library ``zlib`` is used. A
40934094
compression library usually optimizes for either good
@@ -4102,9 +4103,9 @@ control compression: ``complevel`` and ``complib``.
41024103
- `blosc <http://www.blosc.org/>`_: Fast compression and decompression.
41034104

41044105
.. versionadded:: 0.20.2
4105-
4106+
41064107
Support for alternative blosc compressors:
4107-
4108+
41084109
- `blosc:blosclz <http://www.blosc.org/>`_ This is the
41094110
default compressor for ``blosc``
41104111
- `blosc:lz4
@@ -4545,6 +4546,79 @@ Read from a feather file.
45454546
import os
45464547
os.remove('example.feather')
45474548
4549+
4550+
.. _io.parquet:
4551+
4552+
Parquet
4553+
-------
4554+
4555+
.. versionadded:: 0.21.0
4556+
4557+
`Parquet <https://parquet.apache.org/`__ provides a partitioned binary columnar serialization for data frames. It is designed to
4558+
make reading and writing data frames efficient, and to make sharing data across data analysis
4559+
languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible
4560+
while still maintaining good read performance.
4561+
4562+
Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas
4563+
dtypes, including extension dtypes such as datetime with tz.
4564+
4565+
Several caveats.
4566+
4567+
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
4568+
error if a non-default one is provided. You can simply ``.reset_index(drop=True)`` in order to store the index.
4569+
- Duplicate column names and non-string columns names are not supported
4570+
- Categorical dtypes are currently not-supported (for ``pyarrow``).
4571+
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
4572+
on an attempt at serialization.
4573+
4574+
You can specifiy an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
4575+
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``, then
4576+
then ``pyarrow`` is tried, and falling back to ``fastparquet``.
4577+
4578+
See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__
4579+
4580+
.. note::
4581+
4582+
These engines are very similar and should read/write nearly identical parquet format files.
4583+
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
4584+
4585+
.. ipython:: python
4586+
4587+
df = pd.DataFrame({'a': list('abc'),
4588+
'b': list(range(1, 4)),
4589+
'c': np.arange(3, 6).astype('u1'),
4590+
'd': np.arange(4.0, 7.0, dtype='float64'),
4591+
'e': [True, False, True],
4592+
'f': pd.date_range('20130101', periods=3),
4593+
'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
4594+
'h': pd.date_range('20130101', periods=3, freq='ns')})
4595+
4596+
df
4597+
df.dtypes
4598+
4599+
Write to a parquet file.
4600+
4601+
.. ipython:: python
4602+
4603+
df.to_parquet('example_pa.parquet', engine='pyarrow')
4604+
df.to_parquet('example_fp.parquet', engine='fastparquet')
4605+
4606+
Read from a parquet file.
4607+
4608+
.. ipython:: python
4609+
4610+
result = pd.read_parquet('example_pa.parquet', engine='pyarrow')
4611+
result = pd.read_parquet('example_fp.parquet', engine='fastparquet')
4612+
4613+
result.dtypes
4614+
4615+
.. ipython:: python
4616+
:suppress:
4617+
4618+
import os
4619+
os.remove('example_pa.parquet')
4620+
os.remove('example_fp.parquet')
4621+
45484622
.. _io.sql:
45494623

45504624
SQL Queries

0 commit comments

Comments
 (0)