rle-array¶
Extension Array for Pandas that implements Run-length Encoding.
Table of Contents
Quick Start¶
Some basic setup first:
>>> import pandas as pd
>>> pd.set_option("display.max_rows", 40)
>>> pd.set_option("display.width", None)
We need some example data, so let’s create some pseudo-weather data:
>>> from rle_array.testing import generate_example
>>> df = generate_example()
>>> df.head(10)
date month year city country avg_temp rain mood
0 2000-01-01 1 2000 city_0 country_0 12.400000 False ok
1 2000-01-02 1 2000 city_0 country_0 4.000000 False ok
2 2000-01-03 1 2000 city_0 country_0 17.200001 False great
3 2000-01-04 1 2000 city_0 country_0 8.400000 False ok
4 2000-01-05 1 2000 city_0 country_0 6.400000 False ok
5 2000-01-06 1 2000 city_0 country_0 14.400000 False ok
6 2000-01-07 1 2000 city_0 country_0 14.300000 True ok
7 2000-01-08 1 2000 city_0 country_0 6.800000 False ok
8 2000-01-09 1 2000 city_0 country_0 10.100000 False ok
9 2000-01-10 1 2000 city_0 country_0 -1.200000 False ok
Due to the large number of attributes for locations and the date, the data size is quite large:
>>> df.memory_usage()
Index 128
date 32000000
month 4000000
year 8000000
city 32000000
country 32000000
avg_temp 16000000
rain 4000000
mood 32000000
dtype: int64
>>> df.memory_usage().sum()
160000128
To compress the data, we can use rle-array
:
>>> import rle_array
>>> df_rle = df.astype({
... "city": "RLEDtype[object]",
... "country": "RLEDtype[object]",
... "month": "RLEDtype[int8]",
... "mood": "RLEDtype[object]",
... "rain": "RLEDtype[bool]",
... "year": "RLEDtype[int16]",
... })
>>> df_rle.memory_usage()
Index 128
date 32000000
month 1188000
year 120000
city 32000
country 64
avg_temp 16000000
rain 6489477
mood 17153296
dtype: int64
>>> df_rle.memory_usage().sum()
72982965
This works better the longer the runs are. In the above example, it does not work too well for "rain"
.
Development Plan¶
The development of rle-array
has the following priorities (in decreasing order):
Correctness: All results must be correct. The Pandas-provided test suite must pass. Approximation are not allowed.
Transparency: The user can use
RLEDtype
andRLEArray
like other Pandas types. No special parameters or extra functions are required.Features: Support all features that Pandas offers, even if it is slow (but inform the user using a
pandas.errors.PerformanceWarning
).Simplicity: Do not use Python C Extensions or Cython (NumPy and Numba are allowed).
Memory Reduction: Do not decompress the encoded data when not required, try to do as many calculations directly on the compressed representation.
Performance: It should be quick, for large data ideally faster than working on the uncompressed data. Use Numba to speed up code.
Implementation¶
Imagine the following data array:
Index |
Data |
---|---|
1 |
“a” |
2 |
“a” |
3 |
“a” |
4 |
“x” |
5 |
“c” |
6 |
“c” |
7 |
“a” |
8 |
“a” |
There some data points valid for multiple entries in a row:
Index |
Data |
---|---|
1 |
“a” |
2 |
|
3 |
|
4 |
“x” |
5 |
“c” |
6 |
|
7 |
“a” |
8 |
These sections are also called runs and can be encoded by their value and their length:
Length |
Value |
---|---|
3 |
“a” |
1 |
“x” |
2 |
“c” |
2 |
“a” |
This representation is called Run-length Encoding. To integrate this encoding better with Pandas and NumPy and
to support operations like slicing and random access (e.g. via pandas.api.extensions.ExtensionArray.take()
), we
store the end position (the cum-sum of the length column) instead of the length:
End-position |
Value |
---|---|
3 |
“a” |
4 |
“x” |
6 |
“c” |
8 |
“a” |
The value array is an numpy.ndarray
with the same dtype as the original data and the end-positions are an
numpy.ndarray
with the dtype int64
.