Fletcher 0.3: A status report on the mission to get pandas hooked on Apache Arrow·
It has been now nearly two years since the idea came up to use
ExtensionArray interface to provide columns in
pandas that are backed by Apache Arrow.
fletcher was started as a prototype project to show how this idea can be brought together.
Since then there has been quite a lot of development in both
pandas and Apache Arrow.
fletcher remains a prototype to show how this could look like as essential functionality is missing to use it productively.
With the two year mark now approaching, I thought it was a good time to give a progress report and tag new intermediate release.
Although I highly warn of any productive use of
fletcher, I would be curious to find out where the first entry point is that fails for users.
Thus, please try to use it in your project and report the first exception you encounter in our issue tracker.
This will the focus of
fletcher development as it will give us the insight of what the required functionality is to provide a minimal useful library.
Choosing between chunked & continuous arrays as storage backend
ExtensionArray instances in
fletcher were solely backed by
This was chosen as chunked arrays allow for the most flexibility, e.g. concatenating them can be done in constant time.
But with the flexibility on the user side also comes a lot of complexity in implementing algorithms on top of them.
Due to the nature of the chunking, you don’t deal with a simple linear index to access any element of an array but you always need to translate between the scalar index that indicates the position in the whole array and the tuple
(chunked_index, index_in_chunk) that gives you the relative position of an element to its containing chunk.
Thus, we now provide two different extension array implementations.
There now is the more simpler
FletcherContinuousArray which is backed by a
pyarrow.Array instance and thus is always a continuous memory segments.
FlectherArray which is backed by a
pyarrow.ChunkedArray is now renamed to
pyarrow.ChunkedArray allows for more flexibility on how the data is stored, the implementation of algorithms is more complex for it.
As this hinders contributions and also the adoption in downstream libraries, we now provide both implementations with an equal level of support.
We don’t provide the more general named class
FlectherArray anymore as there is not a clear opinion on whether this should point to
As usage increases, we might provide such an alias class in future again.
Arithmetic, comparison and reduce operations
For numeric data,
pandas has added in the last year a test suite that provides a vast amount of tests to check all kind of numeric operations on
With the help of this suite, we were able to implement these operations on top of
The current implementation applies the mask on the input arrays and then delegates to numpy for the computations.
With this, we are on the same performance level as
In future, we want to use numeric operation that are directly implemented in Apache Arrow C++ and make direct use of the validity bitmap instead.
This will save on memory bandwidth / storage as well will be faster on the actual numeric operations as bitmap checking and operation calculation won’t be separated steps anymore.
BooleanArray & StringArray in pandas and its fletcher counterparts
In the newest release, we have an implementation of a boolean array that supports missings and behaves like a
pandas.Series of float type for
There was a blog post outlining its implementation.
In pandas 1.0, a new
BooleanArray was released with a slightly different behaviour, we will adapt
fletcher to this in the next release.
Currently our tests are failing and it looks like an inconsistency in pandas’ implementation which we currently investigate.
BooleanArray, pandas 1.0 also added
StringArray which brings in a check that all objects in that column are strings but doesn’t improve on performance.
Thus there is still the need for a fast string type like we are implementing in fletcher.
A first step in this direction we now support
.str.cat as an algorithm on fletcher string columns via
Missing kernels / operations / .. in Arrow or fletcher
One of the main things making
fletcher not practically usable at the moment are the missing algorithm implementations on top of it.
You can select / slice / store
fletcher columns but executing operations like
zfill for strings or
dt.year on top of its columns is not possible yet.
These operations currently need a cast to an object-typed series making them even slower than their current pandas counterparts.
Such operations are named kernels in the Apache Arrow C++ source code where a rudimentary set of them exists already.
Sadly a lot of common functionality is missing for the basic data types and some of the existing kernels are only implemented for
pyarrow.Array and not for
Having good kernels available for
ChunkedArray in Arrow C++ itself is crucial as applying the kernel to individual chunks often includes non-trivial transformations of intermediate results or indices that were given as an input.
With the pandas integration basics now in place in
fletcher, we will be able to concentrate on exactly these kernels.
As one of the points of
fletcher is to explore on how to implement kernels on top of Arrow in the most efficient way with
numba, we are first trying to implement a kernel in
fletcher and will only resort to Arrow if the implementation turns out to be too complex or too slow with
One of the drawbacks of putting a kernel implementation into Arrow C++ is that we need to wait for a release of it to make it available to end-users.
With implementing them first in
fletcher, we can make releases on our own and thus release them faster to the user.
Afterwards, when Arrow is then released, we can remove our implementation and point to the most likely (a bit) faster implementation in C++.
spatialpandas as an impact example
The main goal of fletcher is to make impact on
pandas and Apache Arrow but we are also very pleased that we have an impact on the ecosystem.
The influence of fletcher can be seen a bit in
ExtensionArray implementation for spatial/geometric operations.
spatialpandas is also building on top of
numba to implement certain specific spatial-specific data types and also reuses basic code from
fletcher for accessing Arrow data.
As the next steps in
fletcher, we will focus to implement more of the
These are the simple datatypes
fletcher can have massive improvements over the status quo in
This is because strings are currently implemented as
object dtype even when using the new
StringDtype and thus are not comparable in performance to the dtypes that are implement with numpy-native types.
date(time) columns, we can also improve a bit by allowing more-than-nanosecond precision and also the use of 32bit datatype for dates where 64bit aren’t needed to represent the most commonly used timespans.
It would also be nice to have more operations on nested types as they are currently unavailable in
fletcher supports them through the use of Arrow.
But as the kernel implementations for them are much more complex, we are going to focus first on strings and dates.