-
Fast JDBC access in Python using pyarrow.jvm
·While most databases are accessible via ODBC where we have an efficient way via turbodbc to turn results into a
pandas.DataFrame
, there are nowadays a lot of databases that either only come solely with a JDBC driver or the non-JDBC drivers are not part of free or open-source offering. To access these databases, you can use -
Taking DuckDB for a spin
·TL;DR: Recently, DuckDB a database that promises to become the SQLite-of-analytics, was released and I took it for an initial test drive. Install it via
conda install python-duckdb
orpip install duckdb
. -
How we build Apache Arrow's manylinux wheels
·Apache Arrow is provided for Python users through two package managers,
pip
andconda
. The first mechanism, providing binary, pip-installable Python wheels is currently unmaintained as highlighted on the mailing list. There has been shoutouts for help, e.g. on Twitter that we need new contributors who look after the builds. We sadly cannot point... -
Writing a boolean array for pandas that can deal with missing values
·When working with missing data in
pandas
, one often runs into issues as the main way is to convert data intofloat
columns.pandas
provides efficient/native support for boolean columns through thenumpy.dtype('bool')
. Sadly, thisdtype
only supportsTrue/False
as possible values and no possibility for storing missing... -
Why the NYC TLC trip record data is a nice training dataset for Data Engineers
·The New York City Taxi & Limousine Commission Trip Record Data is a really nice dataset to get started with Data Engineering or teaching it. It has several nice properties that make it quite useful that we will show in this article. We will look at this data using only
pandas
, not introducing any other tooling. Many...