Uwe Korn's blog **Uwe L. Korn**, Data Scientist, Music Hacker and Food lover. https://uwekorn.com Tue, 04 May 2021 00:00:00 +0100 Automating miniforge updates using Github Actions <p><a href="https://github.com/conda-forge/miniforge"><code class="language-plaintext highlighter-rouge">miniforge</code> and its variants <code class="language-plaintext highlighter-rouge">miniforge-pypy</code> and <code class="language-plaintext highlighter-rouge">mambaforge-*</code></a> are the base installers for using <code class="language-plaintext highlighter-rouge">conda</code> with <code class="language-plaintext highlighter-rouge">conda-forge</code> as the default source for packages. They will provide you with a basic conda installation to get started. This means that as part of that, the newest installers should also bring the newest <code class="language-plaintext highlighter-rouge">conda</code> and <code class="language-plaintext highlighter-rouge">mamba</code> versions with them.</p> Tue, 04 May 2021 00:00:00 +0100 https://uwekorn.com/2021/05/04/automating-miniforge-updates.html https://uwekorn.com/2021/05/04/automating-miniforge-updates.html The implications of pickling ML models <p>When you have trained a machine learning model (pipeline), you will make predictions directly afterwards to assess its quality. When using the model actually for something useful, we also want to make predictions with it at a later point in time. This forces us to store the model to disk and think of a way to serialise it.</p> Mon, 26 Apr 2021 00:00:00 +0100 https://uwekorn.com/2021/04/26/implications-of-pickling-ml-models.html https://uwekorn.com/2021/04/26/implications-of-pickling-ml-models.html Deploying conda environments in (Docker) containers - The Cheatsheet! <p>Deploying conda environments inside a container looks like a straight-forward <code class="language-plaintext highlighter-rouge">conda install</code>. But with a bit more love for details, you can optimise the process so that the build is faster and the resulting container much smaller.</p> Wed, 03 Mar 2021 00:00:00 +0000 https://uwekorn.com/2021/03/03/deploying-conda-environments-in-docker-cheatsheet.html https://uwekorn.com/2021/03/03/deploying-conda-environments-in-docker-cheatsheet.html Deploying conda environments in (Docker) containers - how to do it right <p>Deploying conda environments inside a container looks like a straight-forward <code class="language-plaintext highlighter-rouge">conda install</code>. But with a bit more love for details, you can optimise the process so that the build is faster and the resulting container much smaller.</p> Mon, 01 Mar 2021 00:00:00 +0000 https://uwekorn.com/2021/03/01/deploying-conda-environments-in-docker-how-to-do-it-right.html https://uwekorn.com/2021/03/01/deploying-conda-environments-in-docker-how-to-do-it-right.html Apache Arrow on the Apple M1 <p>In <a href="https://uwekorn.com/2021/01/04/first-two-weeks-with-the-m1.html">the previous blog post</a> I explained how I got a well-working setup on my M1 MacBook. With that in place, I mostly worked on my main work setup running. But as a core Apache Arrow developer, I was also very eager to spend the extra mile and get Arrow (the C++ and Python part) working on the M1. As outlined in the previous post, I used <code class="language-plaintext highlighter-rouge">conda-forge</code> as the source for all dependencies and Arrow itself to build binary packages.</p> Mon, 11 Jan 2021 00:00:00 +0000 https://uwekorn.com/2021/01/11/apache-arrow-on-the-apple-m1.html https://uwekorn.com/2021/01/11/apache-arrow-on-the-apple-m1.html The first two weeks with the Apple M1 <p>Apple recently published new computers that contain their new M1 processors. I was quite excited about them because of the promises made by various benchmarks regarding performance and energy consumption but also because it is also a new platform. Most things won’t work there and some assumption on how we work today have to change if you want to use them.</p> Mon, 04 Jan 2021 00:00:00 +0000 https://uwekorn.com/2021/01/04/first-two-weeks-with-the-m1.html https://uwekorn.com/2021/01/04/first-two-weeks-with-the-m1.html Fast JDBC access in Python using pyarrow.jvm (2020 edition) <p>About a year ago, <a href="https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html">I have benchmarked access databases through JDBC in Python</a>. Recently, the maintainer of <a href="https://pypi.org/project/JPype1/"><code class="language-plaintext highlighter-rouge">jpype</code></a> gave <a href="https://github.com/Thrameos/jpype/issues/52">me a heads-up that they significantly improved performance on their side</a>. While this is actually the library I’m comparing my <code class="language-plaintext highlighter-rouge">pyarrow.jvm</code>-based approach to, I have a high appreciation for any performance tuning that is done. Thus I’m happily trying to recreate the setup I used a year ago to see how performance changed.</p> Wed, 30 Dec 2020 00:00:00 +0000 https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html Calculating levenshtein distances with fletcher <p><em>Levenshtein distance is a typical measure to compare two different strings. It gives you the minimal number of add, remove and replace operations to transition from one string to another.</em></p> Tue, 08 Dec 2020 00:00:00 +0000 https://uwekorn.com/2020/12/08/levenshtein-distance-with-fletcher.html https://uwekorn.com/2020/12/08/levenshtein-distance-with-fletcher.html Trimming down pyarrow’s conda footprint (Part 2 of X) <p><em>We have again reduced the footprint of creating a conda environment with <code class="language-plaintext highlighter-rouge">pyarrow</code>. This time we have done some detective work on the package contents and removed contents from <code class="language-plaintext highlighter-rouge">thrift-cpp</code> and <code class="language-plaintext highlighter-rouge">pyarrow</code> that are definitely not needed at runtime.</em></p> Wed, 28 Oct 2020 00:00:00 +0000 https://uwekorn.com/2020/10/28/trimming-down-pyarrow-conda-2-of-x.html https://uwekorn.com/2020/10/28/trimming-down-pyarrow-conda-2-of-x.html Removing Python as a dependency of R <p><em>Surprisingly Python was a runtime dependency of R on conda-forge. As R doesn’t need Python to run, this was a bit weird. We got rid of this by splitting up the GLib package.</em></p> Mon, 19 Oct 2020 00:00:00 +0100 https://uwekorn.com/2020/10/19/r-without-python.html https://uwekorn.com/2020/10/19/r-without-python.html Trimming down pyarrow’s conda footprint (Part 1 of X) <p><em>We have substantially reduced the footprint of creating a conda environment with <code class="language-plaintext highlighter-rouge">pyarrow</code>. While working on this, we have also substantially reduced the size of a base Python installation from conda-forge. All this was done without disabling any functionality. We reduced the size of a conda environment for pyarrow by nearly 50% and reduced the “pyarrow tax” for reading Parquet files with <code class="language-plaintext highlighter-rouge">pandas</code> to a tenth of its previous size. Additionally, we stripped 81MiB of unneeded files of every Python (3.8+) based conda environment installed from conda-forge.</em></p> Tue, 08 Sep 2020 00:00:00 +0100 https://uwekorn.com/2020/09/08/trimming-down-pyarrow-conda-1-of-x.html https://uwekorn.com/2020/09/08/trimming-down-pyarrow-conda-1-of-x.html Building R Arrow on Windows: A tale of two compilers <p>Windows support for Apache Arrow is pretty good. There are Python wheels, Python conda packages and a binary build for R on CRAN. One thing that has been missing though for a long time has been a conda package for R Arrow on Windows. Thanks to a lot of experimentation and some important suggestions by <a href="https://github.com/isuruf">Isuru Fernando</a> (Thanks!), we are able to use <code class="language-plaintext highlighter-rouge">conda install -c conda-forge r-arrow</code> successfully on Windows now.</p> Sun, 14 Jun 2020 00:00:00 +0100 https://uwekorn.com/2020/06/14/r-arrow-for-conda-windows.html https://uwekorn.com/2020/06/14/r-arrow-for-conda-windows.html The one pandas internal I teach all my new colleagues: the BlockManager <p>When new members join our team, they usually are already fluent in data analysis with <code class="language-plaintext highlighter-rouge">pandas</code> and know their way around the typical quirks. They know that they should use vectorised functions where possible and avoid using <code class="language-plaintext highlighter-rouge">apply</code> with a slow Python callable. There are two main reasons, I teach them the <code class="language-plaintext highlighter-rouge">BlockManager</code> quite at the beginning. The first reason is that it is actually a core architectural component that is neither visible from the API nor is it part of the most tutorials through which people learn <code class="language-plaintext highlighter-rouge">pandas</code>. The other reason is that it has an impact on performance that is neither obvious from the code you were using nor that it always have the same (constant) impact on performance. Even after writing this post, I cannot reliably tell you the performance of a simple <code class="language-plaintext highlighter-rouge">df.loc[0:10, 'column'] = 1</code>, my answer will be a “it depends!”.</p> Sun, 24 May 2020 00:00:00 +0100 https://uwekorn.com/2020/05/24/the-one-pandas-internal.html https://uwekorn.com/2020/05/24/the-one-pandas-internal.html Fletcher 0.3: A status report on the mission to get pandas hooked on Apache Arrow <p>It has been now nearly two years since the idea came up to use <code class="language-plaintext highlighter-rouge">pandas</code>’ new <code class="language-plaintext highlighter-rouge">ExtensionArray</code> interface to provide columns in <code class="language-plaintext highlighter-rouge">pandas</code> that are backed by Apache Arrow. <a href="https://github.com/xhochy/fletcher"><code class="language-plaintext highlighter-rouge">fletcher</code></a> was started as a prototype project to show how this idea can be brought together. Since then there has been quite a lot of development in both <code class="language-plaintext highlighter-rouge">pandas</code> and Apache Arrow. Still, <code class="language-plaintext highlighter-rouge">fletcher</code> remains a prototype to show how this could look like as essential functionality is missing to use it productively. With the two year mark now approaching, I thought it was a good time to give a progress report and tag new intermediate release.</p> Tue, 25 Feb 2020 00:00:00 +0000 https://uwekorn.com/2020/02/25/fletcher-status-report.html https://uwekorn.com/2020/02/25/fletcher-status-report.html Fast JDBC access in Python using pyarrow.jvm <p>While most databases are accessible via ODBC where we have an efficient way via <a href="https://github.com/blue-yonder/turbodbc">turbodbc</a> to turn results into a <code class="language-plaintext highlighter-rouge">pandas.DataFrame</code>, there are nowadays a lot of databases that either only come solely with a JDBC driver or the non-JDBC drivers are not part of free or open-source offering. To access these databases, you can use <a href="https://github.com/baztian/jaydebeapi">JayDeBeApi</a> which is using <a href="https://pypi.org/project/JPype1/">JPype</a> to call the JDBC driver. JPype starts a JVM inside the Python process and exposes the Java APIs as plain Python objects. While the convenience of use is really nice, this Java-Python bridge sadly comes at a high serialisation cost.</p> Sun, 17 Nov 2019 00:00:00 +0000 https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html Taking DuckDB for a spin <p><em>TL;DR: Recently, <a href="https://www.duckdb.org/">DuckDB</a> a database that promises to become the SQLite-of-analytics, was released and I took it for an initial test drive</em>. Install it via <code class="language-plaintext highlighter-rouge">conda install python-duckdb</code> or <code class="language-plaintext highlighter-rouge">pip install duckdb</code>.</p> Sat, 19 Oct 2019 00:00:00 +0100 https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html How we build Apache Arrow's manylinux wheels <p>Apache Arrow is provided for Python users through two package managers, <code class="language-plaintext highlighter-rouge">pip</code> and <code class="language-plaintext highlighter-rouge">conda</code>. The first mechanism, providing binary, pip-installable Python wheels is <a href="https://lists.apache.org/thread.html/128a2bec285ad45aa4189ebb39a15b39dcf6d91c4ab0278ff4f7cdea@%3Cdev.arrow.apache.org%3E">currently unmaintained as highlighted on the mailing list</a>. There has been shoutouts for help, e.g. <a href="https://twitter.com/ApacheArrow/status/1163919996214501377">on Twitter</a> that we need new contributors who look after the builds. We sadly cannot point to all issues that arise, mostly issues come up slowly when people use new releases. Thus we need people that look after the wheel builds and have an understanding what is done to provide these binaries to the end-user. As having worked quite some time on the Linux wheel, I thought the best thing to handover would be to give an introduction to the current build process.</p> Sun, 15 Sep 2019 00:00:00 +0100 https://uwekorn.com/2019/09/15/how-we-build-apache-arrows-manylinux-wheels.html https://uwekorn.com/2019/09/15/how-we-build-apache-arrows-manylinux-wheels.html Writing a boolean array for pandas that can deal with missing values <p>When working with missing data in <code class="language-plaintext highlighter-rouge">pandas</code>, one often runs into issues as the main way is to convert data into <code class="language-plaintext highlighter-rouge">float</code> columns. <code class="language-plaintext highlighter-rouge">pandas</code> provides efficient/native support for boolean columns through the <code class="language-plaintext highlighter-rouge">numpy.dtype('bool')</code>. Sadly, this <code class="language-plaintext highlighter-rouge">dtype</code> only supports <code class="language-plaintext highlighter-rouge">True/False</code> as possible values and no possibility for storing missing values. Additionally, <code class="language-plaintext highlighter-rouge">numpy</code> uses a whole byte to store the <code class="language-plaintext highlighter-rouge">True/False</code> information while a single bit would be sufficient.</p> Mon, 02 Sep 2019 00:00:00 +0100 https://uwekorn.com/2019/09/02/boolean-array-with-missings.html https://uwekorn.com/2019/09/02/boolean-array-with-missings.html Why the NYC TLC trip record data is a nice training dataset for Data Engineers <p>The <a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page">New York City Taxi &amp; Limousine Commission Trip Record Data</a> is a really nice dataset to get started with Data Engineering or teaching it. It has several nice properties that make it quite useful that we will show in this article. We will look at this data using only <code class="language-plaintext highlighter-rouge">pandas</code>, not introducing any other tooling. Many properties are nice for teaching different problems in Data Engineering. Still, these properties are not so dominant that you would have to deal with all of them immediately but you can introduce concept-by-concept by using the data as an example problem.</p> Thu, 22 Aug 2019 00:00:00 +0100 https://uwekorn.com/2019/08/22/why-the-nyc-trd-is-a-nice-training-dataset.html https://uwekorn.com/2019/08/22/why-the-nyc-trd-is-a-nice-training-dataset.html Data Engineers: The best friends of Data Scientists you forgot to hire. <p>At the moment in Computer Science, there are two hot topics: AI and Blockchain. Behind these two buzzwords, there are industries striving to build successful products. Currently, I work in the sector often labelled as AI. Usually, it is also described with other terms like Machine Learning or Big Data. In this sector the currently most sought-after job is the one of a Data Scientist. Although the hype started years ago already, it still is in full swing. Recently, <a href="https://www.bloomberg.com/news/articles/2018-05-18/-sexiest-job-ignites-talent-wars-as-demand-for-data-geeks-soars">Bloomberg published an article describing it as America’s hottest job</a>. Everyone, not only the tech sector, is hiring Data Scientists. Even old-fashioned SMBs are looking to hire a data scientist. The demand for these people has also lead to many jobs being renamed to Data Scientist. Furthermore, a lot of people are describing themselves as Data Scientists. Many jobs did not change, they merged into a single name. Despite the variety, we assume that a Data Scientist has some kind of maths background and is able to code.</p> Wed, 13 Feb 2019 00:00:00 +0000 https://uwekorn.com/2019/02/13/data-engineers-the-best-friends-of-data-scientists-you-forgot.html https://uwekorn.com/2019/02/13/data-engineers-the-best-friends-of-data-scientists-you-forgot.html Data Science I/O - A baseline benchmark for 2019 <p>Data Science and Machine Learning are tasks that have their own requirements on I/O. As many other tasks, they start out on tabular data in most cases. In contrast to a typical reporting task, they don’t work on aggregates but require the data on the most granular level. Some machine learning algorithms are able to directly work on aggregates but most workflows pass over the data in its most granular form.</p> Sun, 27 Jan 2019 00:00:00 +0000 https://uwekorn.com/2019/01/27/data-science-io-a-baseline-benchmark.html https://uwekorn.com/2019/01/27/data-science-io-a-baseline-benchmark.html PyFlame: profiling running Python processes <p>Identifying performance bottlenecks in long-running processes often involves careful instrumentation ahead or guessing where the root of the problem may be. A very welcome set of tools are the ones that help you diagnose problems of live systems without modifying them. One important tool I recently came across is the <a href="https://github.com/uber/pyflame">pyflame</a> profiler.</p> Fri, 05 Oct 2018 00:00:00 +0100 https://uwekorn.com/2018/10/05/pyflame.html https://uwekorn.com/2018/10/05/pyflame.html Use Numba to work with Apache Arrow in pure Python <p><a href="https://arrow.apache.org/">Apache Arrow</a> is an in-memory memory format for columnar data. In more “plain” English, it is a standard on how to store DataFrames/tables in memory, independent of the programming language. One of its most prominent uses is for the <code class="language-plaintext highlighter-rouge">@pandas_udf</code> decorator in <a href="https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html">Apache Spark</a> to move data quickly between Scala and Python/pandas.</p> Fri, 03 Aug 2018 00:00:00 +0100 https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html AHL Python Hackathon April 2018 <p>Three weeks ago MAN AHL organised an <a href="https://www.ahl.com/hackathon">opensource hackathon</a> at their London office. As part of the Hackathon people should contribute to one of the PyData artifacts they regularly use. To support them in making their first contribution, AHL also coordinated that several core committers of opensource projects were present at the event. I joined in as the representative of the <a href="https://arrow.apache.org/">Apache Arrow</a> project.</p> Sat, 19 May 2018 00:00:00 +0100 https://uwekorn.com/2018/05/19/ahl-hackathon.html https://uwekorn.com/2018/05/19/ahl-hackathon.html Play interactively with Apache Arrow C++ in xeus-cling <p><em>Often, we use <code class="language-plaintext highlighter-rouge">pyarrow</code> in a Jupyter Notebook during work. With the <code class="language-plaintext highlighter-rouge">xeus-cling</code> kernel, we can also use the C++ APIs directly in an interactive fashion in Jupyter.</em></p> Sun, 17 Dec 2017 00:00:00 +0000 https://uwekorn.com/2017/12/17/play-interactively-with-arrow-cpp-in-xeus-cling.html https://uwekorn.com/2017/12/17/play-interactively-with-arrow-cpp-in-xeus-cling.html Akka Streams for extracting Wikipedia Articles <p><em>Use Akka Streams as a new technique to extract specific articles from the Wikipedia xml dump into single files without the need to fit all data into RAM.</em></p> Wed, 24 Feb 2016 00:00:00 +0000 https://uwekorn.com/2016/02/24/use-akka-streams-to-extract-wikipedia-articles.html https://uwekorn.com/2016/02/24/use-akka-streams-to-extract-wikipedia-articles.html Beats Music Support in Tomahawk (and the long journey on how we got there) <p><em>tl;dr: With the latest nightlies (<a href="http://download.tomahawk-player.org/nightly/windows/tomahawk-latest.exe">Win</a>, <a href="http://download.tomahawk-player.org/nightly/mac/Tomahawk-latest.dmg">Mac</a>) you can now use your Beats Music Subscription in <a href="http://www.tomahawk-player.org/">Tomahawk</a>. To use it just install the <a href="http://teom.org/axes/nightly/beatsmusic-0.1.1.axe">Beats Music Resolver</a>. Although Beats has a nice API, supporting it was a though cruise through our underlying multimedia stack.</em></p> Fri, 18 Jul 2014 00:00:00 +0100 https://uwekorn.com/2014/07/18/beats-music-in-tomahawk.html https://uwekorn.com/2014/07/18/beats-music-in-tomahawk.html How to get global media keys support for Tomahawk in XFCE4 <p>Although there seems to be no native support for controlling a media player via the <a href="http://www.mpris.org">MPRIS specification</a> in <a href="http://www.xfce.org/">XFCE</a>, you can still set up global shortcuts to use the media keys on your keyboard to control Tomahawk regardless of which application currently has focus.</p> Thu, 03 Jul 2014 00:00:00 +0100 https://uwekorn.com/2014/07/03/tomahawk-xfce4-media-keys.html https://uwekorn.com/2014/07/03/tomahawk-xfce4-media-keys.html Replace QJson with Qt's own JSON handling in Qt5 <p><em>tl;dr: A simple wrapper to use QJson for Qt4 and the built-in JSON parser for Qt5 so that QJson is not required if built with Qt5: <a href="https://github.com/xhochy/qjson-qt5json-wrapper">qjson-qt5json-wrapper</a> (MIT-licensed, no <code class="language-plaintext highlighter-rouge">#ifdef</code> in your code).</em></p> Thu, 29 May 2014 00:00:00 +0100 https://uwekorn.com/2014/05/29/qjson-wrapper.html https://uwekorn.com/2014/05/29/qjson-wrapper.html Using Tomahawk resolvers in node.js <p><em>tl;dr: I wrote a node.js module so that you can use packaged Tomahawk AXE archives in your node.js application for querying music services with a unified interface, see <a href="https://npmjs.org/package/tomahawkjs">node-tomahawkjs</a></em></p> Tue, 28 Jan 2014 00:00:00 +0000 https://uwekorn.com/2014/01/28/node-tomahawkjs.html https://uwekorn.com/2014/01/28/node-tomahawkjs.html