Why you should use siuba, the dplyr of Python

2020/08/06

Introduction

If you find yourself doing data analysis in Python, you should check out siuba. siuba is a port of the R package dplyr, but you don’t need to know any R to get started.

siuba implements a domain specific language for querying data. You can use siuba to query both local Python data frames and remote SQL databases.

Why a new “language” and not the pandas API?

  1. concise lambdas and tab-completion

  2. consistent output for common operations

  3. unified API for grouped and ungrouped data frames

All of these advantages and more are described by Michael Chow in his list of key features.

Let’s walk through each case with some example data.

from siuba import *
from siuba.data import mtcars

import pandas as pd
bridges = pd.read_csv("maryland_bridges.csv")

Concise lambdas with tab-complete

siuba introduces the symbol _. You can think of _ as a placeholder or pronoun for the actual name of the data. So instead of writing bridges.yr_built you can write _.yr_built in any data frame or siuba function.

In many ways, _ behaves like a concise lambda:

bridges[_.yr_built > 1990]
bridges[lambda _: _.yr_built > 1990]

And this new “lambda” doesn’t just save a few characters: it also supports tab-completion in IPython and Jupyter Notebooks, suggesting column names and pandas data frame and series methods:

This was my small contribution to Siuba, with lots of help from Michael Chow. Include the following to try it out!

import siuba.experimental.completer

Consistent output for common operations

Calculating the size of groups is a very common operation. siuba makes this, and many other queries straightforward by providing helpers like count and distinct and by simplifying the API to 5 common actions.

Additionally, siuba output is always consistent. siuba never returns MultiIndex data frames, which has its own API and idiosyncrasies.

(bridges >>
    count(decade=10 * (_.yr_built // 10), sort=True) >>
    head())
##    decade    n
## 0    1970  404
## 1    1960  319
## 2    1980  304
## 3    1990  279
## 4    1950  261
(
  bridges
  .assign(decade=lambda _: 10 * (_.yr_built // 10))
  .groupby("decade")
  .agg(n = ("decade", "size"))
  .reset_index()
  .sort_values("n", ascending=False)
  .head()
)
##     decade    n
## 12    1970  404
## 11    1960  319
## 13    1980  304
## 14    1990  279
## 10    1950  261

Unified group API

When working with groups in pandas, you often have to juggle both the grouped and ungrouped data frame. Additionally, panda’s grouped API has some subtle differences, such as the transform method shown below.

All of siuba’s core actions have consistent behavior in grouped and ungrouped data frames.

(mtcars >>
    group_by(_.cyl) >>
    mutate(hp=_.hp - _.hp.mean()) >>
    ungroup())
mtcars_cyl = mtcars.groupby("cyl")
(
  mtcars
  .assign(hp=mtcars_cyl.obj.hp - mtcars_cyl.hp.transform("mean"))
)