Posted on Categories data science, Pragmatic Data Science, TutorialsTags , , , ,

A Richer Category for Data Wrangling

I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra and in rquery/rqdatatable.

I think I’ve found an even better category theory re-formulation of the package, which I will describe here.

In the earlier formalism our data transform pipelines were arrows over a category of sets of column names (sets of strings). We will call this the “column equality” version of our category.

These pipelines acted on Pandas tables or SQL tables, with one table marked as special. Marking one table as special (or using a “pointed set” notation) lets us use a nice compositional notation, without having to appeal to something like operads. Treating one table as the one of interest is fairly compatible with data science, as in data science often when working with many tables one is the primary model-frame and the rest are used to join in additional information.

The above formulation was really working well. But we have found a variation of the data_algebra with an even neater formalism.

The data_algebra objects have a very nice interpretation as arrows in a category whose objects are set families described by:

  • a set of required column names.
  • a set of forbidden column names.

We will call this the “set ordered” version of our category.

The arrows a and b compose as a >> b as long as:

  • All of the columns required by b are produced by a.
  • None of the columns forbidden by b are produced by a.

This is still an equality check of domains and co-domains, so as long as we maintain associativity we still have a nice category.

We can illustrate the below.

First we import our modules.

import sqlite3

import pandas

from data_algebra.data_ops import *
from data_algebra.arrow import fmt_as_arrow
import data_algebra.SQLite

We define our first arrow which is a transform that creates a new column x as the sum of the columns a and b.

a = TableDescription(
    table_name='table_a', 
    column_names=['a', 'b']). \
        extend({'c': 'a + b'})

a
TableDescription(
 table_name='table_a',
 column_names=[
   'a', 'b']) .\
   extend({
    'c': 'a + b'})
print(fmt_as_arrow(a))
[
 'table_a':
  at least [ a, b ]
   ->
  at least [ a, b, c ]
]

And we define our second arrow, b, which renames the column a to a new column name x.

b = TableDescription(table_name='table_b', column_names=['a']). \
        rename_columns({'x': 'a'})

b
TableDescription(
 table_name='table_b',
 column_names=[
   'a']) .\
   rename_columns({'x': 'a'})
print(fmt_as_arrow(b))
[
 'table_b':
  at least [ a ] , and none of [ x ]
   ->
  at least [ x ]
]

The rules are met, so we can combine these two arrows.

ab = a >> b

ab
TableDescription(
 table_name='table_a',
 column_names=[
   'a', 'b']) .\
   extend({
    'c': 'a + b'}) .\
   rename_columns({'x': 'a'})
print(fmt_as_arrow(ab))
[
 'table_a':
  at least [ a, b ] , and none of [ x ]
   ->
  at least [ b, c, x ]
]

Notice this produces a new arrow ab with appropriate required and forbidden columns. By associativity (one of the primary properties needed to be a category) we get that the arrow ab has an action on data frames the same as using the a action followed by the b action.

Let’s illustrate that here.

d = pandas.DataFrame({
    'a': [1, 2],
    'b': [30, 40]
})

d
a b
0 1 30
1 2 40
b.act_on(a.act_on(d))
x b c
0 1 30 31
1 2 40 42
ab.act_on(d)
x b c
0 1 30 31
1 2 40 42

.act_on() copies forward all columns consistent with the transform specification and used at the output. Missing columns are excess columns are checked for at the start of a calculation.

excess_frame = pandas.DataFrame({
    'a': [1], 
    'b': [2], 
    'd': [3],
    'x': [4]})

try:
    ab.act_on(excess_frame)
except ValueError as ve:
    print("caught ValueError: " + str(ve))

caught ValueError: Table table_a has forbidden columns: {‘x’}

The .transform() method, on the other hand, copies forward only declared columns.

ab.transform(excess_frame)
x b c
0 1 2 3

Notice in the above that the input x did not interfere with the calculation, and d was not copied forward. The idea is behavior during composition is very close to behavior during action/application, so we find more issues during composition.

However, .transform() does not associate with composition in the set ordered verson of our category, so is not an action of this category. The issue is: we have b.transform(a.transform(d)) is not equal to ab.transform(d) in this category.

b.transform(a.transform(d))
x
0 1
1 2

.transform() does associate with the arrows of the stricter identical column set category we demonstrated earlier, so it is an action of this category.

So we have two categories, .act_on() is the category compatible action for one, and .transform() for the other.

In both cases we still have result-oriented narrowing.

c = TableDescription(
    table_name='table_c', 
    column_names=['a', 'b', 'c']). \
        extend({'x': 'a + b'}). \
        select_columns({'x'})

c

TableDescription( table_name=’table_c’, column_names=[ ‘a’, ‘b’, ‘c’]) .
extend({ ‘x’: ‘a + b’}) .
select_columns([‘x’])

print(fmt_as_arrow(c))

[ ‘table_c’: at least [ a, b, c ] -> at least [ x ] ]

table_c = pandas.DataFrame({
    'a': [1, 2],
    'b': [30, 40],
    'c': [500, 600],
    'd': [7000, 8000]
})

table_c
a b c d
0 1 30 500 7000
1 2 40 600 8000
c.act_on(table_c)
x
0 31
1 42
c.transform(table_c)
x
0 31
1 42

.select_columns() conditions are propagated back through the calculation.

Another useful operator is .drop_columns() which drops columns if they are present, but does not raise an issue if the columns to be removed are already not present. .drop_columns() can be used to guarantee forbidden columns are not present. We could use .act_on() or excess_frame using .drop_columns() as follows.

tdr = describe_table(excess_frame). \
        drop_columns(['x'])

tdr

TableDescription( table_name=’data_frame’, column_names=[ ‘a’, ‘b’, ‘d’, ‘x’]) .
drop_columns([‘x’])

rab = tdr >> ab

rab

TableDescription( table_name=’data_frame’, column_names=[ ‘a’, ‘b’, ‘d’, ‘x’]) .
drop_columns([‘x’]) .
extend({ ‘c’: ‘a + b’}) .
rename_columns({‘x’: ‘a’})

The >> notation is composing the arrows. tdr >> ab is syntactic sugar for ab.apply_to(tdr). Both of these are the arrow composition operations.

rab.act_on(excess_frame)
x b d c
0 1 2 3 3

Remember, the original ab operator rejects excess_frame.

try:
    ab.act_on(excess_frame)
except ValueError as ve:
    print("caught ValueError: " + str(ve))

caught ValueError: Table table_a has forbidden columns: {‘x’}

We can also adjust the input-specification by composing pipelines with table descriptions.

a

TableDescription( table_name=’table_a’, column_names=[ ‘a’, ‘b’]) .
extend({ ‘c’: ‘a + b’})

bigger = TableDescription(
    table_name='bigger', 
    column_names=['a', 'b', 'x', 'y', 'z'])

bigger

TableDescription( table_name=’bigger’, column_names=[ ‘a’, ‘b’, ‘x’, ‘y’, ‘z’])

bigger_a = bigger >> a

bigger_a

TableDescription( table_name=’bigger’, column_names=[ ‘a’, ‘b’, ‘x’, ‘y’, ‘z’]) .
extend({ ‘c’: ‘a + b’})

print(fmt_as_arrow(bigger_a))

[ ‘bigger’: at least [ a, b, x, y, z ] -> at least [ a, b, c, x, y, z ] ]

Notice the new arrow (bigger_a) has a wider input specification. Appropriate checking is performed during the composition.

As always, we can also translate any of our operators to SQL.

db_model = data_algebra.SQLite.SQLiteModel()

print(bigger_a.to_sql(db_model=db_model, pretty=True))

SELECT "a" + "b" AS "c", "a", "y", "z", "x", "b" FROM "bigger"


The SQL translation does not use “*“, so it is similar to .transform() in that it only refers to known columns by name. This means we are safe from extra columns in the source tables. This means if we did derive an action acting on SQL or composition over SQL it would not associate with the data_algebra operator composition (just as .transform() did not).

So the column identity is a model of our SQL translation scheme (needs exact column definitions), and the set ordered category is a model of our in-memory data processing (accepts extra columns). These two systems being slightly different is mitigated by the fact each has an orderly model as a category. So we have tools, in principle, to formally talk about how they relate to each other (again using category theory).

Notice we no longer have to use the arrow-adapter classes (except for formatting), the data_algebra itself has been adjusted to a more direct categorical basis.

And that is some of how the data_algebra works on our new set-oriented category. In this formulation much less annotation is required from the user, while still allowing very detailed record-keeping. The detailed record-keeping lets us find issues while assembling the pipelines, not later when working with potentially large/slow data.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.