We'd like to have a central place to collect knowledge and opinions about data structures for statistics, finance, econometrics, and other related application domains. Since a number of projects having to do with this problem domain have appeared in recent times, having constructive discussions about the various design choices made would be beneficial to all parties.

Packages of interest

NumPy structured (record) arrays
la (larry)
tabular
scikits.timeseries
scikits.statsmodels
pandas (Talk on pandas from PyCon 2010)
Biopython
divisi

Comparison of packages

package name	pandas	la
web site	http://pandas.googlecode.com	http://larry.sourceforge.net
license	New BSD	Simplified BSD
programming languages	Python, Cython	Python, Cython
required dependencies	Python, Numpy, dateutil	Python, Numpy
optional dependencies	scikits.statsmodels, nose, scipy	h5py, Scipy, nose
build dependencies	C-compiler	optional: C-compiler
year started (open source)	2008 (2009)	2008 (2010)
data objects	1d: Series; 2d: DataFrame, DataMatrix; 3d: WidePanel, LongPanel	nd: larry
Number of dimensions supported	1-3d	nd > 0d
mutability	data mutable, expandable on 1 axis	data and label mutable; axis insertable
data container	1d: ndarray; 2d: dict of Series or 2D ndarray (DataMatrix); 3d: ndarray	ndarray
data types	homogenous, DataFrame/DataMatrix support heterogeneous	homogenous: float, int, str, object
index/label container	ndarray (Index class)	lists
index/label types	heterogenous, hashable	heterogenous, hashable
index/label constraints	unique along any one axis, hashable	unique along any one axis, hashable
missing values	NaN (float) or None (object)	NaN (float), partial support: "" (str), None (object)
binary operations on two data objects	union of indices	intersection of labels
IO	pickle, partial CSV	HDF5, partial support for CSV

pandas

pandas started in 2008 as a set of data structures for quantitative finance applications. It's original intent was to improve on ideas found in R's data.frame structure (hence the DataFrame class) and to enable Python to be a better tool for building robust systems. For example, the R data.frame does not support automatic data alignment and this is often desirable. Development on pandas proceeded with the following goals:

Want to make it easy to interactively explore the data in memory, particularly in an IPython session. R for example is quite good at this as is MATLAB and other libraries / research environments.
The user should not need to become a Python expert to become an expert in the use of pandas (in other words, where possible it tries not to make your life difficult)
No need to support higher than 3-dimensional data since such data sets are seldom found in statistical / financial applications. For this reason I avoided a generic n-dimensional approach (like larry, for example)
Performance is extremely important. This led to significant amounts of Cython code to optimize core operations (more on this below).

At a glance, the two workhorse data objects of pandas are the Series and the DataFrame:

from pandas import *
s = Series([0, 1, 2], index=['a', 'b', 'c'])
df = DataFrame({'col1' : s, 'col2' : s * 2})
>>> s
a    0
b    1
c    2
>>> df
     col1           col2
a    0              0
b    1              2
c    2              4
>>> df['col3'] = 6
>>> df
     col1           col2           col3
a    0              0              6
b    1              2              6
c    2              4              6

Series behaves like both a dict and an ndarray. Binary operations (+, -, etc.) match on labels. For R users, DataFrame is like a data.frame + automatic data alignment.

Series docs
DataFrame docs

larry

Here's larry in schematic form:

             date1    date2    date3
    'AAPL'   209.19   207.87   210.11
y = 'IBM'    129.03   130.39   130.55
    'DELL'    14.82    15.11    14.94

The larry above is stored internally as a Numpy array and a list of lists:

y.label = [['AAPL', 'IBM', 'DELL'], [date1, date2, date3]]
y.x = np.array([[209.19, 207.87, 210.11],
                [129.03, 130.39, 130.55],
                [ 14.82,  15.11,  14.94]])

A larry can have any number of dimensions except zero. Here, for example, is one way to create a one-dimensional larry:

>>> import la
>>> y = la.larry([1, 2, 3])

In the statement above the list is converted to a Numpy array and the labels default to range(n), where *n* in this case is 3.

For the most part larry acts like a Numpy array. And, whenever you want, you have direct access to the Numpy array that holds your data. For example if you have a function, *myfunc*, that works on Numpy arrays and doesn't change the shape or ordering of the array, then you can use it on a larry, *y*, like this:

y.x = myfunc(y.x)

One of the design goals of the larry data object is to behave in a way that a Numpy array user would expect. Here's a brief comparison to Numpy arrays:

Numpy	la
`arr = np.array([[1, 2], [3, 4]])`	`lar = larry([[1, 2], [3, 4]])`
`np.nansum(arr)`	`lar.sum()`
`arr.shape`, `arr.dtype`, `arr.ndim`, `arr.T`	`lar.shape`, `lar.dtype`, `lar.ndim`, `lar.T`
`arr.astype(float)`	`lar.astype(float)`
`arr1 + arr2`	`lar1 + lar2`
`arr[:,0]`	`lar[:,0]`
nd --> (nd + 1)	`lar.insertaxis(axis, label)`

Design and performance issues for data structures

For each item, we'd like to address the following points:

What is a relevant example (with code snippets)?
How are existing packages solving (or not solving) the problem?
For performance-sensitive operations, how fast is each package (provide a code snippet)? We might also try to quantify the overhead using a particular data structure over vanilla NumPy areas (homogeneous or structured)

Creation of data objects

One of the most basic operations is the creation of a data object. Here are examples of how to create three different data objects: datarray.DataArray, pandas.DataMatrix, and la.larry.

First let's set up the data we will use to create the data objects:

>>> import numpy as np
>>> N = 100
>>> arr = np.random.rand(N, N)
>>> idx1 = map(str, range(N))
>>> idx2 = map(str, range(N))

Create a datarray.DataArray:

>>> import datarray
>>> datarray.__version__
    '0.0.4'
>>> timeit datarray.DataArray(arr, [('row', idx1), ('col', idx2)])
    10000 loops, best of 3: 153 us per loop

Create a pandas.DataMatrix:

>>> import pandas
>>> pandas.__version__
    '0.2.1.dev'
>>> timeit pandas.DataMatrix(arr, idx1, idx2)
    10000 loops, best of 3: 50.1 us per loop

Create a la.larry:

>>> import la
>>> la.__version__
    '0.5.0dev'
>>> timeit la.larry(arr, [idx1, idx2])
    100000 loops, best of 3: 14 us per loop

Most of the time used to create a larry is spent doing an integrity check of the labels. If you skip the integrity check, you get a 10x speed up:

>>> timeit la.larry(arr, [idx1, idx2], integrity=False)
    1000000 loops, best of 3: 1.33 us per loop

Tabular data, mutability

Many statistical data sets arrive in tabular (2D) form-- typically a collection of vectors with associated labels. R provides the mutable data.frame class:

> df <- data.frame(a=rnorm(5), b=rnorm(5))
> df
           a           b
1 -0.4454339 -0.87899052
2  0.1119211  0.08378914
3  0.5823613  1.51314787
4 -0.4235883  1.51682599
5 -0.4429238 -1.86833929
> df$a
[1] -0.4454339  0.1119211  0.5823613 -0.4235883 -0.4429238
> df$c <- 1
> df
           a           b c
1 -0.4454339 -0.87899052 1
2  0.1119211  0.08378914 1
3  0.5823613  1.51314787 1
4 -0.4235883  1.51682599 1
5 -0.4429238 -1.86833929 1

For users of other statistical environments, I'd be interested to hear about the tabular data structures and what they can do. Is such mutability useful? I've found it quite useful in cases like this. The R data.frame does not provide any sort of row-oriented data alignment (see below), but it's worth asking what are the features of such a 2D data structure that are the most useful.

The pandas code for the above is nearly identical, with pythonic notation of course:

In [22]: df = DataFrame({'a' : randn(5), 'b' : randn(5)})

In [23]: df
Out[23]:
     a              b
0    -1.894         -1.373
1    -0.8486        -0.5785
2    1.838          -3.761
3    -0.6144        -0.7812
4    0.8473         0.9544


In [24]: df['a']
Out[24]:
0    -1.89423416608
1    -0.848596359606
2    1.83799383312
3    -0.614350513146
4    0.84729188782

In [25]: df['c'] = 1

In [26]: df
Out[26]:
     a              b              c
0    -1.894         -1.373         1
1    -0.8486        -0.5785        1
2    1.838          -3.761         1
3    -0.6144        -0.7812        1
4    0.8473         0.9544         1

In the context of misaligned data, I've seen a lot of pandas users use DataFrame purely as a container for aligning labeled data:

In [74]: df
Out[74]:
                       A              B
2000-01-03 00:00:00    -0.2306        0.3456
2000-01-04 00:00:00    -0.882         -0.1626
2000-01-05 00:00:00    0.2474         -0.1887
2000-01-06 00:00:00    -0.2704        0.1398
2000-01-07 00:00:00    1.498          1.628
2000-01-10 00:00:00    -1.645         0.09256
2000-01-11 00:00:00    0.5666         -1.128
2000-01-12 00:00:00    -0.5082        -0.5381
2000-01-13 00:00:00    0.2205         -0.5728
2000-01-14 00:00:00    -0.4916        -0.3333


In [76]: s
Out[76]:
2000-01-03 00:00:00    -0.115323455775
2000-01-04 00:00:00    -0.441017336819
2000-01-05 00:00:00    0.123677568341
2000-01-06 00:00:00    -0.135191054412
2000-01-07 00:00:00    0.749180511001
2000-01-10 00:00:00    -0.822250171143
2000-01-11 00:00:00    0.283300117597
2000-01-12 00:00:00    -0.254115521874

In [77]: df['C'] = s

In [78]: df
Out[78]:
                       A              B              C
2000-01-03 00:00:00    -0.2306        0.3456         -0.1153
2000-01-04 00:00:00    -0.882         -0.1626        -0.441
2000-01-05 00:00:00    0.2474         -0.1887        0.1237
2000-01-06 00:00:00    -0.2704        0.1398         -0.1352
2000-01-07 00:00:00    1.498          1.628          0.7492
2000-01-10 00:00:00    -1.645         0.09256        -0.8223
2000-01-11 00:00:00    0.5666         -1.128         0.2833
2000-01-12 00:00:00    -0.5082        -0.5381        -0.2541
2000-01-13 00:00:00    0.2205         -0.5728        NaN
2000-01-14 00:00:00    -0.4916        -0.3333        NaN

To combine two larrys, use merge:

>> lar1 = larry(randn(5,2))
>> lar2 = larry(randn(2,1), [[0, 4], ['c']])

>> lar1.merge(lar2)
label_0
    0
    1
    2
    3
    4
label_1
    0
    1
    c
x
array([[ 0.19271985,  0.38086157,  0.89416946],
       [ 0.17987281, -0.20143654,         NaN],
       [ 0.56567726,  2.13787551,         NaN],
       [-1.92165255,  0.22968882,         NaN],
       [ 1.14440607,  0.07242789, -0.56150006]])

If the two larrys overlap, you can use the update option:

>> lar3 = larry(np.ones((2,1)), [[0, 4], [0]])
>> lar1.merge(lar3, update=True)
label_0
    0
    1
    2
    3
    4
label_1
    0
    1
x
array([[ 1.        ,  0.38086157],
       [ 0.17987281, -0.20143654],
       [ 0.56567726,  2.13787551],
       [-1.92165255,  0.22968882],
       [ 1.        ,  0.07242789]])

Data alignment

Union versus intersection of data labels
Performance for differently-indexed data

Here is a pretty pathological but realistic example:

# Setup
import numpy as np
import pandas
import la
N = 10000
arr1 = np.random.randn(N)
arr2 = np.random.randn(N)
idx1 = range(N)
idx2 = range(N)[::-1]

# pandas
series1 = pandas.Series(arr1, index=idx1)
series2 = pandas.Series(arr2, index=idx2)

# larry
lar1 = la.larry(arr1, [idx1])
lar2 = la.larry(arr2, [idx2])

Here is with latest svn/bzr trunks on 5/30/10:

# aligned
In [5]: timeit series1 + serie1
10000 loops, best of 3: 36.4 us per loop
In [6]: timeit lar1 + lar1
10000 loops, best of 3: 72.5 us per loop

# unaligned
In [9]: timeit series1 + series2
100 loops, best of 3: 2.72 ms per loop
In [8]: timeit lar1 + lar2
100 loops, best of 3: 4.97 ms per loop

A 2d example:

# Setup
import numpy as np
import pandas
import la
N = 1000
arr1 = np.random.randn(N, N)
arr2 = np.random.randn(N, N)
idx1 = range(N)
idx2 = range(N)[::-1]

# pandas
dma1 = pandas.DataMatrix(arr1, idx1, idx1)
dma2 = pandas.DataMatrix(arr2, idx2, idx2)

# larry
lar1 = la.larry(arr1, [idx1, idx1])
lar2 = la.larry(arr2, [idx2, idx2])

Here is with latest svn/bzr trunks on 5/30/10:

# aligned
>> timeit dma1 + dma1
100 loops, best of 3: 16 ms per loop
>> timeit lar1 + lar1
100 loops, best of 3: 15.9 ms per loop

# unaligned
>> timeit dma1 + dma2
10 loops, best of 3: 93.5 ms per loop
>> timeit lar1 + lar2
10 loops, best of 3: 91.7 ms per loop

Reduction operations

Find the maximum value in a 2d data object of shape (1000, 1000):

import pandas
import la

N = 1000
arr = np.random.rand(N, N)
dma = pandas.DataMatrix(arr, range(N), range(N))
lar = la.larry(arr)

# max

dma.max().max()
0.99999943789619117
lar.max()
0.99999943789619117

>> timeit dma.max().max()
10 loops, best of 3: 18 ms per loop
>> timeit lar.max()
100 loops, best of 3: 14.1 ms per loop
>> timeit np.nanmax(arr)
100 loops, best of 3: 14.1 ms per loop

IO

Users of data objects often need the ability to quickly archive their data. Here is the time taken to save and load two data objects of shape (N, N):

Could someone run this benchmark on a Linux machine with NumPy > 1.3? I (Wes) ran these on Windows and I don't trust the OS.

In [8]: roundtrip_archive(100)
Numpy (npz)    0.0107 seconds
larry (HDF5)   0.0064 seconds
pandas (HDF5)  0.0050 seconds

In [9]: roundtrip_archive(500)
Numpy (npz)    0.1026 seconds
larry (HDF5)   0.0908 seconds
pandas (HDF5)  0.0836 seconds

In [10]: roundtrip_archive(1000)
Numpy (npz)    0.5945 seconds
larry (HDF5)   0.3517 seconds
pandas (HDF5)  0.2611 seconds

In [11]: roundtrip_archive(2000, iterations=2)
Numpy (npz)    4.6487 seconds
larry (HDF5)   0.8843 seconds
pandas (HDF5)  0.7653 seconds

In [12]: roundtrip_archive(4000, iterations=1)
Numpy (npz)   19.7412 seconds
larry (HDF5)   2.9064 seconds
pandas (HDF5)  3.3371 seconds

And here is the code used:

This needs pandas r180 and PyTables.

import time, os
import numpy as np

import la
import pandas

def timeit(f, iterations):
    start = time.clock()

    for i in xrange(iterations):
        f()

    return time.clock() - start

def roundtrip_archive(N, iterations=10):

    # Create data
    arr = np.random.randn(N, N)
    lar = la.larry(arr)
    dma = pandas.DataMatrix(arr, range(N), range(N))

    # filenames
    filename_numpy = 'c:/temp/numpy.npz'
    filename_larry = 'c:/temp/archive.hdf5'
    filename_pandas = 'c:/temp/pandas_tmp'

    # Delete old files
    try:
        os.unlink(filename_numpy)
    except:
        pass
    try:
        os.unlink(filename_larry)
    except:
        pass
    try:
        os.unlink(filename_pandas)
    except:
        pass

    # Time a round trip save and load
    numpy_f = lambda: numpy_roundtrip(filename_numpy, arr, arr)
    numpy_time = timeit(numpy_f, iterations) / iterations

    larry_f = lambda: larry_roundtrip(filename_larry, lar, lar)
    larry_time = timeit(larry_f, iterations) / iterations

    pandas_f = lambda: pandas_roundtrip(filename_pandas, dma, dma)
    pandas_time = timeit(pandas_f, iterations) / iterations

    print 'Numpy (npz)   %7.4f seconds' % numpy_time
    print 'larry (HDF5)  %7.4f seconds' % larry_time
    print 'pandas (HDF5) %7.4f seconds' % pandas_time

def numpy_roundtrip(filename, arr1, arr2):
    np.savez(filename, arr1=arr1, arr2=arr2)
    npz = np.load(filename)
    arr1 = npz['arr1']
    arr2 = npz['arr2']

def larry_roundtrip(filename, lar1, lar2):
    io = la.IO(filename)
    io['lar1'] = lar1
    io['lar2'] = lar2
    lar1 = io['lar1']
    lar2 = io['lar2']

def pandas_roundtrip(filename, dma1, dma2):
    from pandas.io.pytables import HDFStore
    store = HDFStore(filename)
    store['dma1'] = dma1
    store['dma2'] = dma2
    dma1 = store['dma1']
    dma2 = store['dma2']

def pandas_roundtrip_pickle(filename, dma1, dma2):
    dma1.save(filename)
    dma1 = pandas.DataMatrix.load(filename)
    dma2.save(filename)
    dma2 = pandas.DataMatrix.load(filename)

Sometimes you only want to load a small part of a large data object from your archive. larry supports loading pieces of a data object. Here are some examples:

io['lar2'][:,-1]
io['lar2'][0,:]

Joining / merging data sets

In pandas, it's very similar to elsewhere

In [65]: df1
Out[65]:
                       A              B
2000-01-03 00:00:00    -0.1174        -0.941
2000-01-04 00:00:00    -0.6034        -0.008094
2000-01-05 00:00:00    -0.3816        -0.9338
2000-01-06 00:00:00    -0.3298        -0.9548
2000-01-07 00:00:00    0.9576         0.4652
2000-01-10 00:00:00    -0.7208        -1.131
2000-01-11 00:00:00    1.568          0.8498
2000-01-12 00:00:00    0.3717         -0.2323
2000-01-13 00:00:00    -1.428         -1.997
2000-01-14 00:00:00    -1.084         -0.271

In [67]: df2
Out[67]:
                       C              D
2000-01-03 00:00:00    0.2833         -0.1937
2000-01-05 00:00:00    1.868          1.207
2000-01-07 00:00:00    -0.8586        -0.7367
2000-01-11 00:00:00    2.121          0.9104
2000-01-13 00:00:00    0.7856         0.9063


In [68]: df1.join(df2)
Out[68]:
                       A              B              C              D
2000-01-03 00:00:00    -0.1174        -0.941         0.2833         -0.1937
2000-01-04 00:00:00    -0.6034        -0.008094      NaN            NaN
2000-01-05 00:00:00    -0.3816        -0.9338        1.868          1.207
2000-01-06 00:00:00    -0.3298        -0.9548        NaN            NaN
2000-01-07 00:00:00    0.9576         0.4652         -0.8586        -0.7367
2000-01-10 00:00:00    -0.7208        -1.131         NaN            NaN
2000-01-11 00:00:00    1.568          0.8498         2.121          0.9104
2000-01-12 00:00:00    0.3717         -0.2323        NaN            NaN
2000-01-13 00:00:00    -1.428         -1.997         0.7856         0.9063
2000-01-14 00:00:00    -1.084         -0.271         NaN            NaN

In [70]: df1.join(df2, how='inner')
Out[70]:
                       A              B              C              D
2000-01-03 00:00:00    -0.1174        -0.941         0.2833         -0.1937
2000-01-05 00:00:00    -0.3816        -0.9338        1.868          1.207
2000-01-07 00:00:00    0.9576         0.4652         -0.8586        -0.7367
2000-01-11 00:00:00    1.568          0.8498         2.121          0.9104
2000-01-13 00:00:00    -1.428         -1.997         0.7856         0.9063

The index (row labels) are the default key for joining, but a column can also be used for a similar SQL-like join:

In [73]: df2
Out[73]:
                       C              D              key
2000-01-03 00:00:00    0.2833         -0.1937        0
2000-01-05 00:00:00    1.868          1.207          1
2000-01-07 00:00:00    -0.8586        -0.7367        0
2000-01-11 00:00:00    2.121          0.9104         1
2000-01-13 00:00:00    0.7856         0.9063         0


In [74]: df3 = DataFrame({'code' : {0 : 'foo', 1 : 'bar'}})

In [75]: df3
Out[75]:
     code
0    foo
1    bar


In [76]: df2.join(df3, on='key')
Out[76]:
                       C              D              code           key
2000-01-03 00:00:00    0.2833         -0.1937        foo            0
2000-01-05 00:00:00    1.868          1.207          bar            1
2000-01-07 00:00:00    -0.8586        -0.7367        foo            0
2000-01-11 00:00:00    2.121          0.9104         bar            1
2000-01-13 00:00:00    0.7856         0.9063         foo            0

Groupby-type operations

Time series / date functionality

Handling missing data

Statistics in computational biology and bioinformatics

I am writing here in response to Vincent Davis' message on the Biopython mailing list about feedback on statistical data structure design issues in biopython. While I have been a Biopython developer for many years, different users have different needs and perspectives, so what I write here is just my personal opinion rather than an official position of the Biopython consortium.

With respect to Python/NumPy/SciPy, computational biology and bioinformatics is different from other numerical sciences in three important ways:

Bioinformatics is dominated by Perl, and therefore people new to bioinformatics tend to choose Perl as a default rather than Python. Many people do their statistical analysis by a combination of Perl and R. For us in Biopython this means that we should make using Python as straightforward as possible.
People in bioinformatics come from a wide range of educational backgrounds. Some come from hard-core computational sciences such as physics, mathematics, or computer science, whereas others have a more traditional background in molecular biology and may not have a solid background in numerics. Some people starting in bioinformatics do not know what a compiler is (but otherwise they can be great scientists). Unfortunately, this means that tools available in SciPy are often not accessible simply because it is too difficult to install.
Some numerical libraries may be more important in bioinformatics than in other fields and vice versa. For example, to be able to calculate p-values it is important to have a library of special functions, whereas linear algebra may be less important.

What I really miss in NumPy are special functions and some basic statistical functions such as kernel density estimators. I know that these are exist in SciPy, but because of the second point this won't do me much good. I hope that some day at least the special functions can be made available in NumPy instead of SciPy.

See below for a discussion of MaskedArrays (numpy.ma)

MaskedArrays commentary

Masked arrays are slower than ndarrays since they do more. Out of the packages listed above, scikits.timeseries is the only one that uses MaskedArrays heavily.

Here's some performance for a 1000x10 array (first number is the time ratio, since my CPU is undoubtedly faster or slower than yours when you run these benchmarks):

Performance comparison
Operation	ndarray	MaskedArray	larry	pandas
arr * 2	1.0 (17 us)	4.95 (84.2us)	1.94 (32.9 us)	1.85 (31.6us)
arr * arr	1.0 (15.9 us)	6.42 (102us)	1.88 (29.9 us)	1.99 (31.7us)

code for these data structures

arr = randn(1000, 10)
marr = np.ma.masked_array(arr)
larr = larry(arr)
parr = DataMatrix(arr, range(1000), range(10))

The slowdown for using MaskedArrays does not seem acceptable for general purpose use-- however in applications where one needs to be able to distinguish between NaN (missing) and NaN (you took sqrt(-1)) then perhaps they should be used. But to pay 4-7x cost for every operation seems excessive.