Differences between revisions 16 and 17

Deletions are marked like this. Additions are marked like this.
Line 139: Line 139:
This is not ideal, however, because it required us to start with two arrays. Is there not some way of computing {{{dot(A.T, A}}} without making an extra copy? There is support for this in the [http://www.netlib.org/blas/ BLAS] through the method {{{_SYRK}}} method (see http://www.netlib.org/blas/blasqr.pdf). SciPy exposes this through the [http://docs.scipy.org/scipy/docs/scipy.linalg.blas/ scipy.linalg.blas] module (or directly through the [http://docs.scipy.org/scipy/docs/scipy.linalg.cblas/ scipy.linalg.cblas] or [http://docs.scipy.org/scipy/docs/scipy.linalg.fblas/ scipy.linalg.fblas] modules (which can be used with {{{C_CONTIGUOUS}}} arrays and {{{F_CONTIGUOUS}}} arrays respectively). Alas, this function is not yet implemented, however, {{{_GEMM}}} is. Here we use it: This is not ideal, however, because it required us to start with two arrays. Is there not some way of computing {{{dot(A.T, A}}} without making an extra copy? There is support for this in the [http://www.netlib.org/blas/ BLAS] through the method {{{_SYRK}}} method (see http://www.netlib.org/blas/blasqr.pdf). SciPy exposes this through [http://docs.scipy.org/scipy/docs/scipy.linalg.blas/ scipy.linalg.blas] (which can be used with {{{C_CONTIGUOUS}}} arrays and {{{F_CONTIGUOUS}}} arrays respectively). Alas, this function is not yet implemented, however, {{{_GEMM}}} is **it is now (scipy 0.13.0)**. Here we use it:
Line 142: Line 142:
>>> import scipy.linalg.fblas
>>> t = time();C = scipy.linalg.fblas.dgemm(alpha=1.0, a=A.T, b=A.T, trans_b=True);time() - t
>>> import scipy.linalg.blas
>>> t = time();C = scipy.linalg.blas.dgemm(alpha=1.0, a=A.T, b=A.T, trans_b=True);time() - t
Line 147: Line 147:
This gives the same performance as {{{dot}}} but with the advantage that we did not need to make an extra copy. Note that we passed {{{A.T}}} here so that the array was in Fortran order because we used the {{{fblas}}} version. One must be careful about this: passing the wrong type of array will not realize the performance gains. Here, for example, we pass the {{{C_CONTIGUOUS}}} arrays for which copies must me made. This gives the same performance as {{{dot}}} but with the advantage that we did not need to make an extra copy. Note that we passed {{{A.T}}} here so that the array was in Fortran order because {{{blas}}} is the Fortran BLAS. One must be careful about this: passing the wrong type of array will not realize the performance gains. Here, for example, we pass the {{{C_CONTIGUOUS}}} arrays for which copies must me made.
Line 150: Line 150:
>>> t = time();C = scipy.linalg.fblas.dgemm(alpha=1.0, a=A, b=A, trans_a=True);time() - t >>> t = time();C = scipy.linalg.blas.dgemm(alpha=1.0, a=A, b=A, trans_a=True);time() - t

This page collects tips and tricks to increase the speed of your code using numpy/scipy.

For general tips and tricks to improve the performance of your Python programs see http://wiki.python.org/moin/PythonSpeed/PerformanceTips.

Python built-ins vs. numpy functions

Note that the built-in python min function can be much slower (up to 300-500 times) than using the .min() method of an array. I.e.: use x.min() instead of min(x).

The same applies to max.

This is also true for the new any and all functions for Python >=2.5.

Beyond pure Python

Sometimes there are tasks for which pure python code can be too slow.

Possible solutions can be obtained via:

  • hand-written C extensions

  • psyco
  • pyrex
  • ctypes
  • f2py
  • weave
  • swig
  • boost
  • SIP
  • CXX

For a full discussion with examples on performance gains through interfacing with other languages see this article.

Examples

Tips and tricks for specific situations.

Finding the row and column of the min or max value of an array or matrix

A slow, but straightforward, way to find the row and column indices of the minimum value of an array or matrix x:

import numpy as np

def min_ij(x):
    i, j = np.where(x == x.min())
    return i[0], j[0]

This can be made quite a bit faster:

def min_ij(x):
    i, j = divmod(x.argmin(), x.shape[1])
    return i, j

The fast method is about 4 times faster on a 500 by 500 array.

Removing the i-th row and j-th column of a 2d array or matrix

The slow way to remove the i-th row and j-th column from a 2d array or matrix:

import numpy as np

def remove_ij(x, i, j):

    # Remove the ith row
    idx = range(x.shape[0])
    idx.remove(i)
    x = x[idx,:]

    # Remove the jth column
    idx = range(x.shape[1])
    idx.remove(j)
    x = x[:,idx]

    return x

The fast way, because it avoids making copies, to remove the i-th row and j-th column from a 2d array or matrix:

def remove_ij(x, i, j):
    # Row i and column j divide the array into 4 quadrants
    y = x[:-1,:-1]
    y[:i,j:] = x[:i,j+1:]
    y[i:,:j] = x[i+1:,:j]
    y[i:,j:] = x[i+1:,j+1:]
    return y

For a 500 by 500 array the second method is over 25 times faster.

Linear Algebra on Large Arrays

Sometimes the performance or memory behaviour of linear algebra on large arrays can suffer because of copying behind the scenes. Consider for example, computing the dot product of two large matrices with a small result:

>>> N = 1e6
>>> n = 40
>>> A = np.ones((N,n))
>>> C = np.dot(A.T, A)

Although C is only 40 by 40, inspecting the memory usage during the operation of dot will indicate that a copy is being made. The reason is that the dot product uses underlying BLAS operations which depend on the matrices being stored in contiguous C order. The transpose of operation does not effect a copy, but places the transposed array in Fortran order:

>>> A.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
   ...
>>> A.T.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
   ...

If we give dot two matrices that are both C_CONTIGUOUS, then the performance is better:

>>> from time import time
>>> AT_F = np.ones((n,N), order='F')
>>> AT_C = np.ones((n,N), order='C')
>>> t = time();C = np.dot(A.T, A);time() - t
3.9203271865844727
>>> t = time();C = np.dot(AT_F, A);time() - t
3.9461679458618164
>>> t = time();C = np.dot(AT_C, A);time() - t
2.4167969226837158

This is not ideal, however, because it required us to start with two arrays. Is there not some way of computing dot(A.T, A without making an extra copy? There is support for this in the BLAS through the method _SYRK method (see http://www.netlib.org/blas/blasqr.pdf). SciPy exposes this through scipy.linalg.blas (which can be used with C_CONTIGUOUS arrays and F_CONTIGUOUS arrays respectively). Alas, this function is not yet implemented, however, _GEMM is **it is now (scipy 0.13.0)**. Here we use it:

>>> import scipy.linalg.blas
>>> t = time();C = scipy.linalg.blas.dgemm(alpha=1.0, a=A.T, b=A.T, trans_b=True);time() - t
2.412520170211792

This gives the same performance as dot but with the advantage that we did not need to make an extra copy. Note that we passed A.T here so that the array was in Fortran order because blas is the Fortran BLAS. One must be careful about this: passing the wrong type of array will not realize the performance gains. Here, for example, we pass the C_CONTIGUOUS arrays for which copies must me made.

>>> t = time();C = scipy.linalg.blas.dgemm(alpha=1.0, a=A, b=A, trans_a=True);time() - t
5.0581560134887695

There is not much documentation about this, but some useful discussions can be found in the mailing list archives. Another related question is about inplace operations (again, avoiding copying). Some functions provide an option for this, and again, some of the BLAS functions can be used for this. Here are some related discussions:

PerformanceTips (last edited 2013-08-25 13:48:00 by RalfGommers)