Processing big numeric arrays in python

Intro

Hello.

As you know, Python is an interpreted language and it is not very fast at processing data. C, Java or any other compilable language is much faster. However, Python is very popular among data scientists and they use it for processing data. But if you want to reach acceptable performance you MUST use some special libraries which allow you to use the performance benefits of compiled languages.

In this article I will talk about numpy, pandas, xarray, cython, numba. And I will show you a real example of how to use them properly and boost the performance hundreds of times. The source code of the example is available on github, so you can download and check them by yourself.

Task

Definition: Calculate Exponential Moving Averages for all columns of 2000 stocks 20-years daily data series.

Summary

I will give you the benchmark results in advance. See them and then decide if you want to read this whole article or not.

Approach	time	peak RAM	end RAM
Load data
pure python (csv)	1m37s	4GB	4GB
pandas (csv)	12s	1.4GB	1.4GB
pandas (csv, big files)	8s	0.72GB	0.72GB
xarray (netcdf, 1 file)	1.6s	1.2GB	0.65GB
xarray (pickle, 1 file)	1.3s	1.2GB	0.65GB
Calculate EMA
naive	>5m	1.2GB	1.2GB
naive numpy	3m	1.2GB	1.2GB
naive improved	2m	1.2GB	1.2GB
slices	4m	1.2GB	1.2GB
numpy slices	2.4s	1.2GB	1.2GB
cython	0.5s	1.2GB	1.2GB
numba	0.5s(+0.3s)	1.2GB	1.2GB

Task - additional info

Data sample:

date,open,high,low,close,vol,divs,split
1998-01-01,2,3,1,2,23232,0.1,1
1998-01-02,2,3,1,3,32423,0,2
...

Restrictions:

execution time should be less than 10 seconds
your program should consume less than 1.5GB RAM

This task looks simple, but the volume of the data and the restrictions make this task not so easy.

Benchmarking

Shortly about how I measure consumed RAM and execution time.

In order to evaluate the approaches, I will use this command to measure the peak memory and the execution time:

/usr/bin/time -f "time: %E peak memory: %MKb" ...

Also, I am going to use timeout command to limit the execution time because sometimes it can take hours.

timeout "300s" ...

Additionaly, before loading the data, I will clear the FS cache:

sudo bash -c "sync; echo 3 > /proc/sys/vm/drop_caches"

After loading the data, I am going to measure the current memory consumption with memory_profiler:

import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")

When I calculate ema, I will use the time module to measure the execution time excluding the loading time:

import time

t0 = time.time()
...
print('time:', t1 - t0)

Generate the data

This script generates 2000 series.

$ python e01_generate_test_data.py

source code

Notice, that size of these series is about 0.5GB. (You can generate more, but this amount is enough to demonstrate all the ideas.)

Load data to RAM

First, we need to load the data. We’ll try different approaches, and I’ll show you how to reorganize your data and significantly reduce the time and RAM.

Load with pure Python (csv)

First, let’s do it with pure Python.

Code:

import os
import csv
from datetime import datetime

DATA_DIR='../data'

data = dict()
for root, dirs, files in os.walk(DATA_DIR):
    for filename in files:
        asset = filename.split('.')[0]
        file_path = os.path.join(DATA_DIR, filename)
        with open(file_path) as csvfile:
            csvreader = csv.reader(csvfile, delimiter=',')
            first = True
            series = []
            headers = None
            for row in csvreader:
                if first:
                    first = False
                    headers = row
                    continue
                row = [float(row[i]) if i > 0 else datetime.strptime(row[i], "%Y-%m-%d").date()
                       for i in range(len(row))]
                series.append(row)
            data[asset] = series


# measure memory after loading
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")

Report:

current memory: 4109392.0 Kb
time: 1:37.41 peak memory: 4109392Kb

The execution time is 1m38s and the consumed memory is about 4GB. This is unacceptable. Let’s try the other approaches.

Load with Pandas (csv)

Next, we will use pandas to load the data.

Code:

import os
import pandas as pd

DATA_DIR='../data'

data = dict()
for root, dirs, files in os.walk(DATA_DIR):
    for filename in files:
        asset = filename.split('.')[0]
        file_path = os.path.join(DATA_DIR, filename)
        series = pd.read_csv(file_path, sep=",", index_col="date")
        data[asset] = series

print("loaded")

# measure memory after loading
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")

Report:

loaded
current memory: 1393812.0 Kb
time: 0:12.20 peak memory: 1393984Kb

The execution time is 12s and the consumed memory is about 1.4GB.

This is much faster, but it still slow.

And, as you see, pandas consumes much less memory than pure Puthon. The reason is that pandas uses internally numpy arrays which based on C-arrays. C-arrays are much more efficient to store numbers than python lists. But the data is still about 3 times larger in RAM than on HDD because pandas creates separate indexes for each file. It makes sense to reorganize the data to reduce the number of files.

Load with pandas (csv, big files)

The data in these files contain the same columns, so we can load all these data and save the every column data for all assets in the separate file.

$ python e03_group_by_column.py

source code

Well, let’s try to load the reorganized data and compare the results:

Code:

import os
import pandas as pd

DATA2_DIR='../data2'

data = dict()
for root, dirs, files in os.walk(DATA2_DIR):
    for filename in files:
        asset = filename.split('.')[0]
        file_path = os.path.join(DATA2_DIR, filename)
        series = pd.read_csv(file_path, sep=",", index_col="date")
        data[asset] = series

# measure memory after loading
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")

Report:

current memory: 724600.0 Kb
time: 0:08.37 peak memory: 724672Kb

The execution time is 8s and the consumed memory is about 0.72GB.

You can considerably improve the performance if you switch from CSV(text format) to any another binary format(netcdf, pickle, etc).

Load with xarray (netcdf, pickle)

Unfortunately, pandas can work only with 2 dimensions. But xarray(similar library) can work with more than 2 dimensions, so we can join all data to one file. Also it supports the netcdf binary file format (out of box with scipy). We also test the pickle file format.

This script joins all the data to one file and saves it to netcdf and pickle.

$ python e05_convert_to_nc_and_pickle.py

source code

Then we will load the data with netcdf and pickle:

netcdf Code:

import xarray as xr

data = xr.open_dataarray('../data.nc', decode_times=True).compute()

# measure memory after loading
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")

Report:

current memory: 647624.0 Kb
time: 0:01.67 peak memory: 1195560Kb

Result: 1.7s, RAM 0.65GB, peak ram 1.2GB

pickle Code:

import xarray as xr
import pickle

data = pickle.load(open("../data.pickle", 'rb'))

# measure memory after loading
# measure memory after loading
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")

Report:

current memory: 647452.0 Kb
time: 0:01.32 peak memory: 1195620Kb

Result: 1.3s, RAM 0.65GB, peak ram 1.2GB

As you see the execution time is less than 2s, the difference is about 0.4s. Also, it consumes more peak memory than pandas in the previous case. It is ok if we take into consideration the further calculations.

If the results are almost equal, I advise you to use netcdf instead of pickle. Pickle is connected very closely to the versions of libraries, so when you update your libraries the data can become unreadable from your code. This is a real headache, try to avoid that.

Honestly, the performance of pandas may be the same if you use binary formats, but it a bit easier to start with xarray+netcdf at this time.

Load data - Conclusions

use the right librarries (numpy,pandas,xarrray) for numeric data
reduce number of files, join small files to big ones
use binary data formats

Calculate EMA

EMA naive

At first let’s try the naive approach.

Code:

import xarray as xr
import numpy as np
import time

data = xr.open_dataarray('../data.nc', decode_times=True).compute()

def calc_ema_xr(prices, n):
    k = 2.0 / (1 + n)
    _k = 1 - k
    ema = xr.zeros_like(prices)
    pe = np.nan
    for i in range(len(prices)):
        e = prices[i]
        if not np.isnan(pe):
            if np.isnan(e):
                e = pe
            else:
                e = k * e + _k * pe
        ema[i] = e
        pe = e
    return ema

ema = xr.zeros_like(data)

t0 = time.time()
for a in data.asset.values.tolist():
    for f in data.field.values.tolist():
        ema.loc[a,f] = calc_ema_xr(data.loc[a,f], 20)
t1 = time.time()

print('time:', t1-t0)

Report:

Command exited with non-zero status 124
time: 5:00.00 peak memory: 1196952Kb

The execution timed out (took more than 5min).

EMA naive numpy

When you read the documentation about xarray, you find that xarray is based on numpy. You can think that it is a good idea to work with the data directly in numpy.

Let’s try this.

Code:

import xarray as xr
import xarray as xr
import numpy as np
import time

data = xr.open_dataarray('../data.nc', decode_times=True).compute()

def calc_ema_np(prices, n):
    k = 2.0 / (1 + n)
    _k = 1 - k
    ema = np.zeros_like(prices)
    pe = np.nan
    for i in range(len(prices)):
        e = prices[i]
        if not np.isnan(pe):
            if np.isnan(e):
                e = pe
            else:
                e = k * e + _k * pe
        ema[i] = e
        pe = e
    return ema

ema = xr.zeros_like(data)

t0 = time.time()
for a in data.asset.values.tolist():
    for f in data.field.values.tolist():
        ema.loc[a,f] = calc_ema_np(data.loc[a,f].values, 20)
t1 = time.time()

print('time:', t1-t0)

Report:

time: 182.26038265228271
time: 3:02.88 peak memory: 1196728Kb

The time is about 3min. Better, but still slow.

EMA naive improved

The main problem is that we read and write the data element by element. xarray(numpy and pandas too) uses system calls to external libraries in order to access to the internal C-array(python can’t work directly). These calls are slow. We should use batching to reduce the number of calls.

The simplest approach is to extract the whole series from the C-array to python list, calculate EMA and write the series back. Let’s check.

Code:

import xarray as xr
import numpy as np
import time

data = xr.open_dataarray('../data.nc', decode_times=True).compute()

def calc_ema_list(prices, n):
    k = 2.0 / (1 + n)
    _k = 1 - k
    ema = []
    pe = np.nan
    for e in prices:
        if not np.isnan(pe):
            if np.isnan(e):
                e = pe
            else:
                e = k * e + _k * pe
        ema.append(e)
        pe = e
    return ema

ema = xr.zeros_like(data)

t0 = time.time()
for a in data.asset.values.tolist():
    for f in data.field.values.tolist():
        ema.loc[a,f] = calc_ema_list(data.loc[a,f].values.tolist(), 20)
t1 = time.time()

print('time:', t1-t0)

Report:

time: 122.7459466457367
time: 2:03.35 peak memory: 1197128Kb

The time is about 2min. Slow.

EMA with slices Another approach is work with the slices of elements and boolean masks. It is also reduces the execution time.

Code:

import xarray as xr
import numpy as np
import time

data = xr.open_dataarray('../data.nc', decode_times=True).compute()

def calc_ema(data, n):
    k = 2.0 / (1 + n)
    _k = 1 - k
    ema = xr.zeros_like(data)
    prev_ema = xr.zeros_like(ema.isel(date=0))
    for t in data.coords['date']:
        cur_data = data.loc[:, t]
        cur_finite = np.isfinite(cur_data)
        prev_finite = np.isfinite(prev_ema)

        ema.loc[:, t] = cur_data
        ema.loc[prev_finite, t] = prev_ema.loc[prev_finite]
        cur_prev_finite = np.logical_and(prev_finite, cur_finite)
        ema.loc[cur_prev_finite, t] = (k * cur_data + _k * prev_ema).loc[cur_prev_finite]

        prev_ema = ema.loc[:, t]
    return ema

t0 = time.time()
ema = xr.zeros_like(data)
for f in data.field.values.tolist():
    ema.loc[:,f,:] = calc_ema(data.loc[:,f,:], 20)
t1 = time.time()
print('time:', t1 - t0)

Report:

time: 240.58265471458435
time: 4:01.14 peak memory: 1275596Kb

The execution time is about 4min. Bad. You can think that this is a wrong way, But slice operations are very fast in numpy. So, You should bypass xarray and work with numpy directly if you want to reach good performance.

EMA with numpy slices

In this approach we are working with numpy slices directly.

Code:

import xarray as xr
import numpy as np
import time

data = xr.open_dataarray('../data.nc', decode_times=True).compute()


def calc_ema_np(data, n):
    k = 2.0 / (1 + n)
    _k = 1 - k
    ema = np.zeros_like(data)
    prev_ema = np.zeros_like(ema[:,0])
    for t in range(data.shape[1]):
        cur_data = data[:, t]
        cur_finite = np.isfinite(cur_data)
        prev_finite = np.isfinite(prev_ema)
        ema[:, t] = cur_data
        ema[prev_finite, t] = prev_ema[prev_finite]
        cur_prev_finite = np.logical_and(prev_finite, cur_finite)
        ema[cur_prev_finite, t] = (k * cur_data + _k * prev_ema)[cur_prev_finite]
        prev_ema = ema[:, t]
    return ema


t0 = time.time()
ema = xr.zeros_like(data)
for f in data.field.values.tolist():
    ema.loc[:,f,:].values = calc_ema_np(data.loc[:,f,:].values, 20)
t1 = time.time()

print('time:', t1 - t0)

Report:

time: 2.398899555206299
time: 0:02.96 peak memory: 1275112Kb

The result is 3s. This is great! We can say, that we solved the task.

Further, I will show some ways how to do it even faster. We compile the code to work directly with C-arrays. It is more complicated, but sometimes it is necessary.

EMA with cython

Let’s try to use cython to make our code faster.

Cython - is a python-like language which may be converted to C, compiled and linked to python.

This ema function implemented in cython:

Cython code:

from libc.math cimport NAN, isnan
import numpy as np
cimport numpy as np
cimport cython

DTYPE = np.float64
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
def c_ema(np.ndarray[DTYPE_t, ndim = 2] data, DTYPE_t n):
    cdef DTYPE_t k = 2.0 / (1.0 + n)
    cdef DTYPE_t _k = 1 - k
    cdef int xmax = data.shape[0]
    cdef int ymax = data.shape[1]
    cdef np.ndarray res = np.zeros([xmax, ymax], dtype=DTYPE)
    cdef double[:,:] resd = res # line 17
    cdef DTYPE_t prev_ema = NAN
    cdef DTYPE_t cur_ema = NAN
    for x in range(0, xmax, 1):
        prev_ema = NAN
        for y in range(0, ymax, 1):
            cur_ema = data[x, y]
            if not isnan(prev_ema):
                if isnan(cur_ema):
                    cur_ema = prev_ema
                else:
                    cur_ema = cur_ema * k + prev_ema * _k
            resd[x, y] = cur_ema
            prev_ema = cur_ema
    return res

As you, it is python-like, but it is not python. There are a lot of code which specifies types and conversions. If you make even a minor mistake, your code may start to work 10 times slower or stop working at all.

For example the line 17. This line may look excess, but this is a some magic with types. We force it to use internal memory structure. Without this line the code works 14 times slower.

Also you have to convert and compile the code:

# convert to C
cython -3 cema.pyx
# compile
gcc -shared -pthread -fPIC -fwrapv -Wall -O2 \
-I "$CONDA_PREFIX/lib/python3.6/site-packages/numpy/core/include/" \
-I "$CONDA_PREFIX/include/python3.7m/" \
-o cema.so cema.c

Python code:

import xarray as xr
import time

import cema

data = xr.open_dataarray('../data.nc', decode_times=True).compute()

def cython_ema(data, n):
    nparr = data.values
    nparr = cema.c_ema(nparr, n)
    return xr.DataArray(nparr, dims=[ 'asset', 'date'], coords={'date': data.date, 'asset': data.asset})

t0 = time.time()
ema = xr.zeros_like(data)
for f in data.field.values.tolist():
    ema.loc[:,f,:] = cython_ema(data.loc[:, f, :], 20)
t1 = time.time()
print('time:', t1 - t0)

Report:

time: 0.5077145099639893
time: 0:01.05 peak memory: 1275424Kb

The result is 0.5s. This is 4 times faster than numpy with slices.

Unfortunately, this code looks very tricky, and it is easy to make a mistake. There is a simpler further.

EMA with numba

Numba allows you to compile you python code with JIT (runtime). But it adds some limitations on data types(numpy array and primitive types) and operations which you can use, otherwise it won’t compile.

Let’s rewrite EMA function for numba.

Python code:

import xarray as xr
import numpy as np
from numba import jit
import time

data = xr.open_dataarray('../data.nc', decode_times=True).compute()

@jit(
    nopython=True,
    #parallel=True,
)
def numba_ema(data, n):
    k = 2.0 / (1 + n)
    _k = 1 - k
    ema = np.zeros(data.shape, np.float64)
    for a in range(data.shape[0]):
        pe = np.nan
        for t in range(data.shape[1]):
            e = data[a, t]
            if not np.isnan(pe):
                if np.isnan(e):
                    e = pe
                else:
                    e = k * e + _k * pe
            ema[a, t] = e
            pe = e
    return ema


def calc_ema(data, n):
    nparr = data.values
    nparr = numba_ema(nparr, n)
    return xr.DataArray(nparr, dims=['asset', 'date'],coords={'date': data.coords['date'], 'asset': data.coords['asset']})

t0 = time.time()
# first call - measure compilation time
ema = xr.zeros_like(data)
calc_ema(data[0:20,0,0:1], 20)
t1 = time.time()
ema = xr.zeros_like(data)
for f in data.field.values.tolist():
    ema.loc[:,f,:] = calc_ema(data.loc[:,f,:], 20)
t2 = time.time()
print('time:', t2-t0, t1 - t0, t2 - t1)

Report:

time: 0.8730692863464355 0.32979297637939453 0.543276309967041
time: 0:02.01 peak memory: 1826688Kb

The performance is almost the same to cython, but JIT-compilationtakes additional 0.27s. But code looks more clear, you can remove @jit and it will work with pure python as well. It simplifies the development a lot, so I prefer to use numba instead of cython.

Calculate EMA - conclusions

numpy with slices is fast enough for most cases
work directly with numpy and bypass xarray and pandas
you can compile your code with cython or numba to work directly with underlying C-arrays in RAM
cython is not python and it is more tricky than numba. Numba requires additional time(often small) for JIT compilation.

Processing big numeric arrays in python

Intro

Task

Summary

Task - additional info

Benchmarking

Generate the data

Load data to RAM

Load data - Conclusions

Calculate EMA

Calculate EMA - conclusions

[Author]