how to process numeric arrays with lightning speed

Hello.

As you know, Python is an interpreted language and it is not very fast at processing data. C, Java or any other compilable language is much faster. However, Python is very popular among data scientists and they use it for processing data. But if you want to reach acceptable performance you **MUST** use some *special libraries* which allow you to use the performance benefits of compiled languages.

In this article I will talk about *numpy*, *pandas*, *xarray*, *cython*, *numba*. And I will show you a real example of how to use them properly and boost the performance hundreds of times. The source code of the example is available on github, so you can download and check them by yourself.

**Definition**: *Calculate Exponential Moving Averages
for all columns of 2000 stocks 20-years daily data series.*

I will give you the benchmark results in advance. See them and then decide if you want to read this whole article or not.

Approach |
time |
peak RAM |
end RAM |
---|---|---|---|

Load data |
|||

pure python (csv) | 1m37s |
4GB | 4GB |

pandas (csv) | 12s |
1.4GB | 1.4GB |

pandas (csv, big files) | 8s |
0.72GB | 0.72GB |

xarray (netcdf, 1 file) | 1.6s |
1.2GB | 0.65GB |

xarray (pickle, 1 file) | 1.3s |
1.2GB | 0.65GB |

Calculate EMA |
|||

naive | >5m |
1.2GB | 1.2GB |

naive numpy | 3m |
1.2GB | 1.2GB |

naive improved | 2m |
1.2GB | 1.2GB |

slices | 4m |
1.2GB | 1.2GB |

numpy slices | 2.4s |
1.2GB | 1.2GB |

cython | 0.5s |
1.2GB | 1.2GB |

numba | 0.5s(+0.3s) |
1.2GB | 1.2GB |

**Data sample**:

```
date,open,high,low,close,vol,divs,split
1998-01-01,2,3,1,2,23232,0.1,1
1998-01-02,2,3,1,3,32423,0,2
...
```

**Restrictions**:

*execution time should be less than 10 seconds**your program should consume less than 1.5GB RAM*

This task looks simple, but the volume of the data and the restrictions make this task not so easy.

*Shortly about how I measure consumed RAM and execution time.*

In order to evaluate the approaches, I will use this command to measure the peak memory and the execution time:

```
/usr/bin/time -f "time: %E peak memory: %MKb" ...
```

Also, I am going to use `timeout`

command to limit the execution time because sometimes it can take hours.

```
timeout "300s" ...
```

Additionaly, before loading the data, I will clear the FS cache:

```
sudo bash -c "sync; echo 3 > /proc/sys/vm/drop_caches"
```

After loading the data, I am going to measure the current memory consumption
with `memory_profiler`

:

```
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")
```

When I calculate ema, I will use the `time`

module to measure the execution time
excluding the loading time:

```
import time
t0 = time.time()
...
print('time:', t1 - t0)
```

This script generates 2000 series.

```
$ python e01_generate_test_data.py
```

Notice, that size of these series is about **0.5GB**.
(You can generate more, but this amount is enough to demonstrate all the ideas.)

First, we need to load the data. We’ll try different approaches, and I’ll show you how to reorganize your data and significantly reduce the time and RAM.

**Load with pure Python (csv)**

First, let’s do it with pure Python.

Code:

```
import os
import csv
from datetime import datetime
DATA_DIR='../data'
data = dict()
for root, dirs, files in os.walk(DATA_DIR):
for filename in files:
asset = filename.split('.')[0]
file_path = os.path.join(DATA_DIR, filename)
with open(file_path) as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
first = True
series = []
headers = None
for row in csvreader:
if first:
first = False
headers = row
continue
row = [float(row[i]) if i > 0 else datetime.strptime(row[i], "%Y-%m-%d").date()
for i in range(len(row))]
series.append(row)
data[asset] = series
# measure memory after loading
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")
```

Report:

```
current memory: 4109392.0 Kb
time: 1:37.41 peak memory: 4109392Kb
```

The execution time is **1m38s** and the consumed memory is about **4GB**.
This is unacceptable. Let’s try the other approaches.

**Load with Pandas (csv)**

Next, we will use pandas to load the data.

Code:

```
import os
import pandas as pd
DATA_DIR='../data'
data = dict()
for root, dirs, files in os.walk(DATA_DIR):
for filename in files:
asset = filename.split('.')[0]
file_path = os.path.join(DATA_DIR, filename)
series = pd.read_csv(file_path, sep=",", index_col="date")
data[asset] = series
print("loaded")
# measure memory after loading
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")
```

Report:

```
loaded
current memory: 1393812.0 Kb
time: 0:12.20 peak memory: 1393984Kb
```

The execution time is **12s** and the consumed memory is about **1.4GB**.

This is much faster, but it still slow.

And, as you see, pandas consumes much less memory than pure Puthon. The reason is that pandas uses internally numpy arrays which based on C-arrays. C-arrays are much more efficient to store numbers than python lists. But the data is still about 3 times larger in RAM than on HDD because pandas creates separate indexes for each file. It makes sense to reorganize the data to reduce the number of files.

**Load with pandas (csv, big files)**

The data in these files contain the same columns, so we can load all these data and save the every column data for all assets in the separate file.

```
$ python e03_group_by_column.py
```

Well, let’s try to load the reorganized data and compare the results:

Code:

```
import os
import pandas as pd
DATA2_DIR='../data2'
data = dict()
for root, dirs, files in os.walk(DATA2_DIR):
for filename in files:
asset = filename.split('.')[0]
file_path = os.path.join(DATA2_DIR, filename)
series = pd.read_csv(file_path, sep=",", index_col="date")
data[asset] = series
# measure memory after loading
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")
```

Report:

```
current memory: 724600.0 Kb
time: 0:08.37 peak memory: 724672Kb
```

The execution time is **8s** and the consumed memory is about **0.72GB**.

*You can considerably improve the performance if you switch from CSV(text format)
to any another binary format(netcdf, pickle, etc).*

**Load with xarray (netcdf, pickle)**

Unfortunately, pandas can work only with 2 dimensions.
But `xarray`

(similar library) can work with more than 2 dimensions, so we can join all data to one file.
Also it supports the netcdf binary file format (out of box with scipy).
We also test the pickle file format.

This script joins all the data to one file and saves it to netcdf and pickle.

```
$ python e05_convert_to_nc_and_pickle.py
```

Then we will load the data with netcdf and pickle:

**netcdf**
Code:

```
import xarray as xr
data = xr.open_dataarray('../data.nc', decode_times=True).compute()
# measure memory after loading
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")
```

Report:

```
current memory: 647624.0 Kb
time: 0:01.67 peak memory: 1195560Kb
```

Result: **1.7s**, RAM **0.65GB**, peak ram **1.2GB**

**pickle**
Code:

```
import xarray as xr
import pickle
data = pickle.load(open("../data.pickle", 'rb'))
# measure memory after loading
# measure memory after loading
import gc
from memory_profiler import memory_usage
gc.collect()
print("current memory:", memory_usage()[0]*1024, "Kb")
```

Report:

```
current memory: 647452.0 Kb
time: 0:01.32 peak memory: 1195620Kb
```

Result: **1.3s**, RAM **0.65GB**, peak ram **1.2GB**

As you see the execution time is less than 2s, the difference is about 0.4s. Also, it consumes more peak memory than pandas in the previous case. It is ok if we take into consideration the further calculations.

If the results are almost equal, I advise you to use netcdf instead of pickle. Pickle is connected very closely to the versions of libraries, so when you update your libraries the data can become unreadable from your code. This is a real headache, try to avoid that.

*Honestly, the performance of pandas may be the same if you use binary formats,
but it a bit easier to start with xarray+netcdf at this time.*

- use the right librarries (numpy,pandas,xarrray) for numeric data
- reduce number of files, join small files to big ones
- use binary data formats

**EMA naive**

At first let’s try the naive approach.

Code:

```
import xarray as xr
import numpy as np
import time
data = xr.open_dataarray('../data.nc', decode_times=True).compute()
def calc_ema_xr(prices, n):
k = 2.0 / (1 + n)
_k = 1 - k
ema = xr.zeros_like(prices)
pe = np.nan
for i in range(len(prices)):
e = prices[i]
if not np.isnan(pe):
if np.isnan(e):
e = pe
else:
e = k * e + _k * pe
ema[i] = e
pe = e
return ema
ema = xr.zeros_like(data)
t0 = time.time()
for a in data.asset.values.tolist():
for f in data.field.values.tolist():
ema.loc[a,f] = calc_ema_xr(data.loc[a,f], 20)
t1 = time.time()
print('time:', t1-t0)
```

Report:

```
Command exited with non-zero status 124
time: 5:00.00 peak memory: 1196952Kb
```

The execution timed out (took more than 5min).

**EMA naive numpy**

When you read the documentation about xarray, you find that xarray is based on `numpy`

.
You can think that it is a good idea to work with the data directly in numpy.

Let’s try this.

Code:

```
import xarray as xr
import xarray as xr
import numpy as np
import time
data = xr.open_dataarray('../data.nc', decode_times=True).compute()
def calc_ema_np(prices, n):
k = 2.0 / (1 + n)
_k = 1 - k
ema = np.zeros_like(prices)
pe = np.nan
for i in range(len(prices)):
e = prices[i]
if not np.isnan(pe):
if np.isnan(e):
e = pe
else:
e = k * e + _k * pe
ema[i] = e
pe = e
return ema
ema = xr.zeros_like(data)
t0 = time.time()
for a in data.asset.values.tolist():
for f in data.field.values.tolist():
ema.loc[a,f] = calc_ema_np(data.loc[a,f].values, 20)
t1 = time.time()
print('time:', t1-t0)
```

Report:

```
time: 182.26038265228271
time: 3:02.88 peak memory: 1196728Kb
```

The time is about **3min**. Better, but still slow.

**EMA naive improved**

The main problem is that we read and write the data element by element. xarray(numpy and pandas too) uses system calls to external libraries in order to access to the internal C-array(python can’t work directly). These calls are slow. We should use batching to reduce the number of calls.

The simplest approach is to extract the whole series from the C-array to python list, calculate EMA and write the series back. Let’s check.

Code:

```
import xarray as xr
import numpy as np
import time
data = xr.open_dataarray('../data.nc', decode_times=True).compute()
def calc_ema_list(prices, n):
k = 2.0 / (1 + n)
_k = 1 - k
ema = []
pe = np.nan
for e in prices:
if not np.isnan(pe):
if np.isnan(e):
e = pe
else:
e = k * e + _k * pe
ema.append(e)
pe = e
return ema
ema = xr.zeros_like(data)
t0 = time.time()
for a in data.asset.values.tolist():
for f in data.field.values.tolist():
ema.loc[a,f] = calc_ema_list(data.loc[a,f].values.tolist(), 20)
t1 = time.time()
print('time:', t1-t0)
```

Report:

```
time: 122.7459466457367
time: 2:03.35 peak memory: 1197128Kb
```

The time is about **2min**. Slow.

**EMA with slices**
Another approach is work with the slices of elements and boolean masks.
It is also reduces the execution time.

Code:

```
import xarray as xr
import numpy as np
import time
data = xr.open_dataarray('../data.nc', decode_times=True).compute()
def calc_ema(data, n):
k = 2.0 / (1 + n)
_k = 1 - k
ema = xr.zeros_like(data)
prev_ema = xr.zeros_like(ema.isel(date=0))
for t in data.coords['date']:
cur_data = data.loc[:, t]
cur_finite = np.isfinite(cur_data)
prev_finite = np.isfinite(prev_ema)
ema.loc[:, t] = cur_data
ema.loc[prev_finite, t] = prev_ema.loc[prev_finite]
cur_prev_finite = np.logical_and(prev_finite, cur_finite)
ema.loc[cur_prev_finite, t] = (k * cur_data + _k * prev_ema).loc[cur_prev_finite]
prev_ema = ema.loc[:, t]
return ema
t0 = time.time()
ema = xr.zeros_like(data)
for f in data.field.values.tolist():
ema.loc[:,f,:] = calc_ema(data.loc[:,f,:], 20)
t1 = time.time()
print('time:', t1 - t0)
```

Report:

```
time: 240.58265471458435
time: 4:01.14 peak memory: 1275596Kb
```

The execution time is about **4min**. Bad. You can think that this is a wrong way,
But slice operations are very fast in numpy.
So, You should bypass xarray and work with numpy directly
if you want to reach good performance.

**EMA with numpy slices**

In this approach we are working with numpy slices directly.

Code:

```
import xarray as xr
import numpy as np
import time
data = xr.open_dataarray('../data.nc', decode_times=True).compute()
def calc_ema_np(data, n):
k = 2.0 / (1 + n)
_k = 1 - k
ema = np.zeros_like(data)
prev_ema = np.zeros_like(ema[:,0])
for t in range(data.shape[1]):
cur_data = data[:, t]
cur_finite = np.isfinite(cur_data)
prev_finite = np.isfinite(prev_ema)
ema[:, t] = cur_data
ema[prev_finite, t] = prev_ema[prev_finite]
cur_prev_finite = np.logical_and(prev_finite, cur_finite)
ema[cur_prev_finite, t] = (k * cur_data + _k * prev_ema)[cur_prev_finite]
prev_ema = ema[:, t]
return ema
t0 = time.time()
ema = xr.zeros_like(data)
for f in data.field.values.tolist():
ema.loc[:,f,:].values = calc_ema_np(data.loc[:,f,:].values, 20)
t1 = time.time()
print('time:', t1 - t0)
```

Report:

```
time: 2.398899555206299
time: 0:02.96 peak memory: 1275112Kb
```

The result is **3s**. This is great! We can say, that we solved the task.

Further, I will show some ways how to do it even faster. We compile the code to work directly with C-arrays. It is more complicated, but sometimes it is necessary.

**EMA with cython**

Let’s try to use cython to make our code faster.

Cython - is a python-like language which may be converted to C, compiled and linked to python.

This ema function implemented in cython:

Cython code:

```
from libc.math cimport NAN, isnan
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.wraparound(False)
def c_ema(np.ndarray[DTYPE_t, ndim = 2] data, DTYPE_t n):
cdef DTYPE_t k = 2.0 / (1.0 + n)
cdef DTYPE_t _k = 1 - k
cdef int xmax = data.shape[0]
cdef int ymax = data.shape[1]
cdef np.ndarray res = np.zeros([xmax, ymax], dtype=DTYPE)
cdef double[:,:] resd = res # line 17
cdef DTYPE_t prev_ema = NAN
cdef DTYPE_t cur_ema = NAN
for x in range(0, xmax, 1):
prev_ema = NAN
for y in range(0, ymax, 1):
cur_ema = data[x, y]
if not isnan(prev_ema):
if isnan(cur_ema):
cur_ema = prev_ema
else:
cur_ema = cur_ema * k + prev_ema * _k
resd[x, y] = cur_ema
prev_ema = cur_ema
return res
```

As you, it is python-like, but it is not python. There are a lot of code which specifies types and conversions. If you make even a minor mistake, your code may start to work 10 times slower or stop working at all.

For example the line 17. This line may look excess, but this is a some magic with types. We force it to use internal memory structure. Without this line the code works 14 times slower.

Also you have to convert and compile the code:

```
# convert to C
cython -3 cema.pyx
# compile
gcc -shared -pthread -fPIC -fwrapv -Wall -O2 \
-I "$CONDA_PREFIX/lib/python3.6/site-packages/numpy/core/include/" \
-I "$CONDA_PREFIX/include/python3.7m/" \
-o cema.so cema.c
```

Python code:

```
import xarray as xr
import time
import cema
data = xr.open_dataarray('../data.nc', decode_times=True).compute()
def cython_ema(data, n):
nparr = data.values
nparr = cema.c_ema(nparr, n)
return xr.DataArray(nparr, dims=[ 'asset', 'date'], coords={'date': data.date, 'asset': data.asset})
t0 = time.time()
ema = xr.zeros_like(data)
for f in data.field.values.tolist():
ema.loc[:,f,:] = cython_ema(data.loc[:, f, :], 20)
t1 = time.time()
print('time:', t1 - t0)
```

Report:

```
time: 0.5077145099639893
time: 0:01.05 peak memory: 1275424Kb
```

The result is **0.5s**. This is 4 times faster than numpy with slices.

Unfortunately, this code looks very tricky, and it is easy to make a mistake. There is a simpler further.

**EMA with numba**

Numba allows you to compile you python code with JIT (runtime). But it adds some limitations on data types(numpy array and primitive types) and operations which you can use, otherwise it won’t compile.

Let’s rewrite EMA function for numba.

Python code:

```
import xarray as xr
import numpy as np
from numba import jit
import time
data = xr.open_dataarray('../data.nc', decode_times=True).compute()
@jit(
nopython=True,
#parallel=True,
)
def numba_ema(data, n):
k = 2.0 / (1 + n)
_k = 1 - k
ema = np.zeros(data.shape, np.float64)
for a in range(data.shape[0]):
pe = np.nan
for t in range(data.shape[1]):
e = data[a, t]
if not np.isnan(pe):
if np.isnan(e):
e = pe
else:
e = k * e + _k * pe
ema[a, t] = e
pe = e
return ema
def calc_ema(data, n):
nparr = data.values
nparr = numba_ema(nparr, n)
return xr.DataArray(nparr, dims=['asset', 'date'],coords={'date': data.coords['date'], 'asset': data.coords['asset']})
t0 = time.time()
# first call - measure compilation time
ema = xr.zeros_like(data)
calc_ema(data[0:20,0,0:1], 20)
t1 = time.time()
ema = xr.zeros_like(data)
for f in data.field.values.tolist():
ema.loc[:,f,:] = calc_ema(data.loc[:,f,:], 20)
t2 = time.time()
print('time:', t2-t0, t1 - t0, t2 - t1)
```

Report:

```
time: 0.8730692863464355 0.32979297637939453 0.543276309967041
time: 0:02.01 peak memory: 1826688Kb
```

The performance is almost the same to cython,
but JIT-compilationtakes **additional 0.27s**.
But code looks more clear, you can remove `@jit`

and it will work with pure python as well.
It simplifies the development a lot, so I prefer to use numba instead of cython.

- numpy with slices is fast enough for most cases
- work directly with numpy and bypass xarray and pandas
- you can compile your code with cython or numba to work directly with underlying C-arrays in RAM
- cython is not python and it is more tricky than numba. Numba requires additional time(often small) for JIT compilation.