ulab.array -> ulab.ndarray

This was flagged as an error in building circuitpython, since ulab.array doesn't name a type object.
code: Add a docstring for numpy, scipy packages
2021-04-01 15:43:09 -05:00 · 2021-04-01 14:57:02 -05:00 · 2021-01-29 16:40:08 +01:00 · 2021-01-29 16:34:37 +01:00 · 2021-01-29 16:32:56 +01:00 · 2021-01-29 16:31:04 +01:00
151 changed files with 34872 additions and 21152 deletions
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@ -3,23 +3,32 @@ name: Build CI
 on:
  push:
  pull_request:
+    paths:
+    - 'code/**'
+    - 'tests/**'
+    - '.github/workflows/**'
  release:
    types: [published]
  check_suite:
    types: [rerequested]

 jobs:
-  test:
-    runs-on: ubuntu-16.04
+  micropython:
+    strategy:
+        matrix:
+            os:
+                - ubuntu-16.04
+                - macos-10.14
+    runs-on: ${{ matrix.os }}
    steps:
    - name: Dump GitHub context
      env:
        GITHUB_CONTEXT: ${{ toJson(github) }}
      run: echo "$GITHUB_CONTEXT"
-    - name: Set up Python 3.5
+    - name: Set up Python 3.8
      uses: actions/setup-python@v1
      with:
-        python-version: 3.5
+        python-version: 3.8

    - name: Versions
      run: |
@ -34,25 +43,42 @@ jobs:
        repository: micropython/micropython
        path: micropython

-    - name: Checkout micropython submodules
-      run: (cd micropython && git submodule update --init)
-
-    - name: Build mpy-cross
-      run: make -C micropython/mpy-cross -j2
-
-    - name: Build micropython unix port
-      run: |
-        make -C micropython/ports/unix -j2 deplibs
-        make -C micropython/ports/unix -j2 USER_C_MODULES=$(readlink -f .)
-
-    - name: Run tests
-      run: env MICROPYTHON_CPYTHON3=python3.5 MICROPY_MICROPYTHON=micropython/ports/unix/micropython micropython/tests/run-tests -d tests
-    - name: Print failure info
-      run: |
-        for exp in *.exp;
-        do testbase=$(basename $exp .exp);
-        echo -e "\nFAILURE $testbase";
-        diff -u $testbase.exp $testbase.out;
-        done
-      if: failure()
+    - name: Run build.sh
+      run: ./build.sh

+#  circuitpython:
+#    strategy:
+#        matrix:
+#            os:
+#                - ubuntu-16.04
+#                - macos-10.14
+#    runs-on: ${{ matrix.os }}
+#    steps:
+#    - name: Dump GitHub context
+#      env:
+#        GITHUB_CONTEXT: ${{ toJson(github) }}
+#      run: echo "$GITHUB_CONTEXT"
+#    - name: Set up Python 3.5
+#      uses: actions/setup-python@v1
+#      with:
+#        python-version: 3.8
+#
+#    - name: Versions
+#      run: |
+#        gcc --version
+#        python3 --version
+#
+#    - name: Checkout ulab
+#      uses: actions/checkout@v1
+#
+#    - name: Install requirements
+#      run: |
+#        if type -path apt-get; then
+#            sudo apt-get install gettext
+#        else
+#            brew install gettext
+#            echo >>$GITHUB_PATH /usr/local/opt/gettext/bin
+#        fi
+#
+#    - name: Run build-cp.sh
+#      run: ./build-cp.sh
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,6 @@
+/micropython
+/circuitpython
+/*.exp
+/*.out
+/docs/manual/build/
+/docs/manual/source/**/*.pyi
--- a/README.md
+++ b/README.md
@ -1,62 +1,226 @@
-# micropython-ulab
+# ulab

-ulab is a numpy-like array manipulation library for micropython. 
-The module is written in C, defines compact containers for numerical 
-data, and is fast. 
+`ulab` is a `numpy`-like array manipulation library for [micropython](http://micropython.org/) and [CircuitPython](https://circuitpython.org/).
+The module is written in C, defines compact containers for numerical data of one to four
+dimensions, and is fast. The library is a software-only standard `micropython` user module,
+i.e., it has no hardware dependencies, and can be compiled for any platform.
+The `float` implementation of `micropython` (`float`, or `double`) is automatically detected.

-Documentation can be found under https://micropython-ulab.readthedocs.io/en/latest/
-The source for the manual is in https://github.com/v923z/micropython-ulab/blob/master/docs/ulab-manual.ipynb,
-while developer help is in https://github.com/v923z/micropython-ulab/blob/master/docs/ulab.ipynb.
+# Supported functions
+
+
+## ndarray
+
+`ulab` implements `numpy`'s `ndarray` with the `==`, `!=`, `<`, `<=`, `>`, `>=`, `+`, `-`, `/`, `*`, `**`,
+`+=`, `-=`, `*=`, `/=`, `**=` binary operators, and the `len`, `~`, `-`, `+`, `abs` unary operators that
+operate element-wise. Type-aware `ndarray`s can be initialised from any `micropython` iterable, lists of
+iterables via the `array` constructor, or by means of the `arange`, `concatenate`, `diag`, `eye`, 
+`frombuffer`, `full`, `linspace`, `logspace`, `ones`, or `zeros`  functions.
+
+`ndarray`s can be iterated on, and have a number of their own methods, such as `flatten`, `itemsize`, `reshape`,
+`shape`, `size`, `strides`, `tobytes`, and `transpose`.
+
+
+## Customising the firmware
+
+In addition to the `ndarray` operators and methods, `ulab` defines a great number of functions that can
+take `ndarray`s or `micropython` iterables as their arguments. Most of the functions have been ported from 
+`numpy`, but several are re-implementations of `scipy` features. For a complete list, see
+[micropython-ulab](https://micropython-ulab.readthedocs.io/en/latest)!
+
+If flash space is a concern, unnecessary functions can be excluded from the compiled firmware with 
+pre-processor switches. In addition, `ulab` also has options for trading execution speed for firmware size. 
+A thorough discussion on how the firmware can be customised can be found in the 
+[corresponding section](https://micropython-ulab.readthedocs.io/en/latest/ulab-intro.html#customising-the-firmware) 
+of the user manual.
+
+It is also possible to extend the library with arbitrary user-defined functions operating on numerical arrays, and add them to the namespace, as explaind in the   [programming manual](https://micropython-ulab.readthedocs.io/en/latest/ulab-programming.html).
+
+
+## Usage
+
+`ulab` sports a `numpy/scipy`-compatible interface, which makes porting of `CPython` code straightforward. The following
+snippet should run equally well in `micropython`, or on a PC.
+
+```python
+try:
+    from ulab import numpy as np
+    from ulab import scipy as spy
+except ImportError:
+    import numpy as np
+    import scipy as spy
+
+x = np.array([1, 2, 3])
+spy.special.erf(x)
+```
+
+# Finding help
+
+Documentation can be found on [readthedocs](https://readthedocs.org/) under
+[micropython-ulab](https://micropython-ulab.readthedocs.io/en/latest),
+as well as at [circuitpython-ulab](https://circuitpython.readthedocs.io/en/latest/shared-bindings/ulab/__init__.html).
+A number of practical examples are listed in Jeff Epler's excellent
+[circuitpython-ulab](https://learn.adafruit.com/ulab-crunch-numbers-fast-with-circuitpython/overview) overview.
+
+# Benchmarks
+
+Representative numbers on performance can be found under [ulab samples](https://github.com/thiagofe/ulab_samples). 

 # Firmware

-Firmware for pyboard.v.1.1, and PYBD_SF6 is updated once in a while, and can be downloaded 
-from https://github.com/v923z/micropython-ulab/releases.
+Compiled firmware for many hardware platforms can be downloaded from Roberto Colistete's
+gitlab repository: for the [pyboard](https://gitlab.com/rcolistete/micropython-samples/-/tree/master/Pyboard/Firmware/), and
+for [ESP8266](https://gitlab.com/rcolistete/micropython-samples/-/tree/master/ESP8266/Firmware).
+Since a number of features can be set in the firmware (threading, support for SD card, LEDs, user switch etc.), and it is
+impossible to create something that suits everyone, these releases should only be used for
+quick testing of `ulab`. Otherwise, compilation from the source is required with
+the appropriate settings, which are usually defined in the `mpconfigboard.h` file of the port
+in question.
+
+`ulab` is also included in the following compiled `micropython` variants and derivatives:
+
+1. `CircuitPython` for SAMD51 and nRF microcontrollers https://github.com/adafruit/circuitpython
+1. `MicroPython for K210` https://github.com/loboris/MicroPython_K210_LoBo
+1. `MaixPy` https://github.com/sipeed/MaixPy
+1. `OpenMV` https://github.com/openmv/openmv
+1. `pycom` https://pycom.io/

 ## Compiling

-If you want to try the latest version of `ulab`, or your hardware is 
-different to pyboard.v.1.1, or PYBD_SF6, the firmware can be compiled 
+If you want to try the latest version of `ulab` on `micropython` or one of its forks, the firmware can be compiled
 from the source by following these steps:

-First, you have to clone the micropython repository by running 
+### UNIX port

+Simply clone the `ulab` repository with
+
+```bash
+git clone https://github.com/v923z/micropython-ulab.git ulab
 ```
+and then run 
+
+```bash
+./build.sh
+```
+This command will clone `micropython`, and build the `unix` port automatically, as well as run the test scripts. If you want an interactive `unix` session, you can launch it in 
+
+```bash
+ulab/micropython/ports/unix
+```
+
+### STM-based boards
+
+First, you have to clone the `micropython` repository by running
+
+```bash
 git clone https://github.com/micropython/micropython.git
 ```
 on the command line. This will create a new repository with the name `micropython`. Staying there, clone the `ulab` repository with

-```
+```bash
 git clone https://github.com/v923z/micropython-ulab.git ulab
 ```
+If you don't have the cross-compiler installed, your might want to do that now, for instance on Linux by executing

-Then you have to include `ulab` in the compilation process by editing `mpconfigport.h` of the directory of the port for which you want to compile, so, still on the command line, navigate to `micropython/ports/unix`, or `micropython/ports/stm32`, or whichever port is your favourite, and edit the `mpconfigport.h` file there. All you have to do is add a single line at the end: 
-
-```
-#define MODULE_ULAB_ENABLED (1)
-```
-
-This line will inform the compiler that you want `ulab` in the resulting firmware. If you don't have the cross-compiler installed, your might want to do that now, for instance on Linux by executing 
-
-```
+```bash
 sudo apt-get install gcc-arm-none-eabi
 ```
-If that was successful, you can try to run the make command in the port's directory as 
-```
+
+If this step was successful, you can try to run the `make` command in the port's directory as
+
+```bash
 make BOARD=PYBV11 USER_C_MODULES=../../../ulab all
 ```
 which will prepare the firmware for pyboard.v.11. Similarly,
-```
+
+```bash
 make BOARD=PYBD_SF6 USER_C_MODULES=../../../ulab all
 ```
-will compile for the SF6 member of the PYBD series. Provided that you managed to compile the firmware, you would upload that by running
-either
-```
+will compile for the SF6 member of the PYBD series. If your target is `unix`, you don't need to specify the `BOARD` parameter.
+
+Provided that you managed to compile the firmware, you would upload that by running either
+
+```bash
 dfu-util --alt 0 -D firmware.dfu
 ```
 or
-```
+
+```bash
 python pydfu.py -u firmware.dfu
 ```

 In case you got stuck somewhere in the process, a bit more detailed instructions can be found under https://github.com/micropython/micropython/wiki/Getting-Started, and https://github.com/micropython/micropython/wiki/Pyboard-Firmware-Update.
+
+
+### ESP32-based boards
+
+```bash
+cd $BUILD_DIR/micropython
+git checkout b137d064e9e0bfebd2a59a9b312935031252e742
+# choose micropython version - note v1.12 is incompatible with ulab
+# and v1.13 is currently broken in some ways (on some platforms) https://github.com/BradenM/micropy-cli/issues/167
+# - the patch is not live yet (should be in 1.14), but is at this commit
+git submodule update --init
+cd $BUILD_DIR/micropython/mpy-cross && make # build cross-compiler (required)
+
+cd $BUILD_DIR/micropython/ports/esp32
+make ESPIDF= # will display supported ESP-IDF commit hashes
+# output should look like: """
+# ...
+# Supported git hash (v3.3): 9e70825d1e1cbf7988cf36981774300066580ea7
+# Supported git hash (v4.0) (experimental): 4c81978a3e2220674a432a588292a4c860eef27b
+```
+
+Choose an ESPIDF version from one of the options printed by the previous command:
+
+```bash
+ESPIDF_VER=9e70825d1e1cbf7988cf36981774300066580ea7
+
+# Download and prepare the SDK
+git clone https://github.com/espressif/esp-idf.git $BUILD_DIR/esp-idf
+cd $BUILD_DIR/esp-idf
+git checkout $ESPIDF_VER
+git submodule update --init --recursive # get idf submodules
+pip install -r ./requirements.txt # install python reqs
+```
+
+Next, install the ESP32 compiler. If using an ESP-IDF version >= 4.x (chosen by `$ESPIDF_VER` above), this can be done by running `. $BUILD_DIR/esp-idf/install.sh`. Otherwise, (for version 3.x) run:
+
+```bash
+cd $BUILD_DIR
+
+# for 64 bit linux
+curl https://dl.espressif.com/dl/xtensa-esp32-elf-linux64-1.22.0-80-g6c4433a-5.2.0.tar.gz | tar xvz
+
+# for 32 bit
+# curl https://dl.espressif.com/dl/xtensa-esp32-elf-linux32-1.22.0-80-g6c4433a-5.2.0.tar.gz | tar xvz
+
+# don't worry about adding to path; we'll specify that later
+
+# also, see https://docs.espressif.com/projects/esp-idf/en/v3.3.2/get-started for more info
+```
+
+We can now clone the `ulab` repository
+
+```
+git clone https://github.com/v923z/micropython-ulab $BUILD_DIR/ulab
+```
+
+Finally, build the firmware:
+
+```bash
+cd $BUILD_DIR/micropython/ports/esp32
+# temporarily add esp32 compiler to path
+export PATH=$BUILD_DIR/xtensa-esp32-elf/bin:$PATH
+export ESPIDF=$BUILD_DIR/esp-idf # req'd by Makefile
+export BOARD=GENERIC # options are dirs in ./boards
+export USER_C_MODULES=$BUILD_DIR/ulab # include ulab in firmware
+
+make submodules & make all
+```
+
+If it compiles without error, you can plug in your ESP32 via USB and then flash it with:
+
+```bash
+make erase && make deploy
+```
--- a/build-cp.sh
+++ b/build-cp.sh
@ -0,0 +1,60 @@
+#!/bin/sh
+set -e
+# POSIX compliant version
+readlinkf_posix() {
+  [ "${1:-}" ] || return 1
+  max_symlinks=40
+  CDPATH='' # to avoid changing to an unexpected directory
+
+  target=$1
+  [ -e "${target%/}" ] || target=${1%"${1##*[!/]}"} # trim trailing slashes
+  [ -d "${target:-/}" ] && target="$target/"
+
+  cd -P . 2>/dev/null || return 1
+  while [ "$max_symlinks" -ge 0 ] && max_symlinks=$((max_symlinks - 1)); do
+    if [ ! "$target" = "${target%/*}" ]; then
+      case $target in
+        /*) cd -P "${target%/*}/" 2>/dev/null || break ;;
+        *) cd -P "./${target%/*}" 2>/dev/null || break ;;
+      esac
+      target=${target##*/}
+    fi
+
+    if [ ! -L "$target" ]; then
+      target="${PWD%/}${target:+/}${target}"
+      printf '%s\n' "${target:-/}"
+      return 0
+    fi
+
+    # `ls -dl` format: "%s %u %s %s %u %s %s -> %s\n",
+    #   <file mode>, <number of links>, <owner name>, <group name>,
+    #   <size>, <date and time>, <pathname of link>, <contents of link>
+    # https://pubs.opengroup.org/onlinepubs/9699919799/utilities/ls.html
+    link=$(ls -dl -- "$target" 2>/dev/null) || break
+    target=${link#*" $target -> "}
+  done
+  return 1
+}
+NPROC=$(python -c 'import multiprocessing; print(multiprocessing.cpu_count())')
+HERE="$(dirname -- "$(readlinkf_posix -- "${0}")" )"
+[ -e circuitpython/py/py.mk ] || (git clone --no-recurse-submodules --depth 100 --branch 6.0.x https://github.com/adafruit/circuitpython && cd circuitpython && git submodule update --init lib/uzlib tools)
+rm -rf circuitpython/extmod/ulab; ln -s "$HERE" circuitpython/extmod/ulab
+make -C circuitpython/mpy-cross -j$NPROC
+sed -e '/MICROPY_PY_UHASHLIB/s/1/0/' < circuitpython/ports/unix/mpconfigport.h > circuitpython/ports/unix/mpconfigport_ulab.h
+# Work around circuitpython#3990
+make -C circuitpython/ports/unix -j$NPROC DEBUG=1 MICROPY_PY_FFI=0 MICROPY_PY_BTREE=0 MICROPY_SSL_AXTLS=0 MICROPY_PY_USSL=0 CFLAGS_EXTRA='-DMP_CONFIGFILE="<mpconfigport_ulab.h>" -Wno-tautological-constant-out-of-range-compare' build/genhdr/qstrdefs.generated.h
+make -k -C circuitpython/ports/unix -j$NPROC DEBUG=1 MICROPY_PY_FFI=0 MICROPY_PY_BTREE=0 MICROPY_SSL_AXTLS=0 MICROPY_PY_USSL=0 CFLAGS_EXTRA='-DMP_CONFIGFILE="<mpconfigport_ulab.h>" -Wno-tautological-constant-out-of-range-compare'
+
+for dir in "circuitpy" "common"
+do
+	if ! env MICROPY_MICROPYTHON=circuitpython/ports/unix/micropython ./run-tests -d tests/"$dir"; then
+		for exp in *.exp; do
+			testbase=$(basename $exp .exp);
+			echo -e "\nFAILURE $testbase";
+			diff -u $testbase.exp $testbase.out;
+		done
+		exit 1
+	fi
+done
+
+#(cd circuitpython && sphinx-build -E -W -b html . _build/html)
--- a/build.sh
+++ b/build.sh
@ -0,0 +1,56 @@
+#!/bin/sh
+# POSIX compliant version
+readlinkf_posix() {
+  [ "${1:-}" ] || return 1
+  max_symlinks=40
+  CDPATH='' # to avoid changing to an unexpected directory
+
+  target=$1
+  [ -e "${target%/}" ] || target=${1%"${1##*[!/]}"} # trim trailing slashes
+  [ -d "${target:-/}" ] && target="$target/"
+
+  cd -P . 2>/dev/null || return 1
+  while [ "$max_symlinks" -ge 0 ] && max_symlinks=$((max_symlinks - 1)); do
+    if [ ! "$target" = "${target%/*}" ]; then
+      case $target in
+        /*) cd -P "${target%/*}/" 2>/dev/null || break ;;
+        *) cd -P "./${target%/*}" 2>/dev/null || break ;;
+      esac
+      target=${target##*/}
+    fi
+
+    if [ ! -L "$target" ]; then
+      target="${PWD%/}${target:+/}${target}"
+      printf '%s\n' "${target:-/}"
+      return 0
+    fi
+
+    # `ls -dl` format: "%s %u %s %s %u %s %s -> %s\n",
+    #   <file mode>, <number of links>, <owner name>, <group name>,
+    #   <size>, <date and time>, <pathname of link>, <contents of link>
+    # https://pubs.opengroup.org/onlinepubs/9699919799/utilities/ls.html
+    link=$(ls -dl -- "$target" 2>/dev/null) || break
+    target=${link#*" $target -> "}
+  done
+  return 1
+}
+NPROC=`python3 -c 'import multiprocessing; print(multiprocessing.cpu_count())'`
+set -e
+HERE="$(dirname -- "$(readlinkf_posix -- "${0}")" )"
+[ -e micropython/py/py.mk ] || git clone --no-recurse-submodules https://github.com/micropython/micropython
+[ -e micropython/lib/axtls/README ] || (cd micropython && git submodule update --init lib/axtls )
+make -C micropython/mpy-cross -j${NPROC}
+make -C micropython/ports/unix -j${NPROC} axtls
+make -C micropython/ports/unix -j${NPROC} USER_C_MODULES="${HERE}" DEBUG=1 STRIP=: MICROPY_PY_FFI=0 MICROPY_PY_BTREE=0
+
+
+for dir in "numpy" "common"
+do
+	if ! env MICROPY_MICROPYTHON=micropython/ports/unix/micropython ./run-tests -d tests/"$dir"; then
+		for exp in *.exp; do
+			testbase=$(basename $exp .exp);
+			echo -e "\nFAILURE $testbase";
+			diff -u $testbase.exp $testbase.out;
+		done
+	fi
+done
--- a/code/extras.c
+++ b/code/extras.c
@ -1,33 +0,0 @@
-
-/*
- * This file is part of the micropython-ulab project,
- *
- * https://github.com/v923z/micropython-ulab
- *
- * The MIT License (MIT)
- *
- * Copyright (c) 2020 Zoltán Vörös
-*/
-
-#include <math.h>
-#include <stdlib.h>
-#include <string.h>
-#include "py/obj.h"
-#include "py/runtime.h"
-#include "py/misc.h"
-#include "extras.h"
-
-#if ULAB_EXTRAS_MODULE
-
-STATIC const mp_rom_map_elem_t ulab_filter_globals_table[] = {
-    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_extras) },
-};
-
-STATIC MP_DEFINE_CONST_DICT(mp_module_ulab_extras_globals, ulab_extras_globals_table);
-
-mp_obj_module_t ulab_filter_module = {
-    .base = { &mp_type_module },
-    .globals = (mp_obj_dict_t*)&mp_module_ulab_extras_globals,
-};
-
-#endif
--- a/code/fft.c
+++ b/code/fft.c
@ -1,201 +0,0 @@
-
-/*
- * This file is part of the micropython-ulab project, 
- *
- * https://github.com/v923z/micropython-ulab
- *
- * The MIT License (MIT)
- *
- * Copyright (c) 2019-2020 Zoltán Vörös
-*/
-
-#include <math.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include "py/runtime.h"
-#include "py/builtin.h"
-#include "py/binary.h"
-#include "py/obj.h"
-#include "py/objarray.h"
-#include "ndarray.h"
-#include "fft.h"
-
-#if ULAB_FFT_MODULE
-
-enum FFT_TYPE {
-    FFT_FFT,
-    FFT_IFFT,
-    FFT_SPECTRUM,
-};
-
-void fft_kernel(mp_float_t *real, mp_float_t *imag, int n, int isign) {
-    // This is basically a modification of four1 from Numerical Recipes
-    // The main difference is that this function takes two arrays, one 
-    // for the real, and one for the imaginary parts. 
-    int j, m, mmax, istep;
-    mp_float_t tempr, tempi;
-    mp_float_t wtemp, wr, wpr, wpi, wi, theta;
-
-    j = 0;
-    for(int i = 0; i < n; i++) {
-        if (j > i) {
-            SWAP(mp_float_t, real[i], real[j]);
-            SWAP(mp_float_t, imag[i], imag[j]);
-        }
-        m = n >> 1;
-        while (j >= m && m > 0) {
-            j -= m;
-            m >>= 1;
-        }
-        j += m;
-    }
-
-    mmax = 1;
-    while (n > mmax) {
-        istep = mmax << 1;
-        theta = -2.0*isign*MP_PI/istep;
-        wtemp = MICROPY_FLOAT_C_FUN(sin)(0.5 * theta);
-        wpr = -2.0 * wtemp * wtemp;
-        wpi = MICROPY_FLOAT_C_FUN(sin)(theta);
-        wr = 1.0;
-        wi = 0.0;
-        for(m = 0; m < mmax; m++) {
-            for(int i = m; i < n; i += istep) {
-                j = i + mmax;
-                tempr = wr * real[j] - wi * imag[j];
-                tempi = wr * imag[j] + wi * real[j];
-                real[j] = real[i] - tempr;
-                imag[j] = imag[i] - tempi;
-                real[i] += tempr;
-                imag[i] += tempi;
-            }
-            wtemp = wr;
-            wr = wr*wpr - wi*wpi + wr;
-            wi = wi*wpr + wtemp*wpi + wi;
-        }
-        mmax = istep;
-    }
-}
-
-mp_obj_t fft_fft_ifft_spectrum(size_t n_args, mp_obj_t arg_re, mp_obj_t arg_im, uint8_t type) {
-    if(!MP_OBJ_IS_TYPE(arg_re, &ulab_ndarray_type)) {
-        mp_raise_NotImplementedError(translate("FFT is defined for ndarrays only"));
-    } 
-    if(n_args == 2) {
-        if(!MP_OBJ_IS_TYPE(arg_im, &ulab_ndarray_type)) {
-            mp_raise_NotImplementedError(translate("FFT is defined for ndarrays only"));
-        }
-    }
-    // Check if input is of length of power of 2
-    ndarray_obj_t *re = MP_OBJ_TO_PTR(arg_re);
-    uint16_t len = re->array->len;
-    if((len & (len-1)) != 0) {
-        mp_raise_ValueError(translate("input array length must be power of 2"));
-    }
-    
-    ndarray_obj_t *out_re = create_new_ndarray(1, len, NDARRAY_FLOAT);
-    mp_float_t *data_re = (mp_float_t *)out_re->array->items;
-    
-    if(re->array->typecode == NDARRAY_FLOAT) { 
-        // By treating this case separately, we can save a bit of time.
-        // I don't know if it is worthwhile, though...
-        memcpy((mp_float_t *)out_re->array->items, (mp_float_t *)re->array->items, re->bytes);
-    } else {
-        for(size_t i=0; i < len; i++) {
-            *data_re++ = ndarray_get_float_value(re->array->items, re->array->typecode, i);
-        }
-        data_re -= len;
-    }
-    ndarray_obj_t *out_im = create_new_ndarray(1, len, NDARRAY_FLOAT);
-    mp_float_t *data_im = (mp_float_t *)out_im->array->items;
-
-    if(n_args == 2) {
-        ndarray_obj_t *im = MP_OBJ_TO_PTR(arg_im);
-        if (re->array->len != im->array->len) {
-            mp_raise_ValueError(translate("real and imaginary parts must be of equal length"));
-        }
-        if(im->array->typecode == NDARRAY_FLOAT) {
-            memcpy((mp_float_t *)out_im->array->items, (mp_float_t *)im->array->items, im->bytes);
-        } else {
-            for(size_t i=0; i < len; i++) {
-               *data_im++ = ndarray_get_float_value(im->array->items, im->array->typecode, i);
-            }
-            data_im -= len;
-        }
-    }
-
-    if((type == FFT_FFT) || (type == FFT_SPECTRUM)) {
-        fft_kernel(data_re, data_im, len, 1);
-        if(type == FFT_SPECTRUM) {
-            for(size_t i=0; i < len; i++) {
-                *data_re = MICROPY_FLOAT_C_FUN(sqrt)(*data_re * *data_re + *data_im * *data_im);
-                data_re++;
-                data_im++;
-            }
-        }
-    } else { // inverse transform
-        fft_kernel(data_re, data_im, len, -1);
-        // TODO: numpy accepts the norm keyword argument
-        for(size_t i=0; i < len; i++) {
-            *data_re++ /= len;
-            *data_im++ /= len;
-        }
-    }
-    if(type == FFT_SPECTRUM) {
-        return MP_OBJ_TO_PTR(out_re);
-    } else {
-        mp_obj_t tuple[2];
-        tuple[0] = out_re;
-        tuple[1] = out_im;
-        return mp_obj_new_tuple(2, tuple);
-    }
-}
-
-mp_obj_t fft_fft(size_t n_args, const mp_obj_t *args) {
-    if(n_args == 2) {
-        return fft_fft_ifft_spectrum(n_args, args[0], args[1], FFT_FFT);
-    } else {
-        return fft_fft_ifft_spectrum(n_args, args[0], mp_const_none, FFT_FFT);        
-    }
-}
-
-MP_DEFINE_CONST_FUN_OBJ_VAR_BETWEEN(fft_fft_obj, 1, 2, fft_fft);
-
-mp_obj_t fft_ifft(size_t n_args, const mp_obj_t *args) {
-    if(n_args == 2) {
-        return fft_fft_ifft_spectrum(n_args, args[0], args[1], FFT_IFFT);
-    } else {
-        return fft_fft_ifft_spectrum(n_args, args[0], mp_const_none, FFT_IFFT);
-    }
-}
-
-MP_DEFINE_CONST_FUN_OBJ_VAR_BETWEEN(fft_ifft_obj, 1, 2, fft_ifft);
-
-mp_obj_t fft_spectrum(size_t n_args, const mp_obj_t *args) {
-    if(n_args == 2) {
-        return fft_fft_ifft_spectrum(n_args, args[0], args[1], FFT_SPECTRUM);
-    } else {
-        return fft_fft_ifft_spectrum(n_args, args[0], mp_const_none, FFT_SPECTRUM);
-    }
-}
-
-MP_DEFINE_CONST_FUN_OBJ_VAR_BETWEEN(fft_spectrum_obj, 1, 2, fft_spectrum);
-
-#if !CIRCUITPY
-STATIC const mp_rom_map_elem_t ulab_fft_globals_table[] = {
-    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_fft) },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_fft), (mp_obj_t)&fft_fft_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_ifft), (mp_obj_t)&fft_ifft_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_spectrum), (mp_obj_t)&fft_spectrum_obj },
-};
-
-STATIC MP_DEFINE_CONST_DICT(mp_module_ulab_fft_globals, ulab_fft_globals_table);
-
-mp_obj_module_t ulab_fft_module = {
-    .base = { &mp_type_module },
-    .globals = (mp_obj_dict_t*)&mp_module_ulab_fft_globals,
-};
-#endif
-
-#endif
--- a/code/fft.h
+++ b/code/fft.h
@ -1,31 +0,0 @@
-
-/*
- * This file is part of the micropython-ulab project, 
- *
- * https://github.com/v923z/micropython-ulab
- *
- * The MIT License (MIT)
- *
- * Copyright (c) 2019-2020 Zoltán Vörös
-*/
-
-#ifndef _FFT_
-#define _FFT_
-#include "ulab.h"
-
-#ifndef MP_PI
-#define MP_PI MICROPY_FLOAT_CONST(3.14159265358979323846)
-#endif
-
-#define SWAP(t, a, b) { t tmp = a; a = b; b = tmp; }
-
-#if ULAB_FFT_MODULE
-
-extern mp_obj_module_t ulab_fft_module;
-
-MP_DECLARE_CONST_FUN_OBJ_VAR_BETWEEN(fft_fft_obj);
-MP_DECLARE_CONST_FUN_OBJ_VAR_BETWEEN(fft_ifft_obj);
-MP_DECLARE_CONST_FUN_OBJ_VAR_BETWEEN(fft_spectrum_obj);
-
-#endif
-#endif
--- a/code/filter.c
+++ b/code/filter.c
@ -1,101 +0,0 @@
-
-/*
- * This file is part of the micropython-ulab project,
- *
- * https://github.com/v923z/micropython-ulab
- *
- * The MIT License (MIT)
- *
- * Copyright (c) 2020 Jeff Epler for Adafruit Industries
-*/
-
-#include <math.h>
-#include <stdlib.h>
-#include <string.h>
-#include "py/obj.h"
-#include "py/runtime.h"
-#include "py/misc.h"
-#include "filter.h"
-
-#if ULAB_FILTER_MODULE
-mp_obj_t filter_convolve(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_a, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_v, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(2, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-
-    if(!MP_OBJ_IS_TYPE(args[0].u_obj, &ulab_ndarray_type) || !MP_OBJ_IS_TYPE(args[1].u_obj, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("convolve arguments must be ndarrays"));
-    }
-
-    ndarray_obj_t *a = MP_OBJ_TO_PTR(args[0].u_obj);
-    ndarray_obj_t *c = MP_OBJ_TO_PTR(args[1].u_obj);
-    int len_a = a->array->len;
-    int len_c = c->array->len;
-    // deal with linear arrays only
-    if(a->m*a->n != len_a || c->m*c->n != len_c) {
-        mp_raise_TypeError(translate("convolve arguments must be linear arrays"));
-    }
-    if(len_a == 0 || len_c == 0) {
-        mp_raise_TypeError(translate("convolve arguments must not be empty"));
-    }
-
-    int len = len_a + len_c - 1; // convolve mode "full"
-    ndarray_obj_t *out = create_new_ndarray(1, len, NDARRAY_FLOAT);
-    mp_float_t *outptr = out->array->items;
-    int off = len_c-1;
-
-    if(a->array->typecode == NDARRAY_FLOAT && c->array->typecode == NDARRAY_FLOAT) {
-        mp_float_t* a_items = (mp_float_t*)a->array->items;
-        mp_float_t* c_items = (mp_float_t*)c->array->items;
-        for(int k=-off; k<len-off; k++) {
-            mp_float_t accum = (mp_float_t)0;
-            int top_n = MIN(len_c, len_a - k);
-            int bot_n = MAX(-k, 0);
-            mp_float_t* a_ptr = a_items + bot_n + k;
-            mp_float_t* a_end = a_ptr + (top_n - bot_n);
-            mp_float_t* c_ptr = c_items + len_c - bot_n - 1;
-            for(; a_ptr != a_end;) {
-                accum += *a_ptr++ * *c_ptr--;
-            }
-            *outptr++ = accum;
-        }
-    } else {
-        for(int k=-off; k<len-off; k++) {
-            mp_float_t accum = (mp_float_t)0;
-            int top_n = MIN(len_c, len_a - k);
-            int bot_n = MAX(-k, 0);
-            for(int n=bot_n; n<top_n; n++) {
-                int idx_c = len_c - n - 1;
-                int idx_a = n+k;
-                mp_float_t ai = ndarray_get_float_value(a->array->items, a->array->typecode, idx_a);
-                mp_float_t ci = ndarray_get_float_value(c->array->items, c->array->typecode, idx_c);
-                accum += ai * ci;
-            }
-            *outptr++ = accum;
-        }
-    }
-
-    return out;
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(filter_convolve_obj, 2, filter_convolve);
-
-#if !CIRCUITPY
-STATIC const mp_rom_map_elem_t ulab_filter_globals_table[] = {
-    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_filter) },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_convolve), (mp_obj_t)&filter_convolve_obj },
-};
-
-STATIC MP_DEFINE_CONST_DICT(mp_module_ulab_filter_globals, ulab_filter_globals_table);
-
-mp_obj_module_t ulab_filter_module = {
-    .base = { &mp_type_module },
-    .globals = (mp_obj_dict_t*)&mp_module_ulab_filter_globals,
-};
-#endif
-
-#endif
--- a/code/linalg.c
+++ b/code/linalg.c
@ -1,448 +0,0 @@
-
-/*
- * This file is part of the micropython-ulab project, 
- *
- * https://github.com/v923z/micropython-ulab
- *
- * The MIT License (MIT)
- *
- * Copyright (c) 2019-2020 Zoltán Vörös
-*/
-
-#include <stdlib.h>
-#include <string.h>
-#include <math.h>
-#include "py/obj.h"
-#include "py/runtime.h"
-#include "py/misc.h"
-#include "linalg.h"
-
-#if ULAB_LINALG_MODULE
-
-mp_obj_t linalg_size(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_axis, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(1, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-
-    if(!MP_OBJ_IS_TYPE(args[0].u_obj, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("size is defined for ndarrays only"));
-    } else {
-        ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(args[0].u_obj);
-        if(args[1].u_obj == mp_const_none) {
-            return mp_obj_new_int(ndarray->array->len);
-        } else if(MP_OBJ_IS_INT(args[1].u_obj)) {
-            uint8_t ax = mp_obj_get_int(args[1].u_obj);
-            if(ax == 0) {
-                if(ndarray->m == 1) {
-                    return mp_obj_new_int(ndarray->n);
-                } else {
-                    return mp_obj_new_int(ndarray->m);                    
-                }
-            } else if(ax == 1) {
-                if(ndarray->m == 1) {
-                    mp_raise_ValueError(translate("tuple index out of range"));
-                } else {
-                    return mp_obj_new_int(ndarray->n);
-                }
-            } else {
-                    mp_raise_ValueError(translate("tuple index out of range"));
-            }
-        } else {
-            mp_raise_TypeError(translate("wrong argument type"));
-        }
-    }
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(linalg_size_obj, 1, linalg_size);
-
-bool linalg_invert_matrix(mp_float_t *data, size_t N) {
-    // returns true, of the inversion was successful, 
-    // false, if the matrix is singular
-    
-    // initially, this is the unit matrix: the contents of this matrix is what 
-    // will be returned after all the transformations
-    mp_float_t *unit = m_new(mp_float_t, N*N);
-
-    mp_float_t elem = 1.0;
-    // initialise the unit matrix
-    memset(unit, 0, sizeof(mp_float_t)*N*N);
-    for(size_t m=0; m < N; m++) {
-        memcpy(&unit[m*(N+1)], &elem, sizeof(mp_float_t));
-    }
-    for(size_t m=0; m < N; m++){
-        // this could be faster with ((c < epsilon) && (c > -epsilon))
-        if(MICROPY_FLOAT_C_FUN(fabs)(data[m*(N+1)]) < epsilon) {
-            m_del(mp_float_t, unit, N*N);
-            return false;
-        }
-        for(size_t n=0; n < N; n++){
-            if(m != n){
-                elem = data[N*n+m] / data[m*(N+1)];
-                for(size_t k=0; k < N; k++){
-                    data[N*n+k] -= elem * data[N*m+k];
-                    unit[N*n+k] -= elem * unit[N*m+k];
-                }
-            }
-        }
-    }
-    for(size_t m=0; m < N; m++){ 
-        elem = data[m*(N+1)];
-        for(size_t n=0; n < N; n++){
-            data[N*m+n] /= elem;
-            unit[N*m+n] /= elem;
-        }
-    }
-    memcpy(data, unit, sizeof(mp_float_t)*N*N);
-    m_del(mp_float_t, unit, N*N);
-    return true;
-}
-
-mp_obj_t linalg_inv(mp_obj_t o_in) {
-    // since inv is not a class method, we have to inspect the input argument first
-    if(!MP_OBJ_IS_TYPE(o_in, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("only ndarrays can be inverted"));
-    }
-    ndarray_obj_t *o = MP_OBJ_TO_PTR(o_in);
-    if(!MP_OBJ_IS_TYPE(o_in, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("only ndarray objects can be inverted"));
-    }
-    if(o->m != o->n) {
-        mp_raise_ValueError(translate("only square matrices can be inverted"));
-    }
-    ndarray_obj_t *inverted = create_new_ndarray(o->m, o->n, NDARRAY_FLOAT);
-    mp_float_t *data = (mp_float_t *)inverted->array->items;
-    mp_obj_t elem;
-    for(size_t m=0; m < o->m; m++) { // rows first
-        for(size_t n=0; n < o->n; n++) { // columns next
-            // this could, perhaps, be done in single line... 
-            // On the other hand, we probably spend little time here
-            elem = mp_binary_get_val_array(o->array->typecode, o->array->items, m*o->n+n);
-            data[m*o->n+n] = (mp_float_t)mp_obj_get_float(elem);
-        }
-    }
-    
-    if(!linalg_invert_matrix(data, o->m)) {
-        // TODO: I am not sure this is needed here. Otherwise, 
-        // how should we free up the unused RAM of inverted?
-        m_del(mp_float_t, inverted->array->items, o->n*o->n);
-        mp_raise_ValueError(translate("input matrix is singular"));
-    }
-    return MP_OBJ_FROM_PTR(inverted);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_1(linalg_inv_obj, linalg_inv);
-
-mp_obj_t linalg_dot(mp_obj_t _m1, mp_obj_t _m2) {
-    // TODO: should the results be upcast?
-    if(!MP_OBJ_IS_TYPE(_m1, &ulab_ndarray_type) || !MP_OBJ_IS_TYPE(_m2, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("arguments must be ndarrays"));
-    }
-    ndarray_obj_t *m1 = MP_OBJ_TO_PTR(_m1);
-    ndarray_obj_t *m2 = MP_OBJ_TO_PTR(_m2);    
-    if(m1->n != m2->m) {
-        mp_raise_ValueError(translate("matrix dimensions do not match"));
-    }
-    // TODO: numpy uses upcasting here
-    ndarray_obj_t *out = create_new_ndarray(m1->m, m2->n, NDARRAY_FLOAT);
-    mp_float_t *outdata = (mp_float_t *)out->array->items;
-    mp_float_t sum, v1, v2;
-    for(size_t i=0; i < m1->m; i++) { // rows of m1
-        for(size_t j=0; j < m2->n; j++) { // columns of m2
-            sum = 0.0;
-            for(size_t k=0; k < m2->m; k++) {
-                // (i, k) * (k, j)
-                v1 = ndarray_get_float_value(m1->array->items, m1->array->typecode, i*m1->n+k);
-                v2 = ndarray_get_float_value(m2->array->items, m2->array->typecode, k*m2->n+j);
-                sum += v1 * v2;
-            }
-            outdata[j*m1->m+i] = sum;
-        }
-    }
-    return MP_OBJ_FROM_PTR(out);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_2(linalg_dot_obj, linalg_dot);
-
-mp_obj_t linalg_zeros_ones(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args, uint8_t kind) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_obj = MP_OBJ_NULL} } ,
-        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_INT, {.u_int = NDARRAY_FLOAT} },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-    
-    uint8_t dtype = args[1].u_int;
-    if(!MP_OBJ_IS_INT(args[0].u_obj) && !MP_OBJ_IS_TYPE(args[0].u_obj, &mp_type_tuple)) {
-        mp_raise_TypeError(translate("input argument must be an integer or a 2-tuple"));
-    }
-    ndarray_obj_t *ndarray = NULL;
-    if(MP_OBJ_IS_INT(args[0].u_obj)) {
-        size_t n = mp_obj_get_int(args[0].u_obj);
-        ndarray = create_new_ndarray(1, n, dtype);
-    } else if(MP_OBJ_IS_TYPE(args[0].u_obj, &mp_type_tuple)) {
-        mp_obj_tuple_t *tuple = MP_OBJ_TO_PTR(args[0].u_obj);
-        if(tuple->len != 2) {
-            mp_raise_TypeError(translate("input argument must be an integer or a 2-tuple"));
-        }
-        ndarray = create_new_ndarray(mp_obj_get_int(tuple->items[0]), 
-                                                  mp_obj_get_int(tuple->items[1]), dtype);
-    }
-    if(kind == 1) {
-        mp_obj_t one = mp_obj_new_int(1);
-        for(size_t i=0; i < ndarray->array->len; i++) {
-            mp_binary_set_val_array(dtype, ndarray->array->items, i, one);
-        }
-    }
-    return MP_OBJ_FROM_PTR(ndarray);
-}
-
-mp_obj_t linalg_zeros(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    return linalg_zeros_ones(n_args, pos_args, kw_args, 0);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(linalg_zeros_obj, 0, linalg_zeros);
-
-mp_obj_t linalg_ones(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    return linalg_zeros_ones(n_args, pos_args, kw_args, 1);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(linalg_ones_obj, 0, linalg_ones);
-
-mp_obj_t linalg_eye(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_INT, {.u_int = 0} },
-        { MP_QSTR_M, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_k, MP_ARG_KW_ONLY | MP_ARG_INT, {.u_int = 0} },        
-        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_INT, {.u_int = NDARRAY_FLOAT} },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-
-    size_t n = args[0].u_int, m;
-    int16_t k = args[2].u_int;
-    uint8_t dtype = args[3].u_int;
-    if(args[1].u_rom_obj == mp_const_none) {
-        m = n;
-    } else {
-        m = mp_obj_get_int(args[1].u_rom_obj);
-    }
-    
-    ndarray_obj_t *ndarray = create_new_ndarray(m, n, dtype);
-    mp_obj_t one = mp_obj_new_int(1);
-    size_t i = 0;
-    if((k >= 0) && (k < n)) {
-        while(k < n) {
-            mp_binary_set_val_array(dtype, ndarray->array->items, i*n+k, one);
-            k++;
-            i++;
-        }
-    } else if((k < 0) && (-k < m)) {
-        k = -k;
-        i = 0;
-        while(k < m) {
-            mp_binary_set_val_array(dtype, ndarray->array->items, k*n+i, one);
-            k++;
-            i++;
-        }
-    }
-    return MP_OBJ_FROM_PTR(ndarray);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(linalg_eye_obj, 0, linalg_eye);
-
-mp_obj_t linalg_det(mp_obj_t oin) {
-    if(!MP_OBJ_IS_TYPE(oin, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("function defined for ndarrays only"));
-    }
-    ndarray_obj_t *in = MP_OBJ_TO_PTR(oin);
-    if(in->m != in->n) {
-        mp_raise_ValueError(translate("input must be square matrix"));
-    }
-    
-    mp_float_t *tmp = m_new(mp_float_t, in->n*in->n);
-    for(size_t i=0; i < in->array->len; i++){
-        tmp[i] = ndarray_get_float_value(in->array->items, in->array->typecode, i);
-    }
-    mp_float_t c;
-    for(size_t m=0; m < in->m-1; m++){
-        if(MICROPY_FLOAT_C_FUN(fabs)(tmp[m*(in->n+1)]) < epsilon) {
-            m_del(mp_float_t, tmp, in->n*in->n);
-            return mp_obj_new_float(0.0);
-        }
-        for(size_t n=0; n < in->n; n++){
-            if(m != n) {
-                c = tmp[in->n*n+m] / tmp[m*(in->n+1)];
-                for(size_t k=0; k < in->n; k++){
-                    tmp[in->n*n+k] -= c * tmp[in->n*m+k];
-                }
-            }
-        }
-    }
-    mp_float_t det = 1.0;
-                            
-    for(size_t m=0; m < in->m; m++){ 
-        det *= tmp[m*(in->n+1)];
-    }
-    m_del(mp_float_t, tmp, in->n*in->n);
-    return mp_obj_new_float(det);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_1(linalg_det_obj, linalg_det);
-
-mp_obj_t linalg_eig(mp_obj_t oin) {
-    if(!MP_OBJ_IS_TYPE(oin, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("function defined for ndarrays only"));
-    }
-    ndarray_obj_t *in = MP_OBJ_TO_PTR(oin);
-    if(in->m != in->n) {
-        mp_raise_ValueError(translate("input must be square matrix"));
-    }
-    mp_float_t *array = m_new(mp_float_t, in->array->len);
-    for(size_t i=0; i < in->array->len; i++) {
-        array[i] = ndarray_get_float_value(in->array->items, in->array->typecode, i);
-    }
-    // make sure the matrix is symmetric
-    for(size_t m=0; m < in->m; m++) {
-        for(size_t n=m+1; n < in->n; n++) {
-            // compare entry (m, n) to (n, m)
-            // TODO: this must probably be scaled!
-            if(epsilon < MICROPY_FLOAT_C_FUN(fabs)(array[m*in->n + n] - array[n*in->n + m])) {
-                mp_raise_ValueError(translate("input matrix is asymmetric"));
-            }
-        }
-    }
-    
-    // if we got this far, then the matrix will be symmetric
-    
-    ndarray_obj_t *eigenvectors = create_new_ndarray(in->m, in->n, NDARRAY_FLOAT);
-    mp_float_t *eigvectors = (mp_float_t *)eigenvectors->array->items;
-    // start out with the unit matrix
-    for(size_t m=0; m < in->m; m++) {
-        eigvectors[m*(in->n+1)] = 1.0;
-    }
-    mp_float_t largest, w, t, c, s, tau, aMk, aNk, vm, vn;
-    size_t M, N;
-    size_t iterations = JACOBI_MAX*in->n*in->n;
-    do {
-        iterations--;
-        // find the pivot here
-        M = 0;
-        N = 0;
-        largest = 0.0;
-        for(size_t m=0; m < in->m-1; m++) { // -1: no need to inspect last row
-            for(size_t n=m+1; n < in->n; n++) {
-                w = MICROPY_FLOAT_C_FUN(fabs)(array[m*in->n + n]);
-                if((largest < w) && (epsilon < w)) {
-                    M = m;
-                    N = n;
-                    largest = w;
-                }
-            }
-        }
-        if(M+N == 0) { // all entries are smaller than epsilon, there is not much we can do...
-            break;
-        }
-        // at this point, we have the pivot, and it is the entry (M, N)
-        // now we have to find the rotation angle
-        w = (array[N*in->n + N] - array[M*in->n + M]) / (2.0*array[M*in->n + N]);
-        // The following if/else chooses the smaller absolute value for the tangent 
-        // of the rotation angle. Going with the smaller should be numerically stabler.
-        if(w > 0) {
-            t = MICROPY_FLOAT_C_FUN(sqrt)(w*w + 1.0) - w;
-        } else {
-            t = -1.0*(MICROPY_FLOAT_C_FUN(sqrt)(w*w + 1.0) + w);
-        }
-        s = t / MICROPY_FLOAT_C_FUN(sqrt)(t*t + 1.0); // the sine of the rotation angle
-        c = 1.0 / MICROPY_FLOAT_C_FUN(sqrt)(t*t + 1.0); // the cosine of the rotation angle
-        tau = (1.0-c)/s; // this is equal to the tangent of the half of the rotation angle
-        
-        // at this point, we have the rotation angles, so we can transform the matrix
-        // first the two diagonal elements
-        // a(M, M) = a(M, M) - t*a(M, N)
-        array[M*in->n + M] = array[M*in->n + M] - t * array[M*in->n + N];
-        // a(N, N) = a(N, N) + t*a(M, N)
-        array[N*in->n + N] = array[N*in->n + N] + t * array[M*in->n + N];
-        // after the rotation, the a(M, N), and a(N, M) entries should become zero
-        array[M*in->n + N] = array[N*in->n + M] = 0.0;
-        // then all other elements in the column
-        for(size_t k=0; k < in->m; k++) {
-            if((k == M) || (k == N)) {
-                continue;
-            }
-            aMk = array[M*in->n + k];
-            aNk = array[N*in->n + k];
-            // a(M, k) = a(M, k) - s*(a(N, k) + tau*a(M, k))
-            array[M*in->n + k] -= s*(aNk + tau*aMk);
-            // a(N, k) = a(N, k) + s*(a(M, k) - tau*a(N, k))
-            array[N*in->n + k] += s*(aMk - tau*aNk);
-            // a(k, M) = a(M, k)
-            array[k*in->n + M] = array[M*in->n + k];
-            // a(k, N) = a(N, k)
-            array[k*in->n + N] = array[N*in->n + k];
-        }
-        // now we have to update the eigenvectors
-        // the rotation matrix, R, multiplies from the right
-        // R is the unit matrix, except for the 
-        // R(M,M) = R(N, N) = c
-        // R(N, M) = s
-        // (M, N) = -s
-        // entries. This means that only the Mth, and Nth columns will change
-        for(size_t m=0; m < in->m; m++) {
-            vm = eigvectors[m*in->n+M];
-            vn = eigvectors[m*in->n+N];
-            // the new value of eigvectors(m, M)
-            eigvectors[m*in->n+M] = c * vm - s * vn;
-            // the new value of eigvectors(m, N)
-            eigvectors[m*in->n+N] = s * vm + c * vn;
-        }
-    } while(iterations > 0);
-    
-    if(iterations == 0) { 
-        // the computation did not converge; numpy raises LinAlgError
-        m_del(mp_float_t, array, in->array->len);
-        mp_raise_ValueError(translate("iterations did not converge"));
-    }
-    ndarray_obj_t *eigenvalues = create_new_ndarray(1, in->n, NDARRAY_FLOAT);
-    mp_float_t *eigvalues = (mp_float_t *)eigenvalues->array->items;
-    for(size_t i=0; i < in->n; i++) {
-        eigvalues[i] = array[i*(in->n+1)];
-    }
-    m_del(mp_float_t, array, in->array->len);
-    
-    mp_obj_tuple_t *tuple = MP_OBJ_TO_PTR(mp_obj_new_tuple(2, NULL));
-    tuple->items[0] = MP_OBJ_FROM_PTR(eigenvalues);
-    tuple->items[1] = MP_OBJ_FROM_PTR(eigenvectors);
-    return tuple;
-    return MP_OBJ_FROM_PTR(eigenvalues);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_1(linalg_eig_obj, linalg_eig);
-
-#if !CIRCUITPY
-STATIC const mp_rom_map_elem_t ulab_linalg_globals_table[] = {
-    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_linalg) },
-    { MP_ROM_QSTR(MP_QSTR_size), (mp_obj_t)&linalg_size_obj },
-    { MP_ROM_QSTR(MP_QSTR_inv), (mp_obj_t)&linalg_inv_obj },
-    { MP_ROM_QSTR(MP_QSTR_dot), (mp_obj_t)&linalg_dot_obj },
-    { MP_ROM_QSTR(MP_QSTR_zeros), (mp_obj_t)&linalg_zeros_obj },
-    { MP_ROM_QSTR(MP_QSTR_ones), (mp_obj_t)&linalg_ones_obj },
-    { MP_ROM_QSTR(MP_QSTR_eye), (mp_obj_t)&linalg_eye_obj },
-    { MP_ROM_QSTR(MP_QSTR_det), (mp_obj_t)&linalg_det_obj },
-    { MP_ROM_QSTR(MP_QSTR_eig), (mp_obj_t)&linalg_eig_obj },    
-};
-
-STATIC MP_DEFINE_CONST_DICT(mp_module_ulab_linalg_globals, ulab_linalg_globals_table);
-
-mp_obj_module_t ulab_linalg_module = {
-    .base = { &mp_type_module },
-    .globals = (mp_obj_dict_t*)&mp_module_ulab_linalg_globals,
-};
-#endif
-
-#endif
--- a/code/linalg.h
+++ b/code/linalg.h
@ -1,35 +0,0 @@
-
-/*
- * This file is part of the micropython-ulab project, 
- *
- * https://github.com/v923z/micropython-ulab
- *
- * The MIT License (MIT)
- *
- * Copyright (c) 2019-2020 Zoltán Vörös
-*/
-
-#ifndef _LINALG_
-#define _LINALG_
-
-#include "ulab.h"
-#include "ndarray.h"
-
-#if MICROPY_FLOAT_IMPL == MICROPY_FLOAT_IMPL_FLOAT
-#define epsilon        1.2e-7
-#elif MICROPY_FLOAT_IMPL == MICROPY_FLOAT_IMPL_DOUBLE
-#define epsilon        2.3e-16
-#endif
-
-#define JACOBI_MAX     20
-
-#if ULAB_LINALG_MODULE || ULAB_POLY_MODULE
-bool linalg_invert_matrix(mp_float_t *, size_t );
-#endif
-
-#if ULAB_LINALG_MODULE
-
-extern mp_obj_module_t ulab_linalg_module;
-
-#endif
-#endif
--- a/code/micropython.mk
+++ b/code/micropython.mk
@ -2,18 +2,29 @@
 USERMODULES_DIR := $(USERMOD_DIR)

 # Add all C files to SRC_USERMOD.
+SRC_USERMOD += $(USERMODULES_DIR)/scipy/optimize/optimize.c
+SRC_USERMOD += $(USERMODULES_DIR)/scipy/signal/signal.c
+SRC_USERMOD += $(USERMODULES_DIR)/scipy/special/special.c
+SRC_USERMOD += $(USERMODULES_DIR)/ndarray_operators.c
+SRC_USERMOD += $(USERMODULES_DIR)/ulab_tools.c
 SRC_USERMOD += $(USERMODULES_DIR)/ndarray.c
-SRC_USERMOD += $(USERMODULES_DIR)/linalg.c
-SRC_USERMOD += $(USERMODULES_DIR)/vectorise.c
-SRC_USERMOD += $(USERMODULES_DIR)/poly.c
-SRC_USERMOD += $(USERMODULES_DIR)/fft.c
-SRC_USERMOD += $(USERMODULES_DIR)/numerical.c
-SRC_USERMOD += $(USERMODULES_DIR)/filter.c
-SRC_USERMOD += $(USERMODULES_DIR)/extras.c
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/approx/approx.c
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/compare/compare.c
+SRC_USERMOD += $(USERMODULES_DIR)/ulab_create.c
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/fft/fft.c
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/fft/fft_tools.c
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/filter/filter.c
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/linalg/linalg.c
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/linalg/linalg_tools.c
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/numerical/numerical.c
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/poly/poly.c
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/vector/vector.c
+SRC_USERMOD += $(USERMODULES_DIR)/user/user.c
+
+SRC_USERMOD += $(USERMODULES_DIR)/numpy/numpy.c
+SRC_USERMOD += $(USERMODULES_DIR)/scipy/scipy.c
 SRC_USERMOD += $(USERMODULES_DIR)/ulab.c

-# We can add our module folder to include paths if needed
-# This is not actually needed in this example.
 CFLAGS_USERMOD += -I$(USERMODULES_DIR)

-CFLAGS_EXTRA = -DMODULE_ULAB_ENABLED=1
+override CFLAGS_EXTRA += -DMODULE_ULAB_ENABLED=1
--- a/code/ndarray.c
+++ b/code/ndarray.c
--- a/code/ndarray.h
+++ b/code/ndarray.h
@ -6,7 +6,8 @@
 *
 * The MIT License (MIT)
 *
- * Copyright (c) 2019-2020 Zoltán Vörös
+ * Copyright (c) 2019-2021 Zoltán Vörös
+ *               2020 Jeff Epler for Adafruit Industries
 */

 #ifndef _NDARRAY_
@ -17,7 +18,14 @@
 #include "py/objstr.h"
 #include "py/objlist.h"

-#define PRINT_MAX  10
+#include "ulab.h"
+
+#ifndef MP_PI
+#define MP_PI MICROPY_FLOAT_CONST(3.14159265358979323846)
+#endif
+#ifndef MP_E
+#define MP_E MICROPY_FLOAT_CONST(2.71828182845904523536)
+#endif

 #if MICROPY_FLOAT_IMPL == MICROPY_FLOAT_IMPL_FLOAT
 #define FLOAT_TYPECODE 'f'
@ -25,15 +33,29 @@
 #define FLOAT_TYPECODE 'd'
 #endif

-#if !CIRCUITPY
-#define translate(x) x
+// this typedef is lifted from objfloat.c, because mp_obj_float_t is not exposed
+typedef struct _mp_obj_float_t {
+    mp_obj_base_t base;
+    mp_float_t value;
+} mp_obj_float_t;
+
+#if CIRCUITPY
+#define mp_obj_is_bool(o) (MP_OBJ_IS_TYPE((o), &mp_type_bool))
+#define mp_obj_is_int(x) (MP_OBJ_IS_INT((x)))
+#else
+#define translate(x) MP_ERROR_TEXT(x)
 #endif

-#define SWAP(t, a, b) { t tmp = a; a = b; b = tmp; }
+#define NDARRAY_NUMERIC   0
+#define NDARRAY_BOOLEAN   1
+
+#define NDARRAY_NDARRAY_TYPE    1
+#define NDARRAY_ITERABLE_TYPE   2

 extern const mp_obj_type_t ulab_ndarray_type;

 enum NDARRAY_TYPE {
+    NDARRAY_BOOL = '?', // this must never be assigned to the dtype!
    NDARRAY_UINT8 = 'B',
    NDARRAY_INT8 = 'b',
    NDARRAY_UINT16 = 'H',
@ -43,23 +65,65 @@ enum NDARRAY_TYPE {

 typedef struct _ndarray_obj_t {
    mp_obj_base_t base;
-    size_t m, n;
+    uint8_t dtype;
+    uint8_t itemsize;
+    uint8_t boolean;
+    uint8_t ndim;
    size_t len;
-    mp_obj_array_t *array;
-    size_t bytes;
+    size_t shape[ULAB_MAX_DIMS];
+    int32_t strides[ULAB_MAX_DIMS];
+    void *array;
 } ndarray_obj_t;

-mp_obj_t mp_obj_new_ndarray_iterator(mp_obj_t , size_t , mp_obj_iter_buf_t *);
+#if ULAB_HAS_DTYPE_OBJECT
+extern const mp_obj_type_t ulab_dtype_type;

-mp_float_t ndarray_get_float_value(void *, uint8_t , size_t );
+typedef struct _dtype_obj_t {
+    mp_obj_base_t base;
+    uint8_t dtype;
+} dtype_obj_t;
+
+void ndarray_dtype_print(const mp_print_t *, mp_obj_t , mp_print_kind_t );
+
+#ifdef CIRCUITPY
+mp_obj_t ndarray_dtype_make_new(const mp_obj_type_t *type, size_t n_args, const mp_obj_t *args, mp_map_t *kw_args);
+#else
+mp_obj_t ndarray_dtype_make_new(const mp_obj_type_t *, size_t , size_t , const mp_obj_t *);
+#endif /* CIRCUITPY */
+#endif /* ULAB_HAS_DTYPE_OBJECT */
+
+mp_obj_t ndarray_new_ndarray_iterator(mp_obj_t , mp_obj_iter_buf_t *);
+
+mp_float_t ndarray_get_float_value(void *, uint8_t );
+mp_float_t ndarray_get_float_index(void *, uint8_t , size_t );
+bool ndarray_object_is_array_like(mp_obj_t );
 void fill_array_iterable(mp_float_t *, mp_obj_t );
+size_t *ndarray_shape_vector(size_t , size_t , size_t , size_t );

-void ndarray_print_row(const mp_print_t *, mp_obj_array_t *, size_t , size_t );
 void ndarray_print(const mp_print_t *, mp_obj_t , mp_print_kind_t );
-void ndarray_assign_elements(mp_obj_array_t *, mp_obj_t , uint8_t , size_t *);
-ndarray_obj_t *create_new_ndarray(size_t , size_t , uint8_t );

-mp_obj_t ndarray_copy(mp_obj_t );
+#if ULAB_HAS_PRINTOPTIONS
+mp_obj_t ndarray_set_printoptions(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(ndarray_set_printoptions_obj);
+
+mp_obj_t ndarray_get_printoptions(void);
+MP_DECLARE_CONST_FUN_OBJ_0(ndarray_get_printoptions_obj);
+#endif
+
+void ndarray_assign_elements(ndarray_obj_t *, mp_obj_t , uint8_t , size_t *);
+size_t *ndarray_contract_shape(ndarray_obj_t *, uint8_t );
+int32_t *ndarray_contract_strides(ndarray_obj_t *, uint8_t );
+
+ndarray_obj_t *ndarray_new_dense_ndarray(uint8_t , size_t *, uint8_t );
+ndarray_obj_t *ndarray_new_ndarray_from_tuple(mp_obj_tuple_t *, uint8_t );
+ndarray_obj_t *ndarray_new_ndarray(uint8_t , size_t *, int32_t *, uint8_t );
+ndarray_obj_t *ndarray_new_linear_array(size_t , uint8_t );
+ndarray_obj_t *ndarray_new_view(ndarray_obj_t *, uint8_t , size_t *, int32_t *, int32_t );
+bool ndarray_is_dense(ndarray_obj_t *);
+ndarray_obj_t *ndarray_copy_view(ndarray_obj_t *);
+void ndarray_copy_array(ndarray_obj_t *, ndarray_obj_t *);
+
+MP_DECLARE_CONST_FUN_OBJ_KW(ndarray_array_constructor_obj);
 #ifdef CIRCUITPY
 mp_obj_t ndarray_make_new(const mp_obj_type_t *type, size_t n_args, const mp_obj_t *args, mp_map_t *kw_args);
 #else
@ -67,79 +131,588 @@ mp_obj_t ndarray_make_new(const mp_obj_type_t *, size_t , size_t , const mp_obj_
 #endif
 mp_obj_t ndarray_subscr(mp_obj_t , mp_obj_t , mp_obj_t );
 mp_obj_t ndarray_getiter(mp_obj_t , mp_obj_iter_buf_t *);
+bool ndarray_can_broadcast(ndarray_obj_t *, ndarray_obj_t *, uint8_t *, size_t *, int32_t *, int32_t *);
+bool ndarray_can_broadcast_inplace(ndarray_obj_t *, ndarray_obj_t *, int32_t *);
 mp_obj_t ndarray_binary_op(mp_binary_op_t , mp_obj_t , mp_obj_t );
 mp_obj_t ndarray_unary_op(mp_unary_op_t , mp_obj_t );

-mp_obj_t ndarray_shape(mp_obj_t );
-mp_obj_t ndarray_size(mp_obj_t );
-mp_obj_t ndarray_itemsize(mp_obj_t );
-mp_obj_t ndarray_flatten(size_t , const mp_obj_t *, mp_map_t *);
+size_t *ndarray_new_coords(uint8_t );
+void ndarray_rewind_array(uint8_t , uint8_t *, size_t *, int32_t *, size_t *);

+// various ndarray methods
+#if NDARRAY_HAS_COPY
+mp_obj_t ndarray_copy(mp_obj_t );
+MP_DECLARE_CONST_FUN_OBJ_1(ndarray_copy_obj);
+#endif
+
+#if NDARRAY_HAS_FLATTEN
+mp_obj_t ndarray_flatten(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(ndarray_flatten_obj);
+#endif
+
+mp_obj_t ndarray_dtype(mp_obj_t );
+mp_obj_t ndarray_itemsize(mp_obj_t );
+mp_obj_t ndarray_size(mp_obj_t );
+mp_obj_t ndarray_shape(mp_obj_t );
+mp_obj_t ndarray_strides(mp_obj_t );
+
+#if NDARRAY_HAS_RESHAPE
 mp_obj_t ndarray_reshape(mp_obj_t , mp_obj_t );
 MP_DECLARE_CONST_FUN_OBJ_2(ndarray_reshape_obj);
+#endif

+#if NDARRAY_HAS_TOBYTES
+mp_obj_t ndarray_tobytes(mp_obj_t );
+MP_DECLARE_CONST_FUN_OBJ_1(ndarray_tobytes_obj);
+#endif
+
+#if NDARRAY_HAS_TRANSPOSE
 mp_obj_t ndarray_transpose(mp_obj_t );
 MP_DECLARE_CONST_FUN_OBJ_1(ndarray_transpose_obj);
+#endif
+
+#if ULAB_NUMPY_HAS_NDINFO
+mp_obj_t ndarray_info(mp_obj_t );
+MP_DECLARE_CONST_FUN_OBJ_1(ndarray_info_obj);
+#endif

-mp_int_t ndarray_get_buffer(mp_obj_t obj, mp_buffer_info_t *bufinfo, mp_uint_t flags);
 //void ndarray_attributes(mp_obj_t , qstr , mp_obj_t *);

+ndarray_obj_t *ndarray_from_mp_obj(mp_obj_t );

-#define CREATE_SINGLE_ITEM(outarray, type, typecode, value) do {\
-    ndarray_obj_t *tmp = create_new_ndarray(1, 1, (typecode));\
-    type *tmparr = (type *)tmp->array->items;\
-    tmparr[0] = (type)(value);\
-    (outarray) = MP_OBJ_FROM_PTR(tmp);\
+
+#define BOOLEAN_ASSIGNMENT_LOOP(type_left, type_right, ndarray, iarray, istride, varray, vstride)\
+    type_left *array = (type_left *)(ndarray)->array;\
+    for(size_t i=0; i < (ndarray)->len; i++) {\
+        if(*(iarray)) {\
+            *array = (type_left)(*((type_right *)(varray)));\
+        }\
+        array += (ndarray)->strides[ULAB_MAX_DIMS - 1] / (ndarray)->itemsize;\
+        (iarray) += (istride);\
+        (varray) += (vstride);\
    } while(0)

-/*  
-    mp_obj_t row = mp_obj_new_list(n, NULL);
-    mp_obj_list_t *row_ptr = MP_OBJ_TO_PTR(row);
-    
-    should work outside the loop, but it doesn't. Go figure! 
-*/
-
-#define RUN_BINARY_LOOP(typecode, type_out, type_left, type_right, ol, or, op) do {\
-    type_left *left = (type_left *)(ol)->array->items;\
-    type_right *right = (type_right *)(or)->array->items;\
-    uint8_t inc = 0;\
-    if((or)->array->len > 1) inc = 1;\
-    if(((op) == MP_BINARY_OP_ADD) || ((op) == MP_BINARY_OP_SUBTRACT) || ((op) == MP_BINARY_OP_MULTIPLY)) {\
-        ndarray_obj_t *out = create_new_ndarray(ol->m, ol->n, typecode);\
-        type_out *(odata) = (type_out *)out->array->items;\
-        if((op) == MP_BINARY_OP_ADD) { for(size_t i=0, j=0; i < (ol)->array->len; i++, j+=inc) odata[i] = left[i] + right[j];}\
-        if((op) == MP_BINARY_OP_SUBTRACT) { for(size_t i=0, j=0; i < (ol)->array->len; i++, j+=inc) odata[i] = left[i] - right[j];}\
-        if((op) == MP_BINARY_OP_MULTIPLY) { for(size_t i=0, j=0; i < (ol)->array->len; i++, j+=inc) odata[i] = left[i] * right[j];}\
-        return MP_OBJ_FROM_PTR(out);\
-    } else if((op) == MP_BINARY_OP_TRUE_DIVIDE) {\
-        ndarray_obj_t *out = create_new_ndarray(ol->m, ol->n, NDARRAY_FLOAT);\
-        mp_float_t *odata = (mp_float_t *)out->array->items;\
-        for(size_t i=0, j=0; i < (ol)->array->len; i++, j+=inc) odata[i] = (mp_float_t)left[i]/(mp_float_t)right[j];\
-        return MP_OBJ_FROM_PTR(out);\
-    } else if(((op) == MP_BINARY_OP_LESS) || ((op) == MP_BINARY_OP_LESS_EQUAL) ||  \
-             ((op) == MP_BINARY_OP_MORE) || ((op) == MP_BINARY_OP_MORE_EQUAL)) {\
-        mp_obj_t out_list = mp_obj_new_list(0, NULL);\
-        size_t m = (ol)->m, n = (ol)->n;\
-        for(size_t i=0, r=0; i < m; i++, r+=inc) {\
-            mp_obj_t row = mp_obj_new_list(n, NULL);\
-            mp_obj_list_t *row_ptr = MP_OBJ_TO_PTR(row);\
-            for(size_t j=0, s=0; j < n; j++, s+=inc) {\
-                row_ptr->items[j] = mp_const_false;\
-                if((op) == MP_BINARY_OP_LESS) {\
-                    if(left[i*n+j] < right[r*n+s]) row_ptr->items[j] = mp_const_true;\
-                } else if((op) == MP_BINARY_OP_LESS_EQUAL) {\
-                    if(left[i*n+j] <= right[r*n+s]) row_ptr->items[j] = mp_const_true;\
-                } else if((op) == MP_BINARY_OP_MORE) {\
-                    if(left[i*n+j] > right[r*n+s]) row_ptr->items[j] = mp_const_true;\
-                } else if((op) == MP_BINARY_OP_MORE_EQUAL) {\
-                    if(left[i*n+j] >= right[r*n+s]) row_ptr->items[j] = mp_const_true;\
-                }\
-            }\
-            if(m == 1) return row;\
-            mp_obj_list_append(out_list, row);\
-        }\
-        return out_list;\
-    }\
+#if ULAB_HAS_FUNCTION_ITERATOR
+#define BINARY_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    type_out *array = (type_out *)(results)->array;\
+    size_t *lcoords = ndarray_new_coords((results)->ndim);\
+    size_t *rcoords = ndarray_new_coords((results)->ndim);\
+    for(size_t i=0; i < (results)->len/(results)->shape[ULAB_MAX_DIMS -1]; i++) {\
+        size_t l = 0;\
+        do {\
+            *array++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        ndarray_rewind_array((results)->ndim, (larray), (results)->shape, (lstrides), lcoords);\
+        ndarray_rewind_array((results)->ndim, (rarray), (results)->shape, (rstrides), rcoords);\
    } while(0)

+#define INPLACE_LOOP(results, type_left, type_right, larray, rarray, rstrides, OPERATOR)\
+    size_t *lcoords = ndarray_new_coords((results)->ndim);\
+    size_t *rcoords = ndarray_new_coords((results)->ndim);\
+    for(size_t i=0; i < (results)->len/(results)->shape[ULAB_MAX_DIMS -1]; i++) {\
+        size_t l = 0;\
+        do {\
+            *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\
+            (larray) += (results)->strides[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        ndarray_rewind_array((results)->ndim, (larray), (results)->shape, (results)->strides, lcoords);\
+        ndarray_rewind_array((results)->ndim, (rarray), (results)->shape, (rstrides), rcoords);\
+    } while(0)
+
+#define EQUALITY_LOOP(results, array, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    size_t *lcoords = ndarray_new_coords((results)->ndim);\
+    size_t *rcoords = ndarray_new_coords((results)->ndim);\
+    for(size_t i=0; i < (results)->len/(results)->shape[ULAB_MAX_DIMS -1]; i++) {\
+        size_t l = 0;\
+        do {\
+            *(array)++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray)) ? 1 : 0;\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        ndarray_rewind_array((results)->ndim, (larray), (results)->shape, (lstrides), lcoords);\
+        ndarray_rewind_array((results)->ndim, (rarray), (results)->shape, (rstrides), rcoords);\
+    } while(0)
+
+#define POWER_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides)\
+    type_out *array = (type_out *)(results)->array;\
+    size_t *lcoords = ndarray_new_coords((results)->ndim);\
+    size_t *rcoords = ndarray_new_coords((results)->ndim);\
+    for(size_t i=0; i < (results)->len/(results)->shape[ULAB_MAX_DIMS -1]; i++) {\
+        size_t l = 0;\
+        do {\
+            *array++ = MICROPY_FLOAT_C_FUN(pow)(*((type_left *)(larray)), *((type_right *)(rarray)));\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        ndarray_rewind_array((results)->ndim, (larray), (results)->shape, (lstrides), lcoords);\
+        ndarray_rewind_array((results)->ndim, (rarray), (results)->shape, (rstrides), rcoords);\
+    } while(0)
+
+#else
+
+#if ULAB_MAX_DIMS == 1
+#define BINARY_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    type_out *array = (type_out *)results->array;\
+    size_t l = 0;\
+    do {\
+        *array++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+
+#define INPLACE_LOOP(results, type_left, type_right, larray, rarray, rstrides, OPERATOR)\
+    size_t l = 0;\
+    do {\
+        *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\
+        (larray) += (results)->strides[ULAB_MAX_DIMS - 1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+
+#define EQUALITY_LOOP(results, array, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    size_t l = 0;\
+    do {\
+        *(array)++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray)) ? 1 : 0;\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+
+#define POWER_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides)\
+    type_out *array = (type_out *)results->array;\
+    size_t l = 0;\
+    do {\
+        *array++ = MICROPY_FLOAT_C_FUN(pow)(*((type_left *)(larray)), *((type_right *)(rarray)));\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+
+#endif /* ULAB_MAX_DIMS == 1 */
+
+#if ULAB_MAX_DIMS == 2
+#define BINARY_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    type_out *array = (type_out *)(results)->array;\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            *array++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+
+#define INPLACE_LOOP(results, type_left, type_right, larray, rarray, rstrides, OPERATOR)\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\
+            (larray) += (results)->strides[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        (larray) -= (results)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (larray) += (results)->strides[ULAB_MAX_DIMS - 2];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+
+#define EQUALITY_LOOP(results, array, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            *(array)++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray)) ? 1 : 0;\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+
+#define POWER_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides)\
+    type_out *array = (type_out *)(results)->array;\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            *array++ = MICROPY_FLOAT_C_FUN(pow)(*((type_left *)(larray)), *((type_right *)(rarray)));\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+
+#endif /* ULAB_MAX_DIMS == 2 */
+
+#if ULAB_MAX_DIMS == 3
+#define BINARY_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    type_out *array = (type_out *)results->array;\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                *array++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+
+#define INPLACE_LOOP(results, type_left, type_right, larray, rarray, rstrides, OPERATOR)\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\
+                (larray) += (results)->strides[ULAB_MAX_DIMS - 1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+            (larray) -= (results)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (larray) += (results)->strides[ULAB_MAX_DIMS - 2];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+        (larray) -= (results)->strides[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (larray) += (results)->strides[ULAB_MAX_DIMS - 3];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+
+#define EQUALITY_LOOP(results, array, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                *(array)++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray)) ? 1 : 0;\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+
+#define POWER_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides)\
+    type_out *array = (type_out *)results->array;\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                *array++ = MICROPY_FLOAT_C_FUN(pow)(*((type_left *)(larray)), *((type_right *)(rarray)));\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+
+#endif /* ULAB_MAX_DIMS == 3 */
+
+#if ULAB_MAX_DIMS == 4
+#define BINARY_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    type_out *array = (type_out *)results->array;\
+    size_t i = 0;\
+    do {\
+        size_t j = 0;\
+        do {\
+            size_t k = 0;\
+            do {\
+                size_t l = 0;\
+                do {\
+                    *array++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\
+                    (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                    (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                    l++;\
+                } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+                (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+                (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+                k++;\
+            } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+            j++;\
+        } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 4];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 4];\
+        i++;\
+    } while(i < (results)->shape[ULAB_MAX_DIMS - 4]);\
+
+#define INPLACE_LOOP(results, type_left, type_right, larray, rarray, rstrides, OPERATOR)\
+    size_t i = 0;\
+    do {\
+        size_t j = 0;\
+        do {\
+            size_t k = 0;\
+            do {\
+                size_t l = 0;\
+                do {\
+                    *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\
+                    (larray) += (results)->strides[ULAB_MAX_DIMS - 1];\
+                    (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                    l++;\
+                } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+                (larray) -= (results)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (larray) += (results)->strides[ULAB_MAX_DIMS - 2];\
+                (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+                k++;\
+            } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+            (larray) -= (results)->strides[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (larray) += (results)->strides[ULAB_MAX_DIMS - 3];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+            j++;\
+        } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+        (larray) -= (results)->strides[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (larray) += (results)->strides[ULAB_MAX_DIMS - 4];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 4];\
+        i++;\
+    } while(i < (results)->shape[ULAB_MAX_DIMS - 4]);\
+
+#define EQUALITY_LOOP(results, array, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    size_t i = 0;\
+    do {\
+        size_t j = 0;\
+        do {\
+            size_t k = 0;\
+            do {\
+                size_t l = 0;\
+                do {\
+                    *(array)++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray)) ? 1 : 0;\
+                    (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                    (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                    l++;\
+                } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+                (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * results->shape[ULAB_MAX_DIMS-1];\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+                (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * results->shape[ULAB_MAX_DIMS-1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+                k++;\
+            } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+            j++;\
+        } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 4];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 4];\
+        i++;\
+    } while(i < (results)->shape[ULAB_MAX_DIMS - 4]);\
+
+#define POWER_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides)\
+    type_out *array = (type_out *)results->array;\
+    size_t i = 0;\
+    do {\
+        size_t j = 0;\
+        do {\
+            size_t k = 0;\
+            do {\
+                size_t l = 0;\
+                do {\
+                    *array++ = MICROPY_FLOAT_C_FUN(pow)(*((type_left *)(larray)), *((type_right *)(rarray)));\
+                    (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                    (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                    l++;\
+                } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+                (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+                (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+                k++;\
+            } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+            j++;\
+        } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 4];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 4];\
+        i++;\
+    } while(i < (results)->shape[ULAB_MAX_DIMS - 4]);\
+
+#endif /* ULAB_MAX_DIMS == 4 */
+#endif /* ULAB_HAS_FUNCTION_ITERATOR */
+
+
+#if ULAB_MAX_DIMS == 1
+#define ASSIGNMENT_LOOP(results, type_left, type_right, lstrides, rarray, rstrides)\
+    type_left *larray = (type_left *)(results)->array;\
+    size_t l = 0;\
+    do {\
+        *larray = (type_left)(*((type_right *)(rarray)));\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+
+#endif /* ULAB_MAX_DIMS == 1 */
+
+#if ULAB_MAX_DIMS == 2
+#define ASSIGNMENT_LOOP(results, type_left, type_right, lstrides, rarray, rstrides)\
+    type_left *larray = (type_left *)(results)->array;\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            *larray = (type_left)(*((type_right *)(rarray)));\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+
+#endif /* ULAB_MAX_DIMS == 2 */
+
+#if ULAB_MAX_DIMS == 3
+#define ASSIGNMENT_LOOP(results, type_left, type_right, lstrides, rarray, rstrides)\
+    type_left *larray = (type_left *)(results)->array;\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                *larray = (type_left)(*((type_right *)(rarray)));\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+
+#endif /* ULAB_MAX_DIMS == 3 */
+
+#if ULAB_MAX_DIMS == 4
+#define ASSIGNMENT_LOOP(results, type_left, type_right, lstrides, rarray, rstrides)\
+    type_left *larray = (type_left *)(results)->array;\
+    size_t i = 0;\
+    do {\
+        size_t j = 0;\
+        do {\
+            size_t k = 0;\
+            do {\
+                size_t l = 0;\
+                do {\
+                    *larray = (type_left)(*((type_right *)(rarray)));\
+                    (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                    (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                    l++;\
+                } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+                (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+                (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+                k++;\
+            } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+            j++;\
+        } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 4];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 4];\
+        i++;\
+    } while(i < (results)->shape[ULAB_MAX_DIMS - 4]);\
+
+#endif /* ULAB_MAX_DIMS == 4 */
+
 #endif
--- a/code/ndarray_operators.c
+++ b/code/ndarray_operators.c
@ -0,0 +1,807 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+*/
+
+
+#include <math.h>
+
+#include "py/runtime.h"
+#include "py/objtuple.h"
+#include "ndarray.h"
+#include "ndarray_operators.h"
+#include "ulab.h"
+#include "ulab_tools.h"
+
+/*
+    This file contains the actual implementations of the various
+    ndarray operators.
+
+    These are the upcasting rules of the binary operators
+
+    - if one of the operarands is a float, the result is always float
+    - operation on identical types preserves type
+
+    uint8 + int8 => int16
+    uint8 + int16 => int16
+    uint8 + uint16 => uint16
+    int8 + int16 => int16
+    int8 + uint16 => uint16
+    uint16 + int16 => float
+*/
+
+#if NDARRAY_HAS_BINARY_OP_EQUAL | NDARRAY_HAS_BINARY_OP_NOT_EQUAL
+mp_obj_t ndarray_binary_equality(ndarray_obj_t *lhs, ndarray_obj_t *rhs,
+                                            uint8_t ndim, size_t *shape,  int32_t *lstrides, int32_t *rstrides, mp_binary_op_t op) {
+
+    ndarray_obj_t *results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT8);
+    results->boolean = 1;
+    uint8_t *array = (uint8_t *)results->array;
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    #if NDARRAY_HAS_BINARY_OP_EQUAL
+    if(op == MP_BINARY_OP_EQUAL) {
+        if(lhs->dtype == NDARRAY_UINT8) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, ==);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, uint8_t, int8_t, larray, lstrides, rarray, rstrides, ==);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, uint8_t, uint16_t, larray, lstrides, rarray, rstrides, ==);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, uint8_t, int16_t, larray, lstrides, rarray, rstrides, ==);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides, ==);
+            }
+        } else if(lhs->dtype == NDARRAY_INT8) {
+            if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, int8_t, int8_t, larray, lstrides, rarray, rstrides, ==);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, int8_t, uint16_t, larray, lstrides, rarray, rstrides, ==);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, int8_t, int16_t, larray, lstrides, rarray, rstrides, ==);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, int8_t, mp_float_t, larray, lstrides, rarray, rstrides, ==);
+            } else {
+                return ndarray_binary_op(op, rhs, lhs);
+            }
+        } else if(lhs->dtype == NDARRAY_UINT16) {
+            if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, uint16_t, uint16_t, larray, lstrides, rarray, rstrides, ==);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, uint16_t, int16_t, larray, lstrides, rarray, rstrides, ==);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, ==);
+            } else {
+                return ndarray_binary_op(op, rhs, lhs);
+            }
+        } else if(lhs->dtype == NDARRAY_INT16) {
+            if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, int16_t, int16_t, larray, lstrides, rarray, rstrides, ==);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, int16_t, mp_float_t, larray, lstrides, rarray, rstrides, ==);
+            } else {
+                return ndarray_binary_op(op, rhs, lhs);
+            }
+        } else if(lhs->dtype == NDARRAY_FLOAT) {
+            if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, mp_float_t, mp_float_t, larray, lstrides, rarray, rstrides, ==);
+            } else {
+                return ndarray_binary_op(op, rhs, lhs);
+            }
+        }
+    }
+    #endif /* NDARRAY_HAS_BINARY_OP_EQUAL */
+
+    #if NDARRAY_HAS_BINARY_OP_NOT_EQUAL
+    if(op == MP_BINARY_OP_NOT_EQUAL) {
+        if(lhs->dtype == NDARRAY_UINT8) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, !=);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, uint8_t, int8_t, larray, lstrides, rarray, rstrides, !=);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, uint8_t, uint16_t, larray, lstrides, rarray, rstrides, !=);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, uint8_t, int16_t, larray, lstrides, rarray, rstrides, !=);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides, !=);
+            }
+        } else if(lhs->dtype == NDARRAY_INT8) {
+            if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, int8_t, int8_t, larray, lstrides, rarray, rstrides, !=);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, int8_t, uint16_t, larray, lstrides, rarray, rstrides, !=);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, int8_t, int16_t, larray, lstrides, rarray, rstrides, !=);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, int8_t, mp_float_t, larray, lstrides, rarray, rstrides, !=);
+            } else {
+                return ndarray_binary_op(op, rhs, lhs);
+            }
+        } else if(lhs->dtype == NDARRAY_UINT16) {
+            if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, uint16_t, uint16_t, larray, lstrides, rarray, rstrides, !=);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, uint16_t, int16_t, larray, lstrides, rarray, rstrides, !=);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, !=);
+            } else {
+                return ndarray_binary_op(op, rhs, lhs);
+            }
+        } else if(lhs->dtype == NDARRAY_INT16) {
+            if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, int16_t, int16_t, larray, lstrides, rarray, rstrides, !=);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, int16_t, mp_float_t, larray, lstrides, rarray, rstrides, !=);
+            } else {
+                return ndarray_binary_op(op, rhs, lhs);
+            }
+        } else if(lhs->dtype == NDARRAY_FLOAT) {
+            if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, mp_float_t, mp_float_t, larray, lstrides, rarray, rstrides, !=);
+            } else {
+                return ndarray_binary_op(op, rhs, lhs);
+            }
+        }
+    }
+    #endif /* NDARRAY_HAS_BINARY_OP_NOT_EQUAL */
+
+    return MP_OBJ_FROM_PTR(results);
+}
+#endif /* NDARRAY_HAS_BINARY_OP_EQUAL | NDARRAY_HAS_BINARY_OP_NOT_EQUAL */
+
+#if NDARRAY_HAS_BINARY_OP_ADD
+mp_obj_t ndarray_binary_add(ndarray_obj_t *lhs, ndarray_obj_t *rhs,
+                                        uint8_t ndim, size_t *shape, int32_t *lstrides, int32_t *rstrides) {
+
+    ndarray_obj_t *results = NULL;
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    if(lhs->dtype == NDARRAY_UINT8) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT16);
+            BINARY_LOOP(results, uint16_t, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, +);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, uint8_t, int8_t, larray, lstrides, rarray, rstrides, +);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT16);
+            BINARY_LOOP(results, uint16_t, uint8_t, uint16_t, larray, lstrides, rarray, rstrides, +);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, uint8_t, int16_t, larray, lstrides, rarray, rstrides, +);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides, +);
+        }
+    } else if(lhs->dtype == NDARRAY_INT8) {
+        if(rhs->dtype == NDARRAY_INT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT8);
+            BINARY_LOOP(results, int8_t, int8_t, int8_t, larray, lstrides, rarray, rstrides, +);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int8_t, uint16_t, larray, lstrides, rarray, rstrides, +);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int8_t, int16_t, larray, lstrides, rarray, rstrides, +);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, int8_t, mp_float_t, larray, lstrides, rarray, rstrides, +);
+        } else {
+            return ndarray_binary_op(MP_BINARY_OP_ADD, rhs, lhs);
+        }
+    } else if(lhs->dtype == NDARRAY_UINT16) {
+        if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT16);
+            BINARY_LOOP(results, uint16_t, uint16_t, uint16_t, larray, lstrides, rarray, rstrides, +);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, uint16_t, int16_t, larray, lstrides, rarray, rstrides, +);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, +);
+        } else {
+            return ndarray_binary_op(MP_BINARY_OP_ADD, rhs, lhs);
+        }
+    } else if(lhs->dtype == NDARRAY_INT16) {
+        if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int16_t, int16_t, larray, lstrides, rarray, rstrides, +);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, int16_t, mp_float_t, larray, lstrides, rarray, rstrides, +);
+        } else {
+            return ndarray_binary_op(MP_BINARY_OP_ADD, rhs, lhs);
+        }
+    } else if(lhs->dtype == NDARRAY_FLOAT) {
+        if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, mp_float_t, mp_float_t, larray, lstrides, rarray, rstrides, +);
+        } else {
+            return ndarray_binary_op(MP_BINARY_OP_ADD, rhs, lhs);
+        }
+    }
+
+    return MP_OBJ_FROM_PTR(results);
+}
+#endif /* NDARRAY_HAS_BINARY_OP_ADD */
+
+#if NDARRAY_HAS_BINARY_OP_MULTIPLY
+mp_obj_t ndarray_binary_multiply(ndarray_obj_t *lhs, ndarray_obj_t *rhs,
+                                            uint8_t ndim, size_t *shape, int32_t *lstrides, int32_t *rstrides) {
+
+    ndarray_obj_t *results = NULL;
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    if(lhs->dtype == NDARRAY_UINT8) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT16);
+            BINARY_LOOP(results, uint16_t, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, *);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, uint8_t, int8_t, larray, lstrides, rarray, rstrides, *);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT16);
+            BINARY_LOOP(results, uint16_t, uint8_t, uint16_t, larray, lstrides, rarray, rstrides, *);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, uint8_t, int16_t, larray, lstrides, rarray, rstrides, *);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides, *);
+        }
+    } else if(lhs->dtype == NDARRAY_INT8) {
+        if(rhs->dtype == NDARRAY_INT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT8);
+            BINARY_LOOP(results, int8_t, int8_t, int8_t, larray, lstrides, rarray, rstrides, *);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int8_t, uint16_t, larray, lstrides, rarray, rstrides, *);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int8_t, int16_t, larray, lstrides, rarray, rstrides, *);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, int8_t, mp_float_t, larray, lstrides, rarray, rstrides, *);
+        } else {
+            return ndarray_binary_op(MP_BINARY_OP_MULTIPLY, rhs, lhs);
+        }
+    } else if(lhs->dtype == NDARRAY_UINT16) {
+        if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT16);
+            BINARY_LOOP(results, uint16_t, uint16_t, uint16_t, larray, lstrides, rarray, rstrides, *);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, uint16_t, int16_t, larray, lstrides, rarray, rstrides, *);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, *);
+        } else {
+            return ndarray_binary_op(MP_BINARY_OP_MULTIPLY, rhs, lhs);
+        }
+    } else if(lhs->dtype == NDARRAY_INT16) {
+        if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int16_t, int16_t, larray, lstrides, rarray, rstrides, *);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, int16_t, mp_float_t, larray, lstrides, rarray, rstrides, *);
+        } else {
+            return ndarray_binary_op(MP_BINARY_OP_MULTIPLY, rhs, lhs);
+        }
+    } else if(lhs->dtype == NDARRAY_FLOAT) {
+        if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, mp_float_t, mp_float_t, larray, lstrides, rarray, rstrides, *);
+        } else {
+            return ndarray_binary_op(MP_BINARY_OP_MULTIPLY, rhs, lhs);
+        }
+    }
+
+    return MP_OBJ_FROM_PTR(results);
+}
+#endif /* NDARRAY_HAS_BINARY_OP_MULTIPLY */
+
+#if NDARRAY_HAS_BINARY_OP_MORE | NDARRAY_HAS_BINARY_OP_MORE_EQUAL | NDARRAY_HAS_BINARY_OP_LESS | NDARRAY_HAS_BINARY_OP_LESS_EQUAL
+mp_obj_t ndarray_binary_more(ndarray_obj_t *lhs, ndarray_obj_t *rhs,
+                                            uint8_t ndim, size_t *shape, int32_t *lstrides, int32_t *rstrides, mp_binary_op_t op) {
+
+    ndarray_obj_t *results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT8);
+    results->boolean = 1;
+    uint8_t *array = (uint8_t *)results->array;
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    #if NDARRAY_HAS_BINARY_OP_MORE | NDARRAY_HAS_BINARY_OP_LESS
+    if(op == MP_BINARY_OP_MORE) {
+        if(lhs->dtype == NDARRAY_UINT8) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, uint8_t, int8_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, uint8_t, uint16_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, uint8_t, int16_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides, >);
+            }
+        } else if(lhs->dtype == NDARRAY_INT8) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, int8_t, uint8_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, int8_t, int8_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, int8_t, uint16_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, int8_t, int16_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, int8_t, mp_float_t, larray, lstrides, rarray, rstrides, >);
+            }
+        } else if(lhs->dtype == NDARRAY_UINT16) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, uint16_t, uint8_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, uint16_t, int8_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, uint16_t, uint16_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, uint16_t, int16_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, >);
+            }
+        } else if(lhs->dtype == NDARRAY_INT16) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, int16_t, uint8_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, int16_t, int8_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, int16_t, uint16_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, int16_t, int16_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, >);
+            }
+        } else if(lhs->dtype == NDARRAY_FLOAT) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, mp_float_t, uint8_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, mp_float_t, int8_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, mp_float_t, uint16_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, mp_float_t, int16_t, larray, lstrides, rarray, rstrides, >);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, mp_float_t, mp_float_t, larray, lstrides, rarray, rstrides, >);
+            }
+        }
+    }
+    #endif /* NDARRAY_HAS_BINARY_OP_MORE | NDARRAY_HAS_BINARY_OP_LESS*/
+    #if NDARRAY_HAS_BINARY_OP_MORE_EQUAL | NDARRAY_HAS_BINARY_OP_LESS_EQUAL
+    if(op == MP_BINARY_OP_MORE_EQUAL) {
+        if(lhs->dtype == NDARRAY_UINT8) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, uint8_t, int8_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, uint8_t, uint16_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, uint8_t, int16_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides, >=);
+            }
+        } else if(lhs->dtype == NDARRAY_INT8) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, int8_t, uint8_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, int8_t, int8_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, int8_t, uint16_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, int8_t, int16_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, int8_t, mp_float_t, larray, lstrides, rarray, rstrides, >=);
+            }
+        } else if(lhs->dtype == NDARRAY_UINT16) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, uint16_t, uint8_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, uint16_t, int8_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, uint16_t, uint16_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, uint16_t, int16_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, >=);
+            }
+        } else if(lhs->dtype == NDARRAY_INT16) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, int16_t, uint8_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, int16_t, int8_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, int16_t, uint16_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, int16_t, int16_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, >=);
+            }
+        } else if(lhs->dtype == NDARRAY_FLOAT) {
+            if(rhs->dtype == NDARRAY_UINT8) {
+                EQUALITY_LOOP(results, array, mp_float_t, uint8_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_INT8) {
+                EQUALITY_LOOP(results, array, mp_float_t, int8_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_UINT16) {
+                EQUALITY_LOOP(results, array, mp_float_t, uint16_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_INT16) {
+                EQUALITY_LOOP(results, array, mp_float_t, int16_t, larray, lstrides, rarray, rstrides, >=);
+            } else if(rhs->dtype == NDARRAY_FLOAT) {
+                EQUALITY_LOOP(results, array, mp_float_t, mp_float_t, larray, lstrides, rarray, rstrides, >=);
+            }
+        }
+    }
+    #endif /* NDARRAY_HAS_BINARY_OP_MORE_EQUAL | NDARRAY_HAS_BINARY_OP_LESS_EQUAL */
+
+    return MP_OBJ_FROM_PTR(results);
+}
+#endif /* NDARRAY_HAS_BINARY_OP_MORE | NDARRAY_HAS_BINARY_OP_MORE_EQUAL | NDARRAY_HAS_BINARY_OP_LESS | NDARRAY_HAS_BINARY_OP_LESS_EQUAL */
+
+#if NDARRAY_HAS_BINARY_OP_SUBTRACT
+mp_obj_t ndarray_binary_subtract(ndarray_obj_t *lhs, ndarray_obj_t *rhs,
+                                            uint8_t ndim, size_t *shape, int32_t *lstrides, int32_t *rstrides) {
+
+    ndarray_obj_t *results = NULL;
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    if(lhs->dtype == NDARRAY_UINT8) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT8);
+            BINARY_LOOP(results, uint8_t, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, uint8_t, int8_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT16);
+            BINARY_LOOP(results, uint16_t, uint8_t, uint16_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, uint8_t, int16_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides, -);
+        }
+    } else if(lhs->dtype == NDARRAY_INT8) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int8_t, uint8_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT8);
+            BINARY_LOOP(results, int8_t, int8_t, int8_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int8_t, uint16_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int8_t, int16_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, int8_t, mp_float_t, larray, lstrides, rarray, rstrides, -);
+        }
+    } else if(lhs->dtype == NDARRAY_UINT16) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT16);
+            BINARY_LOOP(results, uint16_t, uint16_t, uint8_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT16);
+            BINARY_LOOP(results, uint16_t, uint16_t, int8_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_UINT16);
+            BINARY_LOOP(results, uint16_t, uint16_t, uint16_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, uint16_t, int16_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, -);
+        }
+    } else if(lhs->dtype == NDARRAY_INT16) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int16_t, uint8_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int16_t, int8_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, int16_t, uint16_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_INT16);
+            BINARY_LOOP(results, int16_t, int16_t, int16_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, -);
+        }
+    } else if(lhs->dtype == NDARRAY_FLOAT) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, mp_float_t, uint8_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, mp_float_t, int8_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, mp_float_t, uint16_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, mp_float_t, int16_t, larray, lstrides, rarray, rstrides, -);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+            BINARY_LOOP(results, mp_float_t, mp_float_t, mp_float_t, larray, lstrides, rarray, rstrides, -);
+        }
+    }
+
+    return MP_OBJ_FROM_PTR(results);
+}
+#endif /* NDARRAY_HAS_BINARY_OP_SUBTRACT */
+
+#if NDARRAY_HAS_BINARY_OP_TRUE_DIVIDE
+mp_obj_t ndarray_binary_true_divide(ndarray_obj_t *lhs, ndarray_obj_t *rhs,
+                                            uint8_t ndim, size_t *shape, int32_t *lstrides, int32_t *rstrides) {
+
+    ndarray_obj_t *results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    #if NDARRAY_BINARY_USES_FUN_POINTER
+    mp_float_t (*get_lhs)(void *) = ndarray_get_float_function(lhs->dtype);
+    mp_float_t (*get_rhs)(void *) = ndarray_get_float_function(rhs->dtype);
+
+    uint8_t *array = (uint8_t *)results->array;
+    void (*set_result)(void *, mp_float_t ) = ndarray_set_float_function(NDARRAY_FLOAT);
+
+    // Note that lvalue and rvalue are local variables in the macro itself
+    FUNC_POINTER_LOOP(results, array, get_lhs, get_rhs, larray, lstrides, rarray, rstrides, lvalue/rvalue);
+
+    #else
+    if(lhs->dtype == NDARRAY_UINT8) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            BINARY_LOOP(results, mp_float_t, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            BINARY_LOOP(results, mp_float_t, uint8_t, int8_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            BINARY_LOOP(results, mp_float_t, uint8_t, uint16_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            BINARY_LOOP(results, mp_float_t, uint8_t, int16_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            BINARY_LOOP(results, mp_float_t, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides, /);
+        }
+    } else if(lhs->dtype == NDARRAY_INT8) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            BINARY_LOOP(results, mp_float_t, int8_t, uint8_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            BINARY_LOOP(results, mp_float_t, int8_t, int8_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            BINARY_LOOP(results, mp_float_t, int8_t, uint16_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            BINARY_LOOP(results, mp_float_t, int8_t, int16_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            BINARY_LOOP(results, mp_float_t, int8_t, mp_float_t, larray, lstrides, rarray, rstrides, /);
+        }
+    } else if(lhs->dtype == NDARRAY_UINT16) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            BINARY_LOOP(results, mp_float_t, uint16_t, uint8_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            BINARY_LOOP(results, mp_float_t, uint16_t, int8_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            BINARY_LOOP(results, mp_float_t, uint16_t, uint16_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            BINARY_LOOP(results, mp_float_t, uint16_t, int16_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            BINARY_LOOP(results, mp_float_t, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, /);
+        }
+    } else if(lhs->dtype == NDARRAY_INT16) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            BINARY_LOOP(results, mp_float_t, int16_t, uint8_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            BINARY_LOOP(results, mp_float_t, int16_t, int8_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            BINARY_LOOP(results, mp_float_t, int16_t, uint16_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            BINARY_LOOP(results, mp_float_t, int16_t, int16_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            BINARY_LOOP(results, mp_float_t, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, /);
+        }
+    } else if(lhs->dtype == NDARRAY_FLOAT) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            BINARY_LOOP(results, mp_float_t, mp_float_t, uint8_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            BINARY_LOOP(results, mp_float_t, mp_float_t, int8_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            BINARY_LOOP(results, mp_float_t, mp_float_t, uint16_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            BINARY_LOOP(results, mp_float_t, mp_float_t, int16_t, larray, lstrides, rarray, rstrides, /);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            BINARY_LOOP(results, mp_float_t, mp_float_t, mp_float_t, larray, lstrides, rarray, rstrides, /);
+        }
+    }
+    #endif /* NDARRAY_BINARY_USES_FUN_POINTER */
+
+    return MP_OBJ_FROM_PTR(results);
+}
+#endif /* NDARRAY_HAS_BINARY_OP_TRUE_DIVIDE */
+
+#if NDARRAY_HAS_BINARY_OP_POWER
+mp_obj_t ndarray_binary_power(ndarray_obj_t *lhs, ndarray_obj_t *rhs,
+                                            uint8_t ndim, size_t *shape, int32_t *lstrides, int32_t *rstrides) {
+
+    // Note that numpy upcasts the results to int64, if the inputs are of integer type,
+    // while we always return a float array.
+    ndarray_obj_t *results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    #if NDARRAY_BINARY_USES_FUN_POINTER
+    mp_float_t (*get_lhs)(void *) = ndarray_get_float_function(lhs->dtype);
+    mp_float_t (*get_rhs)(void *) = ndarray_get_float_function(rhs->dtype);
+
+    uint8_t *array = (uint8_t *)results->array;
+    void (*set_result)(void *, mp_float_t ) = ndarray_set_float_function(NDARRAY_FLOAT);
+
+    // Note that lvalue and rvalue are local variables in the macro itself
+    FUNC_POINTER_LOOP(results, array, get_lhs, get_rhs, larray, lstrides, rarray, rstrides, MICROPY_FLOAT_C_FUN(pow)(lvalue, rvalue));
+
+    #else
+    if(lhs->dtype == NDARRAY_UINT8) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            POWER_LOOP(results, mp_float_t, uint8_t, uint8_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            POWER_LOOP(results, mp_float_t, uint8_t, int8_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            POWER_LOOP(results, mp_float_t, uint8_t, uint16_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            POWER_LOOP(results, mp_float_t, uint8_t, int16_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            POWER_LOOP(results, mp_float_t, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides);
+        }
+    } else if(lhs->dtype == NDARRAY_INT8) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            POWER_LOOP(results, mp_float_t, int8_t, uint8_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            POWER_LOOP(results, mp_float_t, int8_t, int8_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            POWER_LOOP(results, mp_float_t, int8_t, uint16_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            POWER_LOOP(results, mp_float_t, int8_t, int16_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            POWER_LOOP(results, mp_float_t, int8_t, mp_float_t, larray, lstrides, rarray, rstrides);
+        }
+    } else if(lhs->dtype == NDARRAY_UINT16) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            POWER_LOOP(results, mp_float_t, uint16_t, uint8_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            POWER_LOOP(results, mp_float_t, uint16_t, int8_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            POWER_LOOP(results, mp_float_t, uint16_t, uint16_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            POWER_LOOP(results, mp_float_t, uint16_t, int16_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            POWER_LOOP(results, mp_float_t, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides);
+        }
+    } else if(lhs->dtype == NDARRAY_INT16) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            POWER_LOOP(results, mp_float_t, int16_t, uint8_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            POWER_LOOP(results, mp_float_t, int16_t, int8_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            POWER_LOOP(results, mp_float_t, int16_t, uint16_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            POWER_LOOP(results, mp_float_t, int16_t, int16_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            POWER_LOOP(results, mp_float_t, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides);
+        }
+    } else if(lhs->dtype == NDARRAY_FLOAT) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            POWER_LOOP(results, mp_float_t, mp_float_t, uint8_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            POWER_LOOP(results, mp_float_t, mp_float_t, int8_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            POWER_LOOP(results, mp_float_t, mp_float_t, uint16_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            POWER_LOOP(results, mp_float_t, mp_float_t, int16_t, larray, lstrides, rarray, rstrides);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            POWER_LOOP(results, mp_float_t, mp_float_t, mp_float_t, larray, lstrides, rarray, rstrides);
+        }
+    }
+    #endif /* NDARRAY_BINARY_USES_FUN_POINTER */
+
+    return MP_OBJ_FROM_PTR(results);
+}
+#endif /* NDARRAY_HAS_BINARY_OP_POWER */
+
+#if NDARRAY_HAS_INPLACE_ADD || NDARRAY_HAS_INPLACE_MULTIPLY || NDARRAY_HAS_INPLACE_SUBTRACT
+mp_obj_t ndarray_inplace_ams(ndarray_obj_t *lhs, ndarray_obj_t *rhs, int32_t *rstrides, uint8_t optype) {
+
+    if((lhs->dtype != NDARRAY_FLOAT) && (rhs->dtype == NDARRAY_FLOAT)) {
+        mp_raise_TypeError(translate("cannot cast output with casting rule"));
+    }
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    #if NDARRAY_HAS_INPLACE_ADD
+    if(optype == MP_BINARY_OP_INPLACE_ADD) {
+        UNWRAP_INPLACE_OPERATOR(lhs, larray, rarray, rstrides, +=);
+    }
+    #endif
+    #if NDARRAY_HAS_INPLACE_ADD
+    if(optype == MP_BINARY_OP_INPLACE_MULTIPLY) {
+        UNWRAP_INPLACE_OPERATOR(lhs, larray, rarray, rstrides, *=);
+    }
+    #endif
+    #if NDARRAY_HAS_INPLACE_SUBTRACT
+    if(optype == MP_BINARY_OP_INPLACE_SUBTRACT) {
+        UNWRAP_INPLACE_OPERATOR(lhs, larray, rarray, rstrides, -=);
+    }
+    #endif
+
+    return MP_OBJ_FROM_PTR(lhs);
+}
+#endif /* NDARRAY_HAS_INPLACE_ADD || NDARRAY_HAS_INPLACE_MULTIPLY || NDARRAY_HAS_INPLACE_SUBTRACT */
+
+#if NDARRAY_HAS_INPLACE_TRUE_DIVIDE
+mp_obj_t ndarray_inplace_divide(ndarray_obj_t *lhs, ndarray_obj_t *rhs, int32_t *rstrides) {
+
+    if((lhs->dtype != NDARRAY_FLOAT)) {
+        mp_raise_TypeError(translate("results cannot be cast to specified type"));
+    }
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    if(rhs->dtype == NDARRAY_UINT8) {
+        INPLACE_LOOP(lhs, mp_float_t, uint8_t, larray, rarray, rstrides, /=);
+    } else if(rhs->dtype == NDARRAY_INT8) {
+        INPLACE_LOOP(lhs, mp_float_t, int8_t, larray, rarray, rstrides, /=);
+    } else if(rhs->dtype == NDARRAY_UINT16) {
+        INPLACE_LOOP(lhs, mp_float_t, uint16_t, larray, rarray, rstrides, /=);
+    } else if(rhs->dtype == NDARRAY_INT16) {
+        INPLACE_LOOP(lhs, mp_float_t, int16_t, larray, rarray, rstrides, /=);
+    } else if(lhs->dtype == NDARRAY_FLOAT) {
+        INPLACE_LOOP(lhs, mp_float_t, mp_float_t, larray, rarray, rstrides, /=);
+    }
+    return MP_OBJ_FROM_PTR(lhs);
+}
+#endif /* NDARRAY_HAS_INPLACE_DIVIDE */
+
+#if NDARRAY_HAS_INPLACE_POWER
+mp_obj_t ndarray_inplace_power(ndarray_obj_t *lhs, ndarray_obj_t *rhs, int32_t *rstrides) {
+
+    if((lhs->dtype != NDARRAY_FLOAT)) {
+        mp_raise_TypeError(translate("results cannot be cast to specified type"));
+    }
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    if(rhs->dtype == NDARRAY_UINT8) {
+        INPLACE_POWER(lhs, mp_float_t, uint8_t, larray, rarray, rstrides);
+    } else if(rhs->dtype == NDARRAY_INT8) {
+        INPLACE_POWER(lhs, mp_float_t, int8_t, larray, rarray, rstrides);
+    } else if(rhs->dtype == NDARRAY_UINT16) {
+        INPLACE_POWER(lhs, mp_float_t, uint16_t, larray, rarray, rstrides);
+    } else if(rhs->dtype == NDARRAY_INT16) {
+        INPLACE_POWER(lhs, mp_float_t, int16_t, larray, rarray, rstrides);
+    } else if(lhs->dtype == NDARRAY_FLOAT) {
+        INPLACE_POWER(lhs, mp_float_t, mp_float_t, larray, rarray, rstrides);
+    }
+    return MP_OBJ_FROM_PTR(lhs);
+}
+#endif /* NDARRAY_HAS_INPLACE_POWER */
--- a/code/ndarray_operators.h
+++ b/code/ndarray_operators.h
@ -0,0 +1,277 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+*/
+
+#include "ndarray.h"
+
+mp_obj_t ndarray_binary_equality(ndarray_obj_t *, ndarray_obj_t *, uint8_t , size_t *,  int32_t *, int32_t *, mp_binary_op_t );
+mp_obj_t ndarray_binary_add(ndarray_obj_t *, ndarray_obj_t *, uint8_t , size_t *, int32_t *, int32_t *);
+mp_obj_t ndarray_binary_multiply(ndarray_obj_t *, ndarray_obj_t *, uint8_t , size_t *, int32_t *, int32_t *);
+mp_obj_t ndarray_binary_more(ndarray_obj_t *, ndarray_obj_t *, uint8_t , size_t *, int32_t *, int32_t *, mp_binary_op_t );
+mp_obj_t ndarray_binary_power(ndarray_obj_t *, ndarray_obj_t *, uint8_t , size_t *, int32_t *, int32_t *);
+mp_obj_t ndarray_binary_subtract(ndarray_obj_t *, ndarray_obj_t *, uint8_t , size_t *, int32_t *, int32_t *);
+mp_obj_t ndarray_binary_true_divide(ndarray_obj_t *, ndarray_obj_t *, uint8_t , size_t *, int32_t *, int32_t *);
+
+mp_obj_t ndarray_inplace_ams(ndarray_obj_t *, ndarray_obj_t *, int32_t *, uint8_t );
+mp_obj_t ndarray_inplace_power(ndarray_obj_t *, ndarray_obj_t *, int32_t *);
+mp_obj_t ndarray_inplace_divide(ndarray_obj_t *, ndarray_obj_t *, int32_t *);
+
+#define UNWRAP_INPLACE_OPERATOR(lhs, larray, rarray, rstrides, OPERATOR)\
+({\
+    if((lhs)->dtype == NDARRAY_UINT8) {\
+        if((rhs)->dtype == NDARRAY_UINT8) {\
+            INPLACE_LOOP((lhs), uint8_t, uint8_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_INT8) {\
+            INPLACE_LOOP((lhs), uint8_t, int8_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_UINT16) {\
+            INPLACE_LOOP((lhs), uint8_t, uint16_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else {\
+            INPLACE_LOOP((lhs), uint8_t, int16_t, (larray), (rarray), (rstrides), OPERATOR);\
+        }\
+    } else if(lhs->dtype == NDARRAY_INT8) {\
+        if(rhs->dtype == NDARRAY_UINT8) {\
+            INPLACE_LOOP((lhs), int8_t, uint8_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_INT8) {\
+            INPLACE_LOOP((lhs), int8_t, int8_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_UINT16) {\
+            INPLACE_LOOP((lhs), int8_t, uint16_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else {\
+            INPLACE_LOOP((lhs), int8_t, int16_t, (larray), (rarray), (rstrides), OPERATOR);\
+        }\
+    } else if(lhs->dtype == NDARRAY_UINT16) {\
+        if(rhs->dtype == NDARRAY_UINT8) {\
+            INPLACE_LOOP((lhs), uint16_t, uint8_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_INT8) {\
+            INPLACE_LOOP((lhs), uint16_t, int8_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_UINT16) {\
+            INPLACE_LOOP((lhs), uint16_t, uint16_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else {\
+            INPLACE_LOOP((lhs), uint16_t, int16_t, (larray), (rarray), (rstrides), OPERATOR);\
+        }\
+    } else if(lhs->dtype == NDARRAY_INT16) {\
+        if(rhs->dtype == NDARRAY_UINT8) {\
+            INPLACE_LOOP((lhs), int16_t, uint8_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_INT8) {\
+            INPLACE_LOOP((lhs), int16_t, int8_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_UINT16) {\
+            INPLACE_LOOP((lhs), int16_t, uint16_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else {\
+            INPLACE_LOOP((lhs), int16_t, int16_t, (larray), (rarray), (rstrides), OPERATOR);\
+        }\
+    } else if(lhs->dtype == NDARRAY_FLOAT) {\
+        if(rhs->dtype == NDARRAY_UINT8) {\
+            INPLACE_LOOP((lhs), mp_float_t, uint8_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_INT8) {\
+            INPLACE_LOOP((lhs), mp_float_t, int8_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_UINT16) {\
+            INPLACE_LOOP((lhs), mp_float_t, uint16_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else if(rhs->dtype == NDARRAY_INT16) {\
+            INPLACE_LOOP((lhs), mp_float_t, int16_t, (larray), (rarray), (rstrides), OPERATOR);\
+        } else {\
+            INPLACE_LOOP((lhs), mp_float_t, mp_float_t, (larray), (rarray), (rstrides), OPERATOR);\
+        }\
+    }\
+})
+
+#if ULAB_MAX_DIMS == 1
+#define INPLACE_POWER(results, type_left, type_right, larray, rarray, rstrides)\
+({  size_t l = 0;\
+    do {\
+        *((type_left *)(larray)) = MICROPY_FLOAT_C_FUN(pow)(*((type_left *)(larray)), *((type_right *)(rarray)));\
+        (larray) += (results)->strides[ULAB_MAX_DIMS - 1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+})
+
+#define FUNC_POINTER_LOOP(results, array, get_lhs, get_rhs, larray, lstrides, rarray, rstrides, OPERATION)\
+({  size_t l = 0;\
+    do {\
+        mp_float_t lvalue = (get_lhs)((larray));\
+        mp_float_t rvalue = (get_rhs)((rarray));\
+        (set_result)((array), OPERATION);\
+        (array) += (results)->itemsize;\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+})
+#endif /* ULAB_MAX_DIMS == 1 */
+
+#if ULAB_MAX_DIMS == 2
+#define INPLACE_POWER(results, type_left, type_right, larray, rarray, rstrides)\
+({  size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            *((type_left *)(larray)) = MICROPY_FLOAT_C_FUN(pow)(*((type_left *)(larray)), *((type_right *)(rarray)));\
+            (larray) += (results)->strides[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        (larray) -= (results)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (larray) += (results)->strides[ULAB_MAX_DIMS - 2];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+})
+
+#define FUNC_POINTER_LOOP(results, array, get_lhs, get_rhs, larray, lstrides, rarray, rstrides, OPERATION)\
+({  size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            mp_float_t lvalue = (get_lhs)((larray));\
+            mp_float_t rvalue = (get_rhs)((rarray));\
+            (set_result)((array), OPERATION);\
+            (array) += (results)->itemsize;\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < results->shape[ULAB_MAX_DIMS - 2]);\
+})
+#endif /* ULAB_MAX_DIMS == 2 */
+
+#if ULAB_MAX_DIMS == 3
+#define INPLACE_POWER(results, type_left, type_right, larray, rarray, rstrides)\
+({  size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                *((type_left *)(larray)) = MICROPY_FLOAT_C_FUN(pow)(*((type_left *)(larray)), *((type_right *)(rarray)));\
+                (larray) += (results)->strides[ULAB_MAX_DIMS - 1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+            (larray) -= (results)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (larray) += (results)->strides[ULAB_MAX_DIMS - 2];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+        (larray) -= (results)->strides[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (larray) += (results)->strides[ULAB_MAX_DIMS - 3];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+})
+
+
+#define FUNC_POINTER_LOOP(results, array, get_lhs, get_rhs, larray, lstrides, rarray, rstrides, OPERATION)\
+({  size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                mp_float_t lvalue = (get_lhs)((larray));\
+                mp_float_t rvalue = (get_rhs)((rarray));\
+                (set_result)((array), OPERATION);\
+                (array) += (results)->itemsize;\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < results->shape[ULAB_MAX_DIMS - 2]);\
+        (larray) -= (results)->strides[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (larray) += (results)->strides[ULAB_MAX_DIMS - 3];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+})
+#endif /* ULAB_MAX_DIMS == 3 */
+
+#if ULAB_MAX_DIMS == 4
+#define INPLACE_POWER(results, type_left, type_right, larray, rarray, rstrides)\
+({  size_t i = 0;\
+    do {\
+        size_t j = 0;\
+        do {\
+            size_t k = 0;\
+            do {\
+                size_t l = 0;\
+                do {\
+                    *((type_left *)(larray)) = MICROPY_FLOAT_C_FUN(pow)(*((type_left *)(larray)), *((type_right *)(rarray)));\
+                    (larray) += (results)->strides[ULAB_MAX_DIMS - 1];\
+                    (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                    l++;\
+                } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+                (larray) -= (results)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (larray) += (results)->strides[ULAB_MAX_DIMS - 2];\
+                (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+                k++;\
+            } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\
+            (larray) -= (results)->strides[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (larray) += (results)->strides[ULAB_MAX_DIMS - 3];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+            j++;\
+        } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+        (larray) -= (results)->strides[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (larray) += (results)->strides[ULAB_MAX_DIMS - 4];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 4];\
+        i++;\
+    } while(i < (results)->shape[ULAB_MAX_DIMS - 4]);\
+})
+
+#define FUNC_POINTER_LOOP(results, array, get_lhs, get_rhs, larray, lstrides, rarray, rstrides, OPERATION)\
+({  size_t i = 0;\
+    do {\
+        size_t j = 0;\
+        do {\
+            size_t k = 0;\
+            do {\
+                size_t l = 0;\
+                do {\
+                    mp_float_t lvalue = (get_lhs)((larray));\
+                    mp_float_t rvalue = (get_rhs)((rarray));\
+                    (set_result)((array), OPERATION);\
+                    (array) += (results)->itemsize;\
+                    (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                    (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                    l++;\
+                } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\
+                (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+                (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+                k++;\
+            } while(k < results->shape[ULAB_MAX_DIMS - 2]);\
+            (larray) -= (results)->strides[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (larray) += (results)->strides[ULAB_MAX_DIMS - 3];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+            j++;\
+        } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\
+        (larray) -= (results)->strides[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (larray) += (results)->strides[ULAB_MAX_DIMS - 4];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS-3];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 4];\
+        i++;\
+    } while(i < (results)->shape[ULAB_MAX_DIMS - 4]);\
+})
+#endif /* ULAB_MAX_DIMS == 4 */
--- a/code/ndarray_properties.h
+++ b/code/ndarray_properties.h
@ -18,45 +18,72 @@
 #include "py/obj.h"
 #include "py/objarray.h"

+#include "ulab.h"
 #include "ndarray.h"

+#if CIRCUITPY
 typedef struct _mp_obj_property_t {
    mp_obj_base_t base;
    mp_obj_t proxy[3]; // getter, setter, deleter
 } mp_obj_property_t;

-/* v923z: it is not at all clear to me, why this must be declared; it should already be in obj.h */
-typedef struct _mp_obj_none_t {
-    mp_obj_base_t base;
-} mp_obj_none_t;
+#if NDARRAY_HAS_DTYPE
+MP_DEFINE_CONST_FUN_OBJ_1(ndarray_get_dtype_obj, ndarray_dtype);
+STATIC const mp_obj_property_t ndarray_dtype_obj = {
+    .base.type = &mp_type_property,
+    .proxy = {(mp_obj_t)&ndarray_get_dtype_obj,
+              mp_const_none,
+              mp_const_none },
+};
+#endif /* NDARRAY_HAS_DTYPE */

-const mp_obj_type_t mp_type_NoneType;
-const mp_obj_none_t mp_const_none_obj = {{&mp_type_NoneType}};
-
-MP_DEFINE_CONST_FUN_OBJ_1(ndarray_get_shape_obj, ndarray_shape);
-MP_DEFINE_CONST_FUN_OBJ_1(ndarray_get_size_obj, ndarray_size);
+#if NDARRAY_HAS_ITEMSIZE
 MP_DEFINE_CONST_FUN_OBJ_1(ndarray_get_itemsize_obj, ndarray_itemsize);
-MP_DEFINE_CONST_FUN_OBJ_KW(ndarray_flatten_obj, 1, ndarray_flatten);
-
-STATIC const mp_obj_property_t ndarray_shape_obj = {
-    .base.type = &mp_type_property,
-    .proxy = {(mp_obj_t)&ndarray_get_shape_obj,
-              (mp_obj_t)&mp_const_none_obj,
-              (mp_obj_t)&mp_const_none_obj},
-};
-
-STATIC const mp_obj_property_t ndarray_size_obj = {
-    .base.type = &mp_type_property,
-    .proxy = {(mp_obj_t)&ndarray_get_size_obj,
-              (mp_obj_t)&mp_const_none_obj,
-              (mp_obj_t)&mp_const_none_obj},
-};
-
 STATIC const mp_obj_property_t ndarray_itemsize_obj = {
    .base.type = &mp_type_property,
    .proxy = {(mp_obj_t)&ndarray_get_itemsize_obj,
-              (mp_obj_t)&mp_const_none_obj,
-              (mp_obj_t)&mp_const_none_obj},
+              mp_const_none,
+              mp_const_none },
 };
+#endif /* NDARRAY_HAS_ITEMSIZE */

+#if NDARRAY_HAS_SHAPE
+MP_DEFINE_CONST_FUN_OBJ_1(ndarray_get_shape_obj, ndarray_shape);
+STATIC const mp_obj_property_t ndarray_shape_obj = {
+    .base.type = &mp_type_property,
+    .proxy = {(mp_obj_t)&ndarray_get_shape_obj,
+              mp_const_none,
+              mp_const_none },
+};
+#endif /* NDARRAY_HAS_SHAPE */
+
+#if NDARRAY_HAS_SIZE
+MP_DEFINE_CONST_FUN_OBJ_1(ndarray_get_size_obj, ndarray_size);
+STATIC const mp_obj_property_t ndarray_size_obj = {
+    .base.type = &mp_type_property,
+    .proxy = {(mp_obj_t)&ndarray_get_size_obj,
+              mp_const_none,
+              mp_const_none },
+};
+#endif /* NDARRAY_HAS_SIZE */
+
+#if NDARRAY_HAS_STRIDES
+MP_DEFINE_CONST_FUN_OBJ_1(ndarray_get_strides_obj, ndarray_strides);
+STATIC const mp_obj_property_t ndarray_strides_obj = {
+    .base.type = &mp_type_property,
+    .proxy = {(mp_obj_t)&ndarray_get_strides_obj,
+              mp_const_none,
+              mp_const_none },
+};
+#endif /* NDARRAY_HAS_STRIDES */
+
+#else
+
+MP_DEFINE_CONST_FUN_OBJ_1(ndarray_dtype_obj, ndarray_dtype);
+MP_DEFINE_CONST_FUN_OBJ_1(ndarray_itemsize_obj, ndarray_itemsize);
+MP_DEFINE_CONST_FUN_OBJ_1(ndarray_shape_obj, ndarray_shape);
+MP_DEFINE_CONST_FUN_OBJ_1(ndarray_size_obj, ndarray_size);
+MP_DEFINE_CONST_FUN_OBJ_1(ndarray_strides_obj, ndarray_strides);
+
+#endif /* CIRCUITPY */
 #endif
--- a/code/numerical.c
+++ b/code/numerical.c
@ -1,758 +0,0 @@
-
-/*
- * This file is part of the micropython-ulab project, 
- *
- * https://github.com/v923z/micropython-ulab
- *
- * The MIT License (MIT)
- *
- * Copyright (c) 2019-2020 Zoltán Vörös
-*/
-
-#include <math.h>
-#include <stdlib.h>
-#include <string.h>
-#include "py/obj.h"
-#include "py/objint.h"
-#include "py/runtime.h"
-#include "py/builtin.h"
-#include "py/misc.h"
-#include "numerical.h"
-
-#if ULAB_NUMERICAL_MODULE
-
-enum NUMERICAL_FUNCTION_TYPE {
-    NUMERICAL_MIN,
-    NUMERICAL_MAX,
-    NUMERICAL_ARGMIN,
-    NUMERICAL_ARGMAX,
-    NUMERICAL_SUM,
-    NUMERICAL_MEAN,
-    NUMERICAL_STD,
-};
-
-mp_obj_t numerical_linspace(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_num, MP_ARG_INT, {.u_int = 50} },
-        { MP_QSTR_endpoint, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_true} },
-        { MP_QSTR_retstep, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_false} },
-        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_INT, {.u_int = NDARRAY_FLOAT} },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(2, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-
-    uint16_t len = args[2].u_int;
-    if(len < 2) {
-        mp_raise_ValueError(translate("number of points must be at least 2"));
-    }
-    mp_float_t value, step;
-    value = mp_obj_get_float(args[0].u_obj);
-    uint8_t typecode = args[5].u_int;
-    if(args[3].u_obj == mp_const_true) step = (mp_obj_get_float(args[1].u_obj)-value)/(len-1);
-    else step = (mp_obj_get_float(args[1].u_obj)-value)/len;
-    ndarray_obj_t *ndarray = create_new_ndarray(1, len, typecode);
-    if(typecode == NDARRAY_UINT8) {
-        uint8_t *array = (uint8_t *)ndarray->array->items;
-        for(size_t i=0; i < len; i++, value += step) array[i] = (uint8_t)value;
-    } else if(typecode == NDARRAY_INT8) {
-        int8_t *array = (int8_t *)ndarray->array->items;
-        for(size_t i=0; i < len; i++, value += step) array[i] = (int8_t)value;
-    } else if(typecode == NDARRAY_UINT16) {
-        uint16_t *array = (uint16_t *)ndarray->array->items;
-        for(size_t i=0; i < len; i++, value += step) array[i] = (uint16_t)value;
-    } else if(typecode == NDARRAY_INT16) {
-        int16_t *array = (int16_t *)ndarray->array->items;
-        for(size_t i=0; i < len; i++, value += step) array[i] = (int16_t)value;
-    } else {
-        mp_float_t *array = (mp_float_t *)ndarray->array->items;
-        for(size_t i=0; i < len; i++, value += step) array[i] = value;
-    }
-    if(args[4].u_obj == mp_const_false) {
-        return MP_OBJ_FROM_PTR(ndarray);
-    } else {
-        mp_obj_t tuple[2];
-        tuple[0] = ndarray;
-        tuple[1] = mp_obj_new_float(step);
-        return mp_obj_new_tuple(2, tuple);
-    }
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_linspace_obj, 2, numerical_linspace);
-
-void axis_sorter(ndarray_obj_t *ndarray, mp_obj_t axis, size_t *m, size_t *n, size_t *N, 
-                 size_t *increment, size_t *len, size_t *start_inc) {
-    if(axis == mp_const_none) { // flatten the array
-        *m = 1;
-        *n = 1;
-        *len = ndarray->array->len;
-        *N = 1;
-        *increment = 1;
-        *start_inc = ndarray->array->len;
-    } else if((mp_obj_get_int(axis) == 1)) { // along the horizontal axis
-        *m = ndarray->m;
-        *n = 1;
-        *len = ndarray->n;
-        *N = ndarray->m;
-        *increment = 1;
-        *start_inc = ndarray->n;
-    } else { // along vertical axis
-        *m = 1;
-        *n = ndarray->n;
-        *len = ndarray->m;
-        *N = ndarray->n;
-        *increment = ndarray->n;
-        *start_inc = 1;
-    }    
-}
-
-mp_obj_t numerical_sum_mean_std_iterable(mp_obj_t oin, uint8_t optype, size_t ddof) {
-    mp_float_t value, sum = 0.0, sq_sum = 0.0;
-    mp_obj_iter_buf_t iter_buf;
-    mp_obj_t item, iterable = mp_getiter(oin, &iter_buf);
-    mp_int_t len = mp_obj_get_int(mp_obj_len(oin));
-    while ((item = mp_iternext(iterable)) != MP_OBJ_STOP_ITERATION) {
-        value = mp_obj_get_float(item);
-        sum += value;
-    }
-    if(optype ==  NUMERICAL_SUM) {
-        return mp_obj_new_float(sum);
-    } else if(optype == NUMERICAL_MEAN) {
-        return mp_obj_new_float(sum/len);
-    } else { // this should be the case of the standard deviation
-        // TODO: note that we could get away with a single pass, if we used the Weldorf algorithm
-        // That should save a fair amount of time, because we would have to extract the values only once
-        iterable = mp_getiter(oin, &iter_buf);
-        sum /= len; // this is now the mean!
-        while ((item = mp_iternext(iterable)) != MP_OBJ_STOP_ITERATION) {
-            value = mp_obj_get_float(item) - sum;
-            sq_sum += value * value;
-        }
-        return mp_obj_new_float(MICROPY_FLOAT_C_FUN(sqrt)(sq_sum/(len-ddof)));
-    }
-}
-
-STATIC mp_obj_t numerical_sum_mean_ndarray(ndarray_obj_t *ndarray, mp_obj_t axis, uint8_t optype) {
-    size_t m, n, increment, start, start_inc, N, len; 
-    axis_sorter(ndarray, axis, &m, &n, &N, &increment, &len, &start_inc);
-    ndarray_obj_t *results = create_new_ndarray(m, n, NDARRAY_FLOAT);
-    mp_float_t sum, sq_sum;
-    mp_float_t *farray = (mp_float_t *)results->array->items;
-    for(size_t j=0; j < N; j++) { // result index
-        start = j * start_inc;
-        sum = sq_sum = 0.0;
-        if(ndarray->array->typecode == NDARRAY_UINT8) {
-            RUN_SUM(ndarray, uint8_t, optype, len, start, increment);
-        } else if(ndarray->array->typecode == NDARRAY_INT8) {
-            RUN_SUM(ndarray, int8_t, optype, len, start, increment);
-        } else if(ndarray->array->typecode == NDARRAY_UINT16) {
-            RUN_SUM(ndarray, uint16_t, optype, len, start, increment);
-        } else if(ndarray->array->typecode == NDARRAY_INT16) {
-            RUN_SUM(ndarray, int16_t, optype, len, start, increment);
-        } else { // this will be mp_float_t, no need to check
-            RUN_SUM(ndarray, mp_float_t, optype, len, start, increment);
-        }
-        if(optype == NUMERICAL_SUM) {
-            farray[j] = sum;
-        } else { // this is the case of the mean
-            farray[j] = sum / len;
-        }
-    }
-    if(results->array->len == 1) {
-        return mp_obj_new_float(farray[0]);
-    }
-    return MP_OBJ_FROM_PTR(results);
-}
-
-mp_obj_t numerical_std_ndarray(ndarray_obj_t *ndarray, mp_obj_t axis, size_t ddof) {
-    size_t m, n, increment, start, start_inc, N, len; 
-    mp_float_t sum, sum_sq;
-    
-    axis_sorter(ndarray, axis, &m, &n, &N, &increment, &len, &start_inc);
-    if(ddof > len) {
-        mp_raise_ValueError(translate("ddof must be smaller than length of data set"));
-    }
-    ndarray_obj_t *results = create_new_ndarray(m, n, NDARRAY_FLOAT);
-    mp_float_t *farray = (mp_float_t *)results->array->items;
-    for(size_t j=0; j < N; j++) { // result index
-        start = j * start_inc;
-        sum = 0.0;
-        sum_sq = 0.0;
-        if(ndarray->array->typecode == NDARRAY_UINT8) {
-            RUN_STD(ndarray, uint8_t, len, start, increment);
-        } else if(ndarray->array->typecode == NDARRAY_INT8) {
-            RUN_STD(ndarray, int8_t, len, start, increment);
-        } else if(ndarray->array->typecode == NDARRAY_UINT16) {
-            RUN_STD(ndarray, uint16_t, len, start, increment);
-        } else if(ndarray->array->typecode == NDARRAY_INT16) {
-            RUN_STD(ndarray, int16_t, len, start, increment);
-        } else { // this will be mp_float_t, no need to check
-            RUN_STD(ndarray, mp_float_t, len, start, increment);
-        }
-        farray[j] = MICROPY_FLOAT_C_FUN(sqrt)(sum_sq/(len - ddof));
-    }
-    if(results->array->len == 1) {
-        return mp_obj_new_float(farray[0]);
-    }
-    return MP_OBJ_FROM_PTR(results);
-}
-
-mp_obj_t numerical_argmin_argmax_iterable(mp_obj_t oin, mp_obj_t axis, uint8_t optype) {
-    size_t idx = 0, best_idx = 0;
-    mp_obj_iter_buf_t iter_buf;
-    mp_obj_t iterable = mp_getiter(oin, &iter_buf);
-    mp_obj_t best_obj = MP_OBJ_NULL;
-    mp_obj_t item;
-    mp_uint_t op = MP_BINARY_OP_LESS;
-    if((optype == NUMERICAL_ARGMAX) || (optype == NUMERICAL_MAX)) op = MP_BINARY_OP_MORE;
-    while ((item = mp_iternext(iterable)) != MP_OBJ_STOP_ITERATION) {
-        if ((best_obj == MP_OBJ_NULL) || (mp_binary_op(op, item, best_obj) == mp_const_true)) {
-            best_obj = item;
-            best_idx = idx;
-        }
-        idx++;
-    }
-    if((optype == NUMERICAL_ARGMIN) || (optype == NUMERICAL_ARGMAX)) {
-        return MP_OBJ_NEW_SMALL_INT(best_idx);
-    } else {
-        return best_obj;
-    }    
-}
-
-mp_obj_t numerical_argmin_argmax_ndarray(ndarray_obj_t *ndarray, mp_obj_t axis, uint8_t optype) {
-    size_t m, n, increment, start, start_inc, N, len;
-    axis_sorter(ndarray, axis, &m, &n, &N, &increment, &len, &start_inc);
-    ndarray_obj_t *results;
-    if((optype == NUMERICAL_ARGMIN) || (optype == NUMERICAL_ARGMAX)) {
-        // we could save some RAM by taking NDARRAY_UINT8, if the dimensions 
-        // are smaller than 256, but the code would become more involving 
-        // (we would also need extra flash space)
-        results = create_new_ndarray(m, n, NDARRAY_UINT16);
-    } else {
-        results = create_new_ndarray(m, n, ndarray->array->typecode);
-    }
-    
-    for(size_t j=0; j < N; j++) { // result index
-        start = j * start_inc;
-        if((ndarray->array->typecode == NDARRAY_UINT8) || (ndarray->array->typecode == NDARRAY_INT8)) {
-            if((optype == NUMERICAL_MAX) || (optype == NUMERICAL_MIN)) {
-                RUN_ARGMIN(ndarray, results, uint8_t, uint8_t, len, start, increment, optype, j);
-            } else {
-                RUN_ARGMIN(ndarray, results, uint8_t, uint16_t, len, start, increment, optype, j);                
-            }
-        } else if((ndarray->array->typecode == NDARRAY_UINT16) || (ndarray->array->typecode == NDARRAY_INT16)) {
-            RUN_ARGMIN(ndarray, results, uint16_t, uint16_t, len, start, increment, optype, j);
-        } else {
-            if((optype == NUMERICAL_MAX) || (optype == NUMERICAL_MIN)) {
-                RUN_ARGMIN(ndarray, results, mp_float_t, mp_float_t, len, start, increment, optype, j);
-            } else {
-                RUN_ARGMIN(ndarray, results, mp_float_t, uint16_t, len, start, increment, optype, j);                
-            }
-        }
-    }
-    return MP_OBJ_FROM_PTR(results);
-}
-
-STATIC mp_obj_t numerical_function(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args, uint8_t optype) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none} } ,
-        { MP_QSTR_axis, MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-    
-    mp_obj_t oin = args[0].u_obj;
-    mp_obj_t axis = args[1].u_obj;
-    if((axis != mp_const_none) && (mp_obj_get_int(axis) != 0) && (mp_obj_get_int(axis) != 1)) {
-        // this seems to pass with False, and True...
-        mp_raise_ValueError(translate("axis must be None, 0, or 1"));
-    }
-    
-    if(MP_OBJ_IS_TYPE(oin, &mp_type_tuple) || MP_OBJ_IS_TYPE(oin, &mp_type_list) || 
-        MP_OBJ_IS_TYPE(oin, &mp_type_range)) {
-        switch(optype) {
-            case NUMERICAL_MIN:
-            case NUMERICAL_ARGMIN:
-            case NUMERICAL_MAX:
-            case NUMERICAL_ARGMAX:
-                return numerical_argmin_argmax_iterable(oin, axis, optype);
-            case NUMERICAL_SUM:
-            case NUMERICAL_MEAN:
-                return numerical_sum_mean_std_iterable(oin, optype, 0);
-            default: // we should never reach this point, but whatever
-                return mp_const_none;
-        }
-    } else if(MP_OBJ_IS_TYPE(oin, &ulab_ndarray_type)) {
-        ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(oin);
-        switch(optype) {
-            case NUMERICAL_MIN:
-            case NUMERICAL_MAX:
-            case NUMERICAL_ARGMIN:
-            case NUMERICAL_ARGMAX:
-                return numerical_argmin_argmax_ndarray(ndarray, axis, optype);
-            case NUMERICAL_SUM:
-            case NUMERICAL_MEAN:
-                return numerical_sum_mean_ndarray(ndarray, axis, optype);
-            default:
-                mp_raise_NotImplementedError(translate("operation is not implemented on ndarrays"));
-        }
-    } else {
-        mp_raise_TypeError(translate("input must be tuple, list, range, or ndarray"));
-    }
-    return mp_const_none;
-}
-
-mp_obj_t numerical_min(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    return numerical_function(n_args, pos_args, kw_args, NUMERICAL_MIN);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_min_obj, 1, numerical_min);
-
-mp_obj_t numerical_max(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    return numerical_function(n_args, pos_args, kw_args, NUMERICAL_MAX);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_max_obj, 1, numerical_max);
-
-mp_obj_t numerical_argmin(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    return numerical_function(n_args, pos_args, kw_args, NUMERICAL_ARGMIN);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_argmin_obj, 1, numerical_argmin);
-
-mp_obj_t numerical_argmax(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    return numerical_function(n_args, pos_args, kw_args, NUMERICAL_ARGMAX);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_argmax_obj, 1, numerical_argmax);
-
-mp_obj_t numerical_sum(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    return numerical_function(n_args, pos_args, kw_args, NUMERICAL_SUM);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_sum_obj, 1, numerical_sum);
-
-mp_obj_t numerical_mean(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    return numerical_function(n_args, pos_args, kw_args, NUMERICAL_MEAN);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_mean_obj, 1, numerical_mean);
-
-mp_obj_t numerical_std(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } } ,
-        { MP_QSTR_axis, MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_ddof, MP_ARG_KW_ONLY | MP_ARG_INT, {.u_int = 0} },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-    
-    mp_obj_t oin = args[0].u_obj;
-    mp_obj_t axis = args[1].u_obj;
-    size_t ddof = args[2].u_int;
-    if((axis != mp_const_none) && (mp_obj_get_int(axis) != 0) && (mp_obj_get_int(axis) != 1)) {
-        // this seems to pass with False, and True...
-        mp_raise_ValueError(translate("axis must be None, 0, or 1"));
-    }
-    if(MP_OBJ_IS_TYPE(oin, &mp_type_tuple) || MP_OBJ_IS_TYPE(oin, &mp_type_list) || MP_OBJ_IS_TYPE(oin, &mp_type_range)) {
-        return numerical_sum_mean_std_iterable(oin, NUMERICAL_STD, ddof);
-    } else if(MP_OBJ_IS_TYPE(oin, &ulab_ndarray_type)) {
-        ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(oin);
-        return numerical_std_ndarray(ndarray, axis, ddof);
-    } else {
-        mp_raise_TypeError(translate("input must be tuple, list, range, or ndarray"));
-    }
-    return mp_const_none;
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_std_obj, 1, numerical_std);
-
-mp_obj_t numerical_roll(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none  } },
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_axis, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(2, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-    
-    mp_obj_t oin = args[0].u_obj;
-    int16_t shift = mp_obj_get_int(args[1].u_obj);
-    if((args[2].u_obj != mp_const_none) && 
-           (mp_obj_get_int(args[2].u_obj) != 0) && 
-           (mp_obj_get_int(args[2].u_obj) != 1)) {
-        mp_raise_ValueError(translate("axis must be None, 0, or 1"));
-    }
-
-    ndarray_obj_t *in = MP_OBJ_TO_PTR(oin);
-    uint8_t _sizeof = mp_binary_get_size('@', in->array->typecode, NULL);
-    size_t len;
-    int16_t _shift;
-    uint8_t *array = (uint8_t *)in->array->items;
-    // TODO: transpose the matrix, if axis == 0. Though, that is hard on the RAM...
-    if(shift < 0) {
-        _shift = -shift;
-    } else {
-        _shift = shift;
-    }
-    if((args[2].u_obj == mp_const_none) || (mp_obj_get_int(args[2].u_obj) == 1)) { // shift horizontally
-        uint16_t M;
-        if(args[2].u_obj == mp_const_none) {
-            len = in->array->len;
-            M = 1;
-        } else {
-            len = in->n;
-            M = in->m;
-        }
-        _shift = _shift % len;
-        if(shift < 0) _shift = len - _shift;
-        // TODO: if(shift > len/2), we should move in the opposite direction. That would save RAM
-        _shift *= _sizeof;
-        uint8_t *tmp = m_new(uint8_t, _shift);
-        for(size_t m=0; m < M; m++) {
-            memmove(tmp, &array[m*len*_sizeof], _shift);
-            memmove(&array[m*len*_sizeof], &array[m*len*_sizeof+_shift], len*_sizeof-_shift);
-            memmove(&array[(m+1)*len*_sizeof-_shift], tmp, _shift);
-        }
-        m_del(uint8_t, tmp, _shift);
-        return mp_const_none;
-    } else {
-        len = in->m;
-        // temporary buffer
-        uint8_t *_data = m_new(uint8_t, _sizeof*len);
-        
-        _shift = _shift % len;
-        if(shift < 0) _shift = len - _shift;
-        _shift *= _sizeof;
-        uint8_t *tmp = m_new(uint8_t, _shift);
-
-        for(size_t n=0; n < in->n; n++) {
-            for(size_t m=0; m < len; m++) {
-                // this loop should fill up the temporary buffer
-                memmove(&_data[m*_sizeof], &array[(m*in->n+n)*_sizeof], _sizeof);
-            }
-            // now, the actual shift
-            memmove(tmp, _data, _shift);
-            memmove(_data, &_data[_shift], len*_sizeof-_shift);
-            memmove(&_data[len*_sizeof-_shift], tmp, _shift);
-            for(size_t m=0; m < len; m++) {
-                // this loop should dump the content of the temporary buffer into data
-                memmove(&array[(m*in->n+n)*_sizeof], &_data[m*_sizeof], _sizeof);
-            }            
-        }
-        m_del(uint8_t, tmp, _shift);
-        m_del(uint8_t, _data, _sizeof*len);
-        return mp_const_none;
-    }
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_roll_obj, 2, numerical_roll);
-
-mp_obj_t numerical_flip(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_axis, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(1, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-    
-    if(!MP_OBJ_IS_TYPE(args[0].u_obj, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("flip argument must be an ndarray"));
-    }
-    if((args[1].u_obj != mp_const_none) && 
-           (mp_obj_get_int(args[1].u_obj) != 0) && 
-           (mp_obj_get_int(args[1].u_obj) != 1)) {
-        mp_raise_ValueError(translate("axis must be None, 0, or 1"));
-    }
-
-    ndarray_obj_t *in = MP_OBJ_TO_PTR(args[0].u_obj);
-    mp_obj_t oout = ndarray_copy(args[0].u_obj);
-    ndarray_obj_t *out = MP_OBJ_TO_PTR(oout);
-    uint8_t _sizeof = mp_binary_get_size('@', in->array->typecode, NULL);
-    uint8_t *array_in = (uint8_t *)in->array->items;
-    uint8_t *array_out = (uint8_t *)out->array->items;    
-    size_t len;
-    if((args[1].u_obj == mp_const_none) || (mp_obj_get_int(args[1].u_obj) == 1)) { // flip horizontally
-        uint16_t M = in->m;
-        len = in->n;
-        if(args[1].u_obj == mp_const_none) { // flip flattened array
-            len = in->array->len;
-            M = 1;
-        }
-        for(size_t m=0; m < M; m++) {
-            for(size_t n=0; n < len; n++) {
-                memcpy(array_out+_sizeof*(m*len+n), array_in+_sizeof*((m+1)*len-n-1), _sizeof);
-            }
-        }
-    } else { // flip vertically
-        for(size_t m=0; m < in->m; m++) {
-            for(size_t n=0; n < in->n; n++) {
-                memcpy(array_out+_sizeof*(m*in->n+n), array_in+_sizeof*((in->m-m-1)*in->n+n), _sizeof);
-            }
-        }
-    }
-    return out;
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_flip_obj, 1, numerical_flip);
-
-mp_obj_t numerical_diff(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_n, MP_ARG_KW_ONLY | MP_ARG_INT, {.u_int = 1 } },
-        { MP_QSTR_axis, MP_ARG_KW_ONLY | MP_ARG_INT, {.u_int = -1 } },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(1, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-    
-    if(!MP_OBJ_IS_TYPE(args[0].u_obj, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("diff argument must be an ndarray"));
-    }
-    
-    ndarray_obj_t *in = MP_OBJ_TO_PTR(args[0].u_obj);
-    size_t increment, N, M;
-    if((args[2].u_int == -1) || (args[2].u_int == 1)) { // differentiate along the horizontal axis
-        increment = 1;
-    } else if(args[2].u_int == 0) { // differtiate along vertical axis
-        increment = in->n;
-    } else {
-        mp_raise_ValueError(translate("axis must be -1, 0, or 1"));
-    }
-    if((args[1].u_int < 0) || (args[1].u_int > 9)) {
-        mp_raise_ValueError(translate("n must be between 0, and 9"));
-    }
-    uint8_t n = args[1].u_int;
-    int8_t *stencil = m_new(int8_t, n+1);
-    stencil[0] = 1;
-    for(uint8_t i=1; i < n+1; i++) {
-        stencil[i] = -stencil[i-1]*(n-i+1)/i;
-    }
-
-    ndarray_obj_t *out;
-    
-    if(increment == 1) { // differentiate along the horizontal axis 
-        if(n >= in->n) {
-            out = create_new_ndarray(in->m, 0, in->array->typecode);
-            m_del(uint8_t, stencil, n);
-            return MP_OBJ_FROM_PTR(out);
-        }
-        N = in->n - n;
-        M = in->m;
-    } else { // differentiate along vertical axis
-        if(n >= in->m) {
-            out = create_new_ndarray(0, in->n, in->array->typecode);
-            m_del(uint8_t, stencil, n);
-            return MP_OBJ_FROM_PTR(out);
-        }
-        M = in->m - n;
-        N = in->n;
-    }
-    out = create_new_ndarray(M, N, in->array->typecode);
-    if(in->array->typecode == NDARRAY_UINT8) {
-        CALCULATE_DIFF(in, out, uint8_t, M, N, in->n, increment);
-    } else if(in->array->typecode == NDARRAY_INT8) {
-        CALCULATE_DIFF(in, out, int8_t, M, N, in->n, increment);
-    }  else if(in->array->typecode == NDARRAY_UINT16) {
-        CALCULATE_DIFF(in, out, uint16_t, M, N, in->n, increment);
-    } else if(in->array->typecode == NDARRAY_INT16) {
-        CALCULATE_DIFF(in, out, int16_t, M, N, in->n, increment);
-    } else {
-        CALCULATE_DIFF(in, out, mp_float_t, M, N, in->n, increment);
-    }
-    m_del(int8_t, stencil, n);
-    return MP_OBJ_FROM_PTR(out);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_diff_obj, 1, numerical_diff);
-
-mp_obj_t numerical_sort_helper(mp_obj_t oin, mp_obj_t axis, uint8_t inplace) {
-    if(!MP_OBJ_IS_TYPE(oin, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("sort argument must be an ndarray"));
-    }
-
-    ndarray_obj_t *ndarray;
-    mp_obj_t out;
-    if(inplace == 1) {
-        ndarray = MP_OBJ_TO_PTR(oin);
-    } else {
-        out = ndarray_copy(oin);
-        ndarray = MP_OBJ_TO_PTR(out);
-    }
-    size_t increment, start_inc, end, N;
-    if(axis == mp_const_none) { // flatten the array
-        ndarray->m = 1;
-        ndarray->n = ndarray->array->len;
-        increment = 1;
-        start_inc = ndarray->n;
-        end = ndarray->n;
-        N = ndarray->n;
-    } else if((mp_obj_get_int(axis) == -1) || 
-              (mp_obj_get_int(axis) == 1)) { // sort along the horizontal axis
-        increment = 1;
-        start_inc = ndarray->n;
-        end = ndarray->array->len;
-        N = ndarray->n;
-    } else if(mp_obj_get_int(axis) == 0) { // sort along vertical axis
-        increment = ndarray->n;
-        start_inc = 1;
-        end = ndarray->m;
-        N = ndarray->m;
-    } else {
-        mp_raise_ValueError(translate("axis must be -1, 0, None, or 1"));
-    }
-    
-    size_t q, k, p, c;
-
-    for(size_t start=0; start < end; start+=start_inc) {
-        q = N; 
-        k = (q >> 1);
-        if((ndarray->array->typecode == NDARRAY_UINT8) || (ndarray->array->typecode == NDARRAY_INT8)) {
-            HEAPSORT(uint8_t, ndarray);
-        } else if((ndarray->array->typecode == NDARRAY_INT16) || (ndarray->array->typecode == NDARRAY_INT16)) {
-            HEAPSORT(uint16_t, ndarray);
-        } else {
-            HEAPSORT(mp_float_t, ndarray);
-        }
-    }
-    if(inplace == 1) {
-        return mp_const_none;
-    } else {
-        return out;
-    }
-}
-
-// numpy function
-mp_obj_t numerical_sort(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_axis, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_int = -1 } },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(1, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-
-    return numerical_sort_helper(args[0].u_obj, args[1].u_obj, 0);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_sort_obj, 1, numerical_sort);
-
-// method of an ndarray
-mp_obj_t numerical_sort_inplace(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_axis, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_int = -1 } },
-    };
-
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(1, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-
-    return numerical_sort_helper(args[0].u_obj, args[1].u_obj, 1);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_sort_inplace_obj, 1, numerical_sort_inplace);
-
-mp_obj_t numerical_argsort(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
-    static const mp_arg_t allowed_args[] = {
-        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
-        { MP_QSTR_axis, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_int = -1 } },
-    };
-    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
-    mp_arg_parse_all(1, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
-    if(!MP_OBJ_IS_TYPE(args[0].u_obj, &ulab_ndarray_type)) {
-        mp_raise_TypeError(translate("argsort argument must be an ndarray"));
-    }
-
-    ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(args[0].u_obj);
-    size_t increment, start_inc, end, N, m, n;
-    if(args[1].u_obj == mp_const_none) { // flatten the array
-        m = 1;
-        n = ndarray->array->len;
-        ndarray->m = m;
-        ndarray->n = n;
-        increment = 1;
-        start_inc = ndarray->n;
-        end = ndarray->n;
-        N = n;
-    } else if((mp_obj_get_int(args[1].u_obj) == -1) || 
-              (mp_obj_get_int(args[1].u_obj) == 1)) { // sort along the horizontal axis
-        m = ndarray->m;
-        n = ndarray->n;
-        increment = 1;
-        start_inc = n;
-        end = ndarray->array->len;
-        N = n;
-    } else if(mp_obj_get_int(args[1].u_obj) == 0) { // sort along vertical axis
-        m = ndarray->m;
-        n = ndarray->n;
-        increment = n;
-        start_inc = 1;
-        end = m;
-        N = m;
-    } else {
-        mp_raise_ValueError(translate("axis must be -1, 0, None, or 1"));
-    }
-
-    // at the expense of flash, we could save RAM by creating 
-    // an NDARRAY_UINT16 ndarray only, if needed, otherwise, NDARRAY_UINT8
-    ndarray_obj_t *indices = create_new_ndarray(m, n, NDARRAY_UINT16);
-    uint16_t *index_array = (uint16_t *)indices->array->items;
-    // initialise the index array
-    // if array is flat: 0 to indices->n
-    // if sorting vertically, identical indices are arranged row-wise
-    // if sorting horizontally, identical indices are arranged colunn-wise
-    for(uint16_t start=0; start < end; start+=start_inc) {
-        for(uint16_t s=0; s < N; s++) {
-            index_array[start+s*increment] = s;
-        }
-    }
-
-    size_t q, k, p, c;
-    for(size_t start=0; start < end; start+=start_inc) {
-        q = N; 
-        k = (q >> 1);
-        if((ndarray->array->typecode == NDARRAY_UINT8) || (ndarray->array->typecode == NDARRAY_INT8)) {
-            HEAP_ARGSORT(uint8_t, ndarray, index_array);
-        } else if((ndarray->array->typecode == NDARRAY_INT16) || (ndarray->array->typecode == NDARRAY_INT16)) {
-            HEAP_ARGSORT(uint16_t, ndarray, index_array);
-        } else {
-            HEAP_ARGSORT(mp_float_t, ndarray, index_array);
-        }
-    }
-    return MP_OBJ_FROM_PTR(indices);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_KW(numerical_argsort_obj, 1, numerical_argsort);
-
-#if !CIRCUITPY
-STATIC const mp_rom_map_elem_t ulab_numerical_globals_table[] = {
-    { MP_OBJ_NEW_QSTR(MP_QSTR_linspace), (mp_obj_t)&numerical_linspace_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_sum), (mp_obj_t)&numerical_sum_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_mean), (mp_obj_t)&numerical_mean_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_std), (mp_obj_t)&numerical_std_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_min), (mp_obj_t)&numerical_min_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_max), (mp_obj_t)&numerical_max_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_argmin), (mp_obj_t)&numerical_argmin_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_argmax), (mp_obj_t)&numerical_argmax_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_roll), (mp_obj_t)&numerical_roll_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_flip), (mp_obj_t)&numerical_flip_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_diff), (mp_obj_t)&numerical_diff_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_sort), (mp_obj_t)&numerical_sort_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_argsort), (mp_obj_t)&numerical_argsort_obj },    
-};
-
-STATIC MP_DEFINE_CONST_DICT(mp_module_ulab_numerical_globals, ulab_numerical_globals_table);
-
-mp_obj_module_t ulab_numerical_module = {
-    .base = { &mp_type_module },
-    .globals = (mp_obj_dict_t*)&mp_module_ulab_numerical_globals,
-};
-#endif
-
-#endif
--- a/code/numerical.h
+++ b/code/numerical.h
@ -1,167 +0,0 @@
-
-/*
- * This file is part of the micropython-ulab project, 
- *
- * https://github.com/v923z/micropython-ulab
- *
- * The MIT License (MIT)
- *
- * Copyright (c) 2019-2020 Zoltán Vörös
-*/
-
-#ifndef _NUMERICAL_
-#define _NUMERICAL_
-
-#include "ulab.h"
-#include "ndarray.h"
-
-#if ULAB_NUMERICAL_MODULE
-
-extern mp_obj_module_t ulab_numerical_module;
-
-// TODO: implement minimum/maximum, and cumsum
-//mp_obj_t numerical_minimum(mp_obj_t , mp_obj_t );
-//mp_obj_t numerical_maximum(mp_obj_t , mp_obj_t );
-//mp_obj_t numerical_cumsum(size_t , const mp_obj_t *, mp_map_t *);
-
-#define RUN_ARGMIN(in, out, typein, typeout, len, start, increment, op, pos) do {\
-    typein *array = (typein *)(in)->array->items;\
-    typeout *outarray = (typeout *)(out)->array->items;\
-    size_t best_index = 0;\
-    if(((op) == NUMERICAL_MAX) || ((op) == NUMERICAL_ARGMAX)) {\
-        for(size_t i=1; i < (len); i++) {\
-            if(array[(start)+i*(increment)] > array[(start)+best_index*(increment)]) best_index = i;\
-        }\
-        if((op) == NUMERICAL_MAX) outarray[(pos)] = array[(start)+best_index*(increment)];\
-        else outarray[(pos)] = best_index;\
-    } else{\
-        for(size_t i=1; i < (len); i++) {\
-            if(array[(start)+i*(increment)] < array[(start)+best_index*(increment)]) best_index = i;\
-        }\
-        if((op) == NUMERICAL_MIN) outarray[(pos)] = array[(start)+best_index*(increment)];\
-        else outarray[(pos)] = best_index;\
-    }\
-} while(0)
-
-#define RUN_SUM(ndarray, type, optype, len, start, increment) do {\
-    type *array = (type *)(ndarray)->array->items;\
-    type value;\
-    for(size_t j=0; j < (len); j++) {\
-        value = array[(start)+j*(increment)];\
-        sum += value;\
-    }\
-} while(0)
-
-#define RUN_STD(ndarray, type, len, start, increment) do {\
-    type *array = (type *)(ndarray)->array->items;\
-    mp_float_t value;\
-    for(size_t j=0; j < (len); j++) {\
-        sum += array[(start)+j*(increment)];\
-    }\
-    sum /= (len);\
-    for(size_t j=0; j < (len); j++) {\
-        value = (array[(start)+j*(increment)] - sum);\
-        sum_sq += value * value;\
-    }\
-} while(0)
-
-#define CALCULATE_DIFF(in, out, type, M, N, inn, increment) do {\
-    type *source = (type *)(in)->array->items;\
-    type *target = (type *)(out)->array->items;\
-    for(size_t i=0; i < (M); i++) {\
-        for(size_t j=0; j < (N); j++) {\
-            for(uint8_t k=0; k < n+1; k++) {\
-                target[i*(N)+j] -= stencil[k]*source[i*(inn)+j+k*(increment)];\
-            }\
-        }\
-    }\
-} while(0)
-
-#define HEAPSORT(type, ndarray) do {\
-    type *array = (type *)(ndarray)->array->items;\
-    type tmp;\
-    for (;;) {\
-        if (k > 0) {\
-            tmp = array[start+(--k)*increment];\
-        } else {\
-            q--;\
-            if(q == 0) {\
-                break;\
-            }\
-            tmp = array[start+q*increment];\
-            array[start+q*increment] = array[start];\
-        }\
-        p = k;\
-        c = k + k + 1;\
-        while (c < q) {\
-            if((c + 1 < q)  &&  (array[start+(c+1)*increment] > array[start+c*increment])) {\
-                c++;\
-            }\
-            if(array[start+c*increment] > tmp) {\
-                array[start+p*increment] = array[start+c*increment];\
-                p = c;\
-                c = p + p + 1;\
-            } else {\
-                break;\
-            }\
-        }\
-        array[start+p*increment] = tmp;\
-    }\
-} while(0)
-
-// This is pretty similar to HEAPSORT above; perhaps, the two could be combined somehow
-// On the other hand, since this is a macro, it doesn't really matter
-// Keep in mind that initially, index_array[start+s*increment] = s
-#define HEAP_ARGSORT(type, ndarray, index_array) do {\
-    type *array = (type *)(ndarray)->array->items;\
-    type tmp;\
-    uint16_t itmp;\
-    for (;;) {\
-        if (k > 0) {\
-            k--;\
-            tmp = array[start+index_array[start+k*increment]*increment];\
-            itmp = index_array[start+k*increment];\
-        } else {\
-            q--;\
-            if(q == 0) {\
-                break;\
-            }\
-            tmp = array[start+index_array[start+q*increment]*increment];\
-            itmp = index_array[start+q*increment];\
-            index_array[start+q*increment] = index_array[start];\
-        }\
-        p = k;\
-        c = k + k + 1;\
-        while (c < q) {\
-            if((c + 1 < q)  &&  (array[start+index_array[start+(c+1)*increment]*increment] > array[start+index_array[start+c*increment]*increment])) {\
-                c++;\
-            }\
-            if(array[start+index_array[start+c*increment]*increment] > tmp) {\
-                index_array[start+p*increment] = index_array[start+c*increment];\
-                p = c;\
-                c = p + p + 1;\
-            } else {\
-                break;\
-            }\
-        }\
-        index_array[start+p*increment] = itmp;\
-    }\
-} while(0)
-
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_linspace_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_min_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_max_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_argmin_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_argmax_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_sum_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_mean_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_std_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_roll_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_flip_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_diff_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_sort_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_sort_inplace_obj);
-MP_DECLARE_CONST_FUN_OBJ_KW(numerical_argsort_obj);
-
-#endif
-#endif
--- a/code/numpy/approx/approx.c
+++ b/code/numpy/approx/approx.c
@ -0,0 +1,222 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+ *               2020 Diego Elio Pettenò
+ *               2020 Taku Fukada
+*/
+
+#include <math.h>
+#include <stdlib.h>
+#include <string.h>
+#include "py/obj.h"
+#include "py/runtime.h"
+#include "py/misc.h"
+
+#include "../../ulab.h"
+#include "../../ulab_tools.h"
+#include "approx.h"
+
+//| """Numerical approximation methods"""
+//|
+
+const mp_obj_float_t approx_trapz_dx = {{&mp_type_float}, MICROPY_FLOAT_CONST(1.0)};
+
+#if ULAB_NUMPY_HAS_INTERP
+//| def interp(
+//|     x: ulab.ndarray,
+//|     xp: ulab.ndarray,
+//|     fp: ulab.ndarray,
+//|     *,
+//|     left: Optional[float] = None,
+//|     right: Optional[float] = None
+//| ) -> ulab.ndarray:
+//|     """
+//|     :param ulab.ndarray x: The x-coordinates at which to evaluate the interpolated values.
+//|     :param ulab.ndarray xp: The x-coordinates of the data points, must be increasing
+//|     :param ulab.ndarray fp: The y-coordinates of the data points, same length as xp
+//|     :param left: Value to return for ``x < xp[0]``, default is ``fp[0]``.
+//|     :param right: Value to return for ``x > xp[-1]``, default is ``fp[-1]``.
+//|
+//|     Returns the one-dimensional piecewise linear interpolant to a function with given discrete data points (xp, fp), evaluated at x."""
+//|     ...
+//|
+
+STATIC mp_obj_t approx_interp(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_left, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_none} },
+        { MP_QSTR_right, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_none} },
+    };
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    // TODO: numpy allows generic iterables
+    ndarray_obj_t *x = ndarray_from_mp_obj(args[0].u_obj);
+    ndarray_obj_t *xp = ndarray_from_mp_obj(args[1].u_obj); // xp must hold an increasing sequence of independent values
+    ndarray_obj_t *fp = ndarray_from_mp_obj(args[2].u_obj);
+    if((xp->ndim != 1) || (fp->ndim != 1) || (xp->len < 2) || (fp->len < 2) || (xp->len != fp->len)) {
+        mp_raise_ValueError(translate("interp is defined for 1D arrays of equal length"));
+    }
+
+    ndarray_obj_t *y = ndarray_new_linear_array(x->len, NDARRAY_FLOAT);
+    mp_float_t left_value, right_value;
+    uint8_t *xparray = (uint8_t *)xp->array;
+
+    mp_float_t xp_left = ndarray_get_float_value(xparray, xp->dtype);
+    xparray += (xp->len-1) * xp->strides[ULAB_MAX_DIMS - 1];
+    mp_float_t xp_right = ndarray_get_float_value(xparray, xp->dtype);
+
+    uint8_t *fparray = (uint8_t *)fp->array;
+
+    if(args[3].u_obj == mp_const_none) {
+        left_value = ndarray_get_float_value(fparray, fp->dtype);
+    } else {
+        left_value = mp_obj_get_float(args[3].u_obj);
+    }
+    if(args[4].u_obj == mp_const_none) {
+        fparray += (fp->len-1) * fp->strides[ULAB_MAX_DIMS - 1];
+        right_value = ndarray_get_float_value(fparray, fp->dtype);
+    } else {
+        right_value = mp_obj_get_float(args[4].u_obj);
+    }
+
+    xparray = xp->array;
+    fparray = fp->array;
+
+    uint8_t *xarray = (uint8_t *)x->array;
+    mp_float_t *yarray = (mp_float_t *)y->array;
+    uint8_t *temp;
+
+    for(size_t i=0; i < x->len; i++, yarray++) {
+        mp_float_t x_value = ndarray_get_float_value(xarray, x->dtype);
+        xarray += x->strides[ULAB_MAX_DIMS - 1];
+        if(x_value < xp_left) {
+            *yarray = left_value;
+        } else if(x_value > xp_right) {
+            *yarray = right_value;
+        } else { // do the binary search here
+            mp_float_t xp_left_, xp_right_;
+            mp_float_t fp_left, fp_right;
+            size_t left_index = 0, right_index = xp->len - 1, middle_index;
+            while(right_index - left_index > 1) {
+                middle_index = left_index + (right_index - left_index) / 2;
+                temp = xparray + middle_index * xp->strides[ULAB_MAX_DIMS - 1];
+                mp_float_t xp_middle = ndarray_get_float_value(temp, xp->dtype);
+                if(x_value <= xp_middle) {
+                    right_index = middle_index;
+                } else {
+                    left_index = middle_index;
+                }
+            }
+            temp = xparray + left_index * xp->strides[ULAB_MAX_DIMS - 1];
+            xp_left_ = ndarray_get_float_value(temp, xp->dtype);
+
+            temp = xparray + right_index * xp->strides[ULAB_MAX_DIMS - 1];
+            xp_right_ = ndarray_get_float_value(temp, xp->dtype);
+
+            temp = fparray + left_index * fp->strides[ULAB_MAX_DIMS - 1];
+            fp_left = ndarray_get_float_value(temp, fp->dtype);
+
+            temp = fparray + right_index * fp->strides[ULAB_MAX_DIMS - 1];
+            fp_right = ndarray_get_float_value(temp, fp->dtype);
+
+            *yarray = fp_left + (x_value - xp_left_) * (fp_right - fp_left) / (xp_right_ - xp_left_);
+        }
+    }
+    return MP_OBJ_FROM_PTR(y);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(approx_interp_obj, 2, approx_interp);
+#endif
+
+#if ULAB_NUMPY_HAS_TRAPZ
+//| def trapz(y: ulab.ndarray, x: Optional[ulab.ndarray] = None, dx: float = 1.0) -> float:
+//|     """
+//|     :param 1D ulab.ndarray y: the values of the dependent variable
+//|     :param 1D ulab.ndarray x: optional, the coordinates of the independent variable. Defaults to uniformly spaced values.
+//|     :param float dx: the spacing between sample points, if x=None
+//|
+//|     Returns the integral of y(x) using the trapezoidal rule.
+//|     """
+//|     ...
+//|
+
+STATIC mp_obj_t approx_trapz(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_x, MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_dx, MP_ARG_OBJ, {.u_rom_obj = MP_ROM_PTR(&approx_trapz_dx)} },
+    };
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    ndarray_obj_t *y = ndarray_from_mp_obj(args[0].u_obj);
+    ndarray_obj_t *x;
+    mp_float_t mean = MICROPY_FLOAT_CONST(0.0);
+    if(y->len < 2) {
+        return mp_obj_new_float(mean);
+    }
+    if((y->ndim != 1)) {
+        mp_raise_ValueError(translate("trapz is defined for 1D arrays"));
+    }
+
+    mp_float_t (*funcy)(void *) = ndarray_get_float_function(y->dtype);
+    uint8_t *yarray = (uint8_t *)y->array;
+
+    size_t count = 1;
+    mp_float_t y1, y2, m;
+
+    if(args[1].u_obj != mp_const_none) {
+        x = ndarray_from_mp_obj(args[1].u_obj); // x must hold an increasing sequence of independent values
+        if((x->ndim != 1) || (y->len != x->len)) {
+            mp_raise_ValueError(translate("trapz is defined for 1D arrays of equal length"));
+        }
+
+        mp_float_t (*funcx)(void *) = ndarray_get_float_function(x->dtype);
+        uint8_t *xarray = (uint8_t *)x->array;
+        mp_float_t x1, x2;
+
+        y1 = funcy(yarray);
+        yarray += y->strides[ULAB_MAX_DIMS - 1];
+        x1 = funcx(xarray);
+        xarray += x->strides[ULAB_MAX_DIMS - 1];
+
+        for(size_t i=1; i < y->len; i++) {
+            y2 = funcy(yarray);
+            yarray += y->strides[ULAB_MAX_DIMS - 1];
+            x2 = funcx(xarray);
+            xarray += x->strides[ULAB_MAX_DIMS - 1];
+            mp_float_t value = (x2 - x1) * (y2 + y1);
+            m = mean + (value - mean) / (mp_float_t)count;
+            mean = m;
+            x1 = x2;
+            y1 = y2;
+            count++;
+        }
+    } else {
+        mp_float_t dx = mp_obj_get_float(args[2].u_obj);
+        y1 = funcy(yarray);
+        yarray += y->strides[ULAB_MAX_DIMS - 1];
+
+        for(size_t i=1; i < y->len; i++) {
+            y2 = ndarray_get_float_index(y->array, y->dtype, i);
+            mp_float_t value = (y2 + y1);
+            m = mean + (value - mean) / (mp_float_t)count;
+            mean = m;
+            y1 = y2;
+            count++;
+        }
+        mean *= dx;
+    }
+    return mp_obj_new_float(MICROPY_FLOAT_CONST(0.5)*mean*(y->len-1));
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(approx_trapz_obj, 1, approx_trapz);
+#endif
--- a/code/numpy/approx/approx.h
+++ b/code/numpy/approx/approx.h
@ -0,0 +1,29 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+*/
+
+#ifndef _APPROX_
+#define _APPROX_
+
+#include "../../ulab.h"
+#include "../../ndarray.h"
+
+#define     APPROX_EPS          MICROPY_FLOAT_CONST(1.0e-4)
+#define     APPROX_NONZDELTA    MICROPY_FLOAT_CONST(0.05)
+#define     APPROX_ZDELTA       MICROPY_FLOAT_CONST(0.00025)
+#define     APPROX_ALPHA        MICROPY_FLOAT_CONST(1.0)
+#define     APPROX_BETA         MICROPY_FLOAT_CONST(2.0)
+#define     APPROX_GAMMA        MICROPY_FLOAT_CONST(0.5)
+#define     APPROX_DELTA        MICROPY_FLOAT_CONST(0.5)
+
+MP_DECLARE_CONST_FUN_OBJ_KW(approx_interp_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(approx_trapz_obj);
+
+#endif  /* _APPROX_ */
--- a/code/numpy/compare/compare.c
+++ b/code/numpy/compare/compare.c
@ -0,0 +1,205 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+ *               2020 Jeff Epler for Adafruit Industries
+*/
+
+#include <math.h>
+#include <stdlib.h>
+#include <string.h>
+#include "py/obj.h"
+#include "py/runtime.h"
+#include "py/misc.h"
+
+#include "../../ulab.h"
+#include "../../ndarray_operators.h"
+#include "compare.h"
+
+static mp_obj_t compare_function(mp_obj_t x1, mp_obj_t x2, uint8_t op) {
+    ndarray_obj_t *lhs = ndarray_from_mp_obj(x1);
+    ndarray_obj_t *rhs = ndarray_from_mp_obj(x2);
+    uint8_t ndim = 0;
+    size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
+    int32_t *lstrides = m_new(int32_t, ULAB_MAX_DIMS);
+    int32_t *rstrides = m_new(int32_t, ULAB_MAX_DIMS);
+    if(!ndarray_can_broadcast(lhs, rhs, &ndim, shape, lstrides, rstrides)) {
+        mp_raise_ValueError(translate("operands could not be broadcast together"));
+        m_del(size_t, shape, ULAB_MAX_DIMS);
+        m_del(int32_t, lstrides, ULAB_MAX_DIMS);
+        m_del(int32_t, rstrides, ULAB_MAX_DIMS);
+    }
+
+    uint8_t *larray = (uint8_t *)lhs->array;
+    uint8_t *rarray = (uint8_t *)rhs->array;
+
+    if(op == COMPARE_EQUAL) {
+        return ndarray_binary_equality(lhs, rhs, ndim, shape, lstrides, rstrides, MP_BINARY_OP_EQUAL);
+    } else if(op == COMPARE_NOT_EQUAL) {
+        return ndarray_binary_equality(lhs, rhs, ndim, shape, lstrides, rstrides, MP_BINARY_OP_NOT_EQUAL);
+    }
+    // These are the upcasting rules
+    // float always becomes float
+    // operation on identical types preserves type
+    // uint8 + int8 => int16
+    // uint8 + int16 => int16
+    // uint8 + uint16 => uint16
+    // int8 + int16 => int16
+    // int8 + uint16 => uint16
+    // uint16 + int16 => float
+    // The parameters of RUN_COMPARE_LOOP are
+    // typecode of result, type_out, type_left, type_right, lhs operand, rhs operand, operator
+    if(lhs->dtype == NDARRAY_UINT8) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            RUN_COMPARE_LOOP(NDARRAY_UINT8, uint8_t, uint8_t, uint8_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            RUN_COMPARE_LOOP(NDARRAY_INT16, int16_t, uint8_t, int8_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            RUN_COMPARE_LOOP(NDARRAY_UINT16, uint16_t, uint8_t, uint16_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            RUN_COMPARE_LOOP(NDARRAY_INT16, int16_t, uint8_t, int16_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        }
+    } else if(lhs->dtype == NDARRAY_INT8) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            RUN_COMPARE_LOOP(NDARRAY_INT16, int16_t, int8_t, uint8_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            RUN_COMPARE_LOOP(NDARRAY_INT8, int8_t, int8_t, int8_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            RUN_COMPARE_LOOP(NDARRAY_INT16, int16_t, int8_t, uint16_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            RUN_COMPARE_LOOP(NDARRAY_INT16, int16_t, int8_t, int16_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, int8_t, mp_float_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        }
+    } else if(lhs->dtype == NDARRAY_UINT16) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            RUN_COMPARE_LOOP(NDARRAY_UINT16, uint16_t, uint16_t, uint8_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            RUN_COMPARE_LOOP(NDARRAY_UINT16, uint16_t, uint16_t, int8_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            RUN_COMPARE_LOOP(NDARRAY_UINT16, uint16_t, uint16_t, uint16_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, uint16_t, int16_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, uint8_t, mp_float_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        }
+    } else if(lhs->dtype == NDARRAY_INT16) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            RUN_COMPARE_LOOP(NDARRAY_INT16, int16_t, int16_t, uint8_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            RUN_COMPARE_LOOP(NDARRAY_INT16, int16_t, int16_t, int8_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, int16_t, uint16_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            RUN_COMPARE_LOOP(NDARRAY_INT16, int16_t, int16_t, int16_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, uint16_t, mp_float_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        }
+    } else if(lhs->dtype == NDARRAY_FLOAT) {
+        if(rhs->dtype == NDARRAY_UINT8) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, mp_float_t, uint8_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_INT8) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, mp_float_t, int8_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_UINT16) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, mp_float_t, uint16_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_INT16) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, mp_float_t, int16_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        } else if(rhs->dtype == NDARRAY_FLOAT) {
+            RUN_COMPARE_LOOP(NDARRAY_FLOAT, mp_float_t, mp_float_t, mp_float_t, larray, lstrides, rarray, rstrides, ndim, shape, op);
+        }
+    }
+    return mp_const_none; // we should never reach this point
+}
+
+static mp_obj_t compare_equal_helper(mp_obj_t x1, mp_obj_t x2, uint8_t comptype) {
+    // scalar comparisons should return a single object of mp_obj_t type
+    mp_obj_t result = compare_function(x1, x2, comptype);
+    if((MP_OBJ_IS_INT(x1) || mp_obj_is_float(x1)) && (MP_OBJ_IS_INT(x2) || mp_obj_is_float(x2))) {
+        mp_obj_iter_buf_t iter_buf;
+        mp_obj_t iterable = mp_getiter(result, &iter_buf);
+        mp_obj_t item = mp_iternext(iterable);
+        return item;
+    }
+    return result;
+}
+
+#if ULAB_NUMPY_HAS_CLIP
+
+mp_obj_t compare_clip(mp_obj_t x1, mp_obj_t x2, mp_obj_t x3) {
+    // Note: this function could be made faster by implementing a single-loop comparison in
+    // RUN_COMPARE_LOOP. However, that would add around 2 kB of compile size, while we
+    // would not gain a factor of two in speed, since the two comparisons should still be
+    // evaluated. In contrast, calling the function twice adds only 140 bytes to the firmware
+    if(mp_obj_is_int(x1) || mp_obj_is_float(x1)) {
+        mp_float_t v1 = mp_obj_get_float(x1);
+        mp_float_t v2 = mp_obj_get_float(x2);
+        mp_float_t v3 = mp_obj_get_float(x3);
+        if(v1 < v2) {
+            return x2;
+        } else if(v1 > v3) {
+            return x3;
+        } else {
+            return x1;
+        }
+    } else { // assume ndarrays
+        return compare_function(x2, compare_function(x1, x3, COMPARE_MINIMUM), COMPARE_MAXIMUM);
+    }
+}
+
+MP_DEFINE_CONST_FUN_OBJ_3(compare_clip_obj, compare_clip);
+#endif
+
+#if ULAB_NUMPY_HAS_EQUAL
+
+mp_obj_t compare_equal(mp_obj_t x1, mp_obj_t x2) {
+    return compare_equal_helper(x1, x2, COMPARE_EQUAL);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_2(compare_equal_obj, compare_equal);
+#endif
+
+#if ULAB_NUMPY_HAS_NOTEQUAL
+
+mp_obj_t compare_not_equal(mp_obj_t x1, mp_obj_t x2) {
+    return compare_equal_helper(x1, x2, COMPARE_NOT_EQUAL);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_2(compare_not_equal_obj, compare_not_equal);
+#endif
+
+#if ULAB_NUMPY_HAS_MAXIMUM
+
+mp_obj_t compare_maximum(mp_obj_t x1, mp_obj_t x2) {
+    // extra round, so that we can return maximum(3, 4) properly
+    mp_obj_t result = compare_function(x1, x2, COMPARE_MAXIMUM);
+    if((MP_OBJ_IS_INT(x1) || mp_obj_is_float(x1)) && (MP_OBJ_IS_INT(x2) || mp_obj_is_float(x2))) {
+        ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(result);
+        return mp_binary_get_val_array(ndarray->dtype, ndarray->array, 0);
+    }
+    return result;
+}
+
+MP_DEFINE_CONST_FUN_OBJ_2(compare_maximum_obj, compare_maximum);
+#endif
+
+#if ULAB_NUMPY_HAS_MINIMUM
+
+mp_obj_t compare_minimum(mp_obj_t x1, mp_obj_t x2) {
+    // extra round, so that we can return minimum(3, 4) properly
+    mp_obj_t result = compare_function(x1, x2, COMPARE_MINIMUM);
+    if((MP_OBJ_IS_INT(x1) || mp_obj_is_float(x1)) && (MP_OBJ_IS_INT(x2) || mp_obj_is_float(x2))) {
+        ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(result);
+        return mp_binary_get_val_array(ndarray->dtype, ndarray->array, 0);
+    }
+    return result;
+}
+
+MP_DEFINE_CONST_FUN_OBJ_2(compare_minimum_obj, compare_minimum);
+#endif
--- a/code/numpy/compare/compare.h
+++ b/code/numpy/compare/compare.h
@ -0,0 +1,147 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+*/
+
+#ifndef _COMPARE_
+#define _COMPARE_
+
+#include "../../ulab.h"
+#include "../../ndarray.h"
+
+enum COMPARE_FUNCTION_TYPE {
+    COMPARE_EQUAL,
+    COMPARE_NOT_EQUAL,
+    COMPARE_MINIMUM,
+    COMPARE_MAXIMUM,
+    COMPARE_CLIP,
+};
+
+MP_DECLARE_CONST_FUN_OBJ_2(compare_equal_obj);
+MP_DECLARE_CONST_FUN_OBJ_2(compare_not_equal_obj);
+MP_DECLARE_CONST_FUN_OBJ_2(compare_minimum_obj);
+MP_DECLARE_CONST_FUN_OBJ_2(compare_maximum_obj);
+MP_DECLARE_CONST_FUN_OBJ_3(compare_clip_obj);
+
+#if ULAB_MAX_DIMS == 1
+#define COMPARE_LOOP(results, array, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    size_t l = 0;\
+    do {\
+        *((type_out *)(array)) = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray)) ? (type_out)(*((type_left *)(larray))) : (type_out)(*((type_right *)(rarray)));\
+        (array) += (results)->strides[ULAB_MAX_DIMS - 1];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l <  results->shape[ULAB_MAX_DIMS - 1]);\
+    return MP_OBJ_FROM_PTR(results);\
+
+#endif // ULAB_MAX_DIMS == 1
+
+#if ULAB_MAX_DIMS == 2
+#define COMPARE_LOOP(results, array, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            *((type_out *)(array)) = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray)) ? (type_out)(*((type_left *)(larray))) : (type_out)(*((type_right *)(rarray)));\
+            (array) += (results)->strides[ULAB_MAX_DIMS - 1];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l <  results->shape[ULAB_MAX_DIMS - 1]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * results->shape[ULAB_MAX_DIMS-1];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * results->shape[ULAB_MAX_DIMS-1];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k <  results->shape[ULAB_MAX_DIMS - 2]);\
+    return MP_OBJ_FROM_PTR(results);\
+
+#endif // ULAB_MAX_DIMS == 2
+
+#if ULAB_MAX_DIMS == 3
+#define COMPARE_LOOP(results, array, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                *((type_out *)(array)) = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray)) ? (type_out)(*((type_left *)(larray))) : (type_out)(*((type_right *)(rarray)));\
+                (array) += (results)->strides[ULAB_MAX_DIMS - 1];\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l <  results->shape[ULAB_MAX_DIMS - 1]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * results->shape[ULAB_MAX_DIMS-1];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * results->shape[ULAB_MAX_DIMS-1];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k <  results->shape[ULAB_MAX_DIMS - 2]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j <  results->shape[ULAB_MAX_DIMS - 3]);\
+    return MP_OBJ_FROM_PTR(results);\
+
+#endif // ULAB_MAX_DIMS == 3
+
+#if ULAB_MAX_DIMS == 4
+#define COMPARE_LOOP(results, array, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\
+    size_t i = 0;\
+    do {\
+        size_t j = 0;\
+        do {\
+            size_t k = 0;\
+            do {\
+                size_t l = 0;\
+                do {\
+                    *((type_out *)(array)) = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray)) ? (type_out)(*((type_left *)(larray))) : (type_out)(*((type_right *)(rarray)));\
+                    (array) += (results)->strides[ULAB_MAX_DIMS - 1];\
+                    (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\
+                    (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\
+                    l++;\
+                } while(l <  results->shape[ULAB_MAX_DIMS - 1]);\
+                (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * results->shape[ULAB_MAX_DIMS-1];\
+                (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\
+                (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * results->shape[ULAB_MAX_DIMS-1];\
+                (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\
+                k++;\
+            } while(k <  results->shape[ULAB_MAX_DIMS - 2]);\
+            (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];\
+            (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\
+            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];\
+            (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\
+            j++;\
+        } while(j <  results->shape[ULAB_MAX_DIMS - 3]);\
+        (larray) -= (lstrides)[ULAB_MAX_DIMS - 3] * results->shape[ULAB_MAX_DIMS-3];\
+        (larray) += (lstrides)[ULAB_MAX_DIMS - 4];\
+        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 3] * results->shape[ULAB_MAX_DIMS-3];\
+        (rarray) += (rstrides)[ULAB_MAX_DIMS - 4];\
+        i++;\
+    } while(i <  results->shape[ULAB_MAX_DIMS - 4]);\
+    return MP_OBJ_FROM_PTR(results);\
+
+#endif // ULAB_MAX_DIMS == 4
+
+#define RUN_COMPARE_LOOP(dtype, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, ndim, shape, op) do {\
+    ndarray_obj_t *results = ndarray_new_dense_ndarray((ndim), (shape), (dtype));\
+    uint8_t *array = (uint8_t *)results->array;\
+    if((op) == COMPARE_MINIMUM) {\
+        COMPARE_LOOP(results, array, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, <);\
+    }\
+    if((op) == COMPARE_MAXIMUM) {\
+        COMPARE_LOOP(results, array, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, >);\
+    }\
+} while(0)
+
+#endif
--- a/code/numpy/fft/fft.c
+++ b/code/numpy/fft/fft.c
@ -0,0 +1,82 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2021 Zoltán Vörös
+ *               2020 Scott Shawcroft for Adafruit Industries
+ *               2020 Taku Fukada
+*/
+
+#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include "py/runtime.h"
+#include "py/builtin.h"
+#include "py/binary.h"
+#include "py/obj.h"
+#include "py/objarray.h"
+
+#include "fft.h"
+
+//| """Frequency-domain functions"""
+//|
+
+
+//| def fft(r: ulab.ndarray, c: Optional[ulab.ndarray] = None) -> Tuple[ulab.ndarray, ulab.ndarray]:
+//|     """
+//|     :param ulab.ndarray r: A 1-dimension array of values whose size is a power of 2
+//|     :param ulab.ndarray c: An optional 1-dimension array of values whose size is a power of 2, giving the complex part of the value
+//|     :return tuple (r, c): The real and complex parts of the FFT
+//|
+//|     Perform a Fast Fourier Transform from the time domain into the frequency domain
+//|
+//|     See also ~ulab.extras.spectrum, which computes the magnitude of the fft,
+//|     rather than separately returning its real and imaginary parts."""
+//|     ...
+//|
+static mp_obj_t fft_fft(size_t n_args, const mp_obj_t *args) {
+    if(n_args == 2) {
+        return fft_fft_ifft_spectrogram(n_args, args[0], args[1], FFT_FFT);
+    } else {
+        return fft_fft_ifft_spectrogram(n_args, args[0], mp_const_none, FFT_FFT);
+    }
+}
+
+MP_DEFINE_CONST_FUN_OBJ_VAR_BETWEEN(fft_fft_obj, 1, 2, fft_fft);
+
+//| def ifft(r: ulab.ndarray, c: Optional[ulab.ndarray] = None) -> Tuple[ulab.ndarray, ulab.ndarray]:
+//|     """
+//|     :param ulab.ndarray r: A 1-dimension array of values whose size is a power of 2
+//|     :param ulab.ndarray c: An optional 1-dimension array of values whose size is a power of 2, giving the complex part of the value
+//|     :return tuple (r, c): The real and complex parts of the inverse FFT
+//|
+//|     Perform an Inverse Fast Fourier Transform from the frequeny domain into the time domain"""
+//|     ...
+//|
+
+static mp_obj_t fft_ifft(size_t n_args, const mp_obj_t *args) {
+    if(n_args == 2) {
+        return fft_fft_ifft_spectrogram(n_args, args[0], args[1], FFT_IFFT);
+    } else {
+        return fft_fft_ifft_spectrogram(n_args, args[0], mp_const_none, FFT_IFFT);
+    }
+}
+
+MP_DEFINE_CONST_FUN_OBJ_VAR_BETWEEN(fft_ifft_obj, 1, 2, fft_ifft);
+
+STATIC const mp_rom_map_elem_t ulab_fft_globals_table[] = {
+    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_fft) },
+    { MP_OBJ_NEW_QSTR(MP_QSTR_fft), (mp_obj_t)&fft_fft_obj },
+    { MP_OBJ_NEW_QSTR(MP_QSTR_ifft), (mp_obj_t)&fft_ifft_obj },
+};
+
+STATIC MP_DEFINE_CONST_DICT(mp_module_ulab_fft_globals, ulab_fft_globals_table);
+
+mp_obj_module_t ulab_fft_module = {
+    .base = { &mp_type_module },
+    .globals = (mp_obj_dict_t*)&mp_module_ulab_fft_globals,
+};
--- a/code/numpy/fft/fft.h
+++ b/code/numpy/fft/fft.h
@ -0,0 +1,24 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2021 Zoltán Vörös
+*/
+
+#ifndef _FFT_
+#define _FFT_
+
+#include "../../ulab.h"
+#include "../../ulab_tools.h"
+#include "../../ndarray.h"
+#include "fft_tools.h"
+
+extern mp_obj_module_t ulab_fft_module;
+
+MP_DECLARE_CONST_FUN_OBJ_VAR_BETWEEN(fft_fft_obj);
+MP_DECLARE_CONST_FUN_OBJ_VAR_BETWEEN(fft_ifft_obj);
+#endif
--- a/code/numpy/fft/fft_tools.c
+++ b/code/numpy/fft/fft_tools.c
@ -0,0 +1,165 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2021 Zoltán Vörös
+*/
+
+#include <math.h>
+#include "py/runtime.h"
+
+#include "../../ndarray.h"
+#include "../../ulab_tools.h"
+#include "fft_tools.h"
+
+#ifndef MP_PI
+#define MP_PI MICROPY_FLOAT_CONST(3.14159265358979323846)
+#endif
+#ifndef MP_E
+#define MP_E MICROPY_FLOAT_CONST(2.71828182845904523536)
+#endif
+
+/*
+ * The following function takes two arrays, namely, the real and imaginary
+ * parts of a complex array, and calculates the Fourier transform in place.
+ *
+ * The function is basically a modification of four1 from Numerical Recipes,
+ * has no dependencies beyond micropython itself (for the definition of mp_float_t),
+ * and can be used independent of ulab.
+ */
+
+void fft_kernel(mp_float_t *real, mp_float_t *imag, size_t n, int isign) {
+    size_t j, m, mmax, istep;
+    mp_float_t tempr, tempi;
+    mp_float_t wtemp, wr, wpr, wpi, wi, theta;
+
+    j = 0;
+    for(size_t i = 0; i < n; i++) {
+        if (j > i) {
+            SWAP(mp_float_t, real[i], real[j]);
+            SWAP(mp_float_t, imag[i], imag[j]);
+        }
+        m = n >> 1;
+        while (j >= m && m > 0) {
+            j -= m;
+            m >>= 1;
+        }
+        j += m;
+    }
+
+    mmax = 1;
+    while (n > mmax) {
+        istep = mmax << 1;
+        theta = MICROPY_FLOAT_CONST(-2.0)*isign*MP_PI/istep;
+        wtemp = MICROPY_FLOAT_C_FUN(sin)(MICROPY_FLOAT_CONST(0.5) * theta);
+        wpr = MICROPY_FLOAT_CONST(-2.0) * wtemp * wtemp;
+        wpi = MICROPY_FLOAT_C_FUN(sin)(theta);
+        wr = MICROPY_FLOAT_CONST(1.0);
+        wi = MICROPY_FLOAT_CONST(0.0);
+        for(m = 0; m < mmax; m++) {
+            for(size_t i = m; i < n; i += istep) {
+                j = i + mmax;
+                tempr = wr * real[j] - wi * imag[j];
+                tempi = wr * imag[j] + wi * real[j];
+                real[j] = real[i] - tempr;
+                imag[j] = imag[i] - tempi;
+                real[i] += tempr;
+                imag[i] += tempi;
+            }
+            wtemp = wr;
+            wr = wr*wpr - wi*wpi + wr;
+            wi = wi*wpr + wtemp*wpi + wi;
+        }
+        mmax = istep;
+    }
+}
+
+/*
+ * The following function is a helper interface to the python side.
+ * It has been factored out from fft.c, so that the same argument parsing
+ * routine can be called from scipy.signal.spectrogram.
+ */
+
+mp_obj_t fft_fft_ifft_spectrogram(size_t n_args, mp_obj_t arg_re, mp_obj_t arg_im, uint8_t type) {
+    if(!MP_OBJ_IS_TYPE(arg_re, &ulab_ndarray_type)) {
+        mp_raise_NotImplementedError(translate("FFT is defined for ndarrays only"));
+    }
+    if(n_args == 2) {
+        if(!MP_OBJ_IS_TYPE(arg_im, &ulab_ndarray_type)) {
+            mp_raise_NotImplementedError(translate("FFT is defined for ndarrays only"));
+        }
+    }
+    ndarray_obj_t *re = MP_OBJ_TO_PTR(arg_re);
+    #if ULAB_MAX_DIMS > 1
+    if(re->ndim != 1) {
+        mp_raise_TypeError(translate("FFT is implemented for linear arrays only"));
+    }
+    #endif
+    size_t len = re->len;
+    // Check if input is of length of power of 2
+    if((len & (len-1)) != 0) {
+        mp_raise_ValueError(translate("input array length must be power of 2"));
+    }
+
+    ndarray_obj_t *out_re = ndarray_new_linear_array(len, NDARRAY_FLOAT);
+    mp_float_t *data_re = (mp_float_t *)out_re->array;
+
+    uint8_t *array = (uint8_t *)re->array;
+    mp_float_t (*func)(void *) = ndarray_get_float_function(re->dtype);
+
+    for(size_t i=0; i < len; i++) {
+        *data_re++ = func(array);
+        array += re->strides[ULAB_MAX_DIMS - 1];
+    }
+    data_re -= len;
+    ndarray_obj_t *out_im = ndarray_new_linear_array(len, NDARRAY_FLOAT);
+    mp_float_t *data_im = (mp_float_t *)out_im->array;
+
+    if(n_args == 2) {
+        ndarray_obj_t *im = MP_OBJ_TO_PTR(arg_im);
+        #if ULAB_MAX_DIMS > 1
+        if(im->ndim != 1) {
+            mp_raise_TypeError(translate("FFT is implemented for linear arrays only"));
+        }
+        #endif
+        if (re->len != im->len) {
+            mp_raise_ValueError(translate("real and imaginary parts must be of equal length"));
+        }
+        array = (uint8_t *)im->array;
+        func = ndarray_get_float_function(im->dtype);
+        for(size_t i=0; i < len; i++) {
+           *data_im++ = func(array);
+           array += im->strides[ULAB_MAX_DIMS - 1];
+        }
+        data_im -= len;
+    }
+
+    if((type == FFT_FFT) || (type == FFT_SPECTROGRAM)) {
+        fft_kernel(data_re, data_im, len, 1);
+        if(type == FFT_SPECTROGRAM) {
+            for(size_t i=0; i < len; i++) {
+                *data_re = MICROPY_FLOAT_C_FUN(sqrt)(*data_re * *data_re + *data_im * *data_im);
+                data_re++;
+                data_im++;
+            }
+        }
+    } else { // inverse transform
+        fft_kernel(data_re, data_im, len, -1);
+        // TODO: numpy accepts the norm keyword argument
+        for(size_t i=0; i < len; i++) {
+            *data_re++ /= len;
+            *data_im++ /= len;
+        }
+    }
+    if(type == FFT_SPECTROGRAM) {
+        return MP_OBJ_TO_PTR(out_re);
+    } else {
+        mp_obj_t tuple[2];
+        tuple[0] = out_re;
+        tuple[1] = out_im;
+        return mp_obj_new_tuple(2, tuple);
+    }
+}
--- a/code/numpy/fft/fft_tools.h
+++ b/code/numpy/fft/fft_tools.h
@ -0,0 +1,23 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2021 Zoltán Vörös
+*/
+
+#ifndef _FFT_TOOLS_
+#define _FFT_TOOLS_
+
+enum FFT_TYPE {
+    FFT_FFT,
+    FFT_IFFT,
+    FFT_SPECTROGRAM,
+};
+
+void fft_kernel(mp_float_t *, mp_float_t *, size_t , int );
+mp_obj_t fft_fft_ifft_spectrogram(size_t , mp_obj_t , mp_obj_t , uint8_t );
+
+#endif /* _FFT_TOOLS_ */
--- a/code/numpy/filter/filter.c
+++ b/code/numpy/filter/filter.c
@ -0,0 +1,84 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020 Jeff Epler for Adafruit Industries
+ *               2020 Scott Shawcroft for Adafruit Industries
+ *               2020-2021 Zoltán Vörös
+ *               2020 Taku Fukada
+*/
+
+#include <math.h>
+#include <stdlib.h>
+#include <string.h>
+#include "py/obj.h"
+#include "py/runtime.h"
+#include "py/misc.h"
+
+#include "../../ulab.h"
+#include "../../scipy/signal/signal.h"
+#include "filter.h"
+
+#if ULAB_NUMPY_HAS_CONVOLVE
+
+mp_obj_t filter_convolve(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_a, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_v, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    if(!MP_OBJ_IS_TYPE(args[0].u_obj, &ulab_ndarray_type) || !MP_OBJ_IS_TYPE(args[1].u_obj, &ulab_ndarray_type)) {
+        mp_raise_TypeError(translate("convolve arguments must be ndarrays"));
+    }
+
+    ndarray_obj_t *a = MP_OBJ_TO_PTR(args[0].u_obj);
+    ndarray_obj_t *c = MP_OBJ_TO_PTR(args[1].u_obj);
+    // deal with linear arrays only
+    #if ULAB_MAX_DIMS > 1
+    if((a->ndim != 1) || (c->ndim != 1)) {
+        mp_raise_TypeError(translate("convolve arguments must be linear arrays"));
+    }
+    #endif
+    size_t len_a = a->len;
+    size_t len_c = c->len;
+    if(len_a == 0 || len_c == 0) {
+        mp_raise_TypeError(translate("convolve arguments must not be empty"));
+    }
+
+    int len = len_a + len_c - 1; // convolve mode "full"
+    ndarray_obj_t *out = ndarray_new_linear_array(len, NDARRAY_FLOAT);
+    mp_float_t *outptr = (mp_float_t *)out->array;
+    uint8_t *aarray = (uint8_t *)a->array;
+    uint8_t *carray = (uint8_t *)c->array;
+    
+    int32_t off = len_c - 1;
+    int32_t as = a->strides[ULAB_MAX_DIMS - 1] / a->itemsize;
+    int32_t cs = c->strides[ULAB_MAX_DIMS - 1] / c->itemsize;
+
+    for(int32_t k=-off; k < len-off; k++) {
+        mp_float_t accum = (mp_float_t)0.0;
+        int32_t top_n = MIN(len_c, len_a - k);
+        int32_t bot_n = MAX(-k, 0);
+        for(int32_t n=bot_n; n < top_n; n++) {
+            int32_t idx_c = (len_c - n - 1) * cs;
+            int32_t idx_a = (n + k) * as;
+            mp_float_t ai = ndarray_get_float_index(aarray, a->dtype, idx_a);
+            mp_float_t ci = ndarray_get_float_index(carray, c->dtype, idx_c);
+            accum += ai * ci;
+        }
+        *outptr++ = accum;
+    }
+
+    return out;
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(filter_convolve_obj, 2, filter_convolve);
+
+#endif
--- a/code/numpy/filter/filter.h
+++ b/code/numpy/filter/filter.h
@ -7,19 +7,14 @@
 * The MIT License (MIT)
 *
 * Copyright (c) 2020 Jeff Epler for Adafruit Industries
+ *               2020-2021 Zoltán Vörös
 */

 #ifndef _FILTER_
 #define _FILTER_

-#include "ulab.h"
-#include "ndarray.h"
-
-#if ULAB_FILTER_MODULE
-
-extern mp_obj_module_t ulab_filter_module;
+#include "../../ulab.h"
+#include "../../ndarray.h"

 MP_DECLARE_CONST_FUN_OBJ_KW(filter_convolve_obj);
-
-#endif
 #endif
--- a/code/numpy/linalg/linalg.c
+++ b/code/numpy/linalg/linalg.c
@ -0,0 +1,450 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2021 Zoltán Vörös
+ *               2020 Scott Shawcroft for Adafruit Industries
+ *               2020 Roberto Colistete Jr.
+ *               2020 Taku Fukada
+ *
+*/
+
+#include <stdlib.h>
+#include <string.h>
+#include <math.h>
+#include "py/obj.h"
+#include "py/runtime.h"
+#include "py/misc.h"
+
+#include "../../ulab.h"
+#include "../../ulab_tools.h"
+#include "linalg.h"
+
+#if ULAB_NUMPY_HAS_LINALG_MODULE
+//| """Linear algebra functions"""
+//|
+
+#if ULAB_MAX_DIMS > 1
+static ndarray_obj_t *linalg_object_is_square(mp_obj_t obj) {
+    // Returns an ndarray, if the object is a square ndarray,
+    // raises the appropriate exception otherwise
+    if(!MP_OBJ_IS_TYPE(obj, &ulab_ndarray_type)) {
+        mp_raise_TypeError(translate("size is defined for ndarrays only"));
+    }
+    ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(obj);
+    if((ndarray->shape[ULAB_MAX_DIMS - 1] != ndarray->shape[ULAB_MAX_DIMS - 2]) || (ndarray->ndim != 2)) {
+        mp_raise_ValueError(translate("input must be square matrix"));
+    }
+    return ndarray;
+}
+#endif
+
+#if ULAB_MAX_DIMS > 1
+//| def cholesky(A: ulab.ndarray) -> ulab.ndarray:
+//|     """
+//|     :param ~ulab.ndarray A: a positive definite, symmetric square matrix
+//|     :return ~ulab.ndarray L: a square root matrix in the lower triangular form
+//|     :raises ValueError: If the input does not fulfill the necessary conditions
+//|
+//|     The returned matrix satisfies the equation m=LL*"""
+//|     ...
+//|
+
+static mp_obj_t linalg_cholesky(mp_obj_t oin) {
+    ndarray_obj_t *ndarray = linalg_object_is_square(oin);
+    ndarray_obj_t *L = ndarray_new_dense_ndarray(2, ndarray_shape_vector(0, 0, ndarray->shape[ULAB_MAX_DIMS - 1], ndarray->shape[ULAB_MAX_DIMS - 1]), NDARRAY_FLOAT);
+    mp_float_t *Larray = (mp_float_t *)L->array;
+
+    size_t N = ndarray->shape[ULAB_MAX_DIMS - 1];
+    uint8_t *array = (uint8_t *)ndarray->array;
+    mp_float_t (*func)(void *) = ndarray_get_float_function(ndarray->dtype);
+
+    for(size_t m=0; m < N; m++) { // rows
+        for(size_t n=0; n < N; n++) { // columns
+            *Larray++ = func(array);
+            array += ndarray->strides[ULAB_MAX_DIMS - 1];
+        }
+        array -= ndarray->strides[ULAB_MAX_DIMS - 1] * N;
+        array += ndarray->strides[ULAB_MAX_DIMS - 2];
+    }
+    Larray -= N*N;
+    // make sure the matrix is symmetric
+    for(size_t m=0; m < N; m++) { // rows
+        for(size_t n=m+1; n < N; n++) { // columns
+            // compare entry (m, n) to (n, m)
+            if(LINALG_EPSILON < MICROPY_FLOAT_C_FUN(fabs)(Larray[m * N + n] - Larray[n * N + m])) {
+                mp_raise_ValueError(translate("input matrix is asymmetric"));
+            }
+        }
+    }
+
+    // this is actually not needed, but Cholesky in numpy returns the lower triangular matrix
+    for(size_t i=0; i < N; i++) { // rows
+        for(size_t j=i+1; j < N; j++) { // columns
+            Larray[i*N + j] = MICROPY_FLOAT_CONST(0.0);
+        }
+    }
+    mp_float_t sum = 0.0;
+    for(size_t i=0; i < N; i++) { // rows
+        for(size_t j=0; j <= i; j++) { // columns
+            sum = Larray[i * N + j];
+            for(size_t k=0; k < j; k++) {
+                sum -= Larray[i * N + k] * Larray[j * N + k];
+            }
+            if(i == j) {
+                if(sum <= MICROPY_FLOAT_CONST(0.0)) {
+                    mp_raise_ValueError(translate("matrix is not positive definite"));
+                } else {
+                    Larray[i * N + i] = MICROPY_FLOAT_C_FUN(sqrt)(sum);
+                }
+            } else {
+                Larray[i * N + j] = sum / Larray[j * N + j];
+            }
+        }
+    }
+    return MP_OBJ_FROM_PTR(L);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_1(linalg_cholesky_obj, linalg_cholesky);
+
+//| def det(m: ulab.ndarray) -> float:
+//|     """
+//|     :param: m, a square matrix
+//|     :return float: The determinant of the matrix
+//|
+//|     Computes the eigenvalues and eigenvectors of a square matrix"""
+//|     ...
+//|
+
+static mp_obj_t linalg_det(mp_obj_t oin) {
+    ndarray_obj_t *ndarray = linalg_object_is_square(oin);
+    uint8_t *array = (uint8_t *)ndarray->array;
+    size_t N = ndarray->shape[ULAB_MAX_DIMS - 1];
+    mp_float_t *tmp = m_new(mp_float_t, N * N);
+    for(size_t m=0; m < N; m++) { // rows
+        for(size_t n=0; n < N; n++) { // columns
+            *tmp++ = ndarray_get_float_value(array, ndarray->dtype);
+            array += ndarray->strides[ULAB_MAX_DIMS - 1];
+        }
+        array -= ndarray->strides[ULAB_MAX_DIMS - 1] * N;
+        array += ndarray->strides[ULAB_MAX_DIMS - 2];
+    }
+
+    // re-wind the pointer
+    tmp -= N*N;
+
+    mp_float_t c;
+    mp_float_t det_sign = 1.0;
+
+    for(size_t m=0; m < N-1; m++){
+        if(MICROPY_FLOAT_C_FUN(fabs)(tmp[m * (N+1)]) < LINALG_EPSILON) {
+            size_t m1 = m + 1;
+            for(; m1 < N; m1++) {
+                if(!(MICROPY_FLOAT_C_FUN(fabs)(tmp[m1*N+m]) < LINALG_EPSILON)) {
+                     //look for a line to swap
+                    for(size_t m2=0; m2 < N; m2++) {
+                        mp_float_t swapVal = tmp[m*N+m2];
+                        tmp[m*N+m2] = tmp[m1*N+m2];
+                        tmp[m1*N+m2] = swapVal;
+                    }
+                    det_sign = -det_sign;
+                    break;
+                }
+            }
+            if (m1 >= N) {
+                m_del(mp_float_t, tmp, N * N);
+                return mp_obj_new_float(0.0);
+            }
+        }
+        for(size_t n=0; n < N; n++) {
+            if(m != n) {
+                c = tmp[N * n + m] / tmp[m * (N+1)];
+                for(size_t k=0; k < N; k++){
+                    tmp[N * n + k] -= c * tmp[N * m + k];
+                }
+            }
+        }
+    }
+    mp_float_t det = det_sign;
+
+    for(size_t m=0; m < N; m++){
+        det *= tmp[m * (N+1)];
+    }
+    m_del(mp_float_t, tmp, N * N);
+    return mp_obj_new_float(det);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_1(linalg_det_obj, linalg_det);
+
+#endif
+
+//| def dot(m1: ulab.ndarray, m2: ulab.ndarray) -> Union[ulab.ndarray, float]:
+//|    """
+//|    :param ~ulab.ndarray m1: a matrix, or a vector
+//|    :param ~ulab.ndarray m2: a matrix, or a vector
+//|
+//|    Computes the product of two matrices, or two vectors. In the letter case, the inner product is returned."""
+//|    ...
+//|
+
+static mp_obj_t linalg_dot(mp_obj_t _m1, mp_obj_t _m2) {
+    // TODO: should the results be upcast?
+    // This implements 2D operations only!
+    if(!MP_OBJ_IS_TYPE(_m1, &ulab_ndarray_type) || !MP_OBJ_IS_TYPE(_m2, &ulab_ndarray_type)) {
+        mp_raise_TypeError(translate("arguments must be ndarrays"));
+    }
+    ndarray_obj_t *m1 = MP_OBJ_TO_PTR(_m1);
+    ndarray_obj_t *m2 = MP_OBJ_TO_PTR(_m2);
+
+    #if ULAB_MAX_DIMS > 1
+    if ((m1->ndim == 1) && (m2->ndim == 1)) {
+    #endif
+        // 2 vectors
+        if (m1->len != m2->len) {
+            mp_raise_ValueError(translate("vectors must have same lengths"));
+        }
+        mp_float_t dot = 0.0;
+        uint8_t *array1 = (uint8_t *)m1->array;
+        uint8_t *array2 = (uint8_t *)m2->array;
+        for (size_t i=0; i < m1->len; i++) {
+            dot += ndarray_get_float_value(array1, m1->dtype)*ndarray_get_float_value(array2, m2->dtype);
+            array1 += m1->strides[ULAB_MAX_DIMS - 1];
+            array2 += m2->strides[ULAB_MAX_DIMS - 1];
+        }
+        return mp_obj_new_float(dot);
+    #if ULAB_MAX_DIMS > 1
+    } else {
+        // 2 matrices
+        if(m1->shape[ULAB_MAX_DIMS - 1] != m2->shape[ULAB_MAX_DIMS - 2]) {
+            mp_raise_ValueError(translate("matrix dimensions do not match"));
+        }
+        size_t *shape = ndarray_shape_vector(0, 0, m1->shape[ULAB_MAX_DIMS - 2], m2->shape[ULAB_MAX_DIMS - 1]);
+        ndarray_obj_t *out = ndarray_new_dense_ndarray(2, shape, NDARRAY_FLOAT);
+        mp_float_t *outdata = (mp_float_t *)out->array;
+        for(size_t i=0; i < m1->shape[ULAB_MAX_DIMS - 2]; i++) { // rows of m1
+            for(size_t j=0; j < m2->shape[ULAB_MAX_DIMS - 1]; j++) { // columns of m2
+                mp_float_t sum = 0.0, v1, v2;
+                for(size_t k=0; k < m2->shape[ULAB_MAX_DIMS - 2]; k++) {
+                    // (i, k) * (k, j)
+                    size_t pos1 = i*m1->shape[ULAB_MAX_DIMS - 1]+k;
+                    size_t pos2 = k*m2->shape[ULAB_MAX_DIMS - 1]+j;
+                    v1 = ndarray_get_float_index(m1->array, m1->dtype, pos1);
+                    v2 = ndarray_get_float_index(m2->array, m2->dtype, pos2);
+                    sum += v1 * v2;
+                }
+                *outdata++ = sum;
+            }
+        }
+        return MP_OBJ_FROM_PTR(out);
+    }
+    #endif
+}
+
+MP_DEFINE_CONST_FUN_OBJ_2(linalg_dot_obj, linalg_dot);
+
+#if ULAB_MAX_DIMS > 1
+//| def eig(m: ulab.ndarray) -> Tuple[ulab.ndarray, ulab.ndarray]:
+//|     """
+//|     :param m: a square matrix
+//|     :return tuple (eigenvectors, eigenvalues):
+//|
+//|     Computes the eigenvalues and eigenvectors of a square matrix"""
+//|     ...
+//|
+
+static mp_obj_t linalg_eig(mp_obj_t oin) {
+    ndarray_obj_t *in = linalg_object_is_square(oin);
+    uint8_t *iarray = (uint8_t *)in->array;
+    size_t S = in->shape[ULAB_MAX_DIMS - 1];
+    mp_float_t *array = m_new(mp_float_t, S*S);
+    for(size_t i=0; i < S; i++) { // rows
+        for(size_t j=0; j < S; j++) { // columns
+            *array++ = ndarray_get_float_value(iarray, in->dtype);
+            iarray += in->strides[ULAB_MAX_DIMS - 1];
+        }
+        iarray -= in->strides[ULAB_MAX_DIMS - 1] * S;
+        iarray += in->strides[ULAB_MAX_DIMS - 2];
+    }
+    array -= S * S;
+    // make sure the matrix is symmetric
+    for(size_t m=0; m < S; m++) {
+        for(size_t n=m+1; n < S; n++) {
+            // compare entry (m, n) to (n, m)
+            // TODO: this must probably be scaled!
+            if(LINALG_EPSILON < MICROPY_FLOAT_C_FUN(fabs)(array[m * S + n] - array[n * S + m])) {
+                mp_raise_ValueError(translate("input matrix is asymmetric"));
+            }
+        }
+    }
+
+    // if we got this far, then the matrix will be symmetric
+
+    ndarray_obj_t *eigenvectors = ndarray_new_dense_ndarray(2, ndarray_shape_vector(0, 0, S, S), NDARRAY_FLOAT);
+    mp_float_t *eigvectors = (mp_float_t *)eigenvectors->array;
+
+    size_t iterations = linalg_jacobi_rotations(array, eigvectors, S);
+
+    if(iterations == 0) {
+        // the computation did not converge; numpy raises LinAlgError
+        m_del(mp_float_t, array, in->len);
+        mp_raise_ValueError(translate("iterations did not converge"));
+    }
+    ndarray_obj_t *eigenvalues = ndarray_new_linear_array(S, NDARRAY_FLOAT);
+    mp_float_t *eigvalues = (mp_float_t *)eigenvalues->array;
+    for(size_t i=0; i < S; i++) {
+        eigvalues[i] = array[i * (S + 1)];
+    }
+    m_del(mp_float_t, array, in->len);
+
+    mp_obj_tuple_t *tuple = MP_OBJ_TO_PTR(mp_obj_new_tuple(2, NULL));
+    tuple->items[0] = MP_OBJ_FROM_PTR(eigenvalues);
+    tuple->items[1] = MP_OBJ_FROM_PTR(eigenvectors);
+    return tuple;
+}
+
+MP_DEFINE_CONST_FUN_OBJ_1(linalg_eig_obj, linalg_eig);
+
+//| def inv(m: ulab.ndarray) -> ulab.ndarray:
+//|     """
+//|     :param ~ulab.ndarray m: a square matrix
+//|     :return: The inverse of the matrix, if it exists
+//|     :raises ValueError: if the matrix is not invertible
+//|
+//|     Computes the inverse of a square matrix"""
+//|     ...
+//|
+static mp_obj_t linalg_inv(mp_obj_t o_in) {
+    ndarray_obj_t *ndarray = linalg_object_is_square(o_in);
+    uint8_t *array = (uint8_t *)ndarray->array;
+    size_t N = ndarray->shape[ULAB_MAX_DIMS - 1];
+    ndarray_obj_t *inverted = ndarray_new_dense_ndarray(2, ndarray_shape_vector(0, 0, N, N), NDARRAY_FLOAT);
+    mp_float_t *iarray = (mp_float_t *)inverted->array;
+
+    mp_float_t (*func)(void *) = ndarray_get_float_function(ndarray->dtype);
+
+    for(size_t i=0; i < N; i++) { // rows
+        for(size_t j=0; j < N; j++) { // columns
+            *iarray++ = func(array);
+            array += ndarray->strides[ULAB_MAX_DIMS - 1];
+        }
+        array -= ndarray->strides[ULAB_MAX_DIMS - 1] * N;
+        array += ndarray->strides[ULAB_MAX_DIMS - 2];
+    }
+    // re-wind the pointer
+    iarray -= N*N;
+
+    if(!linalg_invert_matrix(iarray, N)) {
+        mp_raise_ValueError(translate("input matrix is singular"));
+    }
+    return MP_OBJ_FROM_PTR(inverted);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_1(linalg_inv_obj, linalg_inv);
+#endif
+
+//| def norm(x: ulab.ndarray) -> float:
+//|    """
+//|    :param ~ulab.ndarray x: a vector or a matrix
+//|
+//|    Computes the 2-norm of a vector or a matrix, i.e., ``sqrt(sum(x*x))``, however, without the RAM overhead."""
+//|    ...
+//|
+
+static mp_obj_t linalg_norm(mp_obj_t _x) {
+    if (!MP_OBJ_IS_TYPE(_x, &ulab_ndarray_type)) {
+        mp_raise_TypeError(translate("argument must be ndarray"));
+    }
+    ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(_x);
+    if((ndarray->ndim != 1) && (ndarray->ndim != 2)) {
+        mp_raise_ValueError(translate("norm is defined for 1D and 2D arrays"));
+    }
+    mp_float_t dot = 0.0;
+    uint8_t *array = (uint8_t *)ndarray->array;
+
+    mp_float_t (*func)(void *) = ndarray_get_float_function(ndarray->dtype);
+
+    size_t k = 0;
+    do {
+        size_t l = 0;
+        do {
+            mp_float_t v = func(array);
+            array += ndarray->strides[ULAB_MAX_DIMS - 1];
+            dot += v*v;
+            l++;
+        } while(l < ndarray->shape[ULAB_MAX_DIMS - 1]);
+        array -= ndarray->strides[ULAB_MAX_DIMS - 1] * ndarray->shape[ULAB_MAX_DIMS - 1];
+        array += ndarray->strides[ULAB_MAX_DIMS - 2];
+        k++;
+    } while(k < ndarray->shape[ULAB_MAX_DIMS - 2]);
+    return mp_obj_new_float(MICROPY_FLOAT_C_FUN(sqrt)(dot));
+}
+
+MP_DEFINE_CONST_FUN_OBJ_1(linalg_norm_obj, linalg_norm);
+
+#if ULAB_MAX_DIMS > 1
+#if ULAB_LINALG_HAS_TRACE
+
+//| def trace(m: ulab.ndarray) -> float:
+//|     """
+//|     :param m: a square matrix
+//|
+//|     Compute the trace of the matrix, the sum of its diagonal elements."""
+//|     ...
+//|
+
+static mp_obj_t linalg_trace(mp_obj_t oin) {
+    ndarray_obj_t *ndarray = linalg_object_is_square(oin);
+    mp_float_t trace = 0.0;
+    for(size_t i=0; i < ndarray->shape[ULAB_MAX_DIMS - 1]; i++) {
+        int32_t pos = i * (ndarray->strides[ULAB_MAX_DIMS - 1] + ndarray->strides[ULAB_MAX_DIMS - 2]);
+        trace += ndarray_get_float_index(ndarray->array, ndarray->dtype, pos/ndarray->itemsize);
+    }
+    if(ndarray->dtype == NDARRAY_FLOAT) {
+        return mp_obj_new_float(trace);
+    }
+    return mp_obj_new_int_from_float(trace);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_1(linalg_trace_obj, linalg_trace);
+#endif
+#endif
+
+STATIC const mp_rom_map_elem_t ulab_linalg_globals_table[] = {
+    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_linalg) },
+    #if ULAB_MAX_DIMS > 1
+    #if ULAB_LINALG_HAS_CHOLESKY
+    { MP_ROM_QSTR(MP_QSTR_cholesky), (mp_obj_t)&linalg_cholesky_obj },
+    #endif
+    #if ULAB_LINALG_HAS_DET
+    { MP_ROM_QSTR(MP_QSTR_det), (mp_obj_t)&linalg_det_obj },
+    #endif
+    #if ULAB_LINALG_HAS_EIG
+    { MP_ROM_QSTR(MP_QSTR_eig), (mp_obj_t)&linalg_eig_obj },
+    #endif
+    #if ULAB_LINALG_HAS_INV
+    { MP_ROM_QSTR(MP_QSTR_inv), (mp_obj_t)&linalg_inv_obj },
+    #endif
+    #if ULAB_LINALG_HAS_TRACE
+    { MP_ROM_QSTR(MP_QSTR_trace), (mp_obj_t)&linalg_trace_obj },
+    #endif
+    #endif
+    #if ULAB_LINALG_HAS_DOT
+    { MP_ROM_QSTR(MP_QSTR_dot), (mp_obj_t)&linalg_dot_obj },
+    #endif
+    #if ULAB_LINALG_HAS_NORM
+    { MP_ROM_QSTR(MP_QSTR_norm), (mp_obj_t)&linalg_norm_obj },
+    #endif
+};
+
+STATIC MP_DEFINE_CONST_DICT(mp_module_ulab_linalg_globals, ulab_linalg_globals_table);
+
+mp_obj_module_t ulab_linalg_module = {
+    .base = { &mp_type_module },
+    .globals = (mp_obj_dict_t*)&mp_module_ulab_linalg_globals,
+};
+
+#endif
--- a/code/numpy/linalg/linalg.h
+++ b/code/numpy/linalg/linalg.h
@ -0,0 +1,28 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2021 Zoltán Vörös
+*/
+
+#ifndef _LINALG_
+#define _LINALG_
+
+#include "../../ulab.h"
+#include "../../ndarray.h"
+#include "linalg_tools.h"
+
+extern mp_obj_module_t ulab_linalg_module;
+
+MP_DECLARE_CONST_FUN_OBJ_1(linalg_cholesky_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(linalg_det_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(linalg_eig_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(linalg_inv_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(linalg_trace_obj);
+MP_DECLARE_CONST_FUN_OBJ_2(linalg_dot_obj);
+MP_DECLARE_CONST_FUN_OBJ_2(linalg_norm_obj);
+#endif
--- a/code/numpy/linalg/linalg_tools.c
+++ b/code/numpy/linalg/linalg_tools.c
@ -0,0 +1,171 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2010 Zoltán Vörös
+*/
+
+#include <math.h>
+#include <string.h>
+#include "py/runtime.h"
+
+#include "linalg_tools.h"
+
+/* 
+ * The following function inverts a matrix, whose entries are given in the input array 
+ * The function has no dependencies beyond micropython itself (for the definition of mp_float_t),
+ * and can be used independent of ulab.
+ */
+
+bool linalg_invert_matrix(mp_float_t *data, size_t N) {
+    // returns true, of the inversion was successful,
+    // false, if the matrix is singular
+
+    // initially, this is the unit matrix: the contents of this matrix is what
+    // will be returned after all the transformations
+    mp_float_t *unit = m_new(mp_float_t, N*N);
+    mp_float_t elem = 1.0;
+    // initialise the unit matrix
+    memset(unit, 0, sizeof(mp_float_t)*N*N);
+    for(size_t m=0; m < N; m++) {
+        memcpy(&unit[m * (N+1)], &elem, sizeof(mp_float_t));
+    }
+    for(size_t m=0; m < N; m++){
+        // this could be faster with ((c < epsilon) && (c > -epsilon))
+        if(MICROPY_FLOAT_C_FUN(fabs)(data[m * (N+1)]) < LINALG_EPSILON) {
+            //look for a line to swap
+            size_t m1 = m + 1;
+            for(; m1 < N; m1++) {
+                if(!(MICROPY_FLOAT_C_FUN(fabs)(data[m1*N + m]) < LINALG_EPSILON)) {
+                    for(size_t m2=0; m2 < N; m2++) {
+                        mp_float_t swapVal = data[m*N+m2];
+                        data[m*N+m2] = data[m1*N+m2];
+                        data[m1*N+m2] = swapVal;
+                        swapVal = unit[m*N+m2];
+                        unit[m*N+m2] = unit[m1*N+m2];
+                        unit[m1*N+m2] = swapVal;
+                    }
+                    break;
+                }
+            }
+            if (m1 >= N) {
+                m_del(mp_float_t, unit, N*N);
+                return false;
+            }
+        }
+        for(size_t n=0; n < N; n++) {
+            if(m != n){
+                elem = data[N * n + m] / data[m * (N+1)];
+                for(size_t k=0; k < N; k++) {
+                    data[N * n + k] -= elem * data[N * m + k];
+                    unit[N * n + k] -= elem * unit[N * m + k];
+                }
+            }
+        }
+    }
+    for(size_t m=0; m < N; m++) {
+        elem = data[m * (N+1)];
+        for(size_t n=0; n < N; n++) {
+            data[N * m + n] /= elem;
+            unit[N * m + n] /= elem;
+        }
+    }
+    memcpy(data, unit, sizeof(mp_float_t)*N*N);
+    m_del(mp_float_t, unit, N * N);
+    return true;
+}
+
+/* 
+ * The following function calculates the eigenvalues and eigenvectors of a symmetric 
+ * real matrix, whose entries are given in the input array. 
+ * The function has no dependencies beyond micropython itself (for the definition of mp_float_t),
+ * and can be used independent of ulab.
+ */
+
+size_t linalg_jacobi_rotations(mp_float_t *array, mp_float_t *eigvectors, size_t S) {
+    // eigvectors should be a 0-array; start out with the unit matrix
+    for(size_t m=0; m < S; m++) {
+        eigvectors[m * (S+1)] = 1.0;
+    }
+    mp_float_t largest, w, t, c, s, tau, aMk, aNk, vm, vn;
+    size_t M, N;
+    size_t iterations = JACOBI_MAX * S * S;
+    do {
+        iterations--;
+        // find the pivot here
+        M = 0;
+        N = 0;
+        largest = 0.0;
+        for(size_t m=0; m < S-1; m++) { // -1: no need to inspect last row
+            for(size_t n=m+1; n < S; n++) {
+                w = MICROPY_FLOAT_C_FUN(fabs)(array[m * S + n]);
+                if((largest < w) && (LINALG_EPSILON < w)) {
+                    M = m;
+                    N = n;
+                    largest = w;
+                }
+            }
+        }
+        if(M + N == 0) { // all entries are smaller than epsilon, there is not much we can do...
+            break;
+        }
+        // at this point, we have the pivot, and it is the entry (M, N)
+        // now we have to find the rotation angle
+        w = (array[N * S + N] - array[M * S + M]) / (MICROPY_FLOAT_CONST(2.0)*array[M * S + N]);
+        // The following if/else chooses the smaller absolute value for the tangent
+        // of the rotation angle. Going with the smaller should be numerically stabler.
+        if(w > 0) {
+            t = MICROPY_FLOAT_C_FUN(sqrt)(w*w + MICROPY_FLOAT_CONST(1.0)) - w;
+        } else {
+            t = MICROPY_FLOAT_CONST(-1.0)*(MICROPY_FLOAT_C_FUN(sqrt)(w*w + MICROPY_FLOAT_CONST(1.0)) + w);
+        }
+        s = t / MICROPY_FLOAT_C_FUN(sqrt)(t*t + MICROPY_FLOAT_CONST(1.0)); // the sine of the rotation angle
+        c = MICROPY_FLOAT_CONST(1.0) / MICROPY_FLOAT_C_FUN(sqrt)(t*t + MICROPY_FLOAT_CONST(1.0)); // the cosine of the rotation angle
+        tau = (MICROPY_FLOAT_CONST(1.0)-c)/s; // this is equal to the tangent of the half of the rotation angle
+
+        // at this point, we have the rotation angles, so we can transform the matrix
+        // first the two diagonal elements
+        // a(M, M) = a(M, M) - t*a(M, N)
+        array[M * S + M] = array[M * S + M] - t * array[M * S + N];
+        // a(N, N) = a(N, N) + t*a(M, N)
+        array[N * S + N] = array[N * S + N] + t * array[M * S + N];
+        // after the rotation, the a(M, N), and a(N, M) entries should become zero
+        array[M * S + N] = array[N * S + M] = MICROPY_FLOAT_CONST(0.0);
+        // then all other elements in the column
+        for(size_t k=0; k < S; k++) {
+            if((k == M) || (k == N)) {
+                continue;
+            }
+            aMk = array[M * S + k];
+            aNk = array[N * S + k];
+            // a(M, k) = a(M, k) - s*(a(N, k) + tau*a(M, k))
+            array[M * S + k] -= s * (aNk + tau * aMk);
+            // a(N, k) = a(N, k) + s*(a(M, k) - tau*a(N, k))
+            array[N * S + k] += s * (aMk - tau * aNk);
+            // a(k, M) = a(M, k)
+            array[k * S + M] = array[M * S + k];
+            // a(k, N) = a(N, k)
+            array[k * S + N] = array[N * S + k];
+        }
+        // now we have to update the eigenvectors
+        // the rotation matrix, R, multiplies from the right
+        // R is the unit matrix, except for the
+        // R(M,M) = R(N, N) = c
+        // R(N, M) = s
+        // (M, N) = -s
+        // entries. This means that only the Mth, and Nth columns will change
+        for(size_t m=0; m < S; m++) {
+            vm = eigvectors[m * S + M];
+            vn = eigvectors[m * S + N];
+            // the new value of eigvectors(m, M)
+            eigvectors[m * S + M] = c * vm - s * vn;
+            // the new value of eigvectors(m, N)
+            eigvectors[m * S + N] = s * vm + c * vn;
+        }
+    } while(iterations > 0);
+    
+    return iterations;
+}
--- a/code/numpy/linalg/linalg_tools.h
+++ b/code/numpy/linalg/linalg_tools.h
@ -0,0 +1,28 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2021 Zoltán Vörös
+*/
+
+#ifndef _TOOLS_TOOLS_
+#define _TOOLS_TOOLS_
+
+#ifndef LINALG_EPSILON
+#if MICROPY_FLOAT_IMPL == MICROPY_FLOAT_IMPL_FLOAT
+#define LINALG_EPSILON      MICROPY_FLOAT_CONST(1.2e-7)
+#elif MICROPY_FLOAT_IMPL == MICROPY_FLOAT_IMPL_DOUBLE
+#define LINALG_EPSILON      MICROPY_FLOAT_CONST(2.3e-16)
+#endif
+#endif /* LINALG_EPSILON */
+
+#define JACOBI_MAX     20
+
+bool linalg_invert_matrix(mp_float_t *, size_t );
+size_t linalg_jacobi_rotations(mp_float_t *, mp_float_t *, size_t );
+
+#endif /* _TOOLS_TOOLS_ */
+
--- a/code/numpy/numerical/numerical.c
+++ b/code/numpy/numerical/numerical.c
--- a/code/numpy/numerical/numerical.h
+++ b/code/numpy/numerical/numerical.h
@ -0,0 +1,587 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2021 Zoltán Vörös
+*/
+
+#ifndef _NUMERICAL_
+#define _NUMERICAL_
+
+#include "../../ulab.h"
+#include "../../ndarray.h"
+
+// TODO: implement cumsum
+//mp_obj_t numerical_cumsum(size_t , const mp_obj_t *, mp_map_t *);
+
+#define RUN_ARGMIN1(ndarray, type, array, results, rarray, index, op)\
+({\
+    uint16_t best_index = 0;\
+    type best_value = *((type *)(array));\
+    if(((op) == NUMERICAL_MAX) || ((op) == NUMERICAL_ARGMAX)) {\
+        for(uint16_t i=0; i < (ndarray)->shape[(index)]; i++) {\
+            if(*((type *)(array)) > best_value) {\
+                best_index = i;\
+                best_value = *((type *)(array));\
+            }\
+            (array) += (ndarray)->strides[(index)];\
+        }\
+    } else {\
+        for(uint16_t i=0; i < (ndarray)->shape[(index)]; i++) {\
+            if(*((type *)(array)) < best_value) {\
+                best_index = i;\
+                best_value = *((type *)(array));\
+            }\
+            (array) += (ndarray)->strides[(index)];\
+        }\
+    }\
+    if(((op) == NUMERICAL_ARGMAX) || ((op) == NUMERICAL_ARGMIN)) {\
+        memcpy((rarray), &best_index, (results)->itemsize);\
+    } else {\
+        memcpy((rarray), &best_value, (results)->itemsize);\
+    }\
+    (rarray) += (results)->itemsize;\
+})
+
+#define RUN_SUM1(ndarray, type, array, results, rarray, index)\
+({\
+    type sum = 0;\
+    for(size_t i=0; i < (ndarray)->shape[(index)]; i++) {\
+        sum += *((type *)(array));\
+        (array) += (ndarray)->strides[(index)];\
+    }\
+    memcpy((rarray), &sum, (results)->itemsize);\
+    (rarray) += (results)->itemsize;\
+})
+
+// The mean could be calculated by simply dividing the sum by
+// the number of elements, but that method is numerically unstable
+#define RUN_MEAN1(ndarray, type, array, results, r, index)\
+({\
+    mp_float_t M, m;\
+    M = m = (mp_float_t)(*(type *)(array));\
+    for(size_t i=1; i < (ndarray)->shape[(index)]; i++) {\
+        (array) += (ndarray)->strides[(index)];\
+        mp_float_t value = (mp_float_t)(*(type *)(array));\
+        m = M + (value - M) / (mp_float_t)(i+1);\
+        M = m;\
+    }\
+    (array) += (ndarray)->strides[(index)];\
+    *(r)++ = M;\
+})
+
+// Instead of the straightforward implementation of the definition,
+// we take the numerically stable Welford algorithm here
+// https://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
+#define RUN_STD1(ndarray, type, array, results, r, index, div)\
+({\
+    mp_float_t M = 0.0, m = 0.0, S = 0.0, s = 0.0;\
+    for(size_t i=0; i < (ndarray)->shape[(index)]; i++) {\
+        mp_float_t value = (mp_float_t)(*(type *)(array));\
+        m = M + (value - M) / (mp_float_t)(i+1);\
+        s = S + (value - M) * (value - m);\
+        M = m;\
+        S = s;\
+        (array) += (ndarray)->strides[(index)];\
+    }\
+    *(r)++ = MICROPY_FLOAT_C_FUN(sqrt)(s / (div));\
+})
+
+#define RUN_DIFF1(ndarray, type, array, results, rarray, index, stencil, N)\
+({\
+    for(size_t i=0; i < (results)->shape[ULAB_MAX_DIMS - 1]; i++) {\
+        type sum = 0;\
+        uint8_t *source = (array);\
+        for(uint8_t d=0; d < (N)+1; d++) {\
+            sum -= (stencil)[d] * *((type *)source);\
+            source += (ndarray)->strides[(index)];\
+        }\
+        (array) += (ndarray)->strides[ULAB_MAX_DIMS - 1];\
+        *(type *)(rarray) = sum;\
+        (rarray) += (results)->itemsize;\
+    }\
+})
+
+#define HEAPSORT1(type, array, increment, N)\
+({\
+    type *_array = (type *)array;\
+    type tmp;\
+    size_t c, q = (N), p, r = (N) >> 1;\
+    for (;;) {\
+        if (r > 0) {\
+            tmp = _array[(--r)*(increment)];\
+        } else {\
+            q--;\
+            if(q == 0) {\
+                break;\
+            }\
+            tmp = _array[q*(increment)];\
+            _array[q*(increment)] = _array[0];\
+        }\
+        p = r;\
+        c = r + r + 1;\
+        while (c < q) {\
+            if((c + 1 < q)  &&  (_array[(c+1)*(increment)] > _array[c*(increment)])) {\
+                c++;\
+            }\
+            if(_array[c*(increment)] > tmp) {\
+                _array[p*(increment)] = _array[c*(increment)];\
+                p = c;\
+                c = p + p + 1;\
+            } else {\
+                break;\
+            }\
+        }\
+        _array[p*(increment)] = tmp;\
+    }\
+})
+
+#define HEAP_ARGSORT1(type, array, increment, N, iarray, iincrement)\
+({\
+    type *_array = (type *)array;\
+    type tmp;\
+    uint16_t itmp, c, q = (N), p, r = (N) >> 1;\
+    for (;;) {\
+        if (r > 0) {\
+            r--;\
+            itmp = (iarray)[r*(iincrement)];\
+            tmp = _array[itmp*(increment)];\
+        } else {\
+            q--;\
+            if(q == 0) {\
+                break;\
+            }\
+            itmp = (iarray)[q*(iincrement)];\
+            tmp = _array[itmp*(increment)];\
+            (iarray)[q*(iincrement)] = (iarray)[0];\
+        }\
+        p = r;\
+        c = r + r + 1;\
+        while (c < q) {\
+            if((c + 1 < q)  &&  (_array[(iarray)[(c+1)*(iincrement)]*(increment)] > _array[(iarray)[c*(iincrement)]*(increment)])) {\
+                c++;\
+            }\
+            if(_array[(iarray)[c*(iincrement)]*(increment)] > tmp) {\
+                (iarray)[p*(iincrement)] = (iarray)[c*(iincrement)];\
+                p = c;\
+                c = p + p + 1;\
+            } else {\
+                break;\
+            }\
+        }\
+        (iarray)[p*(iincrement)] = itmp;\
+    }\
+})
+
+#if ULAB_MAX_DIMS == 1
+#define RUN_SUM(ndarray, type, array, results, rarray, shape, strides, index) do {\
+    RUN_SUM1((ndarray), type, (array), (results), (rarray), (index));\
+} while(0)
+
+#define RUN_MEAN(ndarray, type, array, results, r, shape, strides, index) do {\
+    RUN_MEAN1((ndarray), type, (array), (results), (r), (index));\
+} while(0)
+
+#define RUN_STD(ndarray, type, array, results, r, shape, strides, index, div) do {\
+    RUN_STD1((ndarray), type, (array), (results), (r), (index), (div));\
+} while(0)
+
+#define RUN_ARGMIN(ndarray, type, array, results, rarray, shape, strides, index, op) do {\
+    RUN_ARGMIN1((ndarray), type, (array), (results), (rarray), (index), (op));\
+} while(0)
+
+#define RUN_DIFF(ndarray, type, array, results, rarray, shape, strides, index, stencil, N) do {\
+    RUN_DIFF1((ndarray), type, (array), (results), (rarray), (index), (stencil), (N));\
+} while(0)
+
+#define HEAPSORT(ndarray, type, array, shape, strides, index, increment, N) do {\
+    HEAPSORT1(type, (array), (increment), (N));\
+} while(0)
+
+#define HEAP_ARGSORT(ndarray, type, array, shape, strides, index, increment, N, iarray, istrides, iincrement) do {\
+    HEAP_ARGSORT1(type, (array), (increment), (N), (iarray), (iincrement));\
+} while(0)
+
+#endif
+
+#if ULAB_MAX_DIMS == 2
+#define RUN_SUM(ndarray, type, array, results, rarray, shape, strides, index) do {\
+    size_t l = 0;\
+    do {\
+        RUN_SUM1((ndarray), type, (array), (results), (rarray), (index));\
+        (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+        (array) += (strides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+} while(0)
+
+#define RUN_MEAN(ndarray, type, array, results, r, shape, strides, index) do {\
+    size_t l = 0;\
+    do {\
+        RUN_MEAN1((ndarray), type, (array), (results), (r), (index));\
+        (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+        (array) += (strides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+} while(0)
+
+#define RUN_STD(ndarray, type, array, results, r, shape, strides, index, div) do {\
+    size_t l = 0;\
+    do {\
+        RUN_STD1((ndarray), type, (array), (results), (r), (index), (div));\
+        (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+        (array) += (strides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+} while(0)
+
+#define RUN_ARGMIN(ndarray, type, array, results, rarray, shape, strides, index, op) do {\
+    size_t l = 0;\
+    do {\
+        RUN_ARGMIN1((ndarray), type, (array), (results), (rarray), (index), (op));\
+        (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+        (array) += (strides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+} while(0)
+
+#define RUN_DIFF(ndarray, type, array, results, rarray, shape, strides, index, stencil, N) do {\
+    size_t l = 0;\
+    do {\
+        RUN_DIFF1((ndarray), type, (array), (results), (rarray), (index), (stencil), (N));\
+        (array) -= (ndarray)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS - 1];\
+        (array) += (ndarray)->strides[ULAB_MAX_DIMS - 2];\
+        (rarray) -= (results)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS - 1];\
+        (rarray) += (results)->strides[ULAB_MAX_DIMS - 2];\
+        l++;\
+    } while(l < (results)->shape[ULAB_MAX_DIMS - 2]);\
+} while(0)
+
+#define HEAPSORT(ndarray, type, array, shape, strides, index, increment, N) do {\
+    size_t l = 0;\
+    do {\
+        HEAPSORT1(type, (array), (increment), (N));\
+        (array) += (strides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+} while(0)
+
+#define HEAP_ARGSORT(ndarray, type, array, shape, strides, index, increment, N, iarray, istrides, iincrement) do {\
+    size_t l = 0;\
+    do {\
+        HEAP_ARGSORT1(type, (array), (increment), (N), (iarray), (iincrement));\
+        (array) += (strides)[ULAB_MAX_DIMS - 1];\
+        (iarray) += (istrides)[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+} while(0)
+
+#endif
+
+#if ULAB_MAX_DIMS == 3
+#define RUN_SUM(ndarray, type, array, results, rarray, shape, strides, index) do {\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            RUN_SUM1((ndarray), type, (array), (results), (rarray), (index));\
+            (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+            (array) += (strides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+        (array) += (strides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+} while(0)
+
+#define RUN_MEAN(ndarray, type, array, results, r, shape, strides, index) do {\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            RUN_MEAN1((ndarray), type, (array), (results), (r), (index));\
+            (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+            (array) += (strides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+        (array) += (strides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+} while(0)
+
+#define RUN_STD(ndarray, type, array, results, r, shape, strides, index, div) do {\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            RUN_STD1((ndarray), type, (array), (results), (r), (index), (div));\
+            (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+            (array) += (strides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+        (array) += (strides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+} while(0)
+
+#define RUN_ARGMIN(ndarray, type, array, results, rarray, shape, strides, index, op) do {\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            RUN_ARGMIN1((ndarray), type, (array), (results), (rarray), (index), (op));\
+            (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+            (array) += (strides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+        (array) += (strides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+} while(0)
+
+#define RUN_DIFF(ndarray, type, array, results, rarray, shape, strides, index, stencil, N) do {\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            RUN_DIFF1((ndarray), type, (array), (results), (rarray), (index), (stencil), (N));\
+            (array) -= (ndarray)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS - 1];\
+			(array) += (ndarray)->strides[ULAB_MAX_DIMS - 2];\
+            (rarray) -= (results)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS - 1];\
+            (rarray) += (results)->strides[ULAB_MAX_DIMS - 2];\
+            l++;\
+        } while(l < (shape)[ULAB_MAX_DIMS - 2]);\
+        (array) -= (ndarray)->strides[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS-2];\
+        (array) += (ndarray)->strides[ULAB_MAX_DIMS - 3];\
+        (rarray) -= (results)->strides[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS - 2];\
+        (rarray) += (results)->strides[ULAB_MAX_DIMS - 3];\
+        k++;\
+    } while(k < (shape)[ULAB_MAX_DIMS - 3]);\
+} while(0)
+
+#define HEAPSORT(ndarray, type, array, shape, strides, index, increment, N) do {\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            HEAPSORT1(type, (array), (increment), (N));\
+            (array) += (strides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+        (array) += (strides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+} while(0)
+
+#define HEAP_ARGSORT(ndarray, type, array, shape, strides, index, increment, N, iarray, istrides, iincrement) do {\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            HEAP_ARGSORT1(type, (array), (increment), (N), (iarray), (iincrement));\
+            (array) += (strides)[ULAB_MAX_DIMS - 1];\
+            (iarray) += (istrides)[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+        (iarray) -= (istrides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+        (iarray) += (istrides)[ULAB_MAX_DIMS - 2];\
+        (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+        (array) += (strides)[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+} while(0)
+
+#endif
+
+#if ULAB_MAX_DIMS == 4
+#define RUN_SUM(ndarray, type, array, results, rarray, shape, strides, index) do {\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                RUN_SUM1((ndarray), type, (array), (results), (rarray), (index));\
+                (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+                (array) += (strides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+            (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+            (array) += (strides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];\
+        (array) += (strides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (shape)[ULAB_MAX_DIMS - 3]);\
+} while(0)
+
+#define RUN_MEAN(ndarray, type, array, results, r, shape, strides, index) do {\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                RUN_MEAN1((ndarray), type, (array), (results), (r), (index));\
+                (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+                (array) += (strides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+            (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+            (array) += (strides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];\
+        (array) += (strides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (shape)[ULAB_MAX_DIMS - 3]);\
+} while(0)
+
+#define RUN_STD(ndarray, type, array, results, r, shape, strides, index, div) do {\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                RUN_STD1((ndarray), type, (array), (results), (r), (index), (div));\
+                (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+                (array) += (strides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+            (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+            (array) += (strides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];\
+        (array) += (strides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (shape)[ULAB_MAX_DIMS - 3]);\
+} while(0)
+
+#define RUN_ARGMIN(ndarray, type, array, results, rarray, shape, strides, index, op) do {\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                RUN_ARGMIN1((ndarray), type, (array), (results), (rarray), (index), (op));\
+                (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\
+                (array) += (strides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+            (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+            (array) += (strides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];\
+        (array) += (strides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (shape)[ULAB_MAX_DIMS - 3]);\
+} while(0)
+
+#define RUN_DIFF(ndarray, type, array, results, rarray, shape, strides, index, stencil, N) do {\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                RUN_DIFF1((ndarray), type, (array), (results), (rarray), (index), (stencil), (N));\
+                (array) -= (ndarray)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS - 1];\
+                (array) += (ndarray)->strides[ULAB_MAX_DIMS - 2];\
+                (rarray) -= (results)->strides[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS - 1];\
+                (rarray) += (results)->strides[ULAB_MAX_DIMS - 2];\
+                l++;\
+            } while(l < (shape)[ULAB_MAX_DIMS - 2]);\
+            (array) -= (strides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];\
+            (array) += (strides)[ULAB_MAX_DIMS - 3];\
+            (rarray) -= (results)->strides[ULAB_MAX_DIMS - 2] * (results)->shape[ULAB_MAX_DIMS - 2];\
+            (rarray) += (results)->strides[ULAB_MAX_DIMS - 3];\
+            k++;\
+        } while(k < (shape)[ULAB_MAX_DIMS - 3]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 3] * (shape)[ULAB_MAX_DIMS-3];\
+        (array) += (strides)[ULAB_MAX_DIMS - 4];\
+        (rarray) -= (results)->strides[ULAB_MAX_DIMS - 3] * (results)->shape[ULAB_MAX_DIMS - 3];\
+        (rarray) += (results)->strides[ULAB_MAX_DIMS - 4];\
+        j++;\
+    } while(j < (shape)[ULAB_MAX_DIMS - 4]);\
+} while(0)
+
+#define HEAPSORT(ndarray, type, array, shape, strides, index, increment, N) do {\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                HEAPSORT1(type, (array), (increment), (N));\
+                (array) += (strides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+            (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+            (array) += (strides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+        (array) -= (strides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];\
+        (array) += (strides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (shape)[ULAB_MAX_DIMS - 3]);\
+} while(0)
+
+#define HEAP_ARGSORT(ndarray, type, array, shape, strides, index, increment, N, iarray, istrides, iincrement) do {\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                HEAP_ARGSORT1(type, (array), (increment), (N), (iarray), (iincrement));\
+                (array) += (strides)[ULAB_MAX_DIMS - 1];\
+                (iarray) += (istrides)[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (shape)[ULAB_MAX_DIMS - 1]);\
+            (iarray) -= (istrides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+            (iarray) += (istrides)[ULAB_MAX_DIMS - 2];\
+            (array) -= (strides)[ULAB_MAX_DIMS - 1] * (shape)[ULAB_MAX_DIMS-1];\
+            (array) += (strides)[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (shape)[ULAB_MAX_DIMS - 2]);\
+        (iarray) -= (istrides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];\
+        (iarray) += (istrides)[ULAB_MAX_DIMS - 3];\
+        (array) -= (strides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];\
+        (array) += (strides)[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (shape)[ULAB_MAX_DIMS - 3]);\
+} while(0)
+
+#endif
+
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_argmax_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_argmin_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_argsort_obj);
+MP_DECLARE_CONST_FUN_OBJ_2(numerical_cross_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_diff_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_flip_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_max_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_mean_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_median_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_min_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_roll_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_std_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_sum_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_sort_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(numerical_sort_inplace_obj);
+
+#endif
--- a/code/numpy/numpy.c
+++ b/code/numpy/numpy.c
@ -0,0 +1,280 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020 Jeff Epler for Adafruit Industries
+ *               2020 Scott Shawcroft for Adafruit Industries
+ *               2020-2021 Zoltán Vörös
+ *               2020 Taku Fukada
+*/
+
+#include <math.h>
+#include <string.h>
+#include "py/runtime.h"
+
+#include "numpy.h"
+#include "../ulab_create.h"
+#include "approx/approx.h"
+#include "compare/compare.h"
+#include "fft/fft.h"
+#include "filter/filter.h"
+#include "linalg/linalg.h"
+#include "numerical/numerical.h"
+#include "poly/poly.h"
+#include "vector/vector.h"
+
+//| """Compatibility layer for numpy"""
+//|
+
+// math constants
+#if ULAB_NUMPY_HAS_E
+mp_obj_float_t ulab_const_float_e_obj = {{&mp_type_float}, MP_E};
+#endif
+
+#if ULAB_NUMPY_HAS_INF
+mp_obj_float_t numpy_const_float_inf_obj = {{&mp_type_float}, (mp_float_t)INFINITY};
+#endif
+
+#if ULAB_NUMPY_HAS_NAN
+mp_obj_float_t numpy_const_float_nan_obj = {{&mp_type_float}, (mp_float_t)NAN};
+#endif
+
+#if ULAB_NUMPY_HAS_PI
+mp_obj_float_t ulab_const_float_pi_obj = {{&mp_type_float}, MP_PI};
+#endif
+
+static const mp_rom_map_elem_t ulab_numpy_globals_table[] = {
+    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_numpy) },
+    { MP_OBJ_NEW_QSTR(MP_QSTR_ndarray), (mp_obj_t)&ulab_ndarray_type },
+    { MP_OBJ_NEW_QSTR(MP_QSTR_array), MP_ROM_PTR(&ndarray_array_constructor_obj) },
+    #if ULAB_NUMPY_HAS_FROMBUFFER
+        { MP_ROM_QSTR(MP_QSTR_frombuffer), MP_ROM_PTR(&create_frombuffer_obj) },
+    #endif
+    // math constants
+    #if ULAB_NUMPY_HAS_E
+        { MP_ROM_QSTR(MP_QSTR_e), MP_ROM_PTR(&ulab_const_float_e_obj) },
+    #endif
+    #if ULAB_NUMPY_HAS_INF
+        { MP_ROM_QSTR(MP_QSTR_inf), MP_ROM_PTR(&numpy_const_float_inf_obj) },
+    #endif
+    #if ULAB_NUMPY_HAS_NAN
+        { MP_ROM_QSTR(MP_QSTR_nan), MP_ROM_PTR(&numpy_const_float_nan_obj) },
+    #endif
+    #if ULAB_NUMPY_HAS_PI
+        { MP_ROM_QSTR(MP_QSTR_pi), MP_ROM_PTR(&ulab_const_float_pi_obj) },
+    #endif
+    // class constants, always included
+    { MP_ROM_QSTR(MP_QSTR_bool), MP_ROM_INT(NDARRAY_BOOL) },
+    { MP_ROM_QSTR(MP_QSTR_uint8), MP_ROM_INT(NDARRAY_UINT8) },
+    { MP_ROM_QSTR(MP_QSTR_int8), MP_ROM_INT(NDARRAY_INT8) },
+    { MP_ROM_QSTR(MP_QSTR_uint16), MP_ROM_INT(NDARRAY_UINT16) },
+    { MP_ROM_QSTR(MP_QSTR_int16), MP_ROM_INT(NDARRAY_INT16) },
+    { MP_ROM_QSTR(MP_QSTR_float), MP_ROM_INT(NDARRAY_FLOAT) },
+    // modules of numpy
+    #if ULAB_NUMPY_HAS_FFT_MODULE
+        { MP_ROM_QSTR(MP_QSTR_fft), MP_ROM_PTR(&ulab_fft_module) },
+    #endif
+    #if ULAB_NUMPY_HAS_LINALG_MODULE
+        { MP_ROM_QSTR(MP_QSTR_linalg), MP_ROM_PTR(&ulab_linalg_module) },
+    #endif
+    #if ULAB_HAS_PRINTOPTIONS
+        { MP_ROM_QSTR(MP_QSTR_set_printoptions), (mp_obj_t)&ndarray_set_printoptions_obj },
+        { MP_ROM_QSTR(MP_QSTR_get_printoptions), (mp_obj_t)&ndarray_get_printoptions_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_NDINFO
+        { MP_ROM_QSTR(MP_QSTR_ndinfo), (mp_obj_t)&ndarray_info_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ARANGE
+        { MP_ROM_QSTR(MP_QSTR_arange), (mp_obj_t)&create_arange_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_CONCATENATE
+        { MP_ROM_QSTR(MP_QSTR_concatenate), (mp_obj_t)&create_concatenate_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_DIAG
+        { MP_ROM_QSTR(MP_QSTR_diag), (mp_obj_t)&create_diag_obj },
+    #endif
+    #if ULAB_MAX_DIMS > 1
+        #if ULAB_NUMPY_HAS_EYE
+            { MP_ROM_QSTR(MP_QSTR_eye), (mp_obj_t)&create_eye_obj },
+        #endif
+    #endif /* ULAB_MAX_DIMS */
+    // functions of the approx sub-module
+    #if ULAB_NUMPY_HAS_INTERP
+        { MP_OBJ_NEW_QSTR(MP_QSTR_interp), (mp_obj_t)&approx_interp_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_TRAPZ
+        { MP_OBJ_NEW_QSTR(MP_QSTR_trapz), (mp_obj_t)&approx_trapz_obj },
+    #endif
+    // functions of the create sub-module
+    #if ULAB_NUMPY_HAS_FULL
+        { MP_ROM_QSTR(MP_QSTR_full), (mp_obj_t)&create_full_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_LINSPACE
+        { MP_ROM_QSTR(MP_QSTR_linspace), (mp_obj_t)&create_linspace_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_LOGSPACE
+        { MP_ROM_QSTR(MP_QSTR_logspace), (mp_obj_t)&create_logspace_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ONES
+        { MP_ROM_QSTR(MP_QSTR_ones), (mp_obj_t)&create_ones_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ZEROS
+        { MP_ROM_QSTR(MP_QSTR_zeros), (mp_obj_t)&create_zeros_obj },
+    #endif
+    // functions of the compare sub-module
+    #if ULAB_NUMPY_HAS_CLIP
+        { MP_OBJ_NEW_QSTR(MP_QSTR_clip), (mp_obj_t)&compare_clip_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_EQUAL
+        { MP_OBJ_NEW_QSTR(MP_QSTR_equal), (mp_obj_t)&compare_equal_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_NOTEQUAL
+        { MP_OBJ_NEW_QSTR(MP_QSTR_not_equal), (mp_obj_t)&compare_not_equal_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_MAXIMUM
+        { MP_OBJ_NEW_QSTR(MP_QSTR_maximum), (mp_obj_t)&compare_maximum_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_MINIMUM
+        { MP_OBJ_NEW_QSTR(MP_QSTR_minimum), (mp_obj_t)&compare_minimum_obj },
+    #endif
+    // functions of the filter sub-module
+    #if ULAB_NUMPY_HAS_CONVOLVE
+        { MP_OBJ_NEW_QSTR(MP_QSTR_convolve), (mp_obj_t)&filter_convolve_obj },
+    #endif
+    // functions of the numerical sub-module
+    #if ULAB_NUMPY_HAS_ARGMINMAX
+        { MP_OBJ_NEW_QSTR(MP_QSTR_argmax), (mp_obj_t)&numerical_argmax_obj },
+        { MP_OBJ_NEW_QSTR(MP_QSTR_argmin), (mp_obj_t)&numerical_argmin_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ARGSORT
+        { MP_OBJ_NEW_QSTR(MP_QSTR_argsort), (mp_obj_t)&numerical_argsort_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_CROSS
+        { MP_OBJ_NEW_QSTR(MP_QSTR_cross), (mp_obj_t)&numerical_cross_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_DIFF
+        { MP_OBJ_NEW_QSTR(MP_QSTR_diff), (mp_obj_t)&numerical_diff_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_FLIP
+        { MP_OBJ_NEW_QSTR(MP_QSTR_flip), (mp_obj_t)&numerical_flip_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_MINMAX
+        { MP_OBJ_NEW_QSTR(MP_QSTR_max), (mp_obj_t)&numerical_max_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_MEAN
+        { MP_OBJ_NEW_QSTR(MP_QSTR_mean), (mp_obj_t)&numerical_mean_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_MEDIAN
+        { MP_OBJ_NEW_QSTR(MP_QSTR_median), (mp_obj_t)&numerical_median_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_MINMAX
+        { MP_OBJ_NEW_QSTR(MP_QSTR_min), (mp_obj_t)&numerical_min_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ROLL
+        { MP_OBJ_NEW_QSTR(MP_QSTR_roll), (mp_obj_t)&numerical_roll_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_SORT
+        { MP_OBJ_NEW_QSTR(MP_QSTR_sort), (mp_obj_t)&numerical_sort_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_STD
+        { MP_OBJ_NEW_QSTR(MP_QSTR_std), (mp_obj_t)&numerical_std_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_SUM
+        { MP_OBJ_NEW_QSTR(MP_QSTR_sum), (mp_obj_t)&numerical_sum_obj },
+    #endif
+    // functions of the poly sub-module
+    #if ULAB_NUMPY_HAS_POLYFIT
+        { MP_OBJ_NEW_QSTR(MP_QSTR_polyfit), (mp_obj_t)&poly_polyfit_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_POLYVAL
+        { MP_OBJ_NEW_QSTR(MP_QSTR_polyval), (mp_obj_t)&poly_polyval_obj },
+    #endif
+    // functions of the vector sub-module
+    #if ULAB_NUMPY_HAS_ACOS
+    { MP_OBJ_NEW_QSTR(MP_QSTR_acos), (mp_obj_t)&vectorise_acos_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ACOSH
+    { MP_OBJ_NEW_QSTR(MP_QSTR_acosh), (mp_obj_t)&vectorise_acosh_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ARCTAN2
+    { MP_OBJ_NEW_QSTR(MP_QSTR_arctan2), (mp_obj_t)&vectorise_arctan2_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_AROUND
+    { MP_OBJ_NEW_QSTR(MP_QSTR_around), (mp_obj_t)&vectorise_around_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ASIN
+    { MP_OBJ_NEW_QSTR(MP_QSTR_asin), (mp_obj_t)&vectorise_asin_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ASINH
+    { MP_OBJ_NEW_QSTR(MP_QSTR_asinh), (mp_obj_t)&vectorise_asinh_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ATAN
+    { MP_OBJ_NEW_QSTR(MP_QSTR_atan), (mp_obj_t)&vectorise_atan_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_ATANH
+    { MP_OBJ_NEW_QSTR(MP_QSTR_atanh), (mp_obj_t)&vectorise_atanh_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_CEIL
+    { MP_OBJ_NEW_QSTR(MP_QSTR_ceil), (mp_obj_t)&vectorise_ceil_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_COS
+    { MP_OBJ_NEW_QSTR(MP_QSTR_cos), (mp_obj_t)&vectorise_cos_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_COSH
+    { MP_OBJ_NEW_QSTR(MP_QSTR_cosh), (mp_obj_t)&vectorise_cosh_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_DEGREES
+    { MP_OBJ_NEW_QSTR(MP_QSTR_degrees), (mp_obj_t)&vectorise_degrees_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_EXP
+    { MP_OBJ_NEW_QSTR(MP_QSTR_exp), (mp_obj_t)&vectorise_exp_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_EXPM1
+    { MP_OBJ_NEW_QSTR(MP_QSTR_expm1), (mp_obj_t)&vectorise_expm1_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_FLOOR
+    { MP_OBJ_NEW_QSTR(MP_QSTR_floor), (mp_obj_t)&vectorise_floor_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_LOG
+    { MP_OBJ_NEW_QSTR(MP_QSTR_log), (mp_obj_t)&vectorise_log_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_LOG10
+    { MP_OBJ_NEW_QSTR(MP_QSTR_log10), (mp_obj_t)&vectorise_log10_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_LOG2
+    { MP_OBJ_NEW_QSTR(MP_QSTR_log2), (mp_obj_t)&vectorise_log2_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_RADIANS
+    { MP_OBJ_NEW_QSTR(MP_QSTR_radians), (mp_obj_t)&vectorise_radians_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_SIN
+    { MP_OBJ_NEW_QSTR(MP_QSTR_sin), (mp_obj_t)&vectorise_sin_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_SINH
+    { MP_OBJ_NEW_QSTR(MP_QSTR_sinh), (mp_obj_t)&vectorise_sinh_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_SQRT
+    { MP_OBJ_NEW_QSTR(MP_QSTR_sqrt), (mp_obj_t)&vectorise_sqrt_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_TAN
+    { MP_OBJ_NEW_QSTR(MP_QSTR_tan), (mp_obj_t)&vectorise_tan_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_TANH
+    { MP_OBJ_NEW_QSTR(MP_QSTR_tanh), (mp_obj_t)&vectorise_tanh_obj },
+    #endif
+    #if ULAB_NUMPY_HAS_VECTORIZE
+    { MP_OBJ_NEW_QSTR(MP_QSTR_vectorize), (mp_obj_t)&vectorise_vectorize_obj },
+    #endif
+
+};
+
+static MP_DEFINE_CONST_DICT(mp_module_ulab_numpy_globals, ulab_numpy_globals_table);
+
+mp_obj_module_t ulab_numpy_module = {
+    .base = { &mp_type_module },
+    .globals = (mp_obj_dict_t*)&mp_module_ulab_numpy_globals,
+};
--- a/code/numpy/numpy.h
+++ b/code/numpy/numpy.h
@ -0,0 +1,21 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+ *               
+*/
+
+#ifndef _NUMPY_
+#define _NUMPY_
+
+#include "ulab.h"
+#include "ndarray.h"
+
+extern mp_obj_module_t ulab_numpy_module;
+
+#endif /* _NUMPY_ */
--- a/code/numpy/poly/poly.c
+++ b/code/numpy/poly/poly.c
@ -6,100 +6,35 @@
 *
 * The MIT License (MIT)
 *
- * Copyright (c) 2019-2020 Zoltán Vörös
+ * Copyright (c) 2019-2021 Zoltán Vörös
+ *               2020 Jeff Epler for Adafruit Industries
+ *               2020 Scott Shawcroft for Adafruit Industries
+ *               2020 Taku Fukada
 */

 #include "py/obj.h"
 #include "py/runtime.h"
 #include "py/objarray.h"
-#include "ndarray.h"
-#include "linalg.h"
+
+#include "../../ulab.h"
+#include "../linalg/linalg_tools.h"
+#include "../../ulab_tools.h"
 #include "poly.h"

-#if ULAB_POLY_MODULE
-bool object_is_nditerable(mp_obj_t o_in) {
-    if(MP_OBJ_IS_TYPE(o_in, &ulab_ndarray_type) || 
-      MP_OBJ_IS_TYPE(o_in, &mp_type_tuple) || 
-      MP_OBJ_IS_TYPE(o_in, &mp_type_list) || 
-      MP_OBJ_IS_TYPE(o_in, &mp_type_range)) {
-        return true;
-    }
-    return false;
-}
-
-size_t get_nditerable_len(mp_obj_t o_in) {
-    if(MP_OBJ_IS_TYPE(o_in, &ulab_ndarray_type)) {
-        ndarray_obj_t *in = MP_OBJ_TO_PTR(o_in);
-        return in->array->len;
-    } else {
-        return (size_t)mp_obj_get_int(mp_obj_len_maybe(o_in));
-    }
-}
-
-mp_obj_t poly_polyval(mp_obj_t o_p, mp_obj_t o_x) {
-    // TODO: return immediately, if o_p is not an iterable
-    // TODO: there is a bug here: matrices won't work, 
-    // because there is a single iteration loop
-    size_t m, n;
-    if(MP_OBJ_IS_TYPE(o_x, &ulab_ndarray_type)) {
-        ndarray_obj_t *ndx = MP_OBJ_TO_PTR(o_x);
-        m = ndx->m;
-        n = ndx->n;
-    } else {
-        mp_obj_array_t *ix = MP_OBJ_TO_PTR(o_x);
-        m = 1;
-        n = ix->len;
-    }
-    // polynomials are going to be of type float, except, when both 
-    // the coefficients and the independent variable are integers
-    ndarray_obj_t *out = create_new_ndarray(m, n, NDARRAY_FLOAT);
-    mp_obj_iter_buf_t x_buf;
-    mp_obj_t x_item, x_iterable = mp_getiter(o_x, &x_buf);
-
-    mp_obj_iter_buf_t p_buf;
-    mp_obj_t p_item, p_iterable;
-
-    mp_float_t x, y;
-    mp_float_t *outf = (mp_float_t *)out->array->items;
-    uint8_t plen = mp_obj_get_int(mp_obj_len_maybe(o_p));
-    mp_float_t *p = m_new(mp_float_t, plen);
-    p_iterable = mp_getiter(o_p, &p_buf);
-    uint16_t i = 0;    
-    while((p_item = mp_iternext(p_iterable)) != MP_OBJ_STOP_ITERATION) {
-        p[i] = mp_obj_get_float(p_item);
-        i++;
-    }
-    i = 0;
-    while ((x_item = mp_iternext(x_iterable)) != MP_OBJ_STOP_ITERATION) {
-        x = mp_obj_get_float(x_item);
-        y = p[0];
-        for(uint8_t j=0; j < plen-1; j++) {
-            y *= x;
-            y += p[j+1];
-        }
-        outf[i++] = y;
-    }
-    m_del(mp_float_t, p, plen);
-    return MP_OBJ_FROM_PTR(out);
-}
-
-MP_DEFINE_CONST_FUN_OBJ_2(poly_polyval_obj, poly_polyval);
+#if ULAB_NUMPY_HAS_POLYFIT

 mp_obj_t poly_polyfit(size_t n_args, const mp_obj_t *args) {
-    if((n_args != 2) && (n_args != 3)) {
-        mp_raise_ValueError(translate("number of arguments must be 2, or 3"));
-    }
-    if(!object_is_nditerable(args[0])) {
+    if(!ndarray_object_is_array_like(args[0])) {
        mp_raise_ValueError(translate("input data must be an iterable"));
    }
-    uint16_t lenx = 0, leny = 0;
+    size_t lenx = 0, leny = 0;
    uint8_t deg = 0;
    mp_float_t *x, *XT, *y, *prod;

    if(n_args == 2) { // only the y values are supplied
        // TODO: this is actually not enough: the first argument can very well be a matrix, 
        // in which case we are between the rock and a hard place
-        leny = (uint16_t)mp_obj_get_int(mp_obj_len_maybe(args[0]));
+        leny = (size_t)mp_obj_get_int(mp_obj_len_maybe(args[0]));
        deg = (uint8_t)mp_obj_get_int(args[1]);
        if(leny < deg) {
            mp_raise_ValueError(translate("more degrees of freedom than data points"));
@ -111,9 +46,12 @@ mp_obj_t poly_polyfit(size_t  n_args, const mp_obj_t *args) {
        }
        y = m_new(mp_float_t, leny);
        fill_array_iterable(y, args[0]);
-    } else if(n_args == 3) {
-        lenx = (uint16_t)mp_obj_get_int(mp_obj_len_maybe(args[0]));
-        leny = (uint16_t)mp_obj_get_int(mp_obj_len_maybe(args[0]));
+    } else /* n_args == 3 */ {
+        if(!ndarray_object_is_array_like(args[1])) {
+            mp_raise_ValueError(translate("input data must be an iterable"));
+        }
+        lenx = (size_t)mp_obj_get_int(mp_obj_len_maybe(args[0]));
+        leny = (size_t)mp_obj_get_int(mp_obj_len_maybe(args[1]));
        if(lenx != leny) {
            mp_raise_ValueError(translate("input vectors must be of equal length"));
        }
@ -130,7 +68,7 @@ mp_obj_t poly_polyfit(size_t  n_args, const mp_obj_t *args) {
    // one could probably express X as a function of XT, 
    // and thereby save RAM, because X is used only in the product
    XT = m_new(mp_float_t, (deg+1)*leny); // XT is a matrix of shape (deg+1, len) (rows, columns)
-    for(uint8_t i=0; i < leny; i++) { // column index
+    for(size_t i=0; i < leny; i++) { // column index
        XT[i+0*lenx] = 1.0; // top row
        for(uint8_t j=1; j < deg+1; j++) { // row index
            XT[i+j*leny] = XT[i+(j-1)*leny]*x[i];
@ -139,8 +77,8 @@ mp_obj_t poly_polyfit(size_t  n_args, const mp_obj_t *args) {
    
    prod = m_new(mp_float_t, (deg+1)*(deg+1)); // the product matrix is of shape (deg+1, deg+1)
    mp_float_t sum;
-    for(uint16_t i=0; i < deg+1; i++) { // column index
-        for(uint16_t j=0; j < deg+1; j++) { // row index
+    for(uint8_t i=0; i < deg+1; i++) { // column index
+        for(uint8_t j=0; j < deg+1; j++) { // row index
            sum = 0.0;
            for(size_t k=0; k < lenx; k++) {
                // (j, k) * (k, i) 
@ -163,9 +101,9 @@ mp_obj_t poly_polyfit(size_t  n_args, const mp_obj_t *args) {
    } 
    // at this point, we have the inverse of X^T * X
    // y is a column vector; x is free now, we can use it for storing intermediate values
-    for(uint16_t i=0; i < deg+1; i++) { // row index
+    for(uint8_t i=0; i < deg+1; i++) { // row index
        sum = 0.0;
-        for(uint16_t j=0; j < lenx; j++) { // column index
+        for(size_t j=0; j < lenx; j++) { // column index
            sum += XT[i*lenx+j]*y[j];
        }
        x[i] = sum;
@ -173,15 +111,15 @@ mp_obj_t poly_polyfit(size_t  n_args, const mp_obj_t *args) {
    // XT is no longer needed
    m_del(mp_float_t, XT, (deg+1)*leny);
    
-    ndarray_obj_t *beta = create_new_ndarray(deg+1, 1, NDARRAY_FLOAT);
-    mp_float_t *betav = (mp_float_t *)beta->array->items;
+    ndarray_obj_t *beta = ndarray_new_linear_array(deg+1, NDARRAY_FLOAT);
+    mp_float_t *betav = (mp_float_t *)beta->array;
    // x[0..(deg+1)] contains now the product X^T * y; we can get rid of y
    m_del(float, y, leny);
    
    // now, we calculate beta, i.e., we apply prod = (X^T * X)^(-1) on x = X^T * y; x is a column vector now
-    for(uint16_t i=0; i < deg+1; i++) {
+    for(uint8_t i=0; i < deg+1; i++) {
        sum = 0.0;
-        for(uint16_t j=0; j < deg+1; j++) {
+        for(uint8_t j=0; j < deg+1; j++) {
            sum += prod[i*(deg+1)+j]*x[j];
        }
        betav[i] = sum;
@ -196,20 +134,99 @@ mp_obj_t poly_polyfit(size_t  n_args, const mp_obj_t *args) {
 }

 MP_DEFINE_CONST_FUN_OBJ_VAR_BETWEEN(poly_polyfit_obj, 2, 3, poly_polyfit);
-
-#if !CIRCUITPY
-STATIC const mp_rom_map_elem_t ulab_poly_globals_table[] = {
-    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_poly) },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_polyval), (mp_obj_t)&poly_polyval_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_polyfit), (mp_obj_t)&poly_polyfit_obj },
-};
-
-STATIC MP_DEFINE_CONST_DICT(mp_module_ulab_poly_globals, ulab_poly_globals_table);
-
-mp_obj_module_t ulab_poly_module = {
-    .base = { &mp_type_module },
-    .globals = (mp_obj_dict_t*)&mp_module_ulab_poly_globals,
-};
 #endif

+#if ULAB_NUMPY_HAS_POLYVAL
+
+mp_obj_t poly_polyval(mp_obj_t o_p, mp_obj_t o_x) {
+    if(!ndarray_object_is_array_like(o_p) || !ndarray_object_is_array_like(o_x)) {
+        mp_raise_TypeError(translate("inputs are not iterable"));
+    }
+    // p had better be a one-dimensional standard iterable
+    uint8_t plen = mp_obj_get_int(mp_obj_len_maybe(o_p));
+    mp_float_t *p = m_new(mp_float_t, plen);
+    mp_obj_iter_buf_t p_buf;
+    mp_obj_t p_item, p_iterable = mp_getiter(o_p, &p_buf);
+    uint8_t i = 0;    
+    while((p_item = mp_iternext(p_iterable)) != MP_OBJ_STOP_ITERATION) {
+        p[i] = mp_obj_get_float(p_item);
+        i++;
+    }
+
+    // polynomials are going to be of type float, except, when both 
+    // the coefficients and the independent variable are integers
+    ndarray_obj_t *ndarray;
+    if(MP_OBJ_IS_TYPE(o_x, &ulab_ndarray_type)) {
+        ndarray_obj_t *source = MP_OBJ_TO_PTR(o_x);
+        uint8_t *sarray = (uint8_t *)source->array;
+        ndarray = ndarray_new_dense_ndarray(source->ndim, source->shape, NDARRAY_FLOAT);
+        mp_float_t *array = (mp_float_t *)ndarray->array;
+        
+        mp_float_t (*func)(void *) = ndarray_get_float_function(source->dtype);
+
+        // TODO: these loops are really nothing, but the re-impplementation of 
+        // ITERATE_VECTOR from vectorise.c. We could pass a function pointer here
+        #if ULAB_MAX_DIMS > 3
+        size_t i = 0;
+        do {
+        #endif
+            #if ULAB_MAX_DIMS > 2
+            size_t j = 0;
+            do {
+            #endif
+                #if ULAB_MAX_DIMS > 1
+                size_t k = 0;
+                do {
+                #endif
+                    size_t l = 0;
+                    do {
+                        mp_float_t y = p[0];
+                        mp_float_t _x = func(sarray);
+                        for(uint8_t m=0; m < plen-1; m++) {
+                            y *= _x;
+                            y += p[m+1];
+                        }
+                        *array++ = y;
+                        sarray += source->strides[ULAB_MAX_DIMS - 1];
+                        l++;
+                    } while(l < source->shape[ULAB_MAX_DIMS - 1]);
+                #if ULAB_MAX_DIMS > 1
+                    sarray -= source->strides[ULAB_MAX_DIMS - 1] * source->shape[ULAB_MAX_DIMS-1];
+                    sarray += source->strides[ULAB_MAX_DIMS - 2];
+                    k++;
+                } while(k < source->shape[ULAB_MAX_DIMS - 2]);
+                #endif
+            #if ULAB_MAX_DIMS > 2
+                sarray -= source->strides[ULAB_MAX_DIMS - 2] * source->shape[ULAB_MAX_DIMS-2];
+                sarray += source->strides[ULAB_MAX_DIMS - 3];
+                j++;
+            } while(j < source->shape[ULAB_MAX_DIMS - 3]);
+            #endif
+        #if ULAB_MAX_DIMS > 3
+            sarray -= source->strides[ULAB_MAX_DIMS - 3] * source->shape[ULAB_MAX_DIMS-3];
+            sarray += source->strides[ULAB_MAX_DIMS - 4];
+            i++;
+        } while(i < source->shape[ULAB_MAX_DIMS - 4]);
+        #endif        
+    } else {
+        // o_x had better be a one-dimensional standard iterable
+        ndarray = ndarray_new_linear_array(mp_obj_get_int(mp_obj_len_maybe(o_x)), NDARRAY_FLOAT);
+        mp_float_t *array = (mp_float_t *)ndarray->array;
+        mp_obj_iter_buf_t x_buf;
+        mp_obj_t x_item, x_iterable = mp_getiter(o_x, &x_buf);
+        while ((x_item = mp_iternext(x_iterable)) != MP_OBJ_STOP_ITERATION) {
+            mp_float_t _x = mp_obj_get_float(x_item);
+            mp_float_t y = p[0];
+            for(uint8_t j=0; j < plen-1; j++) {
+                y *= _x;
+                y += p[j+1];
+            }
+            *array++ = y;
+        }
+    }
+    m_del(mp_float_t, p, plen);
+    return MP_OBJ_FROM_PTR(ndarray);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_2(poly_polyval_obj, poly_polyval);
 #endif
--- a/code/numpy/poly/poly.h
+++ b/code/numpy/poly/poly.h
@ -6,20 +6,16 @@
 *
 * The MIT License (MIT)
 *
- * Copyright (c) 2019-2020 Zoltán Vörös
+ * Copyright (c) 2019-2021 Zoltán Vörös
 */

 #ifndef _POLY_
 #define _POLY_

-#include "ulab.h"
+#include "../../ulab.h"
+#include "../../ndarray.h"

-#if ULAB_POLY_MODULE
-
-extern mp_obj_module_t ulab_poly_module;
-
-MP_DECLARE_CONST_FUN_OBJ_2(poly_polyval_obj);
 MP_DECLARE_CONST_FUN_OBJ_VAR_BETWEEN(poly_polyfit_obj);
+MP_DECLARE_CONST_FUN_OBJ_2(poly_polyval_obj);

 #endif
-#endif
--- a/code/numpy/vector/vector.c
+++ b/code/numpy/vector/vector.c
@ -0,0 +1,643 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2021 Zoltán Vörös
+ *               2020 Jeff Epler for Adafruit Industries
+ *               2020 Scott Shawcroft for Adafruit Industries
+ *               2020 Taku Fukada
+*/
+
+#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "py/runtime.h"
+#include "py/binary.h"
+#include "py/obj.h"
+#include "py/objarray.h"
+
+#include "../../ulab.h"
+#include "../../ulab_tools.h"
+#include "vector.h"
+
+//| """Element-by-element functions
+//|
+//| These functions can operate on numbers, 1-D iterables, and arrays of 1 to 4 dimensions by
+//| applying the function to every element in the array.  This is typically
+//| much more efficient than expressing the same operation as a Python loop."""
+//|
+//| from ulab import _DType, _ArrayLike
+//|
+
+static mp_obj_t vectorise_generic_vector(mp_obj_t o_in, mp_float_t (*f)(mp_float_t)) {
+    // Return a single value, if o_in is not iterable
+    if(mp_obj_is_float(o_in) || MP_OBJ_IS_INT(o_in)) {
+        return mp_obj_new_float(f(mp_obj_get_float(o_in)));
+    }
+    if(MP_OBJ_IS_TYPE(o_in, &ulab_ndarray_type)) {
+        ndarray_obj_t *source = MP_OBJ_TO_PTR(o_in);
+        uint8_t *sarray = (uint8_t *)source->array;
+        ndarray_obj_t *ndarray = ndarray_new_dense_ndarray(source->ndim, source->shape, NDARRAY_FLOAT);
+        mp_float_t *array = (mp_float_t *)ndarray->array;
+        
+        #if ULAB_VECTORISE_USES_FUN_POINTER
+        
+            mp_float_t (*func)(void *) = ndarray_get_float_function(source->dtype);
+            
+            #if ULAB_MAX_DIMS > 3
+            size_t i = 0;
+            do {
+            #endif
+                #if ULAB_MAX_DIMS > 2
+                size_t j = 0;
+                do {
+                #endif
+                    #if ULAB_MAX_DIMS > 1
+                    size_t k = 0;
+                    do {
+                    #endif
+                        size_t l = 0;
+                        do {
+                            mp_float_t value = func(sarray);
+                            *array++ = f(value);
+                            sarray += source->strides[ULAB_MAX_DIMS - 1];
+                            l++;
+                        } while(l < source->shape[ULAB_MAX_DIMS - 1]);
+                    #if ULAB_MAX_DIMS > 1
+                        sarray -= source->strides[ULAB_MAX_DIMS - 1] * source->shape[ULAB_MAX_DIMS-1];
+                        sarray += source->strides[ULAB_MAX_DIMS - 2];
+                        k++;
+                    } while(k < source->shape[ULAB_MAX_DIMS - 2]);
+                    #endif /* ULAB_MAX_DIMS > 1 */
+                #if ULAB_MAX_DIMS > 2
+                    sarray -= source->strides[ULAB_MAX_DIMS - 2] * source->shape[ULAB_MAX_DIMS-2];
+                    sarray += source->strides[ULAB_MAX_DIMS - 3];
+                    j++;
+                } while(j < source->shape[ULAB_MAX_DIMS - 3]);
+                #endif /* ULAB_MAX_DIMS > 2 */
+            #if ULAB_MAX_DIMS > 3
+                sarray -= source->strides[ULAB_MAX_DIMS - 3] * source->shape[ULAB_MAX_DIMS-3];
+                sarray += source->strides[ULAB_MAX_DIMS - 4];
+                i++;
+            } while(i < source->shape[ULAB_MAX_DIMS - 4]);
+            #endif /* ULAB_MAX_DIMS > 3 */
+        #else
+        if(source->dtype == NDARRAY_UINT8) {
+            ITERATE_VECTOR(uint8_t, array, source, sarray);
+        } else if(source->dtype == NDARRAY_INT8) {
+            ITERATE_VECTOR(int8_t, array, source, sarray);
+        } else if(source->dtype == NDARRAY_UINT16) {
+            ITERATE_VECTOR(uint16_t, array, source, sarray);
+        } else if(source->dtype == NDARRAY_INT16) {
+            ITERATE_VECTOR(int16_t, array, source, sarray);
+        } else {
+            ITERATE_VECTOR(mp_float_t, array, source, sarray);
+        }
+        #endif /* ULAB_VECTORISE_USES_FUN_POINTER */
+        
+        return MP_OBJ_FROM_PTR(ndarray);
+    } else if(MP_OBJ_IS_TYPE(o_in, &mp_type_tuple) || MP_OBJ_IS_TYPE(o_in, &mp_type_list) ||
+        MP_OBJ_IS_TYPE(o_in, &mp_type_range)) { // i.e., the input is a generic iterable
+            mp_obj_array_t *o = MP_OBJ_TO_PTR(o_in);
+            ndarray_obj_t *out = ndarray_new_linear_array(o->len, NDARRAY_FLOAT);
+            mp_float_t *array = (mp_float_t *)out->array;
+            mp_obj_iter_buf_t iter_buf;
+            mp_obj_t item, iterable = mp_getiter(o_in, &iter_buf);
+            size_t i=0;
+            while ((item = mp_iternext(iterable)) != MP_OBJ_STOP_ITERATION) {
+                mp_float_t x = mp_obj_get_float(item);
+                *array++ = f(x);
+                i++;
+            }
+        return MP_OBJ_FROM_PTR(out);
+    }
+    return mp_const_none;
+}
+
+#if ULAB_NUMPY_HAS_ACOS
+//| def acos(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the inverse cosine function"""
+//|    ...
+//|
+
+MATH_FUN_1(acos, acos);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_acos_obj, vectorise_acos);
+#endif
+
+#if ULAB_NUMPY_HAS_ACOSH
+//| def acosh(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the inverse hyperbolic cosine function"""
+//|    ...
+//|
+
+MATH_FUN_1(acosh, acosh);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_acosh_obj, vectorise_acosh);
+#endif
+
+#if ULAB_NUMPY_HAS_ASIN
+//| def asin(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the inverse sine function"""
+//|    ...
+//|
+
+MATH_FUN_1(asin, asin);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_asin_obj, vectorise_asin);
+#endif
+
+#if ULAB_NUMPY_HAS_ASINH
+//| def asinh(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the inverse hyperbolic sine function"""
+//|    ...
+//|
+
+MATH_FUN_1(asinh, asinh);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_asinh_obj, vectorise_asinh);
+#endif
+
+#if ULAB_NUMPY_HAS_AROUND
+//| def around(a: _ArrayLike, *, decimals: int = 0) -> ulab.ndarray:
+//|    """Returns a new float array in which each element is rounded to
+//|       ``decimals`` places."""
+//|    ...
+//|
+
+mp_obj_t vectorise_around(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none} },
+        { MP_QSTR_decimals, MP_ARG_KW_ONLY | MP_ARG_INT, {.u_int = 0 } }
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+    if(!MP_OBJ_IS_TYPE(args[0].u_obj, &ulab_ndarray_type)) {
+        mp_raise_TypeError(translate("first argument must be an ndarray"));
+    }
+    int8_t n = args[1].u_int;
+    mp_float_t mul = MICROPY_FLOAT_C_FUN(pow)(10.0, n);
+    ndarray_obj_t *source = MP_OBJ_TO_PTR(args[0].u_obj);
+    ndarray_obj_t *ndarray = ndarray_new_dense_ndarray(source->ndim, source->shape, NDARRAY_FLOAT);
+    mp_float_t *narray = (mp_float_t *)ndarray->array;
+    uint8_t *sarray = (uint8_t *)source->array;
+
+    mp_float_t (*func)(void *) = ndarray_get_float_function(source->dtype);
+
+    #if ULAB_MAX_DIMS > 3
+    size_t i = 0;
+    do {
+    #endif
+        #if ULAB_MAX_DIMS > 2
+        size_t j = 0;
+        do {
+        #endif
+            #if ULAB_MAX_DIMS > 1
+            size_t k = 0;
+            do {
+            #endif
+                size_t l = 0;
+                do {
+                    mp_float_t f = func(sarray);
+                    *narray++ = MICROPY_FLOAT_C_FUN(round)(f * mul) / mul;
+                    sarray += source->strides[ULAB_MAX_DIMS - 1];
+                    l++;
+                } while(l < source->shape[ULAB_MAX_DIMS - 1]);
+            #if ULAB_MAX_DIMS > 1
+                sarray -= source->strides[ULAB_MAX_DIMS - 1] * source->shape[ULAB_MAX_DIMS-1];
+                sarray += source->strides[ULAB_MAX_DIMS - 2];
+                k++;
+            } while(k < source->shape[ULAB_MAX_DIMS - 2]);
+            #endif
+        #if ULAB_MAX_DIMS > 2
+            sarray -= source->strides[ULAB_MAX_DIMS - 2] * source->shape[ULAB_MAX_DIMS-2];
+            sarray += source->strides[ULAB_MAX_DIMS - 3];
+            j++;
+        } while(j < source->shape[ULAB_MAX_DIMS - 3]);
+        #endif
+    #if ULAB_MAX_DIMS > 3
+        sarray -= source->strides[ULAB_MAX_DIMS - 3] * source->shape[ULAB_MAX_DIMS-3];
+        sarray += source->strides[ULAB_MAX_DIMS - 4];
+        i++;
+    } while(i < source->shape[ULAB_MAX_DIMS - 4]);
+    #endif
+    return MP_OBJ_FROM_PTR(ndarray);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(vectorise_around_obj, 1, vectorise_around);
+#endif
+
+#if ULAB_NUMPY_HAS_ATAN
+//| def atan(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the inverse tangent function; the return values are in the
+//|       range [-pi/2,pi/2]."""
+//|    ...
+//|
+
+MATH_FUN_1(atan, atan);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_atan_obj, vectorise_atan);
+#endif
+
+#if ULAB_NUMPY_HAS_ARCTAN2
+//| def arctan2(ya: _ArrayLike, xa: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the inverse tangent function of y/x; the return values are in
+//|       the range [-pi, pi]."""
+//|    ...
+//|
+
+mp_obj_t vectorise_arctan2(mp_obj_t y, mp_obj_t x) {
+    ndarray_obj_t *ndarray_x = ndarray_from_mp_obj(x);
+    ndarray_obj_t *ndarray_y = ndarray_from_mp_obj(y);
+
+    uint8_t ndim = 0;
+    size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
+    int32_t *xstrides = m_new(int32_t, ULAB_MAX_DIMS);
+    int32_t *ystrides = m_new(int32_t, ULAB_MAX_DIMS);
+    if(!ndarray_can_broadcast(ndarray_x, ndarray_y, &ndim, shape, xstrides, ystrides)) {
+        mp_raise_ValueError(translate("operands could not be broadcast together"));
+        m_del(size_t, shape, ULAB_MAX_DIMS);
+        m_del(int32_t, xstrides, ULAB_MAX_DIMS);
+        m_del(int32_t, ystrides, ULAB_MAX_DIMS);
+    }
+
+    uint8_t *xarray = (uint8_t *)ndarray_x->array;
+    uint8_t *yarray = (uint8_t *)ndarray_y->array;
+
+    ndarray_obj_t *results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+    mp_float_t *rarray = (mp_float_t *)results->array;
+
+    mp_float_t (*funcx)(void *) = ndarray_get_float_function(ndarray_x->dtype);
+    mp_float_t (*funcy)(void *) = ndarray_get_float_function(ndarray_y->dtype);
+
+    #if ULAB_MAX_DIMS > 3
+    size_t i = 0;
+    do {
+    #endif
+        #if ULAB_MAX_DIMS > 2
+        size_t j = 0;
+        do {
+        #endif
+            #if ULAB_MAX_DIMS > 1
+            size_t k = 0;
+            do {
+            #endif
+                size_t l = 0;
+                do {
+                    mp_float_t _x = funcx(xarray);
+                    mp_float_t _y = funcy(yarray);
+                    *rarray++ = MICROPY_FLOAT_C_FUN(atan2)(_y, _x);
+                    xarray += xstrides[ULAB_MAX_DIMS - 1];
+                    yarray += ystrides[ULAB_MAX_DIMS - 1];
+                    l++;
+                } while(l < results->shape[ULAB_MAX_DIMS - 1]);
+            #if ULAB_MAX_DIMS > 1
+                xarray -= xstrides[ULAB_MAX_DIMS - 1] * results->shape[ULAB_MAX_DIMS-1];
+                xarray += xstrides[ULAB_MAX_DIMS - 2];
+                yarray -= ystrides[ULAB_MAX_DIMS - 1] * results->shape[ULAB_MAX_DIMS-1];
+                yarray += ystrides[ULAB_MAX_DIMS - 2];
+                k++;
+            } while(k < results->shape[ULAB_MAX_DIMS - 2]);
+            #endif
+        #if ULAB_MAX_DIMS > 2
+            xarray -= xstrides[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];
+            xarray += xstrides[ULAB_MAX_DIMS - 3];
+            yarray -= ystrides[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];
+            yarray += ystrides[ULAB_MAX_DIMS - 3];
+            j++;
+        } while(j < results->shape[ULAB_MAX_DIMS - 3]);
+        #endif
+    #if ULAB_MAX_DIMS > 3
+        xarray -= xstrides[ULAB_MAX_DIMS - 3] * results->shape[ULAB_MAX_DIMS-3];
+        xarray += xstrides[ULAB_MAX_DIMS - 4];
+        yarray -= ystrides[ULAB_MAX_DIMS - 3] * results->shape[ULAB_MAX_DIMS-3];
+        yarray += ystrides[ULAB_MAX_DIMS - 4];
+        i++;
+    } while(i < results->shape[ULAB_MAX_DIMS - 4]);
+    #endif
+
+    return MP_OBJ_FROM_PTR(results);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_2(vectorise_arctan2_obj, vectorise_arctan2);
+#endif /* ULAB_VECTORISE_HAS_ARCTAN2 */
+
+#if ULAB_NUMPY_HAS_ATANH
+//| def atanh(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the inverse hyperbolic tangent function"""
+//|    ...
+//|
+
+MATH_FUN_1(atanh, atanh);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_atanh_obj, vectorise_atanh);
+#endif
+
+#if ULAB_NUMPY_HAS_CEIL
+//| def ceil(a: _ArrayLike) -> ulab.ndarray:
+//|    """Rounds numbers up to the next whole number"""
+//|    ...
+//|
+
+MATH_FUN_1(ceil, ceil);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_ceil_obj, vectorise_ceil);
+#endif
+
+#if ULAB_NUMPY_HAS_COS
+//| def cos(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the cosine function"""
+//|    ...
+//|
+
+MATH_FUN_1(cos, cos);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_cos_obj, vectorise_cos);
+#endif
+
+#if ULAB_NUMPY_HAS_COSH
+//| def cosh(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the hyperbolic cosine function"""
+//|    ...
+//|
+
+MATH_FUN_1(cosh, cosh);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_cosh_obj, vectorise_cosh);
+#endif
+
+#if ULAB_NUMPY_HAS_DEGREES
+//| def degrees(a: _ArrayLike) -> ulab.ndarray:
+//|    """Converts angles from radians to degrees"""
+//|    ...
+//|
+
+static mp_float_t vectorise_degrees_(mp_float_t value) {
+    return value * MICROPY_FLOAT_CONST(180.0) / MP_PI;
+}
+
+static mp_obj_t vectorise_degrees(mp_obj_t x_obj) {
+    return vectorise_generic_vector(x_obj, vectorise_degrees_);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_degrees_obj, vectorise_degrees);
+#endif
+
+#if ULAB_SCIPY_SPECIAL_HAS_ERF
+//| def erf(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the error function, which has applications in statistics"""
+//|    ...
+//|
+
+MATH_FUN_1(erf, erf);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_erf_obj, vectorise_erf);
+#endif
+
+#if ULAB_SCIPY_SPECIAL_HAS_ERFC
+//| def erfc(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the complementary error function, which has applications in statistics"""
+//|    ...
+//|
+
+MATH_FUN_1(erfc, erfc);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_erfc_obj, vectorise_erfc);
+#endif
+
+#if ULAB_NUMPY_HAS_EXP
+//| def exp(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the exponent function."""
+//|    ...
+//|
+
+MATH_FUN_1(exp, exp);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_exp_obj, vectorise_exp);
+#endif
+
+#if ULAB_NUMPY_HAS_EXPM1
+//| def expm1(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes $e^x-1$.  In certain applications, using this function preserves numeric accuracy better than the `exp` function."""
+//|    ...
+//|
+
+MATH_FUN_1(expm1, expm1);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_expm1_obj, vectorise_expm1);
+#endif
+
+#if ULAB_NUMPY_HAS_FLOOR
+//| def floor(a: _ArrayLike) -> ulab.ndarray:
+//|    """Rounds numbers up to the next whole number"""
+//|    ...
+//|
+
+MATH_FUN_1(floor, floor);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_floor_obj, vectorise_floor);
+#endif
+
+#if ULAB_SCIPY_SPECIAL_HAS_GAMMA
+//| def gamma(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the gamma function"""
+//|    ...
+//|
+
+MATH_FUN_1(gamma, tgamma);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_gamma_obj, vectorise_gamma);
+#endif
+
+#if ULAB_SCIPY_SPECIAL_HAS_GAMMALN
+//| def lgamma(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the natural log of the gamma function"""
+//|    ...
+//|
+
+MATH_FUN_1(lgamma, lgamma);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_lgamma_obj, vectorise_lgamma);
+#endif
+
+#if ULAB_NUMPY_HAS_LOG
+//| def log(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the natural log"""
+//|    ...
+//|
+
+MATH_FUN_1(log, log);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_log_obj, vectorise_log);
+#endif
+
+#if ULAB_NUMPY_HAS_LOG10
+//| def log10(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the log base 10"""
+//|    ...
+//|
+
+MATH_FUN_1(log10, log10);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_log10_obj, vectorise_log10);
+#endif
+
+#if ULAB_NUMPY_HAS_LOG2
+//| def log2(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the log base 2"""
+//|    ...
+//|
+
+MATH_FUN_1(log2, log2);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_log2_obj, vectorise_log2);
+#endif
+
+#if ULAB_NUMPY_HAS_RADIANS
+//| def radians(a: _ArrayLike) -> ulab.ndarray:
+//|    """Converts angles from degrees to radians"""
+//|    ...
+//|
+
+static mp_float_t vectorise_radians_(mp_float_t value) {
+    return value * MP_PI / MICROPY_FLOAT_CONST(180.0);
+}
+
+static mp_obj_t vectorise_radians(mp_obj_t x_obj) {
+    return vectorise_generic_vector(x_obj, vectorise_radians_);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_radians_obj, vectorise_radians);
+#endif
+
+#if ULAB_NUMPY_HAS_SIN
+//| def sin(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the sine function"""
+//|    ...
+//|
+
+MATH_FUN_1(sin, sin);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_sin_obj, vectorise_sin);
+#endif
+
+#if ULAB_NUMPY_HAS_SINH
+//| def sinh(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the hyperbolic sine"""
+//|    ...
+//|
+
+MATH_FUN_1(sinh, sinh);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_sinh_obj, vectorise_sinh);
+#endif
+
+#if ULAB_NUMPY_HAS_SQRT
+//| def sqrt(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the square root"""
+//|    ...
+//|
+
+MATH_FUN_1(sqrt, sqrt);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_sqrt_obj, vectorise_sqrt);
+#endif
+
+#if ULAB_NUMPY_HAS_TAN
+//| def tan(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the tangent"""
+//|    ...
+//|
+
+MATH_FUN_1(tan, tan);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_tan_obj, vectorise_tan);
+#endif
+
+#if ULAB_NUMPY_HAS_TANH
+//| def tanh(a: _ArrayLike) -> ulab.ndarray:
+//|    """Computes the hyperbolic tangent"""
+//|    ...
+
+MATH_FUN_1(tanh, tanh);
+MP_DEFINE_CONST_FUN_OBJ_1(vectorise_tanh_obj, vectorise_tanh);
+#endif
+
+#if ULAB_NUMPY_HAS_VECTORIZE
+static mp_obj_t vectorise_vectorized_function_call(mp_obj_t self_in, size_t n_args, size_t n_kw, const mp_obj_t *args) {
+    (void) n_args;
+    (void) n_kw;
+    vectorized_function_obj_t *self = MP_OBJ_TO_PTR(self_in);
+    mp_obj_t avalue[1];
+    mp_obj_t fvalue;
+    if(MP_OBJ_IS_TYPE(args[0], &ulab_ndarray_type)) {
+        ndarray_obj_t *source = MP_OBJ_TO_PTR(args[0]);
+        ndarray_obj_t *ndarray = ndarray_new_dense_ndarray(source->ndim, source->shape, self->otypes);
+        for(size_t i=0; i < source->len; i++) {
+            avalue[0] = mp_binary_get_val_array(source->dtype, source->array, i);
+            fvalue = self->type->call(self->fun, 1, 0, avalue);
+            mp_binary_set_val_array(self->otypes, ndarray->array, i, fvalue);
+        }
+        return MP_OBJ_FROM_PTR(ndarray);
+    } else if(MP_OBJ_IS_TYPE(args[0], &mp_type_tuple) || MP_OBJ_IS_TYPE(args[0], &mp_type_list) ||
+        MP_OBJ_IS_TYPE(args[0], &mp_type_range)) { // i.e., the input is a generic iterable
+        size_t len = (size_t)mp_obj_get_int(mp_obj_len_maybe(args[0]));
+        ndarray_obj_t *ndarray = ndarray_new_linear_array(len, self->otypes);
+        mp_obj_iter_buf_t iter_buf;
+        mp_obj_t iterable = mp_getiter(args[0], &iter_buf);
+        size_t i=0;
+        while ((avalue[0] = mp_iternext(iterable)) != MP_OBJ_STOP_ITERATION) {
+            fvalue = self->type->call(self->fun, 1, 0, avalue);
+            mp_binary_set_val_array(self->otypes, ndarray->array, i, fvalue);
+            i++;
+        }
+        return MP_OBJ_FROM_PTR(ndarray);
+    } else if(mp_obj_is_int(args[0]) || mp_obj_is_float(args[0])) {
+        ndarray_obj_t *ndarray = ndarray_new_linear_array(1, self->otypes);
+        fvalue = self->type->call(self->fun, 1, 0, args);
+        mp_binary_set_val_array(self->otypes, ndarray->array, 0, fvalue);
+        return MP_OBJ_FROM_PTR(ndarray);
+    } else {
+        mp_raise_ValueError(translate("wrong input type"));
+    }
+    return mp_const_none;
+}
+
+const mp_obj_type_t vectorise_function_type = {
+    { &mp_type_type },
+    .name = MP_QSTR_,
+    .call = vectorise_vectorized_function_call,
+};
+
+//| def vectorize(
+//|     f: Union[Callable[[int], float], Callable[[float], float]],
+//|     *,
+//|     otypes: Optional[_DType] = None
+//| ) -> Callable[[_ArrayLike], ulab.ndarray]:
+//|    """
+//|    :param callable f: The function to wrap
+//|    :param otypes: List of array types that may be returned by the function.  None is interpreted to mean the return value is float.
+//|
+//|    Wrap a Python function ``f`` so that it can be applied to arrays.
+//|    The callable must return only values of the types specified by ``otypes``, or the result is undefined."""
+//|    ...
+//|
+
+static mp_obj_t vectorise_vectorize(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none} },
+        { MP_QSTR_otypes, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_none} }
+    };
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+    const mp_obj_type_t *type = mp_obj_get_type(args[0].u_obj);
+    if(type->call == NULL) {
+        mp_raise_TypeError(translate("first argument must be a callable"));
+    }
+    mp_obj_t _otypes = args[1].u_obj;
+    uint8_t otypes = NDARRAY_FLOAT;
+    if(_otypes == mp_const_none) {
+        // TODO: is this what numpy does?
+        otypes = NDARRAY_FLOAT;
+    } else if(mp_obj_is_int(_otypes)) {
+        otypes = mp_obj_get_int(_otypes);
+        if(otypes != NDARRAY_FLOAT && otypes != NDARRAY_UINT8 && otypes != NDARRAY_INT8 &&
+            otypes != NDARRAY_UINT16 && otypes != NDARRAY_INT16) {
+                mp_raise_ValueError(translate("wrong output type"));
+        }
+    }
+    else {
+        mp_raise_ValueError(translate("wrong output type"));
+    }
+    vectorized_function_obj_t *function = m_new_obj(vectorized_function_obj_t);
+    function->base.type = &vectorise_function_type;
+    function->otypes = otypes;
+    function->fun = args[0].u_obj;
+    function->type = type;
+    return MP_OBJ_FROM_PTR(function);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(vectorise_vectorize_obj, 1, vectorise_vectorize);
+#endif
--- a/code/numpy/vector/vector.h
+++ b/code/numpy/vector/vector.h
@ -0,0 +1,156 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2019-2021 Zoltán Vörös
+*/
+
+#ifndef _VECTOR_
+#define _VECTOR_
+
+#include "../../ulab.h"
+#include "../../ndarray.h"
+
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_acos_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_acosh_obj);
+MP_DECLARE_CONST_FUN_OBJ_2(vectorise_arctan2_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(vectorise_around_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_asin_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_asinh_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_atan_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_atanh_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_ceil_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_cos_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_cosh_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_degrees_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_erf_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_erfc_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_exp_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_expm1_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_floor_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_gamma_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_lgamma_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_log_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_log10_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_log2_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_radians_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_sin_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_sinh_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_sqrt_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_tan_obj);
+MP_DECLARE_CONST_FUN_OBJ_1(vectorise_tanh_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(vectorise_vectorize_obj);
+
+typedef struct _vectorized_function_obj_t {
+    mp_obj_base_t base;
+    uint8_t otypes;
+    mp_obj_t fun;
+    const mp_obj_type_t *type;
+} vectorized_function_obj_t;
+
+#if ULAB_HAS_FUNCTION_ITERATOR
+#define ITERATE_VECTOR(type, array, source, sarray)\
+({\
+    size_t *scoords = ndarray_new_coords((source)->ndim);\
+    for(size_t i=0; i < (source)->len/(source)->shape[ULAB_MAX_DIMS -1]; i++) {\
+        for(size_t l=0; l < (source)->shape[ULAB_MAX_DIMS - 1]; l++) {\
+            *(array)++ = f(*((type *)(sarray)));\
+            (sarray) += (source)->strides[ULAB_MAX_DIMS - 1];\
+        }\
+        ndarray_rewind_array((source)->ndim, sarray, (source)->shape, (source)->strides, scoords);\
+    }\
+})
+
+#else
+
+#if ULAB_MAX_DIMS == 4
+#define ITERATE_VECTOR(type, array, source, sarray) do {\
+    size_t i=0;\
+    do {\
+        size_t j = 0;\
+        do {\
+            size_t k = 0;\
+            do {\
+                size_t l = 0;\
+                do {\
+                    *(array)++ = f(*((type *)(sarray)));\
+                    (sarray) += (source)->strides[ULAB_MAX_DIMS - 1];\
+                    l++;\
+                } while(l < (source)->shape[ULAB_MAX_DIMS-1]);\
+                (sarray) -= (source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1];\
+                (sarray) += (source)->strides[ULAB_MAX_DIMS - 2];\
+                k++;\
+            } while(k < (source)->shape[ULAB_MAX_DIMS-2]);\
+            (sarray) -= (source)->strides[ULAB_MAX_DIMS - 2] * (source)->shape[ULAB_MAX_DIMS-2];\
+            (sarray) += (source)->strides[ULAB_MAX_DIMS - 3];\
+            j++;\
+        } while(j < (source)->shape[ULAB_MAX_DIMS-3]);\
+        (sarray) -= (source)->strides[ULAB_MAX_DIMS - 3] * (source)->shape[ULAB_MAX_DIMS-3];\
+        (sarray) += (source)->strides[ULAB_MAX_DIMS - 4];\
+        i++;\
+    } while(i < (source)->shape[ULAB_MAX_DIMS-4]);\
+} while(0)
+#endif /* ULAB_MAX_DIMS == 4 */
+
+#if ULAB_MAX_DIMS == 3
+#define ITERATE_VECTOR(type, array, source, sarray) do {\
+    size_t j = 0;\
+    do {\
+        size_t k = 0;\
+        do {\
+            size_t l = 0;\
+            do {\
+                *(array)++ = f(*((type *)(sarray)));\
+                (sarray) += (source)->strides[ULAB_MAX_DIMS - 1];\
+                l++;\
+            } while(l < (source)->shape[ULAB_MAX_DIMS-1]);\
+            (sarray) -= (source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1];\
+            (sarray) += (source)->strides[ULAB_MAX_DIMS - 2];\
+            k++;\
+        } while(k < (source)->shape[ULAB_MAX_DIMS-2]);\
+        (sarray) -= (source)->strides[ULAB_MAX_DIMS - 2] * (source)->shape[ULAB_MAX_DIMS-2];\
+        (sarray) += (source)->strides[ULAB_MAX_DIMS - 3];\
+        j++;\
+    } while(j < (source)->shape[ULAB_MAX_DIMS-3]);\
+} while(0)
+#endif /* ULAB_MAX_DIMS == 3 */
+
+#if ULAB_MAX_DIMS == 2
+#define ITERATE_VECTOR(type, array, source, sarray) do {\
+    size_t k = 0;\
+    do {\
+        size_t l = 0;\
+        do {\
+            *(array)++ = f(*((type *)(sarray)));\
+            (sarray) += (source)->strides[ULAB_MAX_DIMS - 1];\
+            l++;\
+        } while(l < (source)->shape[ULAB_MAX_DIMS-1]);\
+        (sarray) -= (source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1];\
+        (sarray) += (source)->strides[ULAB_MAX_DIMS - 2];\
+        k++;\
+    } while(k < (source)->shape[ULAB_MAX_DIMS-2]);\
+} while(0)
+#endif /* ULAB_MAX_DIMS == 2 */
+
+#if ULAB_MAX_DIMS == 1
+#define ITERATE_VECTOR(type, array, source, sarray) do {\
+    size_t l = 0;\
+    do {\
+        *(array)++ = f(*((type *)(sarray)));\
+        (sarray) += (source)->strides[ULAB_MAX_DIMS - 1];\
+        l++;\
+    } while(l < (source)->shape[ULAB_MAX_DIMS-1]);\
+} while(0)
+#endif /* ULAB_MAX_DIMS == 1 */
+#endif /* ULAB_HAS_FUNCTION_ITERATOR */
+
+#define MATH_FUN_1(py_name, c_name) \
+    static mp_obj_t vectorise_ ## py_name(mp_obj_t x_obj) { \
+        return vectorise_generic_vector(x_obj, MICROPY_FLOAT_C_FUN(c_name)); \
+}
+
+#endif /* _VECTOR_ */
--- a/code/scipy/optimize/optimize.c
+++ b/code/scipy/optimize/optimize.c
@ -0,0 +1,414 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020 Jeff Epler for Adafruit Industries
+ *               2020 Scott Shawcroft for Adafruit Industries
+ *               2020-2021 Zoltán Vörös
+ *               2020 Taku Fukada
+*/
+
+#include <math.h>
+#include "py/obj.h"
+#include "py/runtime.h"
+#include "py/misc.h"
+
+#include "../../ndarray.h"
+#include "../../ulab.h"
+#include "../../ulab_tools.h"
+#include "optimize.h"
+
+const mp_obj_float_t xtolerance = {{&mp_type_float}, MICROPY_FLOAT_CONST(2.4e-7)};
+const mp_obj_float_t rtolerance = {{&mp_type_float}, MICROPY_FLOAT_CONST(0.0)};
+
+static mp_float_t optimize_python_call(const mp_obj_type_t *type, mp_obj_t fun, mp_float_t x, mp_obj_t *fargs, uint8_t nparams) {
+    // Helper function for calculating the value of f(x, a, b, c, ...),
+    // where f is defined in python. Takes a float, returns a float.
+    // The array of mp_obj_t type must be supplied, as must the number of parameters (a, b, c...) in nparams
+    fargs[0] = mp_obj_new_float(x);
+    return mp_obj_get_float(type->call(fun, nparams+1, 0, fargs));
+}
+
+#if ULAB_SCIPY_OPTIMIZE_HAS_BISECT
+//| def bisect(
+//|     fun: Callable[[float], float],
+//|     a: float,
+//|     b: float,
+//|     *,
+//|     xtol: float = 2.4e-7,
+//|     maxiter: int = 100
+//| ) -> float:
+//|     """
+//|     :param callable f: The function to bisect
+//|     :param float a: The left side of the interval
+//|     :param float b: The right side of the interval
+//|     :param float xtol: The tolerance value
+//|     :param float maxiter: The maximum number of iterations to perform
+//|
+//|     Find a solution (zero) of the function ``f(x)`` on the interval
+//|     (``a``..``b``) using the bisection method.  The result is accurate to within
+//|     ``xtol`` unless more than ``maxiter`` steps are required."""
+//|     ...
+//|
+
+STATIC mp_obj_t optimize_bisect(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    // Simple bisection routine
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_xtol, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = MP_ROM_PTR(&xtolerance)} },
+        { MP_QSTR_maxiter, MP_ARG_KW_ONLY | MP_ARG_INT, {.u_int = 100} },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    mp_obj_t fun = args[0].u_obj;
+    const mp_obj_type_t *type = mp_obj_get_type(fun);
+    if(type->call == NULL) {
+        mp_raise_TypeError(translate("first argument must be a function"));
+    }
+    mp_float_t xtol = mp_obj_get_float(args[3].u_obj);
+    mp_obj_t *fargs = m_new(mp_obj_t, 1);
+    mp_float_t left, right;
+    mp_float_t x_mid;
+    mp_float_t a = mp_obj_get_float(args[1].u_obj);
+    mp_float_t b = mp_obj_get_float(args[2].u_obj);
+    left = optimize_python_call(type, fun, a, fargs, 0);
+    right = optimize_python_call(type, fun, b, fargs, 0);
+    if(left * right > 0) {
+        mp_raise_ValueError(translate("function has the same sign at the ends of interval"));
+    }
+    mp_float_t rtb = left < MICROPY_FLOAT_CONST(0.0) ? a : b;
+    mp_float_t dx = left < MICROPY_FLOAT_CONST(0.0) ? b - a : a - b;
+    if(args[4].u_int < 0) {
+        mp_raise_ValueError(translate("maxiter should be > 0"));
+    }
+    for(uint16_t i=0; i < args[4].u_int; i++) {
+        dx *= MICROPY_FLOAT_CONST(0.5);
+        x_mid = rtb + dx;
+        if(optimize_python_call(type, fun, x_mid, fargs, 0) < MICROPY_FLOAT_CONST(0.0)) {
+            rtb = x_mid;
+        }
+        if(MICROPY_FLOAT_C_FUN(fabs)(dx) < xtol) break;
+    }
+    return mp_obj_new_float(rtb);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(optimize_bisect_obj, 3, optimize_bisect);
+#endif
+
+#if ULAB_SCIPY_OPTIMIZE_HAS_FMIN
+//| def fmin(
+//|     fun: Callable[[float], float],
+//|     x0: float,
+//|     *,
+//|     xatol: float = 2.4e-7,
+//|     fatol: float = 2.4e-7,
+//|     maxiter: int = 200
+//| ) -> float:
+//|     """
+//|     :param callable f: The function to bisect
+//|     :param float x0: The initial x value
+//|     :param float xatol: The absolute tolerance value
+//|     :param float fatol: The relative tolerance value
+//|
+//|     Find a minimum of the function ``f(x)`` using the downhill simplex method.
+//|     The located ``x`` is within ``fxtol`` of the actual minimum, and ``f(x)``
+//|     is within ``fatol`` of the actual minimum unless more than ``maxiter``
+//|     steps are requried."""
+//|     ...
+//|
+
+STATIC mp_obj_t optimize_fmin(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    // downhill simplex method in 1D
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_xatol, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = MP_ROM_PTR(&xtolerance)} },
+        { MP_QSTR_fatol, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = MP_ROM_PTR(&xtolerance)} },
+        { MP_QSTR_maxiter, MP_ARG_KW_ONLY | MP_ARG_INT, {.u_int = 200} },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    mp_obj_t fun = args[0].u_obj;
+    const mp_obj_type_t *type = mp_obj_get_type(fun);
+    if(type->call == NULL) {
+        mp_raise_TypeError(translate("first argument must be a function"));
+    }
+
+    // parameters controlling convergence conditions
+    mp_float_t xatol = mp_obj_get_float(args[2].u_obj);
+    mp_float_t fatol = mp_obj_get_float(args[3].u_obj);
+    if(args[4].u_int <= 0) {
+        mp_raise_ValueError(translate("maxiter must be > 0"));
+    }
+    uint16_t maxiter = (uint16_t)args[4].u_int;
+
+    mp_float_t x0 = mp_obj_get_float(args[1].u_obj);
+    mp_float_t x1 = x0 != MICROPY_FLOAT_CONST(0.0) ? (MICROPY_FLOAT_CONST(1.0) + OPTIMIZE_NONZDELTA) * x0 : OPTIMIZE_ZDELTA;
+    mp_obj_t *fargs = m_new(mp_obj_t, 1);
+    mp_float_t f0 = optimize_python_call(type, fun, x0, fargs, 0);
+    mp_float_t f1 = optimize_python_call(type, fun, x1, fargs, 0);
+    if(f1 < f0) {
+        SWAP(mp_float_t, x0, x1);
+        SWAP(mp_float_t, f0, f1);
+    }
+    for(uint16_t i=0; i < maxiter; i++) {
+        uint8_t shrink = 0;
+        f0 = optimize_python_call(type, fun, x0, fargs, 0);
+        f1 = optimize_python_call(type, fun, x1, fargs, 0);
+
+        // reflection
+        mp_float_t xr = (MICROPY_FLOAT_CONST(1.0) + OPTIMIZE_ALPHA) * x0 - OPTIMIZE_ALPHA * x1;
+        mp_float_t fr = optimize_python_call(type, fun, xr, fargs, 0);
+        if(fr < f0) { // expansion
+            mp_float_t xe = (1 + OPTIMIZE_ALPHA * OPTIMIZE_BETA) * x0 - OPTIMIZE_ALPHA * OPTIMIZE_BETA * x1;
+            mp_float_t fe = optimize_python_call(type, fun, xe, fargs, 0);
+            if(fe < fr) {
+                x1 = xe;
+                f1 = fe;
+            } else {
+                x1 = xr;
+                f1 = fr;
+            }
+        } else {
+            if(fr < f1) { // contraction
+                mp_float_t xc = (1 + OPTIMIZE_GAMMA * OPTIMIZE_ALPHA) * x0 - OPTIMIZE_GAMMA * OPTIMIZE_ALPHA * x1;
+                mp_float_t fc = optimize_python_call(type, fun, xc, fargs, 0);
+                if(fc < fr) {
+                    x1 = xc;
+                    f1 = fc;
+                } else {
+                    shrink = 1;
+                }
+            } else { // inside contraction
+                mp_float_t xc = (MICROPY_FLOAT_CONST(1.0) - OPTIMIZE_GAMMA) * x0 + OPTIMIZE_GAMMA * x1;
+                mp_float_t fc = optimize_python_call(type, fun, xc, fargs, 0);
+                if(fc < f1) {
+                    x1 = xc;
+                    f1 = fc;
+                } else {
+                    shrink = 1;
+                }
+            }
+            if(shrink == 1) {
+                x1 = x0 + OPTIMIZE_DELTA * (x1 - x0);
+                f1 = optimize_python_call(type, fun, x1, fargs, 0);
+            }
+            if((MICROPY_FLOAT_C_FUN(fabs)(f1 - f0) < fatol) ||
+                (MICROPY_FLOAT_C_FUN(fabs)(x1 - x0) < xatol)) {
+                break;
+            }
+            if(f1 < f0) {
+                SWAP(mp_float_t, x0, x1);
+                SWAP(mp_float_t, f0, f1);
+            }
+        }
+    }
+    return mp_obj_new_float(x0);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(optimize_fmin_obj, 2, optimize_fmin);
+#endif
+
+#if ULAB_SCIPY_OPTIMIZE_HAS_CURVE_FIT
+static void optimize_jacobi(const mp_obj_type_t *type, mp_obj_t fun, mp_float_t *x, mp_float_t *y, uint16_t len, mp_float_t *params, uint8_t nparams, mp_float_t *jacobi, mp_float_t *grad) {
+    /* Calculates the Jacobian and the gradient of the cost function
+     *
+     * The entries in the Jacobian are
+     * J(m, n) = de_m/da_n,
+     *
+     * where
+     *
+     * e_m = (f(x_m, a1, a2, ...) - y_m)/sigma_m is the error at x_m,
+     *
+     * and
+     *
+     * a1, a2, ..., a_n are the free parameters
+     */
+    mp_obj_t *fargs0 = m_new(mp_obj_t, lenp+1);
+    mp_obj_t *fargs1 = m_new(mp_obj_t, lenp+1);
+    for(uint8_t p=0; p < nparams; p++) {
+        fargs0[p+1] = mp_obj_new_float(params[p]);
+        fargs1[p+1] = mp_obj_new_float(params[p]);
+    }
+    for(uint8_t p=0; p < nparams; p++) {
+        mp_float_t da = params[p] != MICROPY_FLOAT_CONST(0.0) ? (MICROPY_FLOAT_CONST(1.0) + APPROX_NONZDELTA) * params[p] : APPROX_ZDELTA;
+        fargs1[p+1] = mp_obj_new_float(params[p] + da);
+        grad[p] = MICROPY_FLOAT_CONST(0.0);
+        for(uint16_t i=0; i < len; i++) {
+            mp_float_t f0 = optimize_python_call(type, fun, x[i], fargs0, nparams);
+            mp_float_t f1 = optimize_python_call(type, fun, x[i], fargs1, nparams);
+            jacobi[i*nparamp+p] = (f1 - f0) / da;
+            grad[p] += (f0 - y[i]) * jacobi[i*nparamp+p];
+        }
+        fargs1[p+1] = fargs0[p+1]; // set back to the original value
+    }
+}
+
+static void optimize_delta(mp_float_t *jacobi, mp_float_t *grad, uint16_t len, uint8_t nparams, mp_float_t lambda) {
+    //
+}
+
+mp_obj_t optimize_curve_fit(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    // Levenberg-Marquardt non-linear fit
+    // The implementation follows the introductory discussion in Mark Tanstrum's paper, https://arxiv.org/abs/1201.5885
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_p0, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_xatol, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = MP_ROM_PTR(&xtolerance)} },
+        { MP_QSTR_fatol, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = MP_ROM_PTR(&xtolerance)} },
+        { MP_QSTR_maxiter, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_none} },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    mp_obj_t fun = args[0].u_obj;
+    const mp_obj_type_t *type = mp_obj_get_type(fun);
+    if(type->call == NULL) {
+        mp_raise_TypeError(translate("first argument must be a function"));
+    }
+
+    mp_obj_t x_obj = args[1].u_obj;
+    mp_obj_t y_obj = args[2].u_obj;
+    mp_obj_t p0_obj = args[3].u_obj;
+    if(!ndarray_object_is_array_like(x_obj) || !ndarray_object_is_array_like(y_obj)) {
+        mp_raise_TypeError(translate("data must be iterable"));
+    }
+    if(!ndarray_object_is_nditerable(p0_obj)) {
+        mp_raise_TypeError(translate("initial values must be iterable"));
+    }
+    size_t len = (size_t)mp_obj_get_int(mp_obj_len_maybe(x_obj));
+    uint8_t lenp = (uint8_t)mp_obj_get_int(mp_obj_len_maybe(p0_obj));
+    if(len != (uint16_t)mp_obj_get_int(mp_obj_len_maybe(y_obj))) {
+        mp_raise_ValueError(translate("data must be of equal length"));
+    }
+
+    mp_float_t *x = m_new(mp_float_t, len);
+    fill_array_iterable(x, x_obj);
+    mp_float_t *y = m_new(mp_float_t, len);
+    fill_array_iterable(y, y_obj);
+    mp_float_t *p0 = m_new(mp_float_t, lenp);
+    fill_array_iterable(p0, p0_obj);
+    mp_float_t *grad = m_new(mp_float_t, len);
+    mp_float_t *jacobi = m_new(mp_float_t, len*len);
+    mp_obj_t *fargs = m_new(mp_obj_t, lenp+1);
+
+    m_del(mp_float_t, p0, lenp);
+    // parameters controlling convergence conditions
+    //mp_float_t xatol = mp_obj_get_float(args[2].u_obj);
+    //mp_float_t fatol = mp_obj_get_float(args[3].u_obj);
+
+    // this has finite binary representation; we will multiply/divide by 4
+    //mp_float_t lambda = 0.0078125;
+
+    //linalg_invert_matrix(mp_float_t *data, size_t N)
+
+    m_del(mp_float_t, x, len);
+    m_del(mp_float_t, y, len);
+    m_del(mp_float_t, grad, len);
+    m_del(mp_float_t, jacobi, len*len);
+    m_del(mp_obj_t, fargs, lenp+1);
+    return mp_const_none;
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(optimize_curve_fit_obj, 2, optimize_curve_fit);
+#endif
+
+#if ULAB_SCIPY_OPTIMIZE_HAS_NEWTON
+//| def newton(
+//|     fun: Callable[[float], float],
+//|     x0: float,
+//|     *,
+//|     xtol: float = 2.4e-7,
+//|     rtol: float = 0.0,
+//|     maxiter: int = 50
+//| ) -> float:
+//|     """
+//|     :param callable f: The function to bisect
+//|     :param float x0: The initial x value
+//|     :param float xtol: The absolute tolerance value
+//|     :param float rtol: The relative tolerance value
+//|     :param float maxiter: The maximum number of iterations to perform
+//|
+//|     Find a solution (zero) of the function ``f(x)`` using Newton's Method.
+//|     The result is accurate to within ``xtol * rtol * |f(x)|`` unless more than
+//|     ``maxiter`` steps are requried."""
+//|     ...
+//|
+
+static mp_obj_t optimize_newton(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    // this is actually the secant method, as the first derivative of the function
+    // is not accepted as an argument. The function whose root we want to solve for
+    // must depend on a single variable without parameters, i.e., f(x)
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_tol, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = MP_ROM_PTR(&xtolerance) } },
+        { MP_QSTR_rtol, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = MP_ROM_PTR(&rtolerance) } },
+        { MP_QSTR_maxiter, MP_ARG_KW_ONLY | MP_ARG_INT, { .u_int = 50 } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    mp_obj_t fun = args[0].u_obj;
+    const mp_obj_type_t *type = mp_obj_get_type(fun);
+    if(type->call == NULL) {
+        mp_raise_TypeError(translate("first argument must be a function"));
+    }
+    mp_float_t x = mp_obj_get_float(args[1].u_obj);
+    mp_float_t tol = mp_obj_get_float(args[2].u_obj);
+    mp_float_t rtol = mp_obj_get_float(args[3].u_obj);
+    mp_float_t dx, df, fx;
+    dx = x > MICROPY_FLOAT_CONST(0.0) ? OPTIMIZE_EPS * x : -OPTIMIZE_EPS * x;
+    mp_obj_t *fargs = m_new(mp_obj_t, 1);
+    if(args[4].u_int <= 0) {
+        mp_raise_ValueError(translate("maxiter must be > 0"));
+    }
+    for(uint16_t i=0; i < args[4].u_int; i++) {
+        fx = optimize_python_call(type, fun, x, fargs, 0);
+        df = (optimize_python_call(type, fun, x + dx, fargs, 0) - fx) / dx;
+        dx = fx / df;
+        x -= dx;
+        if(MICROPY_FLOAT_C_FUN(fabs)(dx) < (tol + rtol * MICROPY_FLOAT_C_FUN(fabs)(x))) break;
+    }
+    return mp_obj_new_float(x);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(optimize_newton_obj, 2, optimize_newton);
+#endif
+
+static const mp_rom_map_elem_t ulab_scipy_optimize_globals_table[] = {
+    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_optimize) },
+    #if ULAB_SCIPY_OPTIMIZE_HAS_BISECT
+        { MP_OBJ_NEW_QSTR(MP_QSTR_bisect), (mp_obj_t)&optimize_bisect_obj },
+    #endif
+    #if ULAB_SCIPY_OPTIMIZE_HAS_CURVE_FIT
+        { MP_OBJ_NEW_QSTR(MP_QSTR_curve_fit), (mp_obj_t)&optimize_curve_fit_obj },
+    #endif
+    #if ULAB_SCIPY_OPTIMIZE_HAS_FMIN
+        { MP_OBJ_NEW_QSTR(MP_QSTR_fmin), (mp_obj_t)&optimize_fmin_obj },
+    #endif
+    #if ULAB_SCIPY_OPTIMIZE_HAS_NEWTON
+        { MP_OBJ_NEW_QSTR(MP_QSTR_newton), (mp_obj_t)&optimize_newton_obj },
+    #endif
+};
+
+static MP_DEFINE_CONST_DICT(mp_module_ulab_scipy_optimize_globals, ulab_scipy_optimize_globals_table);
+
+mp_obj_module_t ulab_scipy_optimize_module = {
+    .base = { &mp_type_module },
+    .globals = (mp_obj_dict_t*)&mp_module_ulab_scipy_optimize_globals,
+};
--- a/code/scipy/optimize/optimize.h
+++ b/code/scipy/optimize/optimize.h
@ -0,0 +1,33 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+ *               
+*/
+
+#ifndef _SCIPY_OPTIMIZE_
+#define _SCIPY_OPTIMIZE_
+
+#include "../../ulab_tools.h"
+
+#define     OPTIMIZE_EPS          MICROPY_FLOAT_CONST(1.0e-4)
+#define     OPTIMIZE_NONZDELTA    MICROPY_FLOAT_CONST(0.05)
+#define     OPTIMIZE_ZDELTA       MICROPY_FLOAT_CONST(0.00025)
+#define     OPTIMIZE_ALPHA        MICROPY_FLOAT_CONST(1.0)
+#define     OPTIMIZE_BETA         MICROPY_FLOAT_CONST(2.0)
+#define     OPTIMIZE_GAMMA        MICROPY_FLOAT_CONST(0.5)
+#define     OPTIMIZE_DELTA        MICROPY_FLOAT_CONST(0.5)
+
+extern mp_obj_module_t ulab_scipy_optimize_module;
+
+MP_DECLARE_CONST_FUN_OBJ_KW(optimize_bisect_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(optimize_curve_fit_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(optimize_fmin_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(optimize_newton_obj);
+
+#endif /* _SCIPY_OPTIMIZE_ */
--- a/code/scipy/scipy.c
+++ b/code/scipy/scipy.c
@ -0,0 +1,47 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020 Jeff Epler for Adafruit Industries
+ *               2020 Scott Shawcroft for Adafruit Industries
+ *               2020-2021 Zoltán Vörös
+ *               2020 Taku Fukada
+*/
+
+#include <math.h>
+#include "py/runtime.h"
+
+#include "../ulab.h"
+#include "optimize/optimize.h"
+#include "signal/signal.h"
+#include "special/special.h"
+
+#if ULAB_HAS_SCIPY
+
+//| """Compatibility layer for scipy"""
+//|
+
+static const mp_rom_map_elem_t ulab_scipy_globals_table[] = {
+    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_scipy) },
+    #if ULAB_SCIPY_HAS_OPTIMIZE_MODULE
+        { MP_ROM_QSTR(MP_QSTR_optimize), MP_ROM_PTR(&ulab_scipy_optimize_module) },
+    #endif
+    #if ULAB_SCIPY_HAS_SIGNAL_MODULE
+        { MP_ROM_QSTR(MP_QSTR_signal), MP_ROM_PTR(&ulab_scipy_signal_module) },
+    #endif
+    #if ULAB_SCIPY_HAS_SPECIAL_MODULE
+        { MP_ROM_QSTR(MP_QSTR_special), MP_ROM_PTR(&ulab_scipy_special_module) },
+    #endif
+};
+
+static MP_DEFINE_CONST_DICT(mp_module_ulab_scipy_globals, ulab_scipy_globals_table);
+
+mp_obj_module_t ulab_scipy_module = {
+    .base = { &mp_type_module },
+    .globals = (mp_obj_dict_t*)&mp_module_ulab_scipy_globals,
+};
+#endif
--- a/code/scipy/scipy.h
+++ b/code/scipy/scipy.h
@ -0,0 +1,21 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+ *               
+*/
+
+#ifndef _SCIPY_
+#define _SCIPY_
+
+#include "ulab.h"
+#include "ndarray.h"
+
+extern mp_obj_module_t ulab_scipy_module;
+
+#endif /* _SCIPY_ */
--- a/code/scipy/signal/signal.c
+++ b/code/scipy/signal/signal.c
@ -0,0 +1,153 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020 Jeff Epler for Adafruit Industries
+ *               2020 Scott Shawcroft for Adafruit Industries
+ *               2020-2021 Zoltán Vörös
+ *               2020 Taku Fukada
+*/
+
+#include <math.h>
+#include <string.h>
+#include "py/runtime.h"
+
+#include "../../ulab.h"
+#include "../../ndarray.h"
+#include "../../numpy/fft/fft_tools.h"
+
+#if ULAB_SCIPY_SIGNAL_HAS_SPECTROGRAM
+//| def spectrogram(r: ulab.ndarray) -> ulab.ndarray:
+//|     """
+//|     :param ulab.ndarray r: A 1-dimension array of values whose size is a power of 2
+//|
+//|     Computes the spectrum of the input signal.  This is the absolute value of the (complex-valued) fft of the signal.
+//|     This function is similar to scipy's ``scipy.signal.spectrogram``."""
+//|     ...
+//|
+
+mp_obj_t signal_spectrogram(size_t n_args, const mp_obj_t *args) {
+    if(n_args == 2) {
+        return fft_fft_ifft_spectrogram(n_args, args[0], args[1], FFT_SPECTROGRAM);
+    } else {
+        return fft_fft_ifft_spectrogram(n_args, args[0], mp_const_none, FFT_SPECTROGRAM);
+    }
+}
+
+MP_DEFINE_CONST_FUN_OBJ_VAR_BETWEEN(signal_spectrogram_obj, 1, 2, signal_spectrogram);
+#endif /* ULAB_SCIPY_SIGNAL_HAS_SPECTROGRAM */
+
+#if ULAB_SCIPY_SIGNAL_HAS_SOSFILT
+static void signal_sosfilt_array(mp_float_t *x, const mp_float_t *coeffs, mp_float_t *zf, const size_t len) {
+    for(size_t i=0; i < len; i++) {
+        mp_float_t xn = *x;
+        *x = coeffs[0] * xn + zf[0];
+        zf[0] = zf[1] + coeffs[1] * xn - coeffs[4] * *x;
+        zf[1] = coeffs[2] * xn - coeffs[5] * *x;
+        x++;
+    }
+    x -= len;
+}
+
+mp_obj_t signal_sosfilt(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_sos, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_x, MP_ARG_REQUIRED | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+        { MP_QSTR_zi, MP_ARG_KW_ONLY | MP_ARG_OBJ, {.u_rom_obj = mp_const_none } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    if(!ndarray_object_is_array_like(args[0].u_obj) || !ndarray_object_is_array_like(args[1].u_obj)) {
+        mp_raise_TypeError(translate("sosfilt requires iterable arguments"));
+    }
+    size_t lenx = (size_t)mp_obj_get_int(mp_obj_len_maybe(args[1].u_obj));
+    ndarray_obj_t *y = ndarray_new_linear_array(lenx, NDARRAY_FLOAT);
+    mp_float_t *yarray = (mp_float_t *)y->array;
+    mp_float_t coeffs[6];
+    if(MP_OBJ_IS_TYPE(args[1].u_obj, &ulab_ndarray_type)) {
+        ndarray_obj_t *inarray = MP_OBJ_TO_PTR(args[1].u_obj);
+        #if ULAB_MAX_DIMS > 1
+        if(inarray->ndim > 1) {
+            mp_raise_ValueError(translate("input must be one-dimensional"));
+        }
+        #endif
+        uint8_t *iarray = (uint8_t *)inarray->array;
+        for(size_t i=0; i < lenx; i++) {
+            *yarray++ = ndarray_get_float_value(iarray, inarray->dtype);
+            iarray += inarray->strides[ULAB_MAX_DIMS - 1];
+        }
+        yarray -= lenx;
+    } else {
+        fill_array_iterable(yarray, args[1].u_obj);
+    }
+
+    mp_obj_iter_buf_t iter_buf;
+    mp_obj_t item, iterable = mp_getiter(args[0].u_obj, &iter_buf);
+    size_t lensos = (size_t)mp_obj_get_int(mp_obj_len_maybe(args[0].u_obj));
+
+    size_t *shape = ndarray_shape_vector(0, 0, lensos, 2);
+    ndarray_obj_t *zf = ndarray_new_dense_ndarray(2, shape, NDARRAY_FLOAT);
+    mp_float_t *zf_array = (mp_float_t *)zf->array;
+
+    if(args[2].u_obj != mp_const_none) {
+        if(!MP_OBJ_IS_TYPE(args[2].u_obj, &ulab_ndarray_type)) {
+            mp_raise_TypeError(translate("zi must be an ndarray"));
+        } else {
+            ndarray_obj_t *zi = MP_OBJ_TO_PTR(args[2].u_obj);
+            if((zi->shape[ULAB_MAX_DIMS - 1] != lensos) || (zi->shape[ULAB_MAX_DIMS - 1] != 2)) {
+                mp_raise_ValueError(translate("zi must be of shape (n_section, 2)"));
+            }
+            if(zi->dtype != NDARRAY_FLOAT) {
+                mp_raise_ValueError(translate("zi must be of float type"));
+            }
+            // TODO: this won't work with sparse arrays
+            memcpy(zf_array, zi->array, 2*lensos*sizeof(mp_float_t));
+        }
+    }
+    while((item = mp_iternext(iterable)) != MP_OBJ_STOP_ITERATION) {
+        if(mp_obj_get_int(mp_obj_len_maybe(item)) != 6) {
+            mp_raise_ValueError(translate("sos array must be of shape (n_section, 6)"));
+        } else {
+            fill_array_iterable(coeffs, item);
+            if(coeffs[3] != MICROPY_FLOAT_CONST(1.0)) {
+                mp_raise_ValueError(translate("sos[:, 3] should be all ones"));
+            }
+            signal_sosfilt_array(yarray, coeffs, zf_array, lenx);
+            zf_array += 2;
+        }
+    }
+    if(args[2].u_obj == mp_const_none) {
+        return MP_OBJ_FROM_PTR(y);
+    } else {
+        mp_obj_tuple_t *tuple = MP_OBJ_TO_PTR(mp_obj_new_tuple(2, NULL));
+        tuple->items[0] = MP_OBJ_FROM_PTR(y);
+        tuple->items[1] = MP_OBJ_FROM_PTR(zf);
+        return tuple;
+    }
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(signal_sosfilt_obj, 2, signal_sosfilt);
+#endif /* ULAB_SCIPY_SIGNAL_HAS_SOSFILT */
+
+static const mp_rom_map_elem_t ulab_scipy_signal_globals_table[] = {
+    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_signal) },
+    #if ULAB_SCIPY_SIGNAL_HAS_SPECTROGRAM
+        { MP_OBJ_NEW_QSTR(MP_QSTR_spectrogram), (mp_obj_t)&signal_spectrogram_obj },
+    #endif
+    #if ULAB_SCIPY_SIGNAL_HAS_SOSFILT
+        { MP_OBJ_NEW_QSTR(MP_QSTR_sosfilt), (mp_obj_t)&signal_sosfilt_obj },
+    #endif
+};
+
+static MP_DEFINE_CONST_DICT(mp_module_ulab_scipy_signal_globals, ulab_scipy_signal_globals_table);
+
+mp_obj_module_t ulab_scipy_signal_module = {
+    .base = { &mp_type_module },
+    .globals = (mp_obj_dict_t*)&mp_module_ulab_scipy_signal_globals,
+};
--- a/code/scipy/signal/signal.h
+++ b/code/scipy/signal/signal.h
@ -0,0 +1,24 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+ *
+*/
+
+#ifndef _SCIPY_SIGNAL_
+#define _SCIPY_SIGNAL_
+
+#include "ulab.h"
+#include "ndarray.h"
+
+extern mp_obj_module_t ulab_scipy_signal_module;
+
+MP_DECLARE_CONST_FUN_OBJ_VAR_BETWEEN(signal_spectrogram_obj);
+MP_DECLARE_CONST_FUN_OBJ_KW(signal_sosfilt_obj);
+
+#endif /* _SCIPY_SIGNAL_ */
--- a/code/scipy/special/special.c
+++ b/code/scipy/special/special.c
@ -0,0 +1,42 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020 Jeff Epler for Adafruit Industries
+ *               2020 Scott Shawcroft for Adafruit Industries
+ *               2020-2021 Zoltán Vörös
+ *               2020 Taku Fukada
+*/
+
+#include <math.h>
+#include "py/runtime.h"
+
+#include "../../ulab.h"
+#include "../../numpy/vector/vector.h"
+
+static const mp_rom_map_elem_t ulab_scipy_special_globals_table[] = {
+    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_special) },
+    #if ULAB_SCIPY_SPECIAL_HAS_ERF
+		{ MP_OBJ_NEW_QSTR(MP_QSTR_erf), (mp_obj_t)&vectorise_erf_obj },
+    #endif
+	#if ULAB_SCIPY_SPECIAL_HAS_ERFC
+		{ MP_OBJ_NEW_QSTR(MP_QSTR_erfc), (mp_obj_t)&vectorise_erfc_obj },
+	#endif
+	#if ULAB_SCIPY_SPECIAL_HAS_GAMMA
+		{ MP_OBJ_NEW_QSTR(MP_QSTR_gamma), (mp_obj_t)&vectorise_gamma_obj },
+	#endif
+	#if ULAB_SCIPY_SPECIAL_HAS_GAMMALN
+		{ MP_OBJ_NEW_QSTR(MP_QSTR_gammaln), (mp_obj_t)&vectorise_lgamma_obj },
+	#endif
+};
+
+static MP_DEFINE_CONST_DICT(mp_module_ulab_scipy_special_globals, ulab_scipy_special_globals_table);
+
+mp_obj_module_t ulab_scipy_special_module = {
+    .base = { &mp_type_module },
+    .globals = (mp_obj_dict_t*)&mp_module_ulab_scipy_special_globals,
+};
--- a/code/scipy/special/special.h
+++ b/code/scipy/special/special.h
@ -0,0 +1,21 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+ *               
+*/
+
+#ifndef _SCIPY_SPECIAL_
+#define _SCIPY_SPECIAL_
+
+#include "ulab.h"
+#include "ndarray.h"
+
+extern mp_obj_module_t ulab_scipy_special_module;
+
+#endif /* _SCIPY_SPECIAL_ */
--- a/code/ulab.c
+++ b/code/ulab.c
@ -6,7 +6,8 @@
 *
 * The MIT License (MIT)
 *
- * Copyright (c) 2019-2020 Zoltán Vörös
+ * Copyright (c) 2019-2021 Zoltán Vörös
+ *               2020 Jeff Epler for Adafruit Industries
 */

 #include <math.h>
@ -19,75 +20,117 @@
 #include "py/objarray.h"

 #include "ulab.h"
+#include "ulab_create.h"
 #include "ndarray.h"
 #include "ndarray_properties.h"
-#include "linalg.h"
-#include "vectorise.h"
-#include "poly.h"
-#include "fft.h"
-#include "filter.h"
-#include "numerical.h"
-#include "extras.h"

-STATIC MP_DEFINE_STR_OBJ(ulab_version_obj, "0.34.0");
+#include "numpy/numpy.h"
+#include "scipy/scipy.h"
+#include "numpy/fft/fft.h"
+#include "numpy/linalg/linalg.h"
+// TODO: we should get rid of this; array.sort depends on it
+#include "numpy/numerical/numerical.h"
+
+#include "user/user.h"
+
+#define ULAB_VERSION 2.1.5
+#define xstr(s) str(s)
+#define str(s) #s
+#define ULAB_VERSION_STRING xstr(ULAB_VERSION) xstr(-) xstr(ULAB_MAX_DIMS) xstr(D)
+
+STATIC MP_DEFINE_STR_OBJ(ulab_version_obj, ULAB_VERSION_STRING);
+

 STATIC const mp_rom_map_elem_t ulab_ndarray_locals_dict_table[] = {
-    { MP_ROM_QSTR(MP_QSTR_flatten), MP_ROM_PTR(&ndarray_flatten_obj) },
+    // these are the methods and properties of an ndarray
+    #if ULAB_MAX_DIMS > 1
+        #if NDARRAY_HAS_RESHAPE
            { MP_ROM_QSTR(MP_QSTR_reshape), MP_ROM_PTR(&ndarray_reshape_obj) },
+        #endif
+        #if NDARRAY_HAS_TRANSPOSE
            { MP_ROM_QSTR(MP_QSTR_transpose), MP_ROM_PTR(&ndarray_transpose_obj) },
-    { MP_ROM_QSTR(MP_QSTR_shape), MP_ROM_PTR(&ndarray_shape_obj) },
-    { MP_ROM_QSTR(MP_QSTR_size), MP_ROM_PTR(&ndarray_size_obj) },
+        #endif
+    #endif
+    #if NDARRAY_HAS_COPY
+        { MP_ROM_QSTR(MP_QSTR_copy), MP_ROM_PTR(&ndarray_copy_obj) },
+    #endif
+    #if NDARRAY_HAS_DTYPE
+        { MP_ROM_QSTR(MP_QSTR_dtype), MP_ROM_PTR(&ndarray_dtype_obj) },
+    #endif
+    #if NDARRAY_HAS_FLATTEN
+        { MP_ROM_QSTR(MP_QSTR_flatten), MP_ROM_PTR(&ndarray_flatten_obj) },
+    #endif
+    #if NDARRAY_HAS_ITEMSIZE
        { MP_ROM_QSTR(MP_QSTR_itemsize), MP_ROM_PTR(&ndarray_itemsize_obj) },
-//    { MP_ROM_QSTR(MP_QSTR_sort), MP_ROM_PTR(&numerical_sort_inplace_obj) },
+    #endif
+    #if NDARRAY_HAS_SHAPE
+        { MP_ROM_QSTR(MP_QSTR_shape), MP_ROM_PTR(&ndarray_shape_obj) },
+    #endif
+    #if NDARRAY_HAS_SIZE
+        { MP_ROM_QSTR(MP_QSTR_size), MP_ROM_PTR(&ndarray_size_obj) },
+    #endif
+    #if NDARRAY_HAS_STRIDES
+        { MP_ROM_QSTR(MP_QSTR_strides), MP_ROM_PTR(&ndarray_strides_obj) },
+    #endif
+    #if NDARRAY_HAS_TOBYTES
+        { MP_ROM_QSTR(MP_QSTR_tobytes), MP_ROM_PTR(&ndarray_tobytes_obj) },
+    #endif
+    #if NDARRAY_HAS_SORT
+        { MP_ROM_QSTR(MP_QSTR_sort), MP_ROM_PTR(&numerical_sort_inplace_obj) },
+    #endif
 };

 STATIC MP_DEFINE_CONST_DICT(ulab_ndarray_locals_dict, ulab_ndarray_locals_dict_table);

 const mp_obj_type_t ulab_ndarray_type = {
    { &mp_type_type },
+#if defined(MP_TYPE_FLAG_EQ_CHECKS_OTHER_TYPE) && defined(MP_TYPE_FLAG_EQ_HAS_NEQ_TEST)
+    .flags = MP_TYPE_FLAG_EQ_CHECKS_OTHER_TYPE | MP_TYPE_FLAG_EQ_HAS_NEQ_TEST,
+#endif
    .name = MP_QSTR_ndarray,
    .print = ndarray_print,
    .make_new = ndarray_make_new,
+    #if NDARRAY_IS_SLICEABLE
    .subscr = ndarray_subscr,
+    #endif
+    #if NDARRAY_IS_ITERABLE
    .getiter = ndarray_getiter,
+    #endif
+    #if NDARRAY_HAS_UNARY_OPS
    .unary_op = ndarray_unary_op,
+    #endif
+    #if NDARRAY_HAS_BINARY_OPS
    .binary_op = ndarray_binary_op,
-    .buffer_p = { .get_buffer = ndarray_get_buffer, },
+    #endif
    .locals_dict = (mp_obj_dict_t*)&ulab_ndarray_locals_dict,
 };

-#if !CIRCUITPY
+#if ULAB_HAS_DTYPE_OBJECT
+const mp_obj_type_t ulab_dtype_type = {
+    { &mp_type_type },
+    .name = MP_QSTR_dtype,
+    .print = ndarray_dtype_print,
+    .make_new = ndarray_dtype_make_new,
+};
+#endif
+
 STATIC const mp_map_elem_t ulab_globals_table[] = {
    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_ulab) },
    { MP_ROM_QSTR(MP_QSTR___version__), MP_ROM_PTR(&ulab_version_obj) },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_array), (mp_obj_t)&ulab_ndarray_type },
-    #if ULAB_LINALG_MODULE
-    { MP_ROM_QSTR(MP_QSTR_linalg), MP_ROM_PTR(&ulab_linalg_module) },
+    #if ULAB_HAS_DTYPE_OBJECT
+        { MP_OBJ_NEW_QSTR(MP_QSTR_dtype), (mp_obj_t)&ulab_dtype_type },
+    #else
+        #if NDARRAY_HAS_DTYPE
+        { MP_OBJ_NEW_QSTR(MP_QSTR_dtype), (mp_obj_t)&ndarray_dtype_obj },
+        #endif /* NDARRAY_HAS_DTYPE */
+    #endif /* ULAB_HAS_DTYPE_OBJECT */
+        { MP_ROM_QSTR(MP_QSTR_numpy), MP_ROM_PTR(&ulab_numpy_module) },
+    #if ULAB_HAS_SCIPY
+        { MP_ROM_QSTR(MP_QSTR_scipy), MP_ROM_PTR(&ulab_scipy_module) },
    #endif
-    #if ULAB_VECTORISE_MODULE
-    { MP_ROM_QSTR(MP_QSTR_vector), MP_ROM_PTR(&ulab_vectorise_module) },
+    #if ULAB_HAS_USER_MODULE
+        { MP_ROM_QSTR(MP_QSTR_user), MP_ROM_PTR(&ulab_user_module) },
    #endif
-    #if ULAB_NUMERICAL_MODULE
-    { MP_ROM_QSTR(MP_QSTR_numerical), MP_ROM_PTR(&ulab_numerical_module) },
-    #endif
-    #if ULAB_POLY_MODULE
-    { MP_ROM_QSTR(MP_QSTR_poly), MP_ROM_PTR(&ulab_poly_module) },
-    #endif
-    #if ULAB_FFT_MODULE
-    { MP_ROM_QSTR(MP_QSTR_fft), MP_ROM_PTR(&ulab_fft_module) },
-    #endif
-    #if ULAB_FILTER_MODULE
-    { MP_ROM_QSTR(MP_QSTR_filter), MP_ROM_PTR(&ulab_filter_module) },
-    #endif
-    #if ULAB_EXTRAS_MODULE
-    { MP_ROM_QSTR(MP_QSTR_extras), MP_ROM_PTR(&ulab_extras_module) },
-    #endif
-    // class constants
-    { MP_ROM_QSTR(MP_QSTR_uint8), MP_ROM_INT(NDARRAY_UINT8) },
-    { MP_ROM_QSTR(MP_QSTR_int8), MP_ROM_INT(NDARRAY_INT8) },
-    { MP_ROM_QSTR(MP_QSTR_uint16), MP_ROM_INT(NDARRAY_UINT16) },
-    { MP_ROM_QSTR(MP_QSTR_int16), MP_ROM_INT(NDARRAY_INT16) },
-    { MP_ROM_QSTR(MP_QSTR_float), MP_ROM_INT(NDARRAY_FLOAT) },
 };

 STATIC MP_DEFINE_CONST_DICT (
@ -95,10 +138,13 @@ STATIC MP_DEFINE_CONST_DICT (
    ulab_globals_table
 );

-mp_obj_module_t ulab_user_cmodule = {
+#ifdef OPENMV
+const struct _mp_obj_module_t ulab_user_cmodule = {
+#else
+const mp_obj_module_t ulab_user_cmodule = {
+#endif
    .base = { &mp_type_module },
    .globals = (mp_obj_dict_t*)&mp_module_ulab_globals,
 };

 MP_REGISTER_MODULE(MP_QSTR_ulab, ulab_user_cmodule, MODULE_ULAB_ENABLED);
-#endif
--- a/code/ulab.h
+++ b/code/ulab.h
@ -6,31 +6,593 @@
 *
 * The MIT License (MIT)
 *
- * Copyright (c) 2019-2020 Zoltán Vörös
+ * Copyright (c) 2019-2021 Zoltán Vörös
 */

 #ifndef __ULAB__
 #define __ULAB__

-// vectorise (all functions) takes approx. 3 kB of flash space
-#define ULAB_VECTORISE_MODULE (1)

-// linalg adds around 6 kB
-#define ULAB_LINALG_MODULE (1)

-// poly is approx. 2.5 kB
-#define ULAB_POLY_MODULE (1)
+// The pre-processor constants in this file determine how ulab behaves:
+//
+// - how many dimensions ulab can handle
+// - which functions are included in the compiled firmware
+// - whether the python syntax is numpy-like, or modular
+// - whether arrays can be sliced and iterated over
+// - which binary/unary operators are supported
+//
+// A considerable amount of flash space can be saved by removing (setting
+// the corresponding constants to 0) the unnecessary functions and features.

-// numerical is about 12 kB
-#define ULAB_NUMERICAL_MODULE (1)
+// Values defined here can be overridden by your own config file as 
+// make -DULAB_CONFIG_FILE="my_ulab_config.h"
+#if defined(ULAB_CONFIG_FILE)
+#include ULAB_CONFIG_FILE
+#endif

-// FFT costs about 2 kB of flash space
-#define ULAB_FFT_MODULE (1)

-// the filter module takes about 1 kB of flash space
-#define ULAB_FILTER_MODULE (1)
+// Determines, whether scipy is defined in ulab. The sub-modules and functions
+// of scipy have to be defined separately
+#ifndef ULAB_HAS_SCIPY
+#define ULAB_HAS_SCIPY                      (1)
+#endif

-// user-defined modules
-#define ULAB_EXTRAS_MODULE (0)
+// The maximum number of dimensions the firmware should be able to support
+// Possible values lie between 1, and 4, inclusive
+#define ULAB_MAX_DIMS                       2
+
+// By setting this constant to 1, iteration over array dimensions will be implemented
+// as a function (ndarray_rewind_array), instead of writing out the loops in macros
+// This reduces firmware size at the expense of speed
+#define ULAB_HAS_FUNCTION_ITERATOR          (0)
+
+// If NDARRAY_IS_ITERABLE is 1, the ndarray object defines its own iterator function
+// This option saves approx. 250 bytes of flash space
+#ifndef NDARRAY_IS_ITERABLE
+#define NDARRAY_IS_ITERABLE                 (1)
+#endif
+
+// Slicing can be switched off by setting this variable to 0
+#ifndef NDARRAY_IS_SLICEABLE
+#define NDARRAY_IS_SLICEABLE                (1)
+#endif
+
+// The default threshold for pretty printing. These variables can be overwritten
+// at run-time via the set_printoptions() function
+#ifndef ULAB_HAS_PRINTOPTIONS
+#define ULAB_HAS_PRINTOPTIONS               (1)
+#endif
+#define NDARRAY_PRINT_THRESHOLD             10
+#define NDARRAY_PRINT_EDGEITEMS             3
+
+// determines, whether the dtype is an object, or simply a character
+// the object implementation is numpythonic, but requires more space
+#ifndef ULAB_HAS_DTYPE_OBJECT
+#define ULAB_HAS_DTYPE_OBJECT               (0)
+#endif
+
+// the ndarray binary operators
+#ifndef NDARRAY_HAS_BINARY_OPS
+#define NDARRAY_HAS_BINARY_OPS              (1)
+#endif
+
+// Firmware size can be reduced at the expense of speed by using function
+// pointers in iterations. For each operator, he function pointer saves around
+// 2 kB in the two-dimensional case, and around 4 kB in the four-dimensional case.
+
+#ifndef NDARRAY_BINARY_USES_FUN_POINTER
+#define NDARRAY_BINARY_USES_FUN_POINTER     (0)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_ADD
+#define NDARRAY_HAS_BINARY_OP_ADD           (1)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_EQUAL
+#define NDARRAY_HAS_BINARY_OP_EQUAL         (1)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_LESS
+#define NDARRAY_HAS_BINARY_OP_LESS          (1)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_LESS_EQUAL
+#define NDARRAY_HAS_BINARY_OP_LESS_EQUAL    (1)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_MORE
+#define NDARRAY_HAS_BINARY_OP_MORE          (1)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_MORE_EQUAL
+#define NDARRAY_HAS_BINARY_OP_MORE_EQUAL    (1)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_MULTIPLY
+#define NDARRAY_HAS_BINARY_OP_MULTIPLY      (1)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_NOT_EQUAL
+#define NDARRAY_HAS_BINARY_OP_NOT_EQUAL     (1)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_POWER
+#define NDARRAY_HAS_BINARY_OP_POWER         (1)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_SUBTRACT
+#define NDARRAY_HAS_BINARY_OP_SUBTRACT      (1)
+#endif
+
+#ifndef NDARRAY_HAS_BINARY_OP_TRUE_DIVIDE
+#define NDARRAY_HAS_BINARY_OP_TRUE_DIVIDE   (1)
+#endif
+
+#ifndef NDARRAY_HAS_INPLACE_OPS
+#define NDARRAY_HAS_INPLACE_OPS             (1)
+#endif
+
+#ifndef NDARRAY_HAS_INPLACE_ADD
+#define NDARRAY_HAS_INPLACE_ADD             (1)
+#endif
+
+#ifndef NDARRAY_HAS_INPLACE_MULTIPLY
+#define NDARRAY_HAS_INPLACE_MULTIPLY        (1)
+#endif
+
+#ifndef NDARRAY_HAS_INPLACE_POWER
+#define NDARRAY_HAS_INPLACE_POWER           (1)
+#endif
+
+#ifndef NDARRAY_HAS_INPLACE_SUBTRACT
+#define NDARRAY_HAS_INPLACE_SUBTRACT        (1)
+#endif
+
+#ifndef NDARRAY_HAS_INPLACE_TRUE_DIVIDE
+#define NDARRAY_HAS_INPLACE_TRUE_DIVIDE     (1)
+#endif
+
+// the ndarray unary operators
+#ifndef NDARRAY_HAS_UNARY_OPS
+#define NDARRAY_HAS_UNARY_OPS               (1)
+#endif
+
+#ifndef NDARRAY_HAS_UNARY_OP_ABS
+#define NDARRAY_HAS_UNARY_OP_ABS            (1)
+#endif
+
+#ifndef NDARRAY_HAS_UNARY_OP_INVERT
+#define NDARRAY_HAS_UNARY_OP_INVERT         (1)
+#endif
+
+#ifndef NDARRAY_HAS_UNARY_OP_LEN
+#define NDARRAY_HAS_UNARY_OP_LEN            (1)
+#endif
+
+#ifndef NDARRAY_HAS_UNARY_OP_NEGATIVE
+#define NDARRAY_HAS_UNARY_OP_NEGATIVE       (1)
+#endif
+
+#ifndef NDARRAY_HAS_UNARY_OP_POSITIVE
+#define NDARRAY_HAS_UNARY_OP_POSITIVE       (1)
+#endif
+
+
+// determines, which ndarray methods are available
+#ifndef NDARRAY_HAS_COPY
+#define NDARRAY_HAS_COPY                (1)
+#endif
+
+#ifndef NDARRAY_HAS_DTYPE
+#define NDARRAY_HAS_DTYPE               (1)
+#endif
+
+#ifndef NDARRAY_HAS_FLATTEN
+#define NDARRAY_HAS_FLATTEN             (1)
+#endif
+
+#ifndef NDARRAY_HAS_ITEMSIZE
+#define NDARRAY_HAS_ITEMSIZE            (1)
+#endif
+
+#ifndef NDARRAY_HAS_RESHAPE
+#define NDARRAY_HAS_RESHAPE             (1)
+#endif
+
+#ifndef NDARRAY_HAS_SHAPE
+#define NDARRAY_HAS_SHAPE               (1)
+#endif
+
+#ifndef NDARRAY_HAS_SIZE
+#define NDARRAY_HAS_SIZE                (1)
+#endif
+
+#ifndef NDARRAY_HAS_SORT
+#define NDARRAY_HAS_SORT                (1)
+#endif
+
+#ifndef NDARRAY_HAS_STRIDES
+#define NDARRAY_HAS_STRIDES             (1)
+#endif
+
+#ifndef NDARRAY_HAS_TOBYTES
+#define NDARRAY_HAS_TOBYTES             (1)
+#endif
+
+#ifndef NDARRAY_HAS_TRANSPOSE
+#define NDARRAY_HAS_TRANSPOSE           (1)
+#endif
+
+// Firmware size can be reduced at the expense of speed by using a function
+// pointer in iterations. Setting ULAB_VECTORISE_USES_FUNCPOINTER to 1 saves
+// around 800 bytes in the four-dimensional case, and around 200 in two dimensions.
+#ifndef ULAB_VECTORISE_USES_FUN_POINTER
+#define ULAB_VECTORISE_USES_FUN_POINTER (1)
+#endif
+
+// determines, whether e is defined in ulab.numpy itself
+#ifndef ULAB_NUMPY_HAS_E
+#define ULAB_NUMPY_HAS_E                (1)
+#endif
+
+// ulab defines infinite as a class constant in ulab.numpy
+#ifndef ULAB_NUMPY_HAS_INF
+#define ULAB_NUMPY_HAS_INF              (1)
+#endif
+
+// ulab defines NaN as a class constant in ulab.numpy
+#ifndef ULAB_NUMPY_HAS_NAN
+#define ULAB_NUMPY_HAS_NAN              (1)
+#endif
+
+// determines, whether pi is defined in ulab.numpy itself
+#ifndef ULAB_NUMPY_HAS_PI
+#define ULAB_NUMPY_HAS_PI               (1)
+#endif
+
+// determines, whether the ndinfo function is available
+#ifndef ULAB_NUMPY_HAS_NDINFO
+#define ULAB_NUMPY_HAS_NDINFO           (1)
+#endif
+
+// frombuffer adds 600 bytes to the firmware
+#ifndef ULAB_NUMPY_HAS_FROMBUFFER
+#define ULAB_NUMPY_HAS_FROMBUFFER       (1)
+#endif
+
+// functions that create an array
+#ifndef ULAB_NUMPY_HAS_ARANGE
+#define ULAB_NUMPY_HAS_ARANGE           (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_CONCATENATE
+#define ULAB_NUMPY_HAS_CONCATENATE      (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_DIAG
+#define ULAB_NUMPY_HAS_DIAG             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_EYE
+#define ULAB_NUMPY_HAS_EYE              (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_FULL
+#define ULAB_NUMPY_HAS_FULL             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_LINSPACE
+#define ULAB_NUMPY_HAS_LINSPACE         (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_LOGSPACE
+#define ULAB_NUMPY_HAS_LOGSPACE         (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ONES
+#define ULAB_NUMPY_HAS_ONES             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ZEROS
+#define ULAB_NUMPY_HAS_ZEROS            (1)
+#endif
+
+// functions that compare arrays
+#ifndef ULAB_NUMPY_HAS_CLIP
+#define ULAB_NUMPY_HAS_CLIP             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_EQUAL
+#define ULAB_NUMPY_HAS_EQUAL            (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_NOTEQUAL
+#define ULAB_NUMPY_HAS_NOTEQUAL         (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_MAXIMUM
+#define ULAB_NUMPY_HAS_MAXIMUM          (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_MINIMUM
+#define ULAB_NUMPY_HAS_MINIMUM          (1)
+#endif
+
+// the linalg module; functions of the linalg module still have
+// to be defined separately
+#ifndef ULAB_NUMPY_HAS_LINALG_MODULE
+#define ULAB_NUMPY_HAS_LINALG_MODULE    (1)
+#endif
+
+#ifndef ULAB_LINALG_HAS_CHOLESKY
+#define ULAB_LINALG_HAS_CHOLESKY        (1)
+#endif
+
+#ifndef ULAB_LINALG_HAS_DET
+#define ULAB_LINALG_HAS_DET             (1)
+#endif
+
+#ifndef ULAB_LINALG_HAS_DOT
+#define ULAB_LINALG_HAS_DOT             (1)
+#endif
+
+#ifndef ULAB_LINALG_HAS_EIG
+#define ULAB_LINALG_HAS_EIG             (1)
+#endif
+
+#ifndef ULAB_LINALG_HAS_INV
+#define ULAB_LINALG_HAS_INV             (1)
+#endif
+
+#ifndef ULAB_LINALG_HAS_NORM
+#define ULAB_LINALG_HAS_NORM            (1)
+#endif
+
+#ifndef ULAB_LINALG_HAS_TRACE
+#define ULAB_LINALG_HAS_TRACE           (1)
+#endif
+
+// the FFT module; functions of the fft module still have
+// to be defined separately
+#ifndef ULAB_NUMPY_HAS_FFT_MODULE
+#define ULAB_NUMPY_HAS_FFT_MODULE       (1)
+#endif
+
+#ifndef ULAB_FFT_HAS_FFT
+#define ULAB_FFT_HAS_FFT                (1)
+#endif
+
+#ifndef ULAB_FFT_HAS_IFFT
+#define ULAB_FFT_HAS_IFFT               (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ARGMINMAX
+#define ULAB_NUMPY_HAS_ARGMINMAX        (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ARGSORT
+#define ULAB_NUMPY_HAS_ARGSORT          (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_CONVOLVE
+#define ULAB_NUMPY_HAS_CONVOLVE         (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_CROSS
+#define ULAB_NUMPY_HAS_CROSS            (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_DIFF
+#define ULAB_NUMPY_HAS_DIFF             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_FLIP
+#define ULAB_NUMPY_HAS_FLIP             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_INTERP
+#define ULAB_NUMPY_HAS_INTERP           (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_MEAN
+#define ULAB_NUMPY_HAS_MEAN             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_MEDIAN
+#define ULAB_NUMPY_HAS_MEDIAN           (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_MINMAX
+#define ULAB_NUMPY_HAS_MINMAX           (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_POLYFIT
+#define ULAB_NUMPY_HAS_POLYFIT          (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_POLYVAL
+#define ULAB_NUMPY_HAS_POLYVAL          (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ROLL
+#define ULAB_NUMPY_HAS_ROLL             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_SORT
+#define ULAB_NUMPY_HAS_SORT             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_STD
+#define ULAB_NUMPY_HAS_STD              (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_SUM
+#define ULAB_NUMPY_HAS_SUM              (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_TRAPZ
+#define ULAB_NUMPY_HAS_TRAPZ            (1)
+#endif
+
+// vectorised versions of the functions of the math python module, with
+// the exception of the functions listed in scipy.special
+#ifndef ULAB_NUMPY_HAS_ACOS
+#define ULAB_NUMPY_HAS_ACOS             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ACOSH
+#define ULAB_NUMPY_HAS_ACOSH            (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ARCTAN2
+#define ULAB_NUMPY_HAS_ARCTAN2          (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_AROUND
+#define ULAB_NUMPY_HAS_AROUND           (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ASIN
+#define ULAB_NUMPY_HAS_ASIN             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ASINH
+#define ULAB_NUMPY_HAS_ASINH            (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ATAN
+#define ULAB_NUMPY_HAS_ATAN             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_ATANH
+#define ULAB_NUMPY_HAS_ATANH            (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_CEIL
+#define ULAB_NUMPY_HAS_CEIL             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_COS
+#define ULAB_NUMPY_HAS_COS              (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_COSH
+#define ULAB_NUMPY_HAS_COSH             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_DEGREES
+#define ULAB_NUMPY_HAS_DEGREES          (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_EXP
+#define ULAB_NUMPY_HAS_EXP              (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_EXPM1
+#define ULAB_NUMPY_HAS_EXPM1            (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_FLOOR
+#define ULAB_NUMPY_HAS_FLOOR            (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_LOG
+#define ULAB_NUMPY_HAS_LOG              (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_LOG10
+#define ULAB_NUMPY_HAS_LOG10            (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_LOG2
+#define ULAB_NUMPY_HAS_LOG2             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_RADIANS
+#define ULAB_NUMPY_HAS_RADIANS          (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_SIN
+#define ULAB_NUMPY_HAS_SIN              (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_SINH
+#define ULAB_NUMPY_HAS_SINH             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_SQRT
+#define ULAB_NUMPY_HAS_SQRT             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_TAN
+#define ULAB_NUMPY_HAS_TAN              (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_TANH
+#define ULAB_NUMPY_HAS_TANH             (1)
+#endif
+
+#ifndef ULAB_NUMPY_HAS_VECTORIZE
+#define ULAB_NUMPY_HAS_VECTORIZE        (1)
+#endif
+
+#ifndef ULAB_SCIPY_HAS_SIGNAL_MODULE
+#define ULAB_SCIPY_HAS_SIGNAL_MODULE        (1)
+#endif
+
+#ifndef ULAB_SCIPY_SIGNAL_HAS_SPECTROGRAM
+#define ULAB_SCIPY_SIGNAL_HAS_SPECTROGRAM   (1)
+#endif
+
+#ifndef ULAB_SCIPY_SIGNAL_HAS_SOSFILT
+#define ULAB_SCIPY_SIGNAL_HAS_SOSFILT       (1)
+#endif
+
+#ifndef ULAB_SCIPY_HAS_OPTIMIZE_MODULE
+#define ULAB_SCIPY_HAS_OPTIMIZE_MODULE      (1)
+#endif
+
+#ifndef ULAB_SCIPY_OPTIMIZE_HAS_BISECT
+#define ULAB_SCIPY_OPTIMIZE_HAS_BISECT      (1)
+#endif
+
+#ifndef ULAB_SCIPY_OPTIMIZE_HAS_CURVE_FIT
+#define ULAB_SCIPY_OPTIMIZE_HAS_CURVE_FIT   (0) // not fully implemented
+#endif
+
+#ifndef ULAB_SCIPY_OPTIMIZE_HAS_FMIN
+#define ULAB_SCIPY_OPTIMIZE_HAS_FMIN        (1)
+#endif
+
+#ifndef ULAB_SCIPY_OPTIMIZE_HAS_NEWTON
+#define ULAB_SCIPY_OPTIMIZE_HAS_NEWTON      (1)
+#endif
+
+#ifndef ULAB_SCIPY_HAS_SPECIAL_MODULE
+#define ULAB_SCIPY_HAS_SPECIAL_MODULE       (1)
+#endif
+
+#ifndef ULAB_SCIPY_SPECIAL_HAS_ERF
+#define ULAB_SCIPY_SPECIAL_HAS_ERF          (1)
+#endif
+
+#ifndef ULAB_SCIPY_SPECIAL_HAS_ERFC
+#define ULAB_SCIPY_SPECIAL_HAS_ERFC         (1)
+#endif
+
+#ifndef ULAB_SCIPY_SPECIAL_HAS_GAMMA
+#define ULAB_SCIPY_SPECIAL_HAS_GAMMA        (1)
+#endif
+
+#ifndef ULAB_SCIPY_SPECIAL_HAS_GAMMALN
+#define ULAB_SCIPY_SPECIAL_HAS_GAMMALN      (1)
+#endif
+
+// user-defined module; source of the module and
+// its sub-modules should be placed in code/user/
+#ifndef ULAB_HAS_USER_MODULE
+#define ULAB_HAS_USER_MODULE                (0)
+#endif

 #endif
--- a/code/ulab_create.c
+++ b/code/ulab_create.c
@ -0,0 +1,682 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020 Jeff Epler for Adafruit Industries
+ *               2019-2021 Zoltán Vörös
+ *               2020 Taku Fukada
+*/
+
+#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include "py/obj.h"
+#include "py/runtime.h"
+
+#include "ulab.h"
+#include "ulab_create.h"
+
+#if ULAB_NUMPY_HAS_ONES | ULAB_NUMPY_HAS_ZEROS | ULAB_NUMPY_HAS_FULL
+static mp_obj_t create_zeros_ones_full(mp_obj_t oshape, uint8_t dtype, mp_obj_t value) {
+    if(!MP_OBJ_IS_INT(oshape) && !MP_OBJ_IS_TYPE(oshape, &mp_type_tuple) && !MP_OBJ_IS_TYPE(oshape, &mp_type_list)) {
+        mp_raise_TypeError(translate("input argument must be an integer, a tuple, or a list"));
+    }
+    ndarray_obj_t *ndarray = NULL;
+    if(MP_OBJ_IS_INT(oshape)) {
+        size_t n = mp_obj_get_int(oshape);
+        ndarray = ndarray_new_linear_array(n, dtype);
+    } else if(MP_OBJ_IS_TYPE(oshape, &mp_type_tuple) || MP_OBJ_IS_TYPE(oshape, &mp_type_list)) {
+        uint8_t len = (uint8_t)mp_obj_get_int(mp_obj_len_maybe(oshape));
+        if(len > ULAB_MAX_DIMS) {
+            mp_raise_TypeError(translate("too many dimensions"));
+        }
+        size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
+        memset(shape, 0, ULAB_MAX_DIMS * sizeof(size_t));
+        size_t i = 0;
+        mp_obj_iter_buf_t iter_buf;
+        mp_obj_t item, iterable = mp_getiter(oshape, &iter_buf);
+        while((item = mp_iternext(iterable)) != MP_OBJ_STOP_ITERATION){
+            shape[ULAB_MAX_DIMS - len + i] = (size_t)mp_obj_get_int(item);
+            i++;
+        }
+        ndarray = ndarray_new_dense_ndarray(len, shape, dtype);
+    }
+    if(value != mp_const_none) {
+        for(size_t i=0; i < ndarray->len; i++) {
+            mp_binary_set_val_array(dtype, ndarray->array, i, value);
+        }
+    }
+    // if zeros calls the function, we don't have to do anything
+    return MP_OBJ_FROM_PTR(ndarray);
+}
+#endif
+
+#if ULAB_NUMPY_HAS_ARANGE | ULAB_NUMPY_HAS_LINSPACE
+static ndarray_obj_t *create_linspace_arange(mp_float_t start, mp_float_t step, size_t len, uint8_t dtype) {
+    mp_float_t value = start;
+
+    ndarray_obj_t *ndarray = ndarray_new_linear_array(len, dtype);
+    if(dtype == NDARRAY_UINT8) {
+        uint8_t *array = (uint8_t *)ndarray->array;
+        for(size_t i=0; i < len; i++, value += step) *array++ = (uint8_t)value;
+    } else if(dtype == NDARRAY_INT8) {
+        int8_t *array = (int8_t *)ndarray->array;
+        for(size_t i=0; i < len; i++, value += step) *array++ = (int8_t)value;
+    } else if(dtype == NDARRAY_UINT16) {
+        uint16_t *array = (uint16_t *)ndarray->array;
+        for(size_t i=0; i < len; i++, value += step) *array++ = (uint16_t)value;
+    } else if(dtype == NDARRAY_INT16) {
+        int16_t *array = (int16_t *)ndarray->array;
+        for(size_t i=0; i < len; i++, value += step) *array++ = (int16_t)value;
+    } else {
+        mp_float_t *array = (mp_float_t *)ndarray->array;
+        for(size_t i=0; i < len; i++, value += step) *array++ = value;
+    }
+    return ndarray;
+}
+#endif
+
+#if ULAB_NUMPY_HAS_ARANGE
+//| @overload
+//| def arange(stop: _float, step: _float = 1, *, dtype: _DType = ulab.float) -> ulab.ndarray: ...
+//| @overload
+//| def arange(start: _float, stop: _float, step: _float = 1, *, dtype: _DType = ulab.float) -> ulab.ndarray:
+//|     """
+//|     .. param: start
+//|       First value in the array, optional, defaults to 0
+//|     .. param: stop
+//|       Final value in the array
+//|     .. param: step
+//|       Difference between consecutive elements, optional, defaults to 1.0
+//|     .. param: dtype
+//|       Type of values in the array
+//|
+//|     Return a new 1-D array with elements ranging from ``start`` to ``stop``, with step size ``step``."""
+//|     ...
+//|
+
+mp_obj_t create_arange(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+    uint8_t dtype = NDARRAY_FLOAT;
+    mp_float_t start, stop, step;
+    if(n_args == 1) {
+        start = 0.0;
+        stop = mp_obj_get_float(args[0].u_obj);
+        step = 1.0;
+        if(mp_obj_is_int(args[0].u_obj)) dtype = NDARRAY_INT16;
+    } else if(n_args == 2) {
+        start = mp_obj_get_float(args[0].u_obj);
+        stop = mp_obj_get_float(args[1].u_obj);
+        step = 1.0;
+        if(mp_obj_is_int(args[0].u_obj) && mp_obj_is_int(args[1].u_obj)) dtype = NDARRAY_INT16;
+    } else if(n_args == 3) {
+        start = mp_obj_get_float(args[0].u_obj);
+        stop = mp_obj_get_float(args[1].u_obj);
+        step = mp_obj_get_float(args[2].u_obj);
+        if(mp_obj_is_int(args[0].u_obj) && mp_obj_is_int(args[1].u_obj) && mp_obj_is_int(args[2].u_obj)) dtype = NDARRAY_INT16;
+    } else {
+        mp_raise_TypeError(translate("wrong number of arguments"));
+    }
+    if((MICROPY_FLOAT_C_FUN(fabs)(stop) > 32768) || (MICROPY_FLOAT_C_FUN(fabs)(start) > 32768) || (MICROPY_FLOAT_C_FUN(fabs)(step) > 32768)) {
+        dtype = NDARRAY_FLOAT;
+    }
+    if(args[3].u_obj != mp_const_none) {
+        dtype = (uint8_t)mp_obj_get_int(args[3].u_obj);
+    }
+    ndarray_obj_t *ndarray;
+    if((stop - start)/step < 0) {
+        ndarray = ndarray_new_linear_array(0, dtype);
+    } else {
+        size_t len = (size_t)(MICROPY_FLOAT_C_FUN(ceil)((stop - start)/step));
+        ndarray = create_linspace_arange(start, step, len, dtype);
+    }
+    return MP_OBJ_FROM_PTR(ndarray);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(create_arange_obj, 1, create_arange);
+#endif
+
+#if ULAB_NUMPY_HAS_CONCATENATE
+//| def concatenate(arrays: Tuple[ulab.ndarray], *, axis: int = 0) -> ulab.ndarray:
+//|     """
+//|     .. param: arrays
+//|       tuple of ndarrays
+//|     .. param: axis
+//|       axis along which the arrays will be joined
+//|
+//|     Join a sequence of arrays along an existing axis."""
+//|     ...
+//|
+
+mp_obj_t create_concatenate(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_axis, MP_ARG_KW_ONLY | MP_ARG_INT, { .u_int = 0 } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    if(!MP_OBJ_IS_TYPE(args[0].u_obj, &mp_type_tuple)) {
+        mp_raise_TypeError(translate("first argument must be a tuple of ndarrays"));
+    }
+    int8_t axis = (int8_t)args[1].u_int;
+    size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
+    memset(shape, 0, sizeof(size_t)*ULAB_MAX_DIMS);
+    mp_obj_tuple_t *ndarrays = MP_OBJ_TO_PTR(args[0].u_obj);
+
+    // first check, whether the arrays are compatible
+    ndarray_obj_t *_ndarray = MP_OBJ_TO_PTR(ndarrays->items[0]);
+    uint8_t dtype = _ndarray->dtype;
+    uint8_t ndim = _ndarray->ndim;
+    if(axis < 0) {
+        axis += ndim;
+    }
+    if((axis < 0) || (axis >= ndim)) {
+        mp_raise_ValueError(translate("wrong axis specified"));
+    }
+    // shift axis
+    axis = ULAB_MAX_DIMS - ndim + axis;
+    for(uint8_t j=0; j < ULAB_MAX_DIMS; j++) {
+        shape[j] = _ndarray->shape[j];
+    }
+
+    for(uint8_t i=1; i < ndarrays->len; i++) {
+        _ndarray = MP_OBJ_TO_PTR(ndarrays->items[i]);
+        // check, whether the arrays are compatible
+        if((dtype != _ndarray->dtype) || (ndim != _ndarray->ndim)) {
+            mp_raise_ValueError(translate("input arrays are not compatible"));
+        }
+        for(uint8_t j=0; j < ULAB_MAX_DIMS; j++) {
+            if(j == axis) {
+                shape[j] += _ndarray->shape[j];
+            } else {
+                if(shape[j] != _ndarray->shape[j]) {
+                    mp_raise_ValueError(translate("input arrays are not compatible"));
+                }
+            }
+        }
+    }
+
+    ndarray_obj_t *target = ndarray_new_dense_ndarray(ndim, shape, dtype);
+    uint8_t *tpos = (uint8_t *)target->array;
+    uint8_t *tarray;
+
+    for(uint8_t p=0; p < ndarrays->len; p++) {
+        // reset the pointer along the axis
+        ndarray_obj_t *source = MP_OBJ_TO_PTR(ndarrays->items[p]);
+        uint8_t *sarray = (uint8_t *)source->array;
+        tarray = tpos;
+
+        #if ULAB_MAX_DIMS > 3
+        size_t i = 0;
+        do {
+        #endif
+            #if ULAB_MAX_DIMS > 2
+            size_t j = 0;
+            do {
+            #endif
+                #if ULAB_MAX_DIMS > 1
+                size_t k = 0;
+                do {
+                #endif
+                    size_t l = 0;
+                    do {
+                        memcpy(tarray, sarray, source->itemsize);
+                        tarray += target->strides[ULAB_MAX_DIMS - 1];
+                        sarray += source->strides[ULAB_MAX_DIMS - 1];
+                        l++;
+                    } while(l < source->shape[ULAB_MAX_DIMS - 1]);
+                #if ULAB_MAX_DIMS > 1
+                    tarray -= target->strides[ULAB_MAX_DIMS - 1] * source->shape[ULAB_MAX_DIMS-1];
+                    tarray += target->strides[ULAB_MAX_DIMS - 2];
+                    sarray -= source->strides[ULAB_MAX_DIMS - 1] * source->shape[ULAB_MAX_DIMS-1];
+                    sarray += source->strides[ULAB_MAX_DIMS - 2];
+                    k++;
+                } while(k < source->shape[ULAB_MAX_DIMS - 2]);
+                #endif
+            #if ULAB_MAX_DIMS > 2
+                tarray -= target->strides[ULAB_MAX_DIMS - 2] * source->shape[ULAB_MAX_DIMS-2];
+                tarray += target->strides[ULAB_MAX_DIMS - 3];
+                sarray -= source->strides[ULAB_MAX_DIMS - 2] * source->shape[ULAB_MAX_DIMS-2];
+                sarray += source->strides[ULAB_MAX_DIMS - 3];
+                j++;
+            } while(j < source->shape[ULAB_MAX_DIMS - 3]);
+            #endif
+        #if ULAB_MAX_DIMS > 3
+            tarray -= target->strides[ULAB_MAX_DIMS - 3] * source->shape[ULAB_MAX_DIMS-3];
+            tarray += target->strides[ULAB_MAX_DIMS - 4];
+            sarray -= source->strides[ULAB_MAX_DIMS - 3] * source->shape[ULAB_MAX_DIMS-3];
+            sarray += source->strides[ULAB_MAX_DIMS - 4];
+            i++;
+        } while(i < source->shape[ULAB_MAX_DIMS - 4]);
+        #endif
+        if(p < ndarrays->len - 1) {
+            tpos += target->strides[axis] * source->shape[axis];
+        }
+    }
+    return MP_OBJ_FROM_PTR(target);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(create_concatenate_obj, 1, create_concatenate);
+#endif
+
+#if ULAB_NUMPY_HAS_DIAG
+//| def diag(a: ulab.ndarray, *, k: int = 0) -> ulab.ndarray:
+//|     """
+//|     .. param: a
+//|       an ndarray
+//|     .. param: k
+//|       Offset of the diagonal from the main diagonal. Can be positive or negative.
+//|
+//|     Return specified diagonals."""
+//|     ...
+//|
+mp_obj_t create_diag(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_k, MP_ARG_KW_ONLY | MP_ARG_INT, { .u_int = 0 } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    if(!MP_OBJ_IS_TYPE(args[0].u_obj, &ulab_ndarray_type)) {
+        mp_raise_TypeError(translate("input must be an ndarray"));
+    }
+    ndarray_obj_t *source = MP_OBJ_TO_PTR(args[0].u_obj);
+    if(source->ndim == 1) { // return a rank-2 tensor with the prescribed diagonal
+        ndarray_obj_t *target = ndarray_new_dense_ndarray(2, ndarray_shape_vector(0, 0, source->len, source->len), source->dtype);
+        uint8_t *sarray = (uint8_t *)source->array;
+        uint8_t *tarray = (uint8_t *)target->array;
+        for(size_t i=0; i < source->len; i++) {
+            memcpy(tarray, sarray, source->itemsize);
+            sarray += source->strides[ULAB_MAX_DIMS - 1];
+            tarray += (source->len + 1) * target->itemsize;
+        }
+        return MP_OBJ_FROM_PTR(target);
+    }
+    if(source->ndim > 2) {
+        mp_raise_TypeError(translate("input must be a tensor of rank 2"));
+    }
+    int32_t k = args[1].u_int;
+    size_t len = 0;
+    uint8_t *sarray = (uint8_t *)source->array;
+    if(k < 0) { // move the pointer "vertically"
+        if(-k < (int32_t)source->shape[ULAB_MAX_DIMS - 2]) {
+            sarray -= k * source->strides[ULAB_MAX_DIMS - 2];
+            len = MIN(source->shape[ULAB_MAX_DIMS - 2] + k, source->shape[ULAB_MAX_DIMS - 1]);
+        }
+    } else { // move the pointer "horizontally"
+        if(k < (int32_t)source->shape[ULAB_MAX_DIMS - 1]) {
+            sarray += k * source->strides[ULAB_MAX_DIMS - 1];
+            len = MIN(source->shape[ULAB_MAX_DIMS - 1] - k, source->shape[ULAB_MAX_DIMS - 2]);
+        }
+    }
+
+    if(len == 0) {
+        mp_raise_ValueError(translate("offset is too large"));
+    }
+
+    ndarray_obj_t *target = ndarray_new_linear_array(len, source->dtype);
+    uint8_t *tarray = (uint8_t *)target->array;
+
+    for(size_t i=0; i < len; i++) {
+        memcpy(tarray, sarray, source->itemsize);
+        sarray += source->strides[ULAB_MAX_DIMS - 2];
+        sarray += source->strides[ULAB_MAX_DIMS - 1];
+        tarray += source->itemsize;
+    }
+    return MP_OBJ_FROM_PTR(target);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(create_diag_obj, 1, create_diag);
+#endif /* ULAB_NUMPY_HAS_DIAG */
+
+#if ULAB_MAX_DIMS > 1
+#if ULAB_NUMPY_HAS_EYE
+//| def eye(size: int, *, M: Optional[int] = None, k: int = 0, dtype: _DType = ulab.float) -> ulab.ndarray:
+//|     """Return a new square array of size, with the diagonal elements set to 1
+//|        and the other elements set to 0."""
+//|     ...
+//|
+
+mp_obj_t create_eye(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_INT, { .u_int = 0 } },
+        { MP_QSTR_M, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_k, MP_ARG_KW_ONLY | MP_ARG_INT, { .u_int = 0 } },
+        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_INT, { .u_int = NDARRAY_FLOAT } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    size_t n = args[0].u_int, m;
+    size_t k = args[2].u_int > 0 ? (size_t)args[2].u_int : (size_t)(-args[2].u_int);
+    uint8_t dtype = args[3].u_int;
+    if(args[1].u_rom_obj == mp_const_none) {
+        m = n;
+    } else {
+        m = mp_obj_get_int(args[1].u_rom_obj);
+    }
+    ndarray_obj_t *ndarray = ndarray_new_dense_ndarray(2, ndarray_shape_vector(0, 0, n, m), dtype);
+    mp_obj_t one = mp_obj_new_int(1);
+    size_t i = 0;
+    if((args[2].u_int >= 0)) {
+        while(k < m) {
+            mp_binary_set_val_array(dtype, ndarray->array, i*m+k, one);
+            k++;
+            i++;
+        }
+    } else {
+        while(k < n) {
+            mp_binary_set_val_array(dtype, ndarray->array, k*m+i, one);
+            k++;
+            i++;
+        }
+    }
+    return MP_OBJ_FROM_PTR(ndarray);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(create_eye_obj, 1, create_eye);
+#endif /* ULAB_NUMPY_HAS_EYE */
+#endif /* ULAB_MAX_DIMS > 1 */
+
+#if ULAB_NUMPY_HAS_FULL
+//| def full(shape: Union[int, Tuple[int, ...]], fill_value: Union[_float, _bool], *, dtype: _DType = ulab.float) -> ulab.ndarray:
+//|    """
+//|    .. param: shape
+//|       Shape of the array, either an integer (for a 1-D array) or a tuple of integers (for tensors of higher rank)
+//|    .. param: fill_value
+//|       scalar, the value with which the array is filled
+//|    .. param: dtype
+//|       Type of values in the array
+//|
+//|    Return a new array of the given shape with all elements set to 0."""
+//|    ...
+//|
+
+mp_obj_t create_full(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_obj = MP_OBJ_NULL } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_obj = MP_OBJ_NULL } },
+        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_INT, { .u_int = NDARRAY_FLOAT } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    uint8_t dtype = args[2].u_int;
+
+    return create_zeros_ones_full(args[0].u_obj, dtype, args[1].u_obj);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(create_full_obj, 0, create_full);
+#endif
+
+
+#if ULAB_NUMPY_HAS_LINSPACE
+//| def linspace(
+//|     start: _float,
+//|     stop: _float,
+//|     *,
+//|     dtype: _DType = ulab.float,
+//|     num: int = 50,
+//|     endpoint: _bool = True,
+//|     retstep: _bool = False
+//| ) -> ulab.ndarray:
+//|     """
+//|     .. param: start
+//|       First value in the array
+//|     .. param: stop
+//|       Final value in the array
+//|     .. param int: num
+//|       Count of values in the array.
+//|     .. param: dtype
+//|       Type of values in the array
+//|     .. param bool: endpoint
+//|       Whether the ``stop`` value is included.  Note that even when
+//|       endpoint=True, the exact ``stop`` value may not be included due to the
+//|       inaccuracy of floating point arithmetic.
+//      .. param bool: retstep,
+//|       If True, return (`samples`, `step`), where `step` is the spacing between samples.
+//|
+//|     Return a new 1-D array with ``num`` elements ranging from ``start`` to ``stop`` linearly."""
+//|     ...
+//|
+
+mp_obj_t create_linspace(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_num, MP_ARG_INT, { .u_int = 50 } },
+        { MP_QSTR_endpoint, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = mp_const_true } },
+        { MP_QSTR_retstep, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = mp_const_false } },
+        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_INT, { .u_int = NDARRAY_FLOAT } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    if(args[2].u_int < 2) {
+        mp_raise_ValueError(translate("number of points must be at least 2"));
+    }
+    size_t len = (size_t)args[2].u_int;
+    mp_float_t start, step;
+    start = mp_obj_get_float(args[0].u_obj);
+    uint8_t typecode = args[5].u_int;
+    if(args[3].u_obj == mp_const_true) step = (mp_obj_get_float(args[1].u_obj)-start)/(len-1);
+    else step = (mp_obj_get_float(args[1].u_obj)-start)/len;
+    ndarray_obj_t *ndarray = create_linspace_arange(start, step, len, typecode);
+    if(args[4].u_obj == mp_const_false) {
+        return MP_OBJ_FROM_PTR(ndarray);
+    } else {
+        mp_obj_t tuple[2];
+        tuple[0] = ndarray;
+        tuple[1] = mp_obj_new_float(step);
+        return mp_obj_new_tuple(2, tuple);
+    }
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(create_linspace_obj, 2, create_linspace);
+#endif
+
+#if ULAB_NUMPY_HAS_LOGSPACE
+//| def logspace(
+//|     start: _float,
+//|     stop: _float,
+//|     *,
+//|     dtype: _DType = ulab.float,
+//|     num: int = 50,
+//|     endpoint: _bool = True,
+//|     base: _float = 10.0
+//| ) -> ulab.ndarray:
+//|     """
+//|     .. param: start
+//|       First value in the array
+//|     .. param: stop
+//|       Final value in the array
+//|     .. param int: num
+//|       Count of values in the array. Defaults to 50.
+//|     .. param: base
+//|       The base of the log space. The step size between the elements in
+//|       ``ln(samples) / ln(base)`` (or ``log_base(samples)``) is uniform. Defaults to 10.0.
+//|     .. param: dtype
+//|       Type of values in the array
+//|     .. param bool: endpoint
+//|       Whether the ``stop`` value is included.  Note that even when
+//|       endpoint=True, the exact ``stop`` value may not be included due to the
+//|       inaccuracy of floating point arithmetic. Defaults to True.
+//|
+//|     Return a new 1-D array with ``num`` evenly spaced elements on a log scale.
+//|     The sequence starts at ``base ** start``, and ends with ``base ** stop``."""
+//|     ...
+//|
+
+const mp_obj_float_t create_float_const_ten = {{&mp_type_float}, MICROPY_FLOAT_CONST(10.0)};
+
+mp_obj_t create_logspace(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_num, MP_ARG_INT, { .u_int = 50 } },
+        { MP_QSTR_base, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = MP_ROM_PTR(&create_float_const_ten) } },
+        { MP_QSTR_endpoint, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = mp_const_true } },
+        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_INT, { .u_int = NDARRAY_FLOAT } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    if(args[2].u_int < 2) {
+        mp_raise_ValueError(translate("number of points must be at least 2"));
+    }
+    size_t len = (size_t)args[2].u_int;
+    mp_float_t start, step, quotient;
+    start = mp_obj_get_float(args[0].u_obj);
+    uint8_t dtype = args[5].u_int;
+    mp_float_t base = mp_obj_get_float(args[3].u_obj);
+    if(args[4].u_obj == mp_const_true) step = (mp_obj_get_float(args[1].u_obj)-start)/(len-1);
+    else step = (mp_obj_get_float(args[1].u_obj)-start)/len;
+    quotient = MICROPY_FLOAT_C_FUN(pow)(base, step);
+    ndarray_obj_t *ndarray = ndarray_new_linear_array(len, dtype);
+    mp_float_t value = MICROPY_FLOAT_C_FUN(pow)(base, start);
+    if(dtype == NDARRAY_UINT8) {
+        uint8_t *array = (uint8_t *)ndarray->array;
+        for(size_t i=0; i < len; i++, value *= quotient) *array++ = (uint8_t)value;
+    } else if(dtype == NDARRAY_INT8) {
+        int8_t *array = (int8_t *)ndarray->array;
+        for(size_t i=0; i < len; i++, value *= quotient) *array++ = (int8_t)value;
+    } else if(dtype == NDARRAY_UINT16) {
+        uint16_t *array = (uint16_t *)ndarray->array;
+        for(size_t i=0; i < len; i++, value *= quotient) *array++ = (uint16_t)value;
+    } else if(dtype == NDARRAY_INT16) {
+        int16_t *array = (int16_t *)ndarray->array;
+        for(size_t i=0; i < len; i++, value *= quotient) *array++ = (int16_t)value;
+    } else {
+        mp_float_t *array = (mp_float_t *)ndarray->array;
+        for(size_t i=0; i < len; i++, value *= quotient) *array++ = value;
+    }
+    return MP_OBJ_FROM_PTR(ndarray);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(create_logspace_obj, 2, create_logspace);
+#endif
+
+#if ULAB_NUMPY_HAS_ONES
+//| def ones(shape: Union[int, Tuple[int, ...]], *, dtype: _DType = ulab.float) -> ulab.ndarray:
+//|    """
+//|    .. param: shape
+//|       Shape of the array, either an integer (for a 1-D array) or a tuple of 2 integers (for a 2-D array)
+//|    .. param: dtype
+//|       Type of values in the array
+//|
+//|    Return a new array of the given shape with all elements set to 1."""
+//|    ...
+//|
+
+mp_obj_t create_ones(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_obj = MP_OBJ_NULL } },
+        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_INT, { .u_int = NDARRAY_FLOAT } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    uint8_t dtype = args[1].u_int;
+    mp_obj_t one = mp_obj_new_int(1);
+    return create_zeros_ones_full(args[0].u_obj, dtype, one);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(create_ones_obj, 0, create_ones);
+#endif
+
+#if ULAB_NUMPY_HAS_ZEROS
+//| def zeros(shape: Union[int, Tuple[int, ...]], *, dtype: _DType = ulab.float) -> ulab.ndarray:
+//|    """
+//|    .. param: shape
+//|       Shape of the array, either an integer (for a 1-D array) or a tuple of 2 integers (for a 2-D array)
+//|    .. param: dtype
+//|       Type of values in the array
+//|
+//|    Return a new array of the given shape with all elements set to 0."""
+//|    ...
+//|
+
+mp_obj_t create_zeros(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_obj = MP_OBJ_NULL } },
+        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_INT, { .u_int = NDARRAY_FLOAT } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    uint8_t dtype = args[1].u_int;
+    return create_zeros_ones_full(args[0].u_obj, dtype, mp_const_none);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(create_zeros_obj, 0, create_zeros);
+#endif
+
+#if ULAB_NUMPY_HAS_FROMBUFFER
+mp_obj_t create_frombuffer(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
+    static const mp_arg_t allowed_args[] = {
+        { MP_QSTR_, MP_ARG_REQUIRED | MP_ARG_OBJ, { .u_rom_obj = mp_const_none } },
+        { MP_QSTR_dtype, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = MP_ROM_INT(NDARRAY_FLOAT) } },
+        { MP_QSTR_count, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = MP_ROM_INT(-1) } },
+        { MP_QSTR_offset, MP_ARG_KW_ONLY | MP_ARG_OBJ, { .u_rom_obj = MP_ROM_INT(0) } },
+    };
+
+    mp_arg_val_t args[MP_ARRAY_SIZE(allowed_args)];
+    mp_arg_parse_all(n_args, pos_args, kw_args, MP_ARRAY_SIZE(allowed_args), allowed_args, args);
+
+    uint8_t dtype = mp_obj_get_int(args[1].u_obj);
+    size_t offset = mp_obj_get_int(args[3].u_obj);
+
+    mp_buffer_info_t bufinfo;
+    if(mp_get_buffer(args[0].u_obj, &bufinfo, MP_BUFFER_READ)) {
+        size_t sz = 1;
+        if(dtype != NDARRAY_BOOL) { // mp_binary_get_size doesn't work with Booleans
+            sz = mp_binary_get_size('@', dtype, NULL);
+        }
+        if(bufinfo.len < offset) {
+            mp_raise_ValueError(translate("offset must be non-negative and no greater than buffer length"));
+        }
+        size_t len = (bufinfo.len - offset) / sz;
+        if((len * sz) != (bufinfo.len - offset)) {
+            mp_raise_ValueError(translate("buffer size must be a multiple of element size"));
+        }
+        if(mp_obj_get_int(args[2].u_obj) > 0) {
+            size_t count = mp_obj_get_int(args[2].u_obj);
+            if(len < count) {
+                mp_raise_ValueError(translate("buffer is smaller than requested size"));
+            } else {
+                len = count;
+            }
+        }
+        ndarray_obj_t *ndarray = ndarray_new_linear_array(len, dtype);
+        uint8_t *array = (uint8_t *)ndarray->array;
+        uint8_t *buffer = bufinfo.buf;
+        memcpy(array, buffer + offset, len * sz);
+        return MP_OBJ_FROM_PTR(ndarray);
+    }
+    return mp_const_none;
+}
+
+MP_DEFINE_CONST_FUN_OBJ_KW(create_frombuffer_obj, 1, create_frombuffer);
+#endif
--- a/code/ulab_create.h
+++ b/code/ulab_create.h
@ -0,0 +1,70 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020 Jeff Epler for Adafruit Industries
+ *               2019-2021 Zoltán Vörös
+*/
+
+#ifndef _CREATE_
+#define _CREATE_
+
+#include "ulab.h"
+#include "ndarray.h"
+
+#if ULAB_NUMPY_HAS_ARANGE
+mp_obj_t create_arange(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(create_arange_obj);
+#endif
+
+#if ULAB_NUMPY_HAS_CONCATENATE
+mp_obj_t create_concatenate(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(create_concatenate_obj);
+#endif
+
+#if ULAB_NUMPY_HAS_DIAG
+mp_obj_t create_diag(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(create_diag_obj);
+#endif
+
+#if ULAB_MAX_DIMS > 1
+#if ULAB_NUMPY_HAS_EYE
+mp_obj_t create_eye(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(create_eye_obj);
+#endif
+#endif
+
+#if ULAB_NUMPY_HAS_FULL
+mp_obj_t create_full(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(create_full_obj);
+#endif
+
+#if ULAB_NUMPY_HAS_LINSPACE
+mp_obj_t create_linspace(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(create_linspace_obj);
+#endif
+
+#if ULAB_NUMPY_HAS_LOGSPACE
+mp_obj_t create_logspace(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(create_logspace_obj);
+#endif
+
+#if ULAB_NUMPY_HAS_ONES
+mp_obj_t create_ones(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(create_ones_obj);
+#endif
+
+#if ULAB_NUMPY_HAS_ZEROS
+mp_obj_t create_zeros(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(create_zeros_obj);
+#endif
+
+#if ULAB_NUMPY_HAS_FROMBUFFER
+mp_obj_t create_frombuffer(size_t , const mp_obj_t *, mp_map_t *);
+MP_DECLARE_CONST_FUN_OBJ_KW(create_frombuffer_obj);
+#endif
+
+#endif
--- a/code/ulab_tools.c
+++ b/code/ulab_tools.c
@ -0,0 +1,160 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+ */
+
+
+
+#include "py/runtime.h"
+
+#include "ulab.h"
+#include "ndarray.h"
+#include "ulab_tools.h"
+
+// The following five functions return a float from a void type
+// The value in question is supposed to be located at the head of the pointer
+
+mp_float_t ndarray_get_float_uint8(void *data) {
+    // Returns a float value from an uint8_t type
+    return (mp_float_t)(*(uint8_t *)data);
+}
+
+mp_float_t ndarray_get_float_int8(void *data) {
+    // Returns a float value from an int8_t type
+    return (mp_float_t)(*(int8_t *)data);
+}
+
+mp_float_t ndarray_get_float_uint16(void *data) {
+    // Returns a float value from an uint16_t type
+    return (mp_float_t)(*(uint16_t *)data);
+}
+
+mp_float_t ndarray_get_float_int16(void *data) {
+    // Returns a float value from an int16_t type
+    return (mp_float_t)(*(int16_t *)data);
+}
+
+
+mp_float_t ndarray_get_float_float(void *data) {
+    // Returns a float value from an mp_float_t type
+    return *((mp_float_t *)data);
+}
+
+// returns a single function pointer, depending on the dtype
+void *ndarray_get_float_function(uint8_t dtype) {
+    if(dtype == NDARRAY_UINT8) {
+        return ndarray_get_float_uint8;
+    } else if(dtype == NDARRAY_INT8) {
+        return ndarray_get_float_int8;
+    } else if(dtype == NDARRAY_UINT16) {
+        return ndarray_get_float_uint16;
+    } else if(dtype == NDARRAY_INT16) {
+        return ndarray_get_float_int16;
+    } else {
+        return ndarray_get_float_float;
+    }
+}
+
+mp_float_t ndarray_get_float_index(void *data, uint8_t dtype, size_t index) {
+    // returns a single float value from an array located at index
+    if(dtype == NDARRAY_UINT8) {
+        return (mp_float_t)((uint8_t *)data)[index];
+    } else if(dtype == NDARRAY_INT8) {
+        return (mp_float_t)((int8_t *)data)[index];
+    } else if(dtype == NDARRAY_UINT16) {
+        return (mp_float_t)((uint16_t *)data)[index];
+    } else if(dtype == NDARRAY_INT16) {
+        return (mp_float_t)((int16_t *)data)[index];
+    } else {
+        return (mp_float_t)((mp_float_t *)data)[index];
+    }
+}
+
+mp_float_t ndarray_get_float_value(void *data, uint8_t dtype) {
+    // Returns a float value from an arbitrary data type
+    // The value in question is supposed to be located at the head of the pointer
+    if(dtype == NDARRAY_UINT8) {
+        return (mp_float_t)(*(uint8_t *)data);
+    } else if(dtype == NDARRAY_INT8) {
+        return (mp_float_t)(*(int8_t *)data);
+    } else if(dtype == NDARRAY_UINT16) {
+        return (mp_float_t)(*(uint16_t *)data);
+    } else if(dtype == NDARRAY_INT16) {
+        return (mp_float_t)(*(int16_t *)data);
+    } else {
+        return *((mp_float_t *)data);
+    }
+}
+
+#if NDARRAY_BINARY_USES_FUN_POINTER
+uint8_t ndarray_upcast_dtype(uint8_t ldtype, uint8_t rdtype) {
+    // returns a single character that corresponds to the broadcasting rules
+    // - if one of the operarands is a float, the result is always float
+    // - operation on identical types preserves type
+    //
+    // uint8 + int8 => int16
+    // uint8 + int16 => int16
+    // uint8 + uint16 => uint16
+    // int8 + int16 => int16
+    // int8 + uint16 => uint16
+    // uint16 + int16 => float
+
+    if(ldtype == rdtype) {
+        // if the two dtypes are equal, the result is also of that type
+        return ldtype;
+    } else if(((ldtype == NDARRAY_UINT8) && (rdtype == NDARRAY_INT8)) ||
+            ((ldtype == NDARRAY_INT8) && (rdtype == NDARRAY_UINT8)) ||
+            ((ldtype == NDARRAY_UINT8) && (rdtype == NDARRAY_INT16)) ||
+            ((ldtype == NDARRAY_INT16) && (rdtype == NDARRAY_UINT8)) ||
+            ((ldtype == NDARRAY_INT8) && (rdtype == NDARRAY_INT16)) ||
+            ((ldtype == NDARRAY_INT16) && (rdtype == NDARRAY_INT8))) {
+        return NDARRAY_INT16;
+    } else if(((ldtype == NDARRAY_UINT8) && (rdtype == NDARRAY_UINT16)) ||
+            ((ldtype == NDARRAY_UINT16) && (rdtype == NDARRAY_UINT8)) ||
+            ((ldtype == NDARRAY_INT8) && (rdtype == NDARRAY_UINT16)) ||
+            ((ldtype == NDARRAY_UINT16) && (rdtype == NDARRAY_INT8))) {
+        return NDARRAY_UINT16;
+    }
+    return NDARRAY_FLOAT;
+}
+
+void ndarray_set_float_uint8(void *data, mp_float_t datum) {
+    *((uint8_t *)data) = (uint8_t)datum;
+}
+
+void ndarray_set_float_int8(void *data, int8_t datum) {
+    *((int8_t *)data) = (int8_t)datum;
+}
+
+void ndarray_set_float_uint16(void *data, mp_float_t datum) {
+    *((uint16_t *)data) = (uint16_t)datum;
+}
+
+void ndarray_set_float_int16(void *data, int8_t datum) {
+    *((int16_t *)data) = (int16_t)datum;
+}
+
+void ndarray_set_float_float(void *data, mp_float_t datum) {
+    *((mp_float_t *)data) = datum;
+}
+
+// returns a single function pointer, depending on the dtype
+void *ndarray_set_float_function(uint8_t dtype) {
+    if(dtype == NDARRAY_UINT8) {
+        return ndarray_set_float_uint8;
+    } else if(dtype == NDARRAY_INT8) {
+        return ndarray_set_float_int8;
+    } else if(dtype == NDARRAY_UINT16) {
+        return ndarray_set_float_uint16;
+    } else if(dtype == NDARRAY_INT16) {
+        return ndarray_set_float_int16;
+    } else {
+        return ndarray_set_float_float;
+    }
+}
+#endif /* NDARRAY_BINARY_USES_FUN_POINTER */
--- a/code/ulab_tools.h
+++ b/code/ulab_tools.h
@ -0,0 +1,26 @@
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+*/
+
+#ifndef _TOOLS_
+#define _TOOLS_
+
+#define SWAP(t, a, b) { t tmp = a; a = b; b = tmp; }
+
+mp_float_t ndarray_get_float_uint8(void *);
+mp_float_t ndarray_get_float_int8(void *);
+mp_float_t ndarray_get_float_uint16(void *);
+mp_float_t ndarray_get_float_int16(void *);
+mp_float_t ndarray_get_float_float(void *);
+void *ndarray_get_float_function(uint8_t );
+
+uint8_t ndarray_upcast_dtype(uint8_t , uint8_t );
+void *ndarray_set_float_function(uint8_t );
+
+#endif
--- a/code/user/user.c
+++ b/code/user/user.c
@ -0,0 +1,95 @@
+
+/*
+ * This file is part of the micropython-ulab project,
+ *
+ * https://github.com/v923z/micropython-ulab
+ *
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2020-2021 Zoltán Vörös
+*/
+
+#include <math.h>
+#include <stdlib.h>
+#include <string.h>
+#include "py/obj.h"
+#include "py/runtime.h"
+#include "py/misc.h"
+#include "user.h"
+
+#if ULAB_HAS_USER_MODULE
+
+//| """This module should hold arbitrary user-defined functions."""
+//|
+
+static mp_obj_t user_square(mp_obj_t arg) {
+    // the function takes a single dense ndarray, and calculates the
+    // element-wise square of its entries
+
+    // raise a TypeError exception, if the input is not an ndarray
+    if(!MP_OBJ_IS_TYPE(arg, &ulab_ndarray_type)) {
+        mp_raise_TypeError(translate("input must be an ndarray"));
+    }
+    ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(arg);
+
+    // make sure that the input is a dense array
+    if(!ndarray_is_dense(ndarray)) {
+        mp_raise_TypeError(translate("input must be a dense ndarray"));
+    }
+
+    // if the input is a dense array, create `results` with the same number of
+    // dimensions, shape, and dtype
+    ndarray_obj_t *results = ndarray_new_dense_ndarray(ndarray->ndim, ndarray->shape, ndarray->dtype);
+
+    // since in a dense array the iteration over the elements is trivial, we
+    // can cast the data arrays ndarray->array and results->array to the actual type
+    if(ndarray->dtype == NDARRAY_UINT8) {
+        uint8_t *array = (uint8_t *)ndarray->array;
+        uint8_t *rarray = (uint8_t *)results->array;
+        for(size_t i=0; i < ndarray->len; i++, array++) {
+            *rarray++ = (*array) * (*array);
+        }
+    } else if(ndarray->dtype == NDARRAY_INT8) {
+        int8_t *array = (int8_t *)ndarray->array;
+        int8_t *rarray = (int8_t *)results->array;
+        for(size_t i=0; i < ndarray->len; i++, array++) {
+            *rarray++ = (*array) * (*array);
+        }
+    } else if(ndarray->dtype == NDARRAY_UINT16) {
+        uint16_t *array = (uint16_t *)ndarray->array;
+        uint16_t *rarray = (uint16_t *)results->array;
+        for(size_t i=0; i < ndarray->len; i++, array++) {
+            *rarray++ = (*array) * (*array);
+        }
+    } else if(ndarray->dtype == NDARRAY_INT16) {
+        int16_t *array = (int16_t *)ndarray->array;
+        int16_t *rarray = (int16_t *)results->array;
+        for(size_t i=0; i < ndarray->len; i++, array++) {
+            *rarray++ = (*array) * (*array);
+        }
+    } else { // if we end up here, the dtype is NDARRAY_FLOAT
+        mp_float_t *array = (mp_float_t *)ndarray->array;
+        mp_float_t *rarray = (mp_float_t *)results->array;
+        for(size_t i=0; i < ndarray->len; i++, array++) {
+            *rarray++ = (*array) * (*array);
+        }
+    }
+    // at the end, return a micrppython object
+    return MP_OBJ_FROM_PTR(results);
+}
+
+MP_DEFINE_CONST_FUN_OBJ_1(user_square_obj, user_square);
+
+static const mp_rom_map_elem_t ulab_user_globals_table[] = {
+    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_user) },
+    { MP_OBJ_NEW_QSTR(MP_QSTR_square), (mp_obj_t)&user_square_obj },
+};
+
+static MP_DEFINE_CONST_DICT(mp_module_ulab_user_globals, ulab_user_globals_table);
+
+mp_obj_module_t ulab_user_module = {
+    .base = { &mp_type_module },
+    .globals = (mp_obj_dict_t*)&mp_module_ulab_user_globals,
+};
+
+#endif
--- a/code/user/user.h
+++ b/code/user/user.h
@ -6,18 +6,15 @@
 *
 * The MIT License (MIT)
 *
- * Copyright (c) 2020 Zoltán Vörös
+ * Copyright (c) 2020-2021 Zoltán Vörös
 */

-#ifndef _EXTRA_
-#define _EXTRA_
+#ifndef _USER_
+#define _USER_

 #include "ulab.h"
 #include "ndarray.h"

-#if ULAB_EXTRAS_MODULE
-
-mp_obj_module_t ulab_extras_module;
+extern mp_obj_module_t ulab_user_module;

 #endif
-#endif
--- a/code/vectorise.c
+++ b/code/vectorise.c
@ -1,174 +0,0 @@
-
-/*
- * This file is part of the micropython-ulab project, 
- *
- * https://github.com/v923z/micropython-ulab
- *
- * The MIT License (MIT)
- *
- * Copyright (c) 2019-2020 Zoltán Vörös
-*/
-
-#include <math.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include "py/runtime.h"
-#include "py/binary.h"
-#include "py/obj.h"
-#include "py/objarray.h"
-#include "vectorise.h"
-
-#ifndef MP_PI
-#define MP_PI MICROPY_FLOAT_CONST(3.14159265358979323846)
-#endif
-    
-#if ULAB_VECTORISE_MODULE
-mp_obj_t vectorise_generic_vector(mp_obj_t o_in, mp_float_t (*f)(mp_float_t)) {
-    // Return a single value, if o_in is not iterable
-    if(mp_obj_is_float(o_in) || MP_OBJ_IS_INT(o_in)) {
-            return mp_obj_new_float(f(mp_obj_get_float(o_in)));
-    }
-    mp_float_t x;
-    if(MP_OBJ_IS_TYPE(o_in, &ulab_ndarray_type)) {
-        ndarray_obj_t *source = MP_OBJ_TO_PTR(o_in);
-        ndarray_obj_t *ndarray = create_new_ndarray(source->m, source->n, NDARRAY_FLOAT);
-        mp_float_t *dataout = (mp_float_t *)ndarray->array->items;
-        if(source->array->typecode == NDARRAY_UINT8) {
-            ITERATE_VECTOR(uint8_t, source, dataout);
-        } else if(source->array->typecode == NDARRAY_INT8) {
-            ITERATE_VECTOR(int8_t, source, dataout);
-        } else if(source->array->typecode == NDARRAY_UINT16) {
-            ITERATE_VECTOR(uint16_t, source, dataout);
-        } else if(source->array->typecode == NDARRAY_INT16) {
-            ITERATE_VECTOR(int16_t, source, dataout);
-        } else {
-            ITERATE_VECTOR(mp_float_t, source, dataout);
-        }
-        return MP_OBJ_FROM_PTR(ndarray);
-    } else if(MP_OBJ_IS_TYPE(o_in, &mp_type_tuple) || MP_OBJ_IS_TYPE(o_in, &mp_type_list) || 
-        MP_OBJ_IS_TYPE(o_in, &mp_type_range)) { // i.e., the input is a generic iterable
-            mp_obj_array_t *o = MP_OBJ_TO_PTR(o_in);
-            ndarray_obj_t *out = create_new_ndarray(1, o->len, NDARRAY_FLOAT);
-            mp_float_t *dataout = (mp_float_t *)out->array->items;
-            mp_obj_iter_buf_t iter_buf;
-            mp_obj_t item, iterable = mp_getiter(o_in, &iter_buf);
-            size_t i=0;
-            while ((item = mp_iternext(iterable)) != MP_OBJ_STOP_ITERATION) {
-                x = mp_obj_get_float(item);
-                dataout[i++] = f(x);
-            }
-        return MP_OBJ_FROM_PTR(out);
-    }
-    return mp_const_none;
-}
-
-
-MATH_FUN_1(acos, acos);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_acos_obj, vectorise_acos);
-
-MATH_FUN_1(acosh, acosh);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_acosh_obj, vectorise_acosh);
-
-MATH_FUN_1(asin, asin);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_asin_obj, vectorise_asin);
-
-MATH_FUN_1(asinh, asinh);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_asinh_obj, vectorise_asinh);
-
-MATH_FUN_1(atan, atan);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_atan_obj, vectorise_atan);
-
-MATH_FUN_1(atanh, atanh);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_atanh_obj, vectorise_atanh);
-
-MATH_FUN_1(ceil, ceil);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_ceil_obj, vectorise_ceil);
-
-MATH_FUN_1(cos, cos);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_cos_obj, vectorise_cos);
-
-MATH_FUN_1(cosh, cosh);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_cosh_obj, vectorise_cosh);
-
-MATH_FUN_1(erf, erf);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_erf_obj, vectorise_erf);
-
-MATH_FUN_1(erfc, erfc);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_erfc_obj, vectorise_erfc);
-
-MATH_FUN_1(exp, exp);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_exp_obj, vectorise_exp);
-
-MATH_FUN_1(expm1, expm1);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_expm1_obj, vectorise_expm1);
-
-MATH_FUN_1(floor, floor);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_floor_obj, vectorise_floor);
-
-MATH_FUN_1(gamma, tgamma);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_gamma_obj, vectorise_gamma);
-
-MATH_FUN_1(lgamma, lgamma);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_lgamma_obj, vectorise_lgamma);
-
-MATH_FUN_1(log, log);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_log_obj, vectorise_log);
-
-MATH_FUN_1(log10, log10);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_log10_obj, vectorise_log10);
-
-MATH_FUN_1(log2, log2);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_log2_obj, vectorise_log2);
-
-MATH_FUN_1(sin, sin);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_sin_obj, vectorise_sin);
-
-MATH_FUN_1(sinh, sinh);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_sinh_obj, vectorise_sinh);
-
-MATH_FUN_1(sqrt, sqrt);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_sqrt_obj, vectorise_sqrt);
-
-MATH_FUN_1(tan, tan);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_tan_obj, vectorise_tan);
-
-MATH_FUN_1(tanh, tanh);
-MP_DEFINE_CONST_FUN_OBJ_1(vectorise_tanh_obj, vectorise_tanh);
-
-#if !CIRCUITPY
-STATIC const mp_rom_map_elem_t ulab_vectorise_globals_table[] = {
-    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_vector) },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_acos), (mp_obj_t)&vectorise_acos_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_acosh), (mp_obj_t)&vectorise_acosh_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_asin), (mp_obj_t)&vectorise_asin_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_asinh), (mp_obj_t)&vectorise_asinh_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_atan), (mp_obj_t)&vectorise_atan_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_atanh), (mp_obj_t)&vectorise_atanh_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_ceil), (mp_obj_t)&vectorise_ceil_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_cos), (mp_obj_t)&vectorise_cos_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_erf), (mp_obj_t)&vectorise_erf_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_erfc), (mp_obj_t)&vectorise_erfc_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_exp), (mp_obj_t)&vectorise_exp_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_expm1), (mp_obj_t)&vectorise_expm1_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_floor), (mp_obj_t)&vectorise_floor_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_gamma), (mp_obj_t)&vectorise_gamma_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_lgamma), (mp_obj_t)&vectorise_lgamma_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_log), (mp_obj_t)&vectorise_log_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_log10), (mp_obj_t)&vectorise_log10_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_log2), (mp_obj_t)&vectorise_log2_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_sin), (mp_obj_t)&vectorise_sin_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_sinh), (mp_obj_t)&vectorise_sinh_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_sqrt), (mp_obj_t)&vectorise_sqrt_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_tan), (mp_obj_t)&vectorise_tan_obj },
-    { MP_OBJ_NEW_QSTR(MP_QSTR_tanh), (mp_obj_t)&vectorise_tanh_obj },
-};
-
-STATIC MP_DEFINE_CONST_DICT(mp_module_ulab_vectorise_globals, ulab_vectorise_globals_table);
-
-mp_obj_module_t ulab_vectorise_module = {
-    .base = { &mp_type_module },
-    .globals = (mp_obj_dict_t*)&mp_module_ulab_vectorise_globals,
-};
-#endif
-
-#endif
--- a/code/vectorise.h
+++ b/code/vectorise.h
@ -1,35 +0,0 @@
-
-/*
- * This file is part of the micropython-ulab project, 
- *
- * https://github.com/v923z/micropython-ulab
- *
- * The MIT License (MIT)
- *
- * Copyright (c) 2019-2020 Zoltán Vörös
-*/
-
-#ifndef _VECTORISE_
-#define _VECTORISE_
-
-#include "ulab.h"
-#include "ndarray.h"
-
-#if ULAB_VECTORISE_MODULE
-
-mp_obj_module_t ulab_vectorise_module;
-
-#define ITERATE_VECTOR(type, source, out) do {\
-    type *input = (type *)(source)->array->items;\
-    for(size_t i=0; i < (source)->array->len; i++) {\
-                (out)[i] = f(input[i]);\
-    }\
-} while(0)
-
-#define MATH_FUN_1(py_name, c_name) \
-    mp_obj_t vectorise_ ## py_name(mp_obj_t x_obj) { \
-        return vectorise_generic_vector(x_obj, MICROPY_FLOAT_C_FUN(c_name)); \
-}
-    
-#endif
-#endif
--- a/docs/manual/Makefile
+++ b/docs/manual/Makefile
@ -14,6 +14,10 @@ help:

 .PHONY: help Makefile

+clean:
+	rm -rf "$(BUILDDIR)"
+
+
 # Catch-all target: route all unknown targets to Sphinx using the new
 # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
 %: Makefile
--- a/docs/manual/make.bat
+++ b/docs/manual/make.bat
@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=source
+set BUILDDIR=build
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.http://sphinx-doc.org/
+	exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd
--- a/docs/manual/source/conf.py
+++ b/docs/manual/source/conf.py
@ -10,19 +10,24 @@
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
-# import os
+import os
 # import sys
 # sys.path.insert(0, os.path.abspath('.'))

+#import sphinx_rtd_theme
+
+from sphinx.transforms import SphinxTransform
+from docutils import nodes
+from sphinx import addnodes

 # -- Project information -----------------------------------------------------

-project = 'micropython-ulab'
-copyright = '2019, Zoltán Vörös'
+project = 'The ulab book'
+copyright = '2019-2021, Zoltán Vörös and contributors'
 author = 'Zoltán Vörös'

 # The full version, including alpha/beta/rc tags
-release = '0.32'
+release = '2.1.2'


 # -- General configuration ---------------------------------------------------
@ -42,18 +47,46 @@ templates_path = ['_templates']
 exclude_patterns = []


-# -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_theme = 'sphinx_rtd_theme'
-
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ['_static']

+latex_maketitle = r'''
+\begin{titlepage}
+\begin{flushright}
+\Huge\textbf{The $\mu$lab book}
+\vskip 0.5em
+\LARGE
+\textbf{Release %s}
+\vskip 5em
+\huge\textbf{Zoltán Vörös}
+\end{flushright}
+\begin{flushright}
+\LARGE
+\vskip 2em
+with contributions by
+\vskip 2em
+\textbf{Roberto Colistete Jr.}
+\vskip 0.2em
+\textbf{Jeff Epler}
+\vskip 0.2em
+\textbf{Taku Fukada}
+\vskip 0.2em
+\textbf{Diego Elio Pettenò}
+\vskip 0.2em
+\textbf{Scott Shawcroft}
+\vskip 5em
+\today
+\end{flushright}
+\end{titlepage}
+'''%release
+
+latex_elements = {
+    'maketitle': latex_maketitle
+}
+
+
 master_doc = 'index'

 author=u'Zoltán Vörös'
@ -61,7 +94,19 @@ copyright=author
 language='en'

 latex_documents = [
-(master_doc, 'ulab-manual.tex', 'Micropython ulab documentation', 
+(master_doc, 'the-ulab-book.tex', 'The $\mu$lab book',
 'Zoltán Vörös', 'manual'),
 ]

+# Read the docs theme
+on_rtd = os.environ.get('READTHEDOCS', None) == 'True'
+if not on_rtd:
+    try:
+        import sphinx_rtd_theme
+        html_theme = 'sphinx_rtd_theme'
+        html_theme_path = [sphinx_rtd_theme.get_html_theme_path(), '.']
+    except ImportError:
+        html_theme = 'default'
+        html_theme_path = ['.']
+else:
+    html_theme_path = ['.']
--- a/docs/manual/source/index.rst
+++ b/docs/manual/source/index.rst
@ -1,16 +1,31 @@
+
 .. ulab-manual documentation master file, created by
   sphinx-quickstart on Sat Oct 19 12:48:00 2019.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

-Welcome to micropython-ulab's documentation!
+Welcome to the ulab book!
 =======================================

 .. toctree::
   :maxdepth: 2
-   :caption: Contents:
+   :caption: Introduction

-   ulab
+   ulab-intro
+
+.. toctree::
+   :maxdepth: 2
+   :caption: User's guide:
+
+   ulab-ndarray
+   numpy-functions
+   numpy-universal
+   numpy-fft
+   numpy-linalg
+   scipy-optimize
+   scipy-signal
+   scipy-special
+   ulab-programming

 Indices and tables
 ==================
--- a/docs/manual/source/numpy-fft.rst
+++ b/docs/manual/source/numpy-fft.rst
@ -0,0 +1,163 @@
+
+Fourier transforms
+==================
+
+Functions related to Fourier transforms can be called by prepending them
+with ``numpy.fft.``. The module defines the following two functions:
+
+1. `numpy.fft.fft <#fft>`__
+2. `numpy.fft.ifft <#ifft>`__
+
+``numpy``:
+https://docs.scipy.org/doc/numpy/reference/generated/numpy.fft.ifft.html
+
+fft
+---
+
+Since ``ulab``\ ’s ``ndarray`` does not support complex numbers, the
+invocation of the Fourier transform differs from that in ``numpy``. In
+``numpy``, you can simply pass an array or iterable to the function, and
+it will be treated as a complex array:
+
+.. code::
+
+    # code to be run in CPython
+    
+    fft.fft([1, 2, 3, 4, 1, 2, 3, 4])
+
+
+
+.. parsed-literal::
+
+    array([20.+0.j,  0.+0.j, -4.+4.j,  0.+0.j, -4.+0.j,  0.+0.j, -4.-4.j,
+            0.+0.j])
+
+
+
+**WARNING:** The array returned is also complex, i.e., the real and
+imaginary components are cast together. In ``ulab``, the real and
+imaginary parts are treated separately: you have to pass two
+``ndarray``\ s to the function, although, the second argument is
+optional, in which case the imaginary part is assumed to be zero.
+
+**WARNING:** The function, as opposed to ``numpy``, returns a 2-tuple,
+whose elements are two ``ndarray``\ s, holding the real and imaginary
+parts of the transform separately.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    x = np.linspace(0, 10, num=1024)
+    y = np.sin(x)
+    z = np.zeros(len(x))
+    
+    a, b = np.fft.fft(x)
+    print('real part:\t', a)
+    print('\nimaginary part:\t', b)
+    
+    c, d = np.fft.fft(x, z)
+    print('\nreal part:\t', c)
+    print('\nimaginary part:\t', d)
+
+.. parsed-literal::
+
+    real part:	 array([5119.996, -5.004663, -5.004798, ..., -5.005482, -5.005643, -5.006577], dtype=float)
+    
+    imaginary part:	 array([0.0, 1631.333, 815.659, ..., -543.764, -815.6588, -1631.333], dtype=float)
+    
+    real part:	 array([5119.996, -5.004663, -5.004798, ..., -5.005482, -5.005643, -5.006577], dtype=float)
+    
+    imaginary part:	 array([0.0, 1631.333, 815.659, ..., -543.764, -815.6588, -1631.333], dtype=float)
+    
+
+
+ifft
+----
+
+The above-mentioned rules apply to the inverse Fourier transform. The
+inverse is also normalised by ``N``, the number of elements, as is
+customary in ``numpy``. With the normalisation, we can ascertain that
+the inverse of the transform is equal to the original array.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    x = np.linspace(0, 10, num=1024)
+    y = np.sin(x)
+    
+    a, b = np.fft.fft(y)
+    
+    print('original vector:\t', y)
+    
+    y, z = np.fft.ifft(a, b)
+    # the real part should be equal to y
+    print('\nreal part of inverse:\t', y)
+    # the imaginary part should be equal to zero
+    print('\nimaginary part of inverse:\t', z)
+
+.. parsed-literal::
+
+    original vector:	 array([0.0, 0.009775016, 0.0195491, ..., -0.5275068, -0.5357859, -0.5440139], dtype=float)
+    
+    real part of inverse:	 array([-2.980232e-08, 0.0097754, 0.0195494, ..., -0.5275064, -0.5357857, -0.5440133], dtype=float)
+    
+    imaginary part of inverse:	 array([-2.980232e-08, -1.451171e-07, 3.693752e-08, ..., 6.44871e-08, 9.34986e-08, 2.18336e-07], dtype=float)
+    
+
+
+Note that unlike in ``numpy``, the length of the array on which the
+Fourier transform is carried out must be a power of 2. If this is not
+the case, the function raises a ``ValueError`` exception.
+
+Computation and storage costs
+-----------------------------
+
+RAM
+~~~
+
+The FFT routine of ``ulab`` calculates the transform in place. This
+means that beyond reserving space for the two ``ndarray``\ s that will
+be returned (the computation uses these two as intermediate storage
+space), only a handful of temporary variables, all floats or 32-bit
+integers, are required.
+
+Speed of FFTs
+~~~~~~~~~~~~~
+
+A comment on the speed: a 1024-point transform implemented in python
+would cost around 90 ms, and 13 ms in assembly, if the code runs on the
+pyboard, v.1.1. You can gain a factor of four by moving to the D series
+https://github.com/peterhinch/micropython-fourier/blob/master/README.md#8-performance.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    import ulab as np
+    from ulab import vector
+    from ulab import fft
+    
+    x = np.linspace(0, 10, num=1024)
+    y = vector.sin(x)
+    
+    @timeit
+    def np_fft(y):
+        return fft.fft(y)
+    
+    a, b = np_fft(y)
+
+.. parsed-literal::
+
+    execution time:  1985  us
+    
+
+
+The C implementation runs in less than 2 ms on the pyboard (we have just
+measured that), and has been reported to run in under 0.8 ms on the D
+series board. That is an improvement of at least a factor of four.
--- a/docs/manual/source/numpy-functions.rst
+++ b/docs/manual/source/numpy-functions.rst
--- a/docs/manual/source/numpy-linalg.rst
+++ b/docs/manual/source/numpy-linalg.rst
@ -0,0 +1,446 @@
+
+Linalg
+======
+
+Functions in the ``linalg`` module can be called by prepending them by
+``numpy.linalg.``. The module defines the following seven functions:
+
+1. `numpy.linalg.cholesky <#cholesky>`__
+2. `numpy.linalg.det <#det>`__
+3. `numpy.linalg.dot <#dot>`__
+4. `numpy.linalg.eig <#eig>`__
+5. `numpy.linalg.inv <#inv>`__
+6. `numpy.linalg.norm <#norm>`__
+7. `numpy.linalg.trace <#trace>`__
+
+cholesky
+--------
+
+``numpy``:
+https://docs.scipy.org/doc/numpy-1.17.0/reference/generated/numpy.linalg.cholesky.html
+
+The function of the Cholesky decomposition takes a positive definite,
+symmetric square matrix as its single argument, and returns the *square
+root matrix* in the lower triangular form. If the input argument does
+not fulfill the positivity or symmetry condition, a ``ValueError`` is
+raised.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    a = np.array([[25, 15, -5], [15, 18,  0], [-5,  0, 11]])
+    print('a: ', a)
+    print('\n' + '='*20 + '\nCholesky decomposition\n', np.linalg.cholesky(a))
+
+.. parsed-literal::
+
+    a:  array([[25.0, 15.0, -5.0],
+    	 [15.0, 18.0, 0.0],
+    	 [-5.0, 0.0, 11.0]], dtype=float)
+    
+    ====================
+    Cholesky decomposition
+     array([[5.0, 0.0, 0.0],
+    	 [3.0, 3.0, 0.0],
+    	 [-1.0, 1.0, 3.0]], dtype=float)
+    
+    
+
+
+det
+---
+
+``numpy``:
+https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.det.html
+
+The ``det`` function takes a square matrix as its single argument, and
+calculates the determinant. The calculation is based on successive
+elimination of the matrix elements, and the return value is a float,
+even if the input array was of integer type.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    a = np.array([[1, 2], [3, 4]], dtype=np.uint8)
+    print(np.linalg.det(a))
+
+.. parsed-literal::
+
+    -2.0
+    
+
+
+Benchmark
+~~~~~~~~~
+
+Since the routine for calculating the determinant is pretty much the
+same as for finding the `inverse of a matrix <#inv>`__, the execution
+times are similar:
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    @timeit
+    def matrix_det(m):
+        return np.linalg.inv(m)
+    
+    m = np.array([[1, 2, 3, 4, 5, 6, 7, 8], [0, 5, 6, 4, 5, 6, 4, 5], 
+                  [0, 0, 9, 7, 8, 9, 7, 8], [0, 0, 0, 10, 11, 12, 11, 12], 
+                 [0, 0, 0, 0, 4, 6, 7, 8], [0, 0, 0, 0, 0, 5, 6, 7], 
+                 [0, 0, 0, 0, 0, 0, 7, 6], [0, 0, 0, 0, 0, 0, 0, 2]])
+    
+    matrix_det(m)
+
+.. parsed-literal::
+
+    execution time:  294  us
+    
+
+
+dot
+---
+
+``numpy``:
+https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html
+
+**WARNING:** numpy applies upcasting rules for the multiplication of
+matrices, while ``ulab`` simply returns a float matrix.
+
+Once you can invert a matrix, you might want to know, whether the
+inversion is correct. You can simply take the original matrix and its
+inverse, and multiply them by calling the ``dot`` function, which takes
+the two matrices as its arguments. If the matrix dimensions do not
+match, the function raises a ``ValueError``. The result of the
+multiplication is expected to be the unit matrix, which is demonstrated
+below.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    m = np.array([[1, 2, 3], [4, 5, 6], [7, 10, 9]], dtype=np.uint8)
+    n = np.linalg.inv(m)
+    print("m:\n", m)
+    print("\nm^-1:\n", n)
+    # this should be the unit matrix
+    print("\nm*m^-1:\n", np.linalg.dot(m, n))
+
+.. parsed-literal::
+
+    m:
+     array([[1, 2, 3],
+    	 [4, 5, 6],
+    	 [7, 10, 9]], dtype=uint8)
+    
+    m^-1:
+     array([[-1.25, 1.0, -0.25],
+    	 [0.5, -1.0, 0.5],
+    	 [0.4166667, 0.3333334, -0.25]], dtype=float)
+    
+    m*m^-1:
+     array([[1.0, 2.384186e-07, -1.490116e-07],
+    	 [-2.980232e-07, 1.000001, -4.172325e-07],
+    	 [-3.278255e-07, 1.311302e-06, 0.9999992]], dtype=float)
+    
+
+
+Note that for matrix multiplication you don’t necessarily need square
+matrices, it is enough, if their dimensions are compatible (i.e., the
+the left-hand-side matrix has as many columns, as does the
+right-hand-side matrix rows):
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    m = np.array([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=np.uint8)
+    n = np.array([[1, 2], [3, 4], [5, 6], [7, 8]], dtype=np.uint8)
+    print(m)
+    print(n)
+    print(np.linalg.dot(m, n))
+
+.. parsed-literal::
+
+    array([[1, 2, 3, 4],
+    	 [5, 6, 7, 8]], dtype=uint8)
+    array([[1, 2],
+    	 [3, 4],
+    	 [5, 6],
+    	 [7, 8]], dtype=uint8)
+    array([[7.0, 10.0],
+    	 [23.0, 34.0]], dtype=float)
+    
+    
+
+
+eig
+---
+
+``numpy``:
+https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eig.html
+
+The ``eig`` function calculates the eigenvalues and the eigenvectors of
+a real, symmetric square matrix. If the matrix is not symmetric, a
+``ValueError`` will be raised. The function takes a single argument, and
+returns a tuple with the eigenvalues, and eigenvectors. With the help of
+the eigenvectors, amongst other things, you can implement sophisticated
+stabilisation routines for robots.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    a = np.array([[1, 2, 1, 4], [2, 5, 3, 5], [1, 3, 6, 1], [4, 5, 1, 7]], dtype=np.uint8)
+    x, y = np.linalg.eig(a)
+    print('eigenvectors of a:\n', y)
+    print('\neigenvalues of a:\n', x)
+
+.. parsed-literal::
+
+    eigenvectors of a:
+     array([[0.8151560042509081, -0.4499411232970823, -0.1644660242574522, 0.3256141906686505],
+           [0.2211334179893007, 0.7846992598235538, 0.08372081379922657, 0.5730077734355189],
+           [-0.1340114162071679, -0.3100776411558949, 0.8742786816656, 0.3486109343758527],
+           [-0.5183258053659028, -0.292663481927148, -0.4489749870391468, 0.6664142156731531]], dtype=float)
+    
+    eigenvalues of a:
+     array([-1.165288365404889, 0.8029365530314914, 5.585625756072663, 13.77672605630074], dtype=float)
+    
+    
+
+
+The same matrix diagonalised with ``numpy`` yields:
+
+.. code::
+
+    # code to be run in CPython
+    
+    a = array([[1, 2, 1, 4], [2, 5, 3, 5], [1, 3, 6, 1], [4, 5, 1, 7]], dtype=np.uint8)
+    x, y = eig(a)
+    print('eigenvectors of a:\n', y)
+    print('\neigenvalues of a:\n', x)
+
+.. parsed-literal::
+
+    eigenvectors of a:
+     [[ 0.32561419  0.815156    0.44994112 -0.16446602]
+     [ 0.57300777  0.22113342 -0.78469926  0.08372081]
+     [ 0.34861093 -0.13401142  0.31007764  0.87427868]
+     [ 0.66641421 -0.51832581  0.29266348 -0.44897499]]
+    
+    eigenvalues of a:
+     [13.77672606 -1.16528837  0.80293655  5.58562576]
+
+
+When comparing results, we should keep two things in mind:
+
+1. the eigenvalues and eigenvectors are not necessarily sorted in the
+   same way
+2. an eigenvector can be multiplied by an arbitrary non-zero scalar, and
+   it is still an eigenvector with the same eigenvalue. This is why all
+   signs of the eigenvector belonging to 5.58, and 0.80 are flipped in
+   ``ulab`` with respect to ``numpy``. This difference, however, is of
+   absolutely no consequence.
+
+Computation expenses
+~~~~~~~~~~~~~~~~~~~~
+
+Since the function is based on `Givens
+rotations <https://en.wikipedia.org/wiki/Givens_rotation>`__ and runs
+till convergence is achieved, or till the maximum number of allowed
+rotations is exhausted, there is no universal estimate for the time
+required to find the eigenvalues. However, an order of magnitude can, at
+least, be guessed based on the measurement below:
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    @timeit
+    def matrix_eig(a):
+        return np.linalg.eig(a)
+    
+    a = np.array([[1, 2, 1, 4], [2, 5, 3, 5], [1, 3, 6, 1], [4, 5, 1, 7]], dtype=np.uint8)
+    
+    matrix_eig(a)
+
+.. parsed-literal::
+
+    execution time:  111  us
+    
+
+
+inv
+---
+
+``numpy``:
+https://docs.scipy.org/doc/numpy-1.17.0/reference/generated/numpy.linalg.inv.html
+
+A square matrix, provided that it is not singular, can be inverted by
+calling the ``inv`` function that takes a single argument. The inversion
+is based on successive elimination of elements in the lower left
+triangle, and raises a ``ValueError`` exception, if the matrix turns out
+to be singular (i.e., one of the diagonal entries is zero).
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    m = np.array([[1, 2, 3, 4], [4, 5, 6, 4], [7, 8.6, 9, 4], [3, 4, 5, 6]])
+    
+    print(np.linalg.inv(m))
+
+.. parsed-literal::
+
+    array([[-2.166666666666667, 1.500000000000001, -0.8333333333333337, 1.0],
+           [1.666666666666667, -3.333333333333335, 1.666666666666668, -0.0],
+           [0.1666666666666666, 2.166666666666668, -0.8333333333333337, -1.0],
+           [-0.1666666666666667, -0.3333333333333333, 0.0, 0.5]], dtype=float64)
+    
+    
+
+
+Computation expenses
+~~~~~~~~~~~~~~~~~~~~
+
+Note that the cost of inverting a matrix is approximately twice as many
+floats (RAM), as the number of entries in the original matrix, and
+approximately as many operations, as the number of entries. Here are a
+couple of numbers:
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    @timeit
+    def invert_matrix(m):
+        return np.linalg.inv(m)
+    
+    m = np.array([[1, 2,], [4, 5]])
+    print('2 by 2 matrix:')
+    invert_matrix(m)
+    
+    m = np.array([[1, 2, 3, 4], [4, 5, 6, 4], [7, 8.6, 9, 4], [3, 4, 5, 6]])
+    print('\n4 by 4 matrix:')
+    invert_matrix(m)
+    
+    m = np.array([[1, 2, 3, 4, 5, 6, 7, 8], [0, 5, 6, 4, 5, 6, 4, 5], 
+                  [0, 0, 9, 7, 8, 9, 7, 8], [0, 0, 0, 10, 11, 12, 11, 12], 
+                 [0, 0, 0, 0, 4, 6, 7, 8], [0, 0, 0, 0, 0, 5, 6, 7], 
+                 [0, 0, 0, 0, 0, 0, 7, 6], [0, 0, 0, 0, 0, 0, 0, 2]])
+    print('\n8 by 8 matrix:')
+    invert_matrix(m)
+
+.. parsed-literal::
+
+    2 by 2 matrix:
+    execution time:  65  us
+    
+    4 by 4 matrix:
+    execution time:  105  us
+    
+    8 by 8 matrix:
+    execution time:  299  us
+    
+
+
+The above-mentioned scaling is not obeyed strictly. The reason for the
+discrepancy is that the function call is still the same for all three
+cases: the input must be inspected, the output array must be created,
+and so on.
+
+norm
+----
+
+``numpy``:
+https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html
+
+The function takes a vector or matrix without options, and returns its
+2-norm, i.e., the square root of the sum of the square of the elements.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    a = np.array([1, 2, 3, 4, 5])
+    b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+    
+    print('norm of a:', np.linalg.norm(a))
+    print('norm of b:', np.linalg.norm(b))
+
+.. parsed-literal::
+
+    norm of a: 7.416198487095663
+    norm of b: 16.88194301613414
+    
+    
+
+
+trace
+-----
+
+``numpy``:
+https://docs.scipy.org/doc/numpy-1.17.0/reference/generated/numpy.linalg.trace.html
+
+The ``trace`` function returns the sum of the diagonal elements of a
+square matrix. If the input argument is not a square matrix, an
+exception will be raised.
+
+The scalar so returned will inherit the type of the input array, i.e.,
+integer arrays have integer trace, and floating point arrays a floating
+point trace.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    a = np.array([[25, 15, -5], [15, 18,  0], [-5,  0, 11]], dtype=np.int8)
+    print('a: ', a)
+    print('\ntrace of a: ', np.linalg.trace(a))
+    
+    b = np.array([[25, 15, -5], [15, 18,  0], [-5,  0, 11]], dtype=np.float)
+    
+    print('='*20 + '\nb: ', b)
+    print('\ntrace of b: ', np.linalg.trace(b))
+
+.. parsed-literal::
+
+    a:  array([[25, 15, -5],
+    	 [15, 18, 0],
+    	 [-5, 0, 11]], dtype=int8)
+    
+    trace of a:  54
+    ====================
+    b:  array([[25.0, 15.0, -5.0],
+    	 [15.0, 18.0, 0.0],
+    	 [-5.0, 0.0, 11.0]], dtype=float)
+    
+    trace of b:  54.0
+    
+    
+
--- a/docs/manual/source/numpy-universal.rst
+++ b/docs/manual/source/numpy-universal.rst
@ -0,0 +1,416 @@
+
+Universal functions
+===================
+
+Standard mathematical functions can be calculated on any scalar,
+scalar-valued iterable (ranges, lists, tuples containing numbers), and
+on ``ndarray``\ s without having to change the call signature. In all
+cases the functions return a new ``ndarray`` of typecode ``float``
+(since these functions usually generate float values, anyway). The
+functions execute faster with ``ndarray`` arguments than with iterables,
+because the values of the input vector can be extracted faster.
+
+At present, the following functions are supported:
+
+``acos``, ``acosh``, ``arctan2``, ``around``, ``asin``, ``asinh``,
+``atan``, ``arctan2``, ``atanh``, ``ceil``, ``cos``, ``degrees``,
+``exp``, ``expm1``, ``floor``, ``log``, ``log10``, ``log2``,
+``radians``, ``sin``, ``sinh``, ``sqrt``, ``tan``, ``tanh``.
+
+These functions are applied element-wise to the arguments, thus, e.g.,
+the exponential of a matrix cannot be calculated in this way.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    a = range(9)
+    b = np.array(a)
+    
+    # works with ranges, lists, tuples etc.
+    print('a:\t', a)
+    print('exp(a):\t', np.exp(a))
+    
+    # with 1D arrays
+    print('\n=============\nb:\n', b)
+    print('exp(b):\n', np.exp(b))
+    
+    # as well as with matrices
+    c = np.array(range(9)).reshape((3, 3))
+    print('\n=============\nc:\n', c)
+    print('exp(c):\n', np.exp(c))
+
+.. parsed-literal::
+
+    a:	 range(0, 9)
+    exp(a):	 array([1.0, 2.718281828459045, 7.38905609893065, 20.08553692318767, 54.59815003314424, 148.4131591025766, 403.4287934927351, 1096.633158428459, 2980.957987041728], dtype=float64)
+    
+    =============
+    b:
+     array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype=float64)
+    exp(b):
+     array([1.0, 2.718281828459045, 7.38905609893065, 20.08553692318767, 54.59815003314424, 148.4131591025766, 403.4287934927351, 1096.633158428459, 2980.957987041728], dtype=float64)
+    
+    =============
+    c:
+     array([[0.0, 1.0, 2.0],
+           [3.0, 4.0, 5.0],
+           [6.0, 7.0, 8.0]], dtype=float64)
+    exp(c):
+     array([[1.0, 2.718281828459045, 7.38905609893065],
+           [20.08553692318767, 54.59815003314424, 148.4131591025766],
+           [403.4287934927351, 1096.633158428459, 2980.957987041728]], dtype=float64)
+    
+    
+
+
+Computation expenses
+--------------------
+
+The overhead for calculating with micropython iterables is quite
+significant: for the 1000 samples below, the difference is more than 800
+microseconds, because internally the function has to create the
+``ndarray`` for the output, has to fetch the iterable’s items of unknown
+type, and then convert them to floats. All these steps are skipped for
+``ndarray``\ s, because these pieces of information are already known.
+
+Doing the same with ``list`` comprehension requires 30 times more time
+than with the ``ndarray``, which would become even more, if we converted
+the resulting list to an ``ndarray``.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    import math
+    
+    a = [0]*1000
+    b = np.array(a)
+    
+    @timeit
+    def timed_vector(iterable):
+        return np.exp(iterable)
+    
+    @timeit
+    def timed_list(iterable):
+        return [math.exp(i) for i in iterable]
+    
+    print('iterating over ndarray in ulab')
+    timed_vector(b)
+    
+    print('\niterating over list in ulab')
+    timed_vector(a)
+    
+    print('\niterating over list in python')
+    timed_list(a)
+
+.. parsed-literal::
+
+    iterating over ndarray in ulab
+    execution time:  441  us
+    
+    iterating over list in ulab
+    execution time:  1266  us
+    
+    iterating over list in python
+    execution time:  11379  us
+    
+
+
+arctan2
+-------
+
+``numpy``:
+https://docs.scipy.org/doc/numpy-1.17.0/reference/generated/numpy.arctan2.html
+
+The two-argument inverse tangent function is also part of the ``vector``
+sub-module. The function implements broadcasting as discussed in the
+section on ``ndarray``\ s. Scalars (``micropython`` integers or floats)
+are also allowed.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    a = np.array([1, 2.2, 33.33, 444.444])
+    print('a:\n', a)
+    print('\narctan2(a, 1.0)\n', np.arctan2(a, 1.0))
+    print('\narctan2(1.0, a)\n', np.arctan2(1.0, a))
+    print('\narctan2(a, a): \n', np.arctan2(a, a))
+
+.. parsed-literal::
+
+    a:
+     array([1.0, 2.2, 33.33, 444.444], dtype=float64)
+    
+    arctan2(a, 1.0)
+     array([0.7853981633974483, 1.14416883366802, 1.5408023243361, 1.568546328341769], dtype=float64)
+    
+    arctan2(1.0, a)
+     array([0.7853981633974483, 0.426627493126876, 0.02999400245879636, 0.002249998453127392], dtype=float64)
+    
+    arctan2(a, a): 
+     array([0.7853981633974483, 0.7853981633974483, 0.7853981633974483, 0.7853981633974483], dtype=float64)
+    
+    
+
+
+around
+------
+
+``numpy``:
+https://docs.scipy.org/doc/numpy-1.17.0/reference/generated/numpy.around.html
+
+``numpy``\ ’s ``around`` function can also be found in the ``vector``
+sub-module. The function implements the ``decimals`` keyword argument
+with default value ``0``. The first argument must be an ``ndarray``. If
+this is not the case, the function raises a ``TypeError`` exception.
+Note that ``numpy`` accepts general iterables. The ``out`` keyword
+argument known from ``numpy`` is not accepted. The function always
+returns an ndarray of type ``mp_float_t``.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    a = np.array([1, 2.2, 33.33, 444.444])
+    print('a:\t\t', a)
+    print('\ndecimals = 0\t', np.around(a, decimals=0))
+    print('\ndecimals = 1\t', np.around(a, decimals=1))
+    print('\ndecimals = -1\t', np.around(a, decimals=-1))
+
+.. parsed-literal::
+
+    a:		 array([1.0, 2.2, 33.33, 444.444], dtype=float64)
+    
+    decimals = 0	 array([1.0, 2.0, 33.0, 444.0], dtype=float64)
+    
+    decimals = 1	 array([1.0, 2.2, 33.3, 444.4], dtype=float64)
+    
+    decimals = -1	 array([0.0, 0.0, 30.0, 440.0], dtype=float64)
+    
+    
+
+
+Vectorising generic python functions
+------------------------------------
+
+``numpy``:
+https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html
+
+The examples above use factory functions. In fact, they are nothing but
+the vectorised versions of the standard mathematical functions.
+User-defined ``python`` functions can also be vectorised by help of
+``vectorize``. This function takes a positional argument, namely, the
+``python`` function that you want to vectorise, and a non-mandatory
+keyword argument, ``otypes``, which determines the ``dtype`` of the
+output array. The ``otypes`` must be ``None`` (default), or any of the
+``dtypes`` defined in ``ulab``. With ``None``, the output is
+automatically turned into a float array.
+
+The return value of ``vectorize`` is a ``micropython`` object that can
+be called as a standard function, but which now accepts either a scalar,
+an ``ndarray``, or a generic ``micropython`` iterable as its sole
+argument. Note that the function that is to be vectorised must have a
+single argument.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    def f(x):
+        return x*x
+    
+    vf = np.vectorize(f)
+    
+    # calling with a scalar
+    print('{:20}'.format('f on a scalar: '), vf(44.0))
+    
+    # calling with an ndarray
+    a = np.array([1, 2, 3, 4])
+    print('{:20}'.format('f on an ndarray: '), vf(a))
+    
+    # calling with a list
+    print('{:20}'.format('f on a list: '), vf([2, 3, 4]))
+
+.. parsed-literal::
+
+    f on a scalar:       array([1936.0], dtype=float64)
+    f on an ndarray:     array([1.0, 4.0, 9.0, 16.0], dtype=float64)
+    f on a list:         array([4.0, 9.0, 16.0], dtype=float64)
+    
+    
+
+
+As mentioned, the ``dtype`` of the resulting ``ndarray`` can be
+specified via the ``otypes`` keyword. The value is bound to the function
+object that ``vectorize`` returns, therefore, if the same function is to
+be vectorised with different output types, then for each type a new
+function object must be created.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    l = [1, 2, 3, 4]
+    def f(x):
+        return x*x
+    
+    vf1 = np.vectorize(f, otypes=np.uint8)
+    vf2 = np.vectorize(f, otypes=np.float)
+    
+    print('{:20}'.format('output is uint8: '), vf1(l))
+    print('{:20}'.format('output is float: '), vf2(l))
+
+.. parsed-literal::
+
+    output is uint8:     array([1, 4, 9, 16], dtype=uint8)
+    output is float:     array([1.0, 4.0, 9.0, 16.0], dtype=float64)
+    
+    
+
+
+The ``otypes`` keyword argument cannot be used for type coercion: if the
+function evaluates to a float, but ``otypes`` would dictate an integer
+type, an exception will be raised:
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    int_list = [1, 2, 3, 4]
+    float_list = [1.0, 2.0, 3.0, 4.0]
+    def f(x):
+        return x*x
+    
+    vf = np.vectorize(f, otypes=np.uint8)
+    
+    print('{:20}'.format('integer list: '), vf(int_list))
+    # this will raise a TypeError exception
+    print(vf(float_list))
+
+.. parsed-literal::
+
+    integer list:        array([1, 4, 9, 16], dtype=uint8)
+    
+    Traceback (most recent call last):
+      File "/dev/shm/micropython.py", line 14, in <module>
+    TypeError: can't convert float to int
+    
+
+
+Benchmarks
+~~~~~~~~~~
+
+It should be pointed out that the ``vectorize`` function produces the
+pseudo-vectorised version of the ``python`` function that is fed into
+it, i.e., on the C level, the same ``python`` function is called, with
+the all-encompassing ``mp_obj_t`` type arguments, and all that happens
+is that the ``for`` loop in ``[f(i) for i in iterable]`` runs purely in
+C. Since type checking and type conversion in ``f()`` is expensive, the
+speed-up is not so spectacular as when iterating over an ``ndarray``
+with a factory function: a gain of approximately 30% can be expected,
+when a native ``python`` type (e.g., ``list``) is returned by the
+function, and this becomes around 50% (a factor of 2), if conversion to
+an ``ndarray`` is also counted.
+
+The following code snippet calculates the square of a 1000 numbers with
+the vectorised function (which returns an ``ndarray``), with ``list``
+comprehension, and with ``list`` comprehension followed by conversion to
+an ``ndarray``. For comparison, the execution time is measured also for
+the case, when the square is calculated entirely in ``ulab``.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    def f(x):
+        return x*x
+    
+    vf = np.vectorize(f)
+    
+    @timeit
+    def timed_vectorised_square(iterable):
+        return vf(iterable)
+    
+    @timeit
+    def timed_python_square(iterable):
+        return [f(i) for i in iterable]
+    
+    @timeit
+    def timed_ndarray_square(iterable):
+        return np.array([f(i) for i in iterable])
+    
+    @timeit
+    def timed_ulab_square(ndarray):
+        return ndarray**2
+    
+    print('vectorised function')
+    squares = timed_vectorised_square(range(1000))
+    
+    print('\nlist comprehension')
+    squares = timed_python_square(range(1000))
+    
+    print('\nlist comprehension + ndarray conversion')
+    squares = timed_ndarray_square(range(1000))
+    
+    print('\nsquaring an ndarray entirely in ulab')
+    a = np.array(range(1000))
+    squares = timed_ulab_square(a)
+
+.. parsed-literal::
+
+    vectorised function
+    execution time:  7237  us
+    
+    list comprehension
+    execution time:  10248  us
+    
+    list comprehension + ndarray conversion
+    execution time:  12562  us
+    
+    squaring an ndarray entirely in ulab
+    execution time:  560  us
+    
+
+
+From the comparisons above, it is obvious that ``python`` functions
+should only be vectorised, when the same effect cannot be gotten in
+``ulab`` only. However, although the time savings are not significant,
+there is still a good reason for caring about vectorised functions.
+Namely, user-defined ``python`` functions become universal, i.e., they
+can accept generic iterables as well as ``ndarray``\ s as their
+arguments. A vectorised function is still a one-liner, resulting in
+transparent and elegant code.
+
+A final comment on this subject: the ``f(x)`` that we defined is a
+*generic* ``python`` function. This means that it is not required that
+it just crunches some numbers. It has to return a number object, but it
+can still access the hardware in the meantime. So, e.g.,
+
+.. code:: python
+
+
+   led = pyb.LED(2)
+
+   def f(x):
+       if x < 100:
+           led.toggle()
+       return x*x
+
+is perfectly valid code.
--- a/docs/manual/source/scipy-optimize.rst
+++ b/docs/manual/source/scipy-optimize.rst
@ -0,0 +1,173 @@
+
+Optimize
+========
+
+Functions in the ``optimize`` module can be called by prepending them by
+``scipy.optimize.``. The module defines the following three functions:
+
+1. `scipy.optimize.bisect <#bisect>`__
+2. `scipy.optimize.fmin <#fmin>`__
+3. `scipy.optimize.newton <#newton>`__
+
+Note that routines that work with user-defined functions still have to
+call the underlying ``python`` code, and therefore, gains in speed are
+not as significant as with other vectorised operations. As a rule of
+thumb, a factor of two can be expected, when compared to an optimised
+``python`` implementation.
+
+bisect
+------
+
+``scipy``:
+https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.bisect.html
+
+``bisect`` finds the root of a function of one variable using a simple
+bisection routine. It takes three positional arguments, the function
+itself, and two starting points. The function must have opposite signs
+at the starting points. Returned is the position of the root.
+
+Two keyword arguments, ``xtol``, and ``maxiter`` can be supplied to
+control the accuracy, and the number of bisections, respectively.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import scipy as spy
+        
+    def f(x):
+        return x*x - 1
+    
+    print(spy.optimize.bisect(f, 0, 4))
+    
+    print('only 8 bisections: ',  spy.optimize.bisect(f, 0, 4, maxiter=8))
+    
+    print('with 0.1 accuracy: ',  spy.optimize.bisect(f, 0, 4, xtol=0.1))
+
+.. parsed-literal::
+
+    0.9999997615814209
+    only 8 bisections:  0.984375
+    with 0.1 accuracy:  0.9375
+    
+    
+
+
+Performance
+~~~~~~~~~~~
+
+Since the ``bisect`` routine calls user-defined ``python`` functions,
+the speed gain is only about a factor of two, if compared to a purely
+``python`` implementation.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import scipy as spy
+    
+    def f(x):
+        return (x-1)*(x-1) - 2.0
+    
+    def bisect(f, a, b, xtol=2.4e-7, maxiter=100):
+        if f(a) * f(b) > 0:
+            raise ValueError
+    
+        rtb = a if f(a) < 0.0 else b
+        dx = b - a if f(a) < 0.0 else a - b
+        for i in range(maxiter):
+            dx *= 0.5
+            x_mid = rtb + dx
+            mid_value = f(x_mid)
+            if mid_value < 0:
+                rtb = x_mid
+            if abs(dx) < xtol:
+                break
+    
+        return rtb
+    
+    @timeit
+    def bisect_scipy(f, a, b):
+        return spy.optimize.bisect(f, a, b)
+    
+    @timeit
+    def bisect_timed(f, a, b):
+        return bisect(f, a, b)
+    
+    print('bisect running in python')
+    bisect_timed(f, 3, 2)
+    
+    print('bisect running in C')
+    bisect_scipy(f, 3, 2)
+
+.. parsed-literal::
+
+    bisect running in python
+    execution time:  1270  us
+    bisect running in C
+    execution time:  642  us
+    
+
+
+fmin
+----
+
+``scipy``:
+https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin.html
+
+The ``fmin`` function finds the position of the minimum of a
+user-defined function by using the downhill simplex method. Requires two
+positional arguments, the function, and the initial value. Three keyword
+arguments, ``xatol``, ``fatol``, and ``maxiter`` stipulate conditions
+for stopping.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import scipy as spy
+    
+    def f(x):
+        return (x-1)**2 - 1
+    
+    print(spy.optimize.fmin(f, 3.0))
+    print(spy.optimize.fmin(f, 3.0, xatol=0.1))
+
+.. parsed-literal::
+
+    0.9996093749999952
+    1.199999999999996
+    
+    
+
+
+newton
+------
+
+``scipy``:https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.newton.html
+
+``newton`` finds a zero of a real, user-defined function using the
+Newton-Raphson (or secant or Halley’s) method. The routine requires two
+positional arguments, the function, and the initial value. Three keyword
+arguments can be supplied to control the iteration. These are the
+absolute and relative tolerances ``tol``, and ``rtol``, respectively,
+and the number of iterations before stopping, ``maxiter``. The function
+retuns a single scalar, the position of the root.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import scipy as spy
+        
+    def f(x):
+        return x*x*x - 2.0
+    
+    print(spy.optimize.newton(f, 3., tol=0.001, rtol=0.01))
+
+.. parsed-literal::
+
+    1.260135727246117
+    
+    
+
--- a/docs/manual/source/scipy-signal.rst
+++ b/docs/manual/source/scipy-signal.rst
@ -0,0 +1,135 @@
+
+Signal
+======
+
+Functions in the ``signal`` module can be called by prepending them by
+``scipy.signal.``. The module defines the following two functions:
+
+1. `scipy.signal.sosfilt <#sosfilt>`__
+2. `scipy.signal.spectrogram <#spectrogram>`__
+
+sosfilt
+-------
+
+``scipy``:
+https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.sosfilt.html
+
+Filter data along one dimension using cascaded second-order sections.
+
+The function takes two positional arguments, ``sos``, the filter
+segments of length 6, and the one-dimensional, uniformly sampled data
+set to be filtered. Returns the filtered data, or the filtered data and
+the final filter delays, if the ``zi`` keyword arguments is supplied.
+The keyword argument must be a float ``ndarray`` of shape
+``(n_sections, 2)``. If ``zi`` is not passed to the function, the
+initial values are assumed to be 0.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    from ulab import scipy as spy
+    
+    x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
+    sos = [[1, 2, 3, 1, 5, 6], [1, 2, 3, 1, 5, 6]]
+    y = spy.signal.sosfilt(sos, x)
+    print('y: ', y)
+
+.. parsed-literal::
+
+    y:  array([0.0, 1.0, -4.0, 24.0, -104.0, 440.0, -1728.0, 6532.000000000001, -23848.0, 84864.0], dtype=float)
+    
+    
+
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    from ulab import scipy as spy
+    
+    x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
+    sos = [[1, 2, 3, 1, 5, 6], [1, 2, 3, 1, 5, 6]]
+    # initial conditions of the filter
+    zi = np.array([[1, 2], [3, 4]])
+    
+    y, zf = spy.signal.sosfilt(sos, x, zi=zi)
+    print('y: ', y)
+    print('\n' + '='*40 + '\nzf: ', zf)
+
+.. parsed-literal::
+
+    y:  array([4.0, -16.0, 63.00000000000001, -227.0, 802.9999999999999, -2751.0, 9271.000000000001, -30775.0, 101067.0, -328991.0000000001], dtype=float)
+    
+    ========================================
+    zf:  array([[37242.0, 74835.0],
+    	 [1026187.0, 1936542.0]], dtype=float)
+    
+    
+
+
+spectrogram
+-----------
+
+In addition to the Fourier transform and its inverse, ``ulab`` also
+sports a function called ``spectrogram``, which returns the absolute
+value of the Fourier transform. This could be used to find the dominant
+spectral component in a time series. The arguments are treated in the
+same way as in ``fft``, and ``ifft``.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    from ulab import scipy as spy
+    
+    x = np.linspace(0, 10, num=1024)
+    y = np.sin(x)
+    
+    a = spy.signal.spectrogram(y)
+    
+    print('original vector:\t', y)
+    print('\nspectrum:\t', a)
+
+.. parsed-literal::
+
+    original vector:	 array([0.0, 0.009775015390171337, 0.01954909674625918, ..., -0.5275140569487312, -0.5357931822978732, -0.5440211108893639], dtype=float64)
+    
+    spectrum:	 array([187.8635087634579, 315.3112063607119, 347.8814873399374, ..., 84.45888934298905, 347.8814873399374, 315.3112063607118], dtype=float64)
+    
+    
+
+
+As such, ``spectrogram`` is really just a shorthand for
+``np.sqrt(a*a + b*b)``:
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    from ulab import scipy as spy
+    
+    x = np.linspace(0, 10, num=1024)
+    y = np.sin(x)
+    
+    a, b = np.fft.fft(y)
+    
+    print('\nspectrum calculated the hard way:\t', np.sqrt(a*a + b*b))
+    
+    a = spy.signal.spectrogram(y)
+    
+    print('\nspectrum calculated the lazy way:\t', a)
+
+.. parsed-literal::
+
+    
+    spectrum calculated the hard way:	 array([187.8635087634579, 315.3112063607119, 347.8814873399374, ..., 84.45888934298905, 347.8814873399374, 315.3112063607118], dtype=float64)
+    
+    spectrum calculated the lazy way:	 array([187.8635087634579, 315.3112063607119, 347.8814873399374, ..., 84.45888934298905, 347.8814873399374, 315.3112063607118], dtype=float64)
+    
+    
+
--- a/docs/manual/source/scipy-special.rst
+++ b/docs/manual/source/scipy-special.rst
@ -0,0 +1,44 @@
+
+Special functions
+=================
+
+``scipy``\ ’s ``special`` module defines several functions that behave
+as do the standard mathematical functions of the ``numpy``, i.e., they
+can be called on any scalar, scalar-valued iterable (ranges, lists,
+tuples containing numbers), and on ``ndarray``\ s without having to
+change the call signature. In all cases the functions return a new
+``ndarray`` of typecode ``float`` (since these functions usually
+generate float values, anyway).
+
+At present, ``ulab``\ ’s ``special`` module contains the following
+functions:
+
+``erf``, ``erfc``, ``gamma``, and ``gammaln``, and they can be called by
+prepending them by ``scipy.special.``.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    from ulab import scipy as spy
+    
+    a = range(9)
+    b = np.array(a)
+    
+    print('a: ', a)
+    print(spy.special.erf(a))
+    
+    print('\nb: ', b)
+    print(spy.special.erfc(b))
+
+.. parsed-literal::
+
+    a:  range(0, 9)
+    array([0.0, 0.8427007929497149, 0.9953222650189527, 0.9999779095030014, 0.9999999845827421, 1.0, 1.0, 1.0, 1.0], dtype=float64)
+    
+    b:  array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype=float64)
+    array([1.0, 0.1572992070502851, 0.004677734981047265, 2.209049699858544e-05, 1.541725790028002e-08, 1.537459794428035e-12, 2.151973671249892e-17, 4.183825607779414e-23, 1.122429717298293e-29], dtype=float64)
+    
+    
+
--- a/docs/manual/source/ulab-intro.rst
+++ b/docs/manual/source/ulab-intro.rst
@ -0,0 +1,589 @@
+
+Introduction
+============
+
+Enter ulab
+----------
+
+``ulab`` is a ``numpy``-like module for ``micropython`` and its
+derivatives, meant to simplify and speed up common mathematical
+operations on arrays. ``ulab`` implements a small subset of ``numpy``
+and ``scipy``. The functions were chosen such that they might be useful
+in the context of a microcontroller. However, the project is a living
+one, and suggestions for new features are always welcome.
+
+This document discusses how you can use the library, starting from
+building your own firmware, through questions like what affects the
+firmware size, what are the trade-offs, and what are the most important
+differences to ``numpy`` and ``scipy``, respectively. The document is
+organised as follows:
+
+The chapter after this one helps you with firmware customisation.
+
+The third chapter gives a very concise summary of the ``ulab`` functions
+and array methods. This chapter can be used as a quick reference.
+
+The chapters after that are an in-depth review of most functions. Here
+you can find usage examples, benchmarks, as well as a thorough
+discussion of such concepts as broadcasting, and views versus copies.
+
+The final chapter of this book can be regarded as the programming
+manual. The inner working of ``ulab`` is dissected here, and you will
+also find hints as to how to implement your own ``numpy``-compatible
+functions.
+
+Purpose
+-------
+
+Of course, the first question that one has to answer is, why on Earth
+one would need a fast math library on a microcontroller. After all, it
+is not expected that heavy number crunching is going to take place on
+bare metal. It is not meant to. On a PC, the main reason for writing
+fast code is the sheer amount of data that one wants to process. On a
+microcontroller, the data volume is probably small, but it might lead to
+catastrophic system failure, if these data are not processed in time,
+because the microcontroller is supposed to interact with the outside
+world in a timely fashion. In fact, this latter objective was the
+initiator of this project: I needed the Fourier transform of a signal
+coming from the ADC of the ``pyboard``, and all available options were
+simply too slow.
+
+In addition to speed, another issue that one has to keep in mind when
+working with embedded systems is the amount of available RAM: I believe,
+everything here could be implemented in pure ``python`` with relatively
+little effort (in fact, there are a couple of ``python``-only
+implementations of ``numpy`` functions out there), but the price we
+would have to pay for that is not only speed, but RAM, too. ``python``
+code, if is not frozen, and compiled into the firmware, has to be
+compiled at runtime, which is not exactly a cheap process. On top of
+that, if numbers are stored in a list or tuple, which would be the
+high-level container, then they occupy 8 bytes, no matter, whether they
+are all smaller than 100, or larger than one hundred million. This is
+obviously a waste of resources in an environment, where resources are
+scarce.
+
+Finally, there is a reason for using ``micropython`` in the first place.
+Namely, that a microcontroller can be programmed in a very elegant, and
+*pythonic* way. But if it is so, why should we not extend this idea to
+other tasks and concepts that might come up in this context? If there
+was no other reason than this *elegance*, I would find that convincing
+enough.
+
+Based on the above-mentioned considerations, all functions in ``ulab``
+are implemented in a way that
+
+1. conforms to ``numpy`` as much as possible
+2. is so frugal with RAM as possible,
+3. and yet, fast. Much faster than pure python. Think of speed-ups of
+   30-50!
+
+The main points of ``ulab`` are
+
+-  compact, iterable and slicable containers of numerical data in one to
+   four dimensions. These containers support all the relevant unary and
+   binary operators (e.g., ``len``, ==, +, \*, etc.)
+-  vectorised computations on ``micropython`` iterables and numerical
+   arrays (in ``numpy``-speak, universal functions)
+-  computing statistical properties (mean, standard deviation etc.) on
+   arrays
+-  basic linear algebra routines (matrix inversion, multiplication,
+   reshaping, transposition, determinant, and eigenvalues, Cholesky
+   decomposition and so on)
+-  polynomial fits to numerical data, and evaluation of polynomials
+-  fast Fourier transforms
+-  filtering of data (convolution and second-order filters)
+-  function minimisation, fitting, and numerical approximation routines
+
+``ulab`` implements close to a hundred functions and array methods. At
+the time of writing this manual (for version 2.1.0), the library adds
+approximately 120 kB of extra compiled code to the ``micropython``
+(pyboard.v.11) firmware. However, if you are tight with flash space, you
+can easily shave tens of kB off the firmware. In fact, if only a small
+sub-set of functions are needed, you can get away with less than 10 kB
+of flash space. See the section on `customising
+ulab <#Customising-the-firmware>`__.
+
+Resources and legal matters
+---------------------------
+
+The source code of the module can be found under
+https://github.com/v923z/micropython-ulab/tree/master/code. while the
+source of this user manual is under
+https://github.com/v923z/micropython-ulab/tree/master/docs.
+
+The MIT licence applies to all material.
+
+Friendly request
+----------------
+
+If you use ``ulab``, and bump into a bug, or think that a particular
+function is missing, or its behaviour does not conform to ``numpy``,
+please, raise a `ulab
+issue <#https://github.com/v923z/micropython-ulab/issues>`__ on github,
+so that the community can profit from your experiences.
+
+Even better, if you find the project to be useful, and think that it
+could be made better, faster, tighter, and shinier, please, consider
+contributing, and issue a pull request with the implementation of your
+improvements and new features. ``ulab`` can only become successful, if
+it offers what the community needs.
+
+These last comments apply to the documentation, too. If, in your
+opinion, the documentation is obscure, misleading, or not detailed
+enough, please, let us know, so that *we* can fix it.
+
+Differences between micropython-ulab and circuitpython-ulab
+-----------------------------------------------------------
+
+``ulab`` has originally been developed for ``micropython``, but has
+since been integrated into a number of its flavours. Most of these
+flavours are simply forks of ``micropython`` itself, with some
+additional functionality. One of the notable exceptions is
+``circuitpython``, which has slightly diverged at the core level, and
+this has some minor consequences. Some of these concern the C
+implementation details only, which all have been sorted out with the
+generous and enthusiastic support of Jeff Epler from `Adafruit
+Industries <http://www.adafruit.com>`__.
+
+There are, however, a couple of instances, where the two environments
+differ at the python level in how the class properties can be accessed.
+We will point out the differences and possible workarounds at the
+relevant places in this document.
+
+Customising the firmware
+========================
+
+As mentioned above, ``ulab`` has considerably grown since its
+conception, which also means that it might no longer fit on the
+microcontroller of your choice. There are, however, a couple of ways of
+customising the firmware, and thereby reducing its size.
+
+All ``ulab`` options are listed in a single header file,
+`ulab.h <https://github.com/v923z/micropython-ulab/blob/master/code/ulab.h>`__,
+which contains pre-processor flags for each feature that can be
+fine-tuned. The first couple of lines of the file look like this
+
+.. code:: c
+
+   // The pre-processor constants in this file determine how ulab behaves:
+   //
+   // - how many dimensions ulab can handle
+   // - which functions are included in the compiled firmware
+   // - whether the python syntax is numpy-like, or modular
+   // - whether arrays can be sliced and iterated over
+   // - which binary/unary operators are supported
+   //
+   // A considerable amount of flash space can be saved by removing (setting
+   // the corresponding constants to 0) the unnecessary functions and features.
+
+   // Determines, whether scipy is defined in ulab. The sub-modules and functions
+   // of scipy have to be defined separately
+   #define ULAB_HAS_SCIPY                      (1)
+
+   // The maximum number of dimensions the firmware should be able to support
+   // Possible values lie between 1, and 4, inclusive
+   #define ULAB_MAX_DIMS                       2
+
+   // By setting this constant to 1, iteration over array dimensions will be implemented
+   // as a function (ndarray_rewind_array), instead of writing out the loops in macros
+   // This reduces firmware size at the expense of speed
+   #define ULAB_HAS_FUNCTION_ITERATOR          (0)
+
+   // If NDARRAY_IS_ITERABLE is 1, the ndarray object defines its own iterator function
+   // This option saves approx. 250 bytes of flash space
+   #define NDARRAY_IS_ITERABLE                 (1)
+
+   // Slicing can be switched off by setting this variable to 0
+   #define NDARRAY_IS_SLICEABLE                (1)
+
+   // The default threshold for pretty printing. These variables can be overwritten
+   // at run-time via the set_printoptions() function
+   #define ULAB_HAS_PRINTOPTIONS               (1)
+   #define NDARRAY_PRINT_THRESHOLD             10
+   #define NDARRAY_PRINT_EDGEITEMS             3
+
+   // determines, whether the dtype is an object, or simply a character
+   // the object implementation is numpythonic, but requires more space
+   #define ULAB_HAS_DTYPE_OBJECT               (0)
+
+   // the ndarray binary operators
+   #define NDARRAY_HAS_BINARY_OPS              (1)
+
+   // Firmware size can be reduced at the expense of speed by using function
+   // pointers in iterations. For each operator, he function pointer saves around
+   // 2 kB in the two-dimensional case, and around 4 kB in the four-dimensional case.
+
+   #define NDARRAY_BINARY_USES_FUN_POINTER     (0)
+
+   #define NDARRAY_HAS_BINARY_OP_ADD           (1)
+   #define NDARRAY_HAS_BINARY_OP_EQUAL         (1)
+   #define NDARRAY_HAS_BINARY_OP_LESS          (1)
+   #define NDARRAY_HAS_BINARY_OP_LESS_EQUAL    (1)
+   #define NDARRAY_HAS_BINARY_OP_MORE          (1)
+   #define NDARRAY_HAS_BINARY_OP_MORE_EQUAL    (1)
+   #define NDARRAY_HAS_BINARY_OP_MULTIPLY      (1)
+   #define NDARRAY_HAS_BINARY_OP_NOT_EQUAL     (1)
+   #define NDARRAY_HAS_BINARY_OP_POWER         (1)
+   #define NDARRAY_HAS_BINARY_OP_SUBTRACT      (1)
+   #define NDARRAY_HAS_BINARY_OP_TRUE_DIVIDE   (1)
+   ...     
+
+The meaning of flags with names ``_HAS_`` should be obvious, so we will
+just explain the other options.
+
+To see how much you can gain by un-setting the functions that you do not
+need, here are some pointers. In four dimensions, including all
+functions adds around 120 kB to the ``micropython`` firmware. On the
+other hand, if you are interested in Fourier transforms only, and strip
+everything else, you get away with less than 5 kB extra.
+
+Compatibility with numpy
+------------------------
+
+The functions implemented in ``ulab`` are organised in three sub-modules
+at the C level, namely, ``numpy``, ``scipy``, and ``user``. This
+modularity is elevated to ``python``, meaning that in order to use
+functions that are part of ``numpy``, you have to import ``numpy`` as
+
+.. code:: python
+
+   from ulab import numpy as np
+
+   x = np.array([4, 5, 6])
+   p = np.array([1, 2, 3])
+   np.polyval(p, x)
+
+There are a couple of exceptions to this rule, namely ``fft``, and
+``linalg``, which are sub-modules even in ``numpy``, thus you have to
+write them out as
+
+.. code:: python
+
+   from ulab import numpy as np
+
+   A = np.array([1, 2, 3, 4]).reshape()
+   np.linalg.trace(A)
+
+Some of the functions in ``ulab`` are re-implementations of ``scipy``
+functions, and they are to be imported as
+
+.. code:: python
+
+   from ulab import numpy as np
+   from ulab import scipy as spy
+
+
+   x = np.array([1, 2, 3])
+   spy.special.erf(x)
+
+``numpy``-compatibility has an enormous benefit : namely, by
+``try``\ ing to ``import``, we can guarantee that the same, unmodified
+code runs in ``CPython``, as in ``micropython``. The following snippet
+is platform-independent, thus, the ``python`` code can be tested and
+debugged on a computer before loading it onto the microcontroller.
+
+.. code:: python
+
+
+   try:
+       from ulab import numpy as np
+       from ulab import scipy as spy
+   except ImportError:
+       import numpy as np
+       import scipy as spy
+       
+   x = np.array([1, 2, 3])
+   spy.special.erf(x)    
+
+The impact of dimensionality
+----------------------------
+
+Reducing the number of dimensions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``ulab`` supports tensors of rank four, but this is expensive in terms
+of flash: with all available functions and options, the library adds
+around 100 kB to the firmware. However, if such high dimensions are not
+required, significant reductions in size can be gotten by changing the
+value of
+
+.. code:: c
+
+   #define ULAB_MAX_DIMS                   2
+
+Two dimensions cost a bit more than half of four, while you can get away
+with around 20 kB of flash in one dimension, because all those functions
+that don’t make sense (e.g., matrix inversion, eigenvalues etc.) are
+automatically stripped from the firmware.
+
+Using the function iterator
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In higher dimensions, the firmware size increases, because each
+dimension (axis) adds another level of nested loops. An example of this
+is the macro of the binary operator in three dimensions
+
+.. code:: c
+
+   #define BINARY_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)
+       type_out *array = (type_out *)results->array;
+       size_t j = 0;
+       do {
+           size_t k = 0;
+           do {
+               size_t l = 0;
+               do {
+                   *array++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));
+                   (larray) += (lstrides)[ULAB_MAX_DIMS - 1];
+                   (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];
+                   l++;
+               } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);
+               (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];
+               (larray) += (lstrides)[ULAB_MAX_DIMS - 2];
+               (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];
+               (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];
+               k++;
+           } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);
+           (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];
+           (larray) += (lstrides)[ULAB_MAX_DIMS - 3];
+           (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];
+           (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];
+           j++;
+       } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);
+
+In order to reduce firmware size, it *might* make sense in higher
+dimensions to make use of the function iterator by setting the
+
+.. code:: c
+
+   #define ULAB_HAS_FUNCTION_ITERATOR      (1)
+
+constant to 1. This allows the compiler to call the
+``ndarray_rewind_array`` function, so that it doesn’t have to unwrap the
+loops for ``k``, and ``j``. Instead of the macro above, we now have
+
+.. code:: c
+
+   #define BINARY_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)
+       type_out *array = (type_out *)(results)->array;
+       size_t *lcoords = ndarray_new_coords((results)->ndim);
+       size_t *rcoords = ndarray_new_coords((results)->ndim);
+       for(size_t i=0; i < (results)->len/(results)->shape[ULAB_MAX_DIMS -1]; i++) {
+           size_t l = 0;
+           do {
+               *array++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));
+               (larray) += (lstrides)[ULAB_MAX_DIMS - 1];
+               (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];
+               l++;
+           } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);
+           ndarray_rewind_array((results)->ndim, larray, (results)->shape, lstrides, lcoords);
+           ndarray_rewind_array((results)->ndim, rarray, (results)->shape, rstrides, rcoords);
+       } while(0)
+
+Since the ``ndarray_rewind_array`` function is implemented only once, a
+lot of space can be saved. Obviously, function calls cost time, thus
+such trade-offs must be evaluated for each application. The gain also
+depends on which functions and features you include. Operators and
+functions that involve two arrays are expensive, because at the C level,
+the number of cases that must be handled scales with the squares of the
+number of data types. As an example, the innocent-looking expression
+
+.. code:: python
+
+
+   from ulab import numpy as np
+
+   a = np.array([1, 2, 3])
+   b = np.array([4, 5, 6])
+
+   c = a + b
+
+requires 25 loops in C, because the ``dtypes`` of both ``a``, and ``b``
+can assume 5 different values, and the addition has to be resolved for
+all possible cases. Hint: each binary operator costs between 3 and 4 kB
+in two dimensions.
+
+The ulab version string
+-----------------------
+
+As is customary with ``python`` packages, information on the package
+version can be found be querying the ``__version__`` string.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    import ulab
+    
+    print('you are running ulab version', ulab.__version__)
+
+.. parsed-literal::
+
+    you are running ulab version 2.1.0-2D
+    
+    
+
+
+The first three numbers indicate the major, minor, and sub-minor
+versions of ``ulab`` (defined by the ``ULAB_VERSION`` constant in
+`ulab.c <https://github.com/v923z/micropython-ulab/blob/master/code/ulab.c>`__).
+We usually change the minor version, whenever a new function is added to
+the code, and the sub-minor version will be incremented, if a bug fix is
+implemented.
+
+``2D`` tells us that the particular firmware supports tensors of rank 2
+(defined by ``ULAB_MAX_DIMS`` in
+`ulab.h <https://github.com/v923z/micropython-ulab/blob/master/code/ulab.h>`__).
+
+If you find a bug, please, include the version string in your report!
+
+Should you need the numerical value of ``ULAB_MAX_DIMS``, you can get it
+from the version string in the following way:
+
+.. code::
+        
+    # code to be run in micropython
+    
+    import ulab
+    
+    version = ulab.__version__
+    version_dims = version.split('-')[1]
+    version_num = int(version_dims.replace('D', ''))
+    
+    print('version string: ', version)
+    print('version dimensions: ', version_dims)
+    print('numerical value of dimensions: ', version_num)
+
+.. parsed-literal::
+
+    version string:  2.1.0-2D
+    version dimensions:  2D
+    numerical value of dimensions:  2
+    
+    
+
+
+Finding out what your firmware supports
+---------------------------------------
+
+``ulab`` implements a number of array operators and functions, but this
+does not mean that all of these functions and methods are actually
+compiled into the firmware. You can fine-tune your firmware by
+setting/unsetting any of the ``_HAS_`` constants in
+`ulab.h <https://github.com/v923z/micropython-ulab/blob/master/code/ulab.h>`__.
+
+Functions included in the firmware
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The version string will not tell you everything about your firmware,
+because the supported functions and sub-modules can still arbitrarily be
+included or excluded. One way of finding out what is compiled into the
+firmware is calling ``dir`` with ``ulab`` as its argument.
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    from ulab import scipy as spy
+    
+    
+    print('===== constants, functions, and modules of numpy =====\n\n', dir(np))
+    
+    # since fft and linalg are sub-modules, print them separately
+    print('\nfunctions included in the fft module:\n', dir(np.fft))
+    print('\nfunctions included in the linalg module:\n', dir(np.linalg))
+    
+    print('\n\n===== modules of scipy =====\n\n', dir(spy))
+    print('\nfunctions included in the optimize module:\n', dir(spy.optimize))
+    print('\nfunctions included in the signal module:\n', dir(spy.signal))
+    print('\nfunctions included in the special module:\n', dir(spy.special))
+
+.. parsed-literal::
+
+    ===== constants, functions, and modules of numpy =====
+    
+     ['__class__', '__name__', 'bool', 'sort', 'sum', 'acos', 'acosh', 'arange', 'arctan2', 'argmax', 'argmin', 'argsort', 'around', 'array', 'asin', 'asinh', 'atan', 'atanh', 'ceil', 'clip', 'concatenate', 'convolve', 'cos', 'cosh', 'cross', 'degrees', 'diag', 'diff', 'e', 'equal', 'exp', 'expm1', 'eye', 'fft', 'flip', 'float', 'floor', 'frombuffer', 'full', 'get_printoptions', 'inf', 'int16', 'int8', 'interp', 'linalg', 'linspace', 'log', 'log10', 'log2', 'logspace', 'max', 'maximum', 'mean', 'median', 'min', 'minimum', 'nan', 'ndinfo', 'not_equal', 'ones', 'pi', 'polyfit', 'polyval', 'radians', 'roll', 'set_printoptions', 'sin', 'sinh', 'sqrt', 'std', 'tan', 'tanh', 'trapz', 'uint16', 'uint8', 'vectorize', 'zeros']
+    
+    functions included in the fft module:
+     ['__class__', '__name__', 'fft', 'ifft']
+    
+    functions included in the linalg module:
+     ['__class__', '__name__', 'cholesky', 'det', 'dot', 'eig', 'inv', 'norm', 'trace']
+    
+    
+    ===== modules of scipy =====
+    
+     ['__class__', '__name__', 'optimize', 'signal', 'special']
+    
+    functions included in the optimize module:
+     ['__class__', '__name__', 'bisect', 'fmin', 'newton']
+    
+    functions included in the signal module:
+     ['__class__', '__name__', 'sosfilt', 'spectrogram']
+    
+    functions included in the special module:
+     ['__class__', '__name__', 'erf', 'erfc', 'gamma', 'gammaln']
+    
+    
+
+
+Methods included in the firmware
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``dir`` function applied to the module or its sub-modules gives
+information on what the module and sub-modules include, but is not
+enough to find out which methods the ``ndarray`` class supports. We can
+list the methods by calling ``dir`` with the ``array`` object itself:
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    print(dir(np.array))
+
+.. parsed-literal::
+
+    ['__class__', '__name__', 'copy', 'sort', '__bases__', '__dict__', 'dtype', 'flatten', 'itemsize', 'reshape', 'shape', 'size', 'strides', 'tobytes', 'transpose']
+    
+    
+
+
+Operators included in the firmware
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A list of operators cannot be generated as shown above. If you really
+need to find out, whether, e.g., the ``**`` operator is supported by the
+firmware, you have to ``try`` it:
+
+.. code::
+        
+    # code to be run in micropython
+    
+    from ulab import numpy as np
+    
+    a = np.array([1, 2, 3])
+    b = np.array([4, 5, 6])
+    
+    try:
+        print(a ** b)
+    except Exception as e:
+        print('operator is not supported: ', e)
+
+.. parsed-literal::
+
+    operator is not supported:  unsupported types for __pow__: 'ndarray', 'ndarray'
+    
+    
+
+
+The exception above would be raised, if the firmware was compiled with
+the
+
+.. code:: c
+
+   #define NDARRAY_HAS_BINARY_OP_POWER         (0)
+
+definition.
--- a/docs/manual/source/ulab-ndarray.rst
+++ b/docs/manual/source/ulab-ndarray.rst
--- a/docs/manual/source/ulab-programming.rst
+++ b/docs/manual/source/ulab-programming.rst
@ -0,0 +1,911 @@
+
+Programming ulab
+================
+
+Earlier we have seen, how ``ulab``\ ’s functions and methods can be
+accessed in ``micropython``. This last section of the book explains, how
+these functions are implemented. By the end of this chapter, not only
+would you be able to extend ``ulab``, and write your own
+``numpy``-compatible functions, but through a deeper understanding of
+the inner workings of the functions, you would also be able to see what
+the trade-offs are at the ``python`` level.
+
+Code organisation
+-----------------
+
+As mentioned earlier, the ``python`` functions are organised into
+sub-modules at the C level. The C sub-modules can be found in
+``./ulab/code/``.
+
+The ``ndarray`` object
+----------------------
+
+General comments
+~~~~~~~~~~~~~~~~
+
+``ndarrays`` are efficient containers of numerical data of the same type
+(i.e., signed/unsigned chars, signed/unsigned integers or
+``mp_float_t``\ s, which, depending on the platform, are either C
+``float``\ s, or C ``double``\ s). Beyond storing the actual data in the
+void pointer ``*array``, the type definition has eight additional
+members (on top of the ``base`` type). Namely, the ``dtype``, which
+tells us, how the bytes are to be interpreted. Moreover, the
+``itemsize``, which stores the size of a single entry in the array,
+``boolean``, an unsigned integer, which determines, whether the arrays
+is to be treated as a set of Booleans, or as numerical data, ``ndim``,
+the number of dimensions (``uint8_t``), ``len``, the length of the array
+(the number of entries), the shape (``*size_t``), the strides
+(``*int32_t``). The length is simply the product of the numbers in
+``shape``.
+
+The type definition is as follows:
+
+.. code:: c
+
+   typedef struct _ndarray_obj_t {
+       mp_obj_base_t base;
+       uint8_t dtype;
+       uint8_t itemsize;
+       uint8_t boolean;
+       uint8_t ndim;
+       size_t len;
+       size_t shape[ULAB_MAX_DIMS];
+       int32_t strides[ULAB_MAX_DIMS];
+       void *array;
+   } ndarray_obj_t;
+
+Memory layout
+~~~~~~~~~~~~~
+
+The values of an ``ndarray`` are stored in a contiguous segment in the
+RAM. The ``ndarray`` can be dense, meaning that all numbers in the
+linear memory segment belong to a linar combination of coordinates, and
+it can also be sparse, i.e., some elements of the linear storage space
+will be skipped, when the elements of the tensor are traversed.
+
+In the RAM, the position of the item
+:math:`M(n_1, n_2, ..., n_{k-1}, n_k)` in a dense tensor of rank
+:math:`k` is given by the linear combination
+
+:raw-latex:`\begin{equation}
+P(n_1, n_2, ..., n_{k-1}, n_k) = n_1 s_1 + n_2 s_2 + ... + n_{k-1}s_{k-1} + n_ks_k = \sum_{i=1}^{k}n_is_i
+\end{equation}` where :math:`s_i` are the strides of the tensor, defined
+as
+
+:raw-latex:`\begin{equation}
+s_i = \prod_{j=i+1}^k l_j
+\end{equation}`
+
+where :math:`l_j` is length of the tensor along the :math:`j`\ th axis.
+When the tensor is sparse (e.g., when the tensor is sliced), the strides
+along a particular axis will be multiplied by a non-zero integer. If
+this integer is different to :math:`\pm 1`, the linear combination above
+cannot access all elements in the RAM, i.e., some numbers will be
+skipped. Note that :math:`|s_1| > |s_2| > ... > |s_{k-1}| > |s_k|`, even
+if the tensor is sparse. The statement is trivial for dense tensors, and
+it follows from the definition of :math:`s_i`. For sparse tensors, a
+slice cannot have a step larger than the shape along that axis. But for
+dense tensors, :math:`s_i/s_{i+1} = l_i`.
+
+When creating a *view*, we simply re-calculate the ``strides``, and
+re-set the ``*array`` pointer.
+
+Iterating over elements of a tensor
+-----------------------------------
+
+The ``shape`` and ``strides`` members of the array tell us how we have
+to move our pointer, when we want to read out the numbers. For technical
+reasons that will become clear later, the numbers in ``shape`` and in
+``strides`` are aligned to the right, and begin on the right hand side,
+i.e., if the number of possible dimensions is ``ULAB_MAX_DIMS``, then
+``shape[ULAB_MAX_DIMS-1]`` is the length of the last axis,
+``shape[ULAB_MAX_DIMS-2]`` is the length of the last but one axis, and
+so on. If the number of actual dimensions, ``ndim < ULAB_MAX_DIMS``, the
+first ``ULAB_MAX_DIMS - ndim`` entries in ``shape`` and ``strides`` will
+be equal to zero, but they could, in fact, be assigned any value,
+because these will never be accessed in an operation.
+
+With this definition of the strides, the linear combination in
+:math:`P(n_1, n_2, ..., n_{k-1}, n_k)` is a one-to-one mapping from the
+space of tensor coordinates, :math:`(n_1, n_2, ..., n_{k-1}, n_k)`, and
+the coordinate in the linear array,
+:math:`n_1s_1 + n_2s_2 + ... + n_{k-1}s_{k-1} + n_ks_k`, i.e., no two
+distinct sets of coordinates will result in the same position in the
+linear array.
+
+Since the ``strides`` are given in terms of bytes, when we iterate over
+an array, the void data pointer is usually cast to ``uint8_t``, and the
+values are converted using the proper data type stored in
+``ndarray->dtype``. However, there might be cases, when it makes perfect
+sense to cast ``*array`` to a different type, in which case the
+``strides`` have to be re-scaled by the value of ``ndarray->itemsize``.
+
+Iterating using the unwrapped loops
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following macro definition is taken from
+`vector.h <https://github.com/v923z/micropython-ulab/blob/master/code/numpy/vector/vector.h>`__,
+and demonstrates, how we can iterate over a single array in four
+dimensions.
+
+.. code:: c
+
+   #define ITERATE_VECTOR(type, array, source, sarray) do {
+       size_t i=0;
+       do {
+           size_t j = 0;
+           do {
+               size_t k = 0;
+               do {
+                   size_t l = 0;
+                   do {
+                       *(array)++ = f(*((type *)(sarray)));
+                       (sarray) += (source)->strides[ULAB_MAX_DIMS - 1];
+                       l++;
+                   } while(l < (source)->shape[ULAB_MAX_DIMS-1]);
+                   (sarray) -= (source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1];
+                   (sarray) += (source)->strides[ULAB_MAX_DIMS - 2];
+                   k++;
+               } while(k < (source)->shape[ULAB_MAX_DIMS-2]);
+               (sarray) -= (source)->strides[ULAB_MAX_DIMS - 2] * (source)->shape[ULAB_MAX_DIMS-2];
+               (sarray) += (source)->strides[ULAB_MAX_DIMS - 3];
+               j++;
+           } while(j < (source)->shape[ULAB_MAX_DIMS-3]);
+           (sarray) -= (source)->strides[ULAB_MAX_DIMS - 3] * (source)->shape[ULAB_MAX_DIMS-3];
+           (sarray) += (source)->strides[ULAB_MAX_DIMS - 4];
+           i++;
+       } while(i < (source)->shape[ULAB_MAX_DIMS-4]);
+   } while(0)
+
+We start with the innermost loop, the one recursing ``l``. ``array`` is
+already of type ``mp_float_t``, while the source array, ``sarray``, has
+been cast to ``uint8_t`` in the calling function. The numbers contained
+in ``sarray`` have to be read out in the proper type dictated by
+``ndarray->dtype``. This is what happens in the statement
+``*((type *)(sarray))``, and this number is then fed into the function
+``f``. Vectorised mathematical functions produce *dense* arrays, and for
+this reason, we can simply advance the ``array`` pointer.
+
+The advancing of the ``sarray`` pointer is a bit more involving: first,
+in the innermost loop, we simply move forward by the amount given by the
+last stride, which is ``(source)->strides[ULAB_MAX_DIMS - 1]``, because
+the ``shape`` and the ``strides`` are aligned to the right. We move the
+pointer as many times as given by ``(source)->shape[ULAB_MAX_DIMS-1]``,
+which is the length of the very last axis. Hence the the structure of
+the loop
+
+.. code:: c
+
+       size_t l = 0;
+       do {
+           ...
+           l++;
+       } while(l < (source)->shape[ULAB_MAX_DIMS-1]);
+
+Once we have exhausted the last axis, we have to re-wind the pointer,
+and advance it by an amount given by the last but one stride. Keep in
+mind that in the the innermost loop we moved our pointer
+``(source)->shape[ULAB_MAX_DIMS-1]`` times by
+``(source)->strides[ULAB_MAX_DIMS - 1]``, i.e., we re-wind it by moving
+it backwards by
+``(source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1]``.
+In the next step, we move forward by
+``(source)->strides[ULAB_MAX_DIMS - 2]``, which is the last but one
+stride.
+
+.. code:: c
+
+       (sarray) -= (source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1];
+       (sarray) += (source)->strides[ULAB_MAX_DIMS - 2];
+
+This pattern must be repeated for each axis of the array, and this is
+how we arrive at the four nested loops listed above.
+
+Re-winding arrays by means of a function
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In addition to un-wrapping the iteration loops by means of macros, there
+is another way of traversing all elements of a tensor: we note that,
+since :math:`|s_1| > |s_2| > ... > |s_{k-1}| > |s_k|`,
+:math:`P(n1, n2, ..., n_{k-1}, n_k)` changes most slowly in the last
+coordinate. Hence, if we start from the very beginning, (:math:`n_i = 0`
+for all :math:`i`), and walk along the linear RAM segment, we increment
+the value of :math:`n_k` as long as :math:`n_k < l_k`. Once
+:math:`n_k = l_k`, we have to reset :math:`n_k` to 0, and increment
+:math:`n_{k-1}` by one. After each such round, :math:`n_{k-1}` will be
+incremented by one, as long as :math:`n_{k-1} < l_{k-1}`. Once
+:math:`n_{k-1} = l_{k-1}`, we reset both :math:`n_k`, and
+:math:`n_{k-1}` to 0, and increment :math:`n_{k-2}` by one.
+
+Rewinding the arrays in this way is implemented in the function
+``ndarray_rewind_array`` in
+`ndarray.c <https://github.com/v923z/micropython-ulab/blob/master/code/ndarray.c>`__.
+
+.. code:: c
+
+   void ndarray_rewind_array(uint8_t ndim, uint8_t *array, size_t *shape, int32_t *strides, size_t *coords) {
+       // resets the data pointer of a single array, whenever an axis is full
+       // since we always iterate over the very last axis, we have to keep track of
+       // the last ndim-2 axes only
+       array -= shape[ULAB_MAX_DIMS - 1] * strides[ULAB_MAX_DIMS - 1];
+       array += strides[ULAB_MAX_DIMS - 2];
+       for(uint8_t i=1; i < ndim-1; i++) {
+           coords[ULAB_MAX_DIMS - 1 - i] += 1;
+           if(coords[ULAB_MAX_DIMS - 1 - i] == shape[ULAB_MAX_DIMS - 1 - i]) { // we are at a dimension boundary
+               array -= shape[ULAB_MAX_DIMS - 1 - i] * strides[ULAB_MAX_DIMS - 1 - i];
+               array += strides[ULAB_MAX_DIMS - 2 - i];
+               coords[ULAB_MAX_DIMS - 1 - i] = 0;
+               coords[ULAB_MAX_DIMS - 2 - i] += 1;
+           } else { // coordinates can change only, if the last coordinate changes
+               return;
+           }
+       }
+   }
+
+and the function would be called as in the snippet below. Note that the
+innermost loop is factored out, so that we can save the ``if(...)``
+statement for the last axis.
+
+.. code:: c
+
+       size_t *coords = ndarray_new_coords(results->ndim);
+       for(size_t i=0; i < results->len/results->shape[ULAB_MAX_DIMS -1]; i++) {
+           size_t l = 0;
+           do {
+               ...
+               l++;
+           } while(l < results->shape[ULAB_MAX_DIMS - 1]);
+           ndarray_rewind_array(results->ndim, array, results->shape, strides, coords);
+       } while(0)
+
+The advantage of this method is that the implementation is independent
+of the number of dimensions: the iteration requires more or less the
+same flash space for 2 dimensions as for 22. However, the price we have
+to pay for this convenience is the extra function call.
+
+Iterating over two ndarrays simultaneously: broadcasting
+--------------------------------------------------------
+
+Whenever we invoke a binary operator, call a function with two arguments
+of ``ndarray`` type, or assign something to an ``ndarray``, we have to
+iterate over two views at the same time. The task is trivial, if the two
+``ndarray``\ s in question have the same shape (but not necessarily the
+same set of strides), because in this case, we can still iterate in the
+same loop. All that happens is that we move two data pointers in sync.
+
+The problem becomes a bit more involving, when the shapes of the two
+``ndarray``\ s are not identical. For such cases, ``numpy`` defines
+so-called broadcasting, which boils down to two rules.
+
+1. The shapes in the tensor with lower rank has to be prepended with
+   axes of size 1 till the two ranks become equal.
+2. Along all axes the two tensors should have the same size, or one of
+   the sizes must be 1.
+
+If, after applying the first rule the second is not satisfied, the two
+``ndarray``\ s cannot be broadcast together.
+
+Now, let us suppose that we have two compatible ``ndarray``\ s, i.e.,
+after applying the first rule, the second is satisfied. How do we
+iterate over the elements in the tensors?
+
+We should recall, what exactly we do, when iterating over a single
+array: normally, we move the data pointer by the last stride, except,
+when we arrive at a dimension boundary (when the last axis is
+exhausted). At that point, we move the pointer by an amount dictated by
+the strides. And this is the key: *dictated by the strides*. Now, if we
+have two arrays that are originally not compatible, we define new
+strides for them, and use these in the iteration. With that, we are back
+to the case, where we had two compatible arrays.
+
+Now, let us look at the second broadcasting rule: if the two arrays have
+the same size, we take both ``ndarray``\ s’ strides along that axis. If,
+on the other hand, one of the ``ndarray``\ s is of length 1 along one of
+its axes, we set the corresponding strides to 0. This will ensure that
+that data pointer is not moved, when we iterate over both ``ndarray``\ s
+at the same time.
+
+Thus, in order to implement broadcasting, we first have to check,
+whether the two above-mentioned rules can be satisfied, and if so, we
+have to find the two new sets strides.
+
+The ``ndarray_can_broadcast`` function from
+`ndarray.c <https://github.com/v923z/micropython-ulab/blob/master/code/ndarray.c>`__
+takes two ``ndarray``\ s, and returns ``true``, if the two arrays can be
+broadcast together. At the same time, it also calculates new strides for
+the two arrays, so that they can be iterated over at the same time.
+
+.. code:: c
+
+   bool ndarray_can_broadcast(ndarray_obj_t *lhs, ndarray_obj_t *rhs, uint8_t *ndim, size_t *shape, int32_t *lstrides, int32_t *rstrides) {
+       // returns True or False, depending on, whether the two arrays can be broadcast together
+       // numpy's broadcasting rules are as follows:
+       //
+       // 1. the two shapes are either equal
+       // 2. one of the shapes is 1
+       memset(lstrides, 0, sizeof(size_t)*ULAB_MAX_DIMS);
+       memset(rstrides, 0, sizeof(size_t)*ULAB_MAX_DIMS);
+       lstrides[ULAB_MAX_DIMS - 1] = lhs->strides[ULAB_MAX_DIMS - 1];
+       rstrides[ULAB_MAX_DIMS - 1] = rhs->strides[ULAB_MAX_DIMS - 1];
+       for(uint8_t i=ULAB_MAX_DIMS; i > 0; i--) {
+           if((lhs->shape[i-1] == rhs->shape[i-1]) || (lhs->shape[i-1] == 0) || (lhs->shape[i-1] == 1) ||
+           (rhs->shape[i-1] == 0) || (rhs->shape[i-1] == 1)) {
+               shape[i-1] = MAX(lhs->shape[i-1], rhs->shape[i-1]);
+               if(shape[i-1] > 0) (*ndim)++;
+               if(lhs->shape[i-1] < 2) {
+                   lstrides[i-1] = 0;
+               } else {
+                   lstrides[i-1] = lhs->strides[i-1];
+               }
+               if(rhs->shape[i-1] < 2) {
+                   rstrides[i-1] = 0;
+               } else {
+                   rstrides[i-1] = rhs->strides[i-1];
+               }
+           } else {
+               return false;
+           }
+       }
+       return true;
+   }
+
+A good example of how the function would be called can be found in
+`vector.c <https://github.com/v923z/micropython-ulab/blob/master/code/numpy/vector/vector.c>`__,
+in the ``vector_arctan2`` function:
+
+.. code:: c
+
+   mp_obj_t vectorise_arctan2(mp_obj_t y, mp_obj_t x) {
+       ...
+       uint8_t ndim = 0;
+       size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
+       int32_t *xstrides = m_new(int32_t, ULAB_MAX_DIMS);
+       int32_t *ystrides = m_new(int32_t, ULAB_MAX_DIMS);
+       if(!ndarray_can_broadcast(ndarray_x, ndarray_y, &ndim, shape, xstrides, ystrides)) {
+           mp_raise_ValueError(translate("operands could not be broadcast together"));
+           m_del(size_t, shape, ULAB_MAX_DIMS);
+           m_del(int32_t, xstrides, ULAB_MAX_DIMS);
+           m_del(int32_t, ystrides, ULAB_MAX_DIMS);
+       }
+
+       uint8_t *xarray = (uint8_t *)ndarray_x->array;
+       uint8_t *yarray = (uint8_t *)ndarray_y->array;
+       
+       ndarray_obj_t *results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);
+       mp_float_t *rarray = (mp_float_t *)results->array;
+       ...
+
+After the new strides have been calculated, the iteration loop is
+identical to what we discussed in the previous section.
+
+Contracting an ``ndarray``
+--------------------------
+
+There are many operations that reduce the number of dimensions of an
+``ndarray`` by 1, i.e., that remove an axis from the tensor. The drill
+is the same as before, with the exception that first we have to remove
+the ``strides`` and ``shape`` that corresponds to the axis along which
+we intend to contract. The ``numerical_reduce_axes`` function from
+`numerical.c <https://github.com/v923z/micropython-ulab/blob/master/code/numerical/numerical.c>`__
+does that.
+
+.. code:: c
+
+   static void numerical_reduce_axes(ndarray_obj_t *ndarray, int8_t axis, size_t *shape, int32_t *strides) {
+       // removes the values corresponding to a single axis from the shape and strides array
+       uint8_t index = ULAB_MAX_DIMS - ndarray->ndim + axis;
+       if((ndarray->ndim == 1) && (axis == 0)) {
+           index = 0;
+           shape[ULAB_MAX_DIMS - 1] = 0;
+           return;
+       }
+       for(uint8_t i = ULAB_MAX_DIMS - 1; i > 0; i--) {
+           if(i > index) {
+               shape[i] = ndarray->shape[i];
+               strides[i] = ndarray->strides[i];
+           } else {
+               shape[i] = ndarray->shape[i-1];
+               strides[i] = ndarray->strides[i-1];
+           }
+       }
+   }
+
+Once the reduced ``strides`` and ``shape`` are known, we place the axis
+in question in the innermost loop, and wrap it with the loops, whose
+coordinates are in the ``strides``, and ``shape`` arrays. The
+``RUN_STD`` macro from
+`numerical.h <https://github.com/v923z/micropython-ulab/blob/master/code/numpy/numerical/numerical.h>`__
+is a good example. The macro is expanded in the
+``numerical_sum_mean_std_ndarray`` function.
+
+.. code:: c
+
+   static mp_obj_t numerical_sum_mean_std_ndarray(ndarray_obj_t *ndarray, mp_obj_t axis, uint8_t optype, size_t ddof) {
+       uint8_t *array = (uint8_t *)ndarray->array;
+       size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
+       memset(shape, 0, sizeof(size_t)*ULAB_MAX_DIMS);
+       int32_t *strides = m_new(int32_t, ULAB_MAX_DIMS);
+       memset(strides, 0, sizeof(uint32_t)*ULAB_MAX_DIMS);
+
+       int8_t ax = mp_obj_get_int(axis);
+       if(ax < 0) ax += ndarray->ndim;
+       if((ax < 0) || (ax > ndarray->ndim - 1)) {
+           mp_raise_ValueError(translate("index out of range"));
+       }
+       numerical_reduce_axes(ndarray, ax, shape, strides);
+       uint8_t index = ULAB_MAX_DIMS - ndarray->ndim + ax;
+       ndarray_obj_t *results = NULL;
+       uint8_t *rarray = NULL;
+       ...
+
+Here is the macro for the three-dimensional case:
+
+.. code:: c
+
+   #define RUN_STD(ndarray, type, array, results, r, shape, strides, index, div) do {
+       size_t k = 0;
+       do {
+           size_t l = 0;
+           do {
+               RUN_STD1((ndarray), type, (array), (results), (r), (index), (div));
+               (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];
+               (array) += (strides)[ULAB_MAX_DIMS - 1];
+               l++;
+           } while(l < (shape)[ULAB_MAX_DIMS - 1]);
+           (array) -= (strides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];
+           (array) += (strides)[ULAB_MAX_DIMS - 3];
+           k++;
+       } while(k < (shape)[ULAB_MAX_DIMS - 2]);
+   } while(0)
+
+In ``RUN_STD``, we simply move our pointers; the calculation itself
+happens in the ``RUN_STD1`` macro below. (Note that this is the
+implementation of the numerically stable Welford algorithm.)
+
+.. code:: c
+
+   #define RUN_STD1(ndarray, type, array, results, r, index, div)
+   ({
+       mp_float_t M, m, S = 0.0, s = 0.0;
+       M = m = *(mp_float_t *)((type *)(array));
+       for(size_t i=1; i < (ndarray)->shape[(index)]; i++) {
+           (array) += (ndarray)->strides[(index)];
+           mp_float_t value = *(mp_float_t *)((type *)(array));
+           m = M + (value - M) / (mp_float_t)i;
+           s = S + (value - M) * (value - m);
+           M = m;
+           S = s;
+       }
+       (array) += (ndarray)->strides[(index)];
+       *(r)++ = MICROPY_FLOAT_C_FUN(sqrt)((ndarray)->shape[(index)] * s / (div));
+   })
+
+Upcasting
+---------
+
+When in an operation the ``dtype``\ s of two arrays are different, the
+result’s ``dtype`` will be decided by the following upcasting rules:
+
+1. Operations with two ``ndarray``\ s of the same ``dtype`` preserve
+   their ``dtype``, even when the results overflow.
+
+2. if either of the operands is a float, the result automatically
+   becomes a float
+
+3. otherwise
+
+   -  ``uint8`` + ``int8`` => ``int16``,
+
+   -  ``uint8`` + ``int16`` => ``int16``
+
+   -  ``uint8`` + ``uint16`` => ``uint16``
+
+   -  ``int8`` + ``int16`` => ``int16``
+
+   -  ``int8`` + ``uint16`` => ``uint16`` (in numpy, the result is a
+      ``int32``)
+
+   -  ``uint16`` + ``int16`` => ``float`` (in numpy, the result is a
+      ``int32``)
+
+4. When one operand of a binary operation is a generic scalar
+   ``micropython`` variable, i.e., ``mp_obj_int``, or ``mp_obj_float``,
+   it will be converted to a linear array of length 1, and with the
+   smallest ``dtype`` that can accommodate the variable in question.
+   After that the broadcasting rules apply, as described in the section
+   `Iterating over two ndarrays simultaneously:
+   broadcasting <#Iterating_over_two_ndarrays_simultaneously:_broadcasting>`__
+
+Upcasting is resolved in place, wherever it is required. Notable
+examples can be found in
+`ndarray_operators.c <https://github.com/v923z/micropython-ulab/blob/master/code/ndarray_operators.c>`__
+
+Slicing and indexing
+--------------------
+
+An ``ndarray`` can be indexed with three types of objects: integer
+scalars, slices, and another ``ndarray``, whose elements are either
+integer scalars, or Booleans. Since slice and integer indices can be
+thought of as modifications of the ``strides``, these indices return a
+view of the ``ndarray``. This statement does not hold for ``ndarray``
+indices, and therefore, the return a copy of the array.
+
+Extending ulab
+--------------
+
+The ``user`` module is disabled by default, as can be seen from the last
+couple of lines of
+`ulab.h <https://github.com/v923z/micropython-ulab/blob/master/code/ulab.h>`__
+
+.. code:: c
+
+   // user-defined module
+   #ifndef ULAB_USER_MODULE
+   #define ULAB_USER_MODULE                (0)
+   #endif
+
+The module contains a very simple function, ``user_dummy``, and this
+function is bound to the module itself. In other words, even if the
+module is enabled, one has to ``import``:
+
+.. code:: python
+
+
+   import ulab
+   from ulab import user
+
+   user.dummy_function(2.5)
+
+which should just return 5.0. Even if ``numpy``-compatibility is
+required (i.e., if most functions are bound at the top level to ``ulab``
+directly), having to ``import`` the module has a great advantage.
+Namely, only the
+`user.h <https://github.com/v923z/micropython-ulab/blob/master/code/user/user.h>`__
+and
+`user.c <https://github.com/v923z/micropython-ulab/blob/master/code/user/user.c>`__
+files have to be modified, thus it should be relatively straightforward
+to update your local copy from
+`github <https://github.com/v923z/micropython-ulab/blob/master/>`__.
+
+Now, let us see, how we can add a more meaningful function.
+
+Creating a new ndarray
+----------------------
+
+In the `General comments <#General_comments>`__ sections we have seen
+the type definition of an ``ndarray``. This structure can be generated
+by means of a couple of functions listed in
+`ndarray.c <https://github.com/v923z/micropython-ulab/blob/master/code/ndarray.c>`__.
+
+ndarray_new_ndarray
+~~~~~~~~~~~~~~~~~~~
+
+The ``ndarray_new_ndarray`` functions is called by all other
+array-generating functions. It takes the number of dimensions, ``ndim``,
+a ``uint8_t``, the ``shape``, a pointer to ``size_t``, the ``strides``,
+a pointer to ``int32_t``, and ``dtype``, another ``uint8_t`` as its
+arguments, and returns a new array with all entries initialised to 0.
+
+Assuming that ``ULAB_MAX_DIMS > 2``, a new dense array of dimension 3,
+of ``shape`` (3, 4, 5), of ``strides`` (1000, 200, 10), and ``dtype``
+``uint16_t`` can be generated by the following instructions
+
+.. code:: c
+
+   size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
+   shape[ULAB_MAX_DIMS - 1] = 5;
+   shape[ULAB_MAX_DIMS - 2] = 4;
+   shape[ULAB_MAX_DIMS - 3] = 3;
+
+   int32_t *strides = m_new(int32_t, ULAB_MAX_DIMS);
+   strides[ULAB_MAX_DIMS - 1] = 10;
+   strides[ULAB_MAX_DIMS - 2] = 200;
+   strides[ULAB_MAX_DIMS - 3] = 1000;
+
+   ndarray_obj_t *new_ndarray = ndarray_new_ndarray(3, shape, strides, NDARRAY_UINT16);
+
+ndarray_new_dense_ndarray
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The functions simply calculates the ``strides`` from the ``shape``, and
+calls ``ndarray_new_ndarray``. Assuming that ``ULAB_MAX_DIMS > 2``, a
+new dense array of dimension 3, of ``shape`` (3, 4, 5), and ``dtype``
+``mp_float_t`` can be generated by the following instructions
+
+.. code:: c
+
+   size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
+   shape[ULAB_MAX_DIMS - 1] = 5;
+   shape[ULAB_MAX_DIMS - 2] = 4;
+   shape[ULAB_MAX_DIMS - 3] = 3;
+
+   ndarray_obj_t *new_ndarray = ndarray_new_dense_ndarray(3, shape, NDARRAY_FLOAT);
+
+ndarray_new_linear_array
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since the dimensions of a linear array are known (1), the
+``ndarray_new_linear_array`` takes the ``length``, a ``size_t``, and the
+``dtype``, an ``uint8_t``. Internally, ``ndarray_new_linear_array``
+generates the ``shape`` array, and calls ``ndarray_new_dense_array``
+with ``ndim = 1``.
+
+A linear array of length 100, and ``dtype`` ``uint8`` could be created
+by the function call
+
+.. code:: c
+
+   ndarray_obj_t *new_ndarray = ndarray_new_linear_array(100, NDARRAY_UINT8)
+
+ndarray_new_ndarray_from_tuple
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This function takes a ``tuple``, which should hold the lengths of the
+axes (in other words, the ``shape``), and the ``dtype``, and calls
+internally ``ndarray_new_dense_array``. A new ``ndarray`` can be
+generated by calling
+
+.. code:: c
+
+   ndarray_obj_t *new_ndarray = ndarray_new_ndarray_from_tuple(shape, NDARRAY_FLOAT);
+
+where ``shape`` is a tuple.
+
+ndarray_new_view
+~~~~~~~~~~~~~~~~
+
+This function crates a *view*, and takes the source, an ``ndarray``, the
+number of dimensions, an ``uint8_t``, the ``shape``, a pointer to
+``size_t``, the ``strides``, a pointer to ``int32_t``, and the offset,
+an ``int32_t`` as arguments. The offset is the number of bytes by which
+the void ``array`` pointer is shifted. E.g., the ``python`` statement
+
+.. code:: python
+
+   a = np.array([0, 1, 2, 3, 4, 5], dtype=uint8)
+   b = a[1::2]
+
+produces the array
+
+.. code:: python
+
+   array([1, 3, 5], dtype=uint8)
+
+which holds its data at position ``x0 + 1``, if ``a``\ ’s pointer is at
+``x0``. In this particular case, the offset is 1.
+
+The array ``b`` from the example above could be generated as
+
+.. code:: c
+
+   size_t *shape = m_new(size_t, ULAB_MAX_DIMS);
+   shape[ULAB_MAX_DIMS - 1] = 3;
+
+   int32_t *strides = m_new(int32_t, ULAB_MAX_DIMS);
+   strides[ULAB_MAX_DIMS - 1] = 2;
+
+   int32_t offset = 1;
+   uint8_t ndim = 1;
+
+   ndarray_obj_t *new_ndarray = ndarray_new_view(ndarray_a, ndim, shape, strides, offset);
+
+ndarray_copy_array
+~~~~~~~~~~~~~~~~~~
+
+The ``ndarray_copy_array`` function can be used for copying the contents
+of an array. Note that the target array has to be created beforehand.
+E.g., a one-to-one copy can be gotten by
+
+.. code:: c
+
+   ndarray_obj_t *new_ndarray = ndarray_new_ndarray(source->ndim, source->shape, source->strides, source->dtype);
+   ndarray_copy_array(source, new_ndarray);
+
+Note that the function cannot be used for forcing type conversion, i.e.,
+the input and output types must be identical, because the function
+simply calls the ``memcpy`` function. On the other hand, the input and
+output ``strides`` do not necessarily have to be equal.
+
+ndarray_copy_view
+~~~~~~~~~~~~~~~~~
+
+The ``ndarray_obj_t *new_ndarray = ...`` instruction can be saved by
+calling the ``ndarray_copy_view`` function with the single ``source``
+argument.
+
+Accessing data in the ndarray
+-----------------------------
+
+Having seen, how arrays can be generated and copied, it is time to look
+at how the data in an ``ndarray`` can be accessed and modified.
+
+For starters, let us suppose that the object in question comes from the
+user (i.e., via the ``micropython`` interface), First, we have to
+acquire a pointer to the ``ndarray`` by calling
+
+.. code:: c
+
+   ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(object_in);
+
+If it is not clear, whether the object is an ``ndarray`` (e.g., if we
+want to write a function that can take ``ndarray``\ s, and other
+iterables as its argument), we find this out by evaluating
+
+.. code:: c
+
+   MP_OBJ_IS_TYPE(object_in, &ulab_ndarray_type)
+
+which should return ``true``. Once the pointer is at our disposal, we
+can get a pointer to the underlying numerical array as discussed
+earlier, i.e.,
+
+.. code:: c
+
+   uint8_t *array = (uint8_t *)ndarray->array;
+
+If you need to find out the ``dtype`` of the array, you can get it by
+accessing the ``dtype`` member of the ``ndarray``, i.e.,
+
+.. code:: c
+
+   ndarray->dtype
+
+should be equal to ``B``, ``b``, ``H``, ``h``, or ``f``. The size of a
+single item is stored in the ``itemsize`` member. This number should be
+equal to 1, if the ``dtype`` is ``B``, or ``b``, 2, if the ``dtype`` is
+``H``, or ``h``, 4, if the ``dtype`` is ``f``, and 8 for ``d``.
+
+Boilerplate
+-----------
+
+In the next section, we will construct a function that generates the
+element-wise square of a dense array, otherwise, raises a ``TypeError``
+exception. Dense arrays can easily be iterated over, since we do not
+have to care about the ``shape`` and the ``strides``. If the array is
+sparse, the section `Iterating over elements of a
+tensor <#Iterating-over-elements-of-a-tensor>`__ should contain hints as
+to how the iteration can be implemented.
+
+The function is listed under
+`user.c <https://github.com/v923z/micropython-ulab/tree/master/code/user/>`__.
+The ``user`` module is bound to ``ulab`` in
+`ulab.c <https://github.com/v923z/micropython-ulab/tree/master/code/ulab.c>`__
+in the lines
+
+.. code:: c
+
+       #if ULAB_USER_MODULE
+           { MP_ROM_QSTR(MP_QSTR_user), MP_ROM_PTR(&ulab_user_module) },
+       #endif
+
+which assumes that at the very end of
+`ulab.h <https://github.com/v923z/micropython-ulab/tree/master/code/ulab.h>`__
+the
+
+.. code:: c
+
+   // user-defined module
+   #ifndef ULAB_USER_MODULE
+   #define ULAB_USER_MODULE                (1)
+   #endif
+
+constant has been set to 1. After compilation, you can call a particular
+``user`` function in ``python`` by importing the module first, i.e.,
+
+.. code:: python
+
+   from ulab import numpy as np
+   from ulab import user
+
+   user.some_function(...)
+
+This separation of user-defined functions from the rest of the code
+ensures that the integrity of the main module and all its functions are
+always preserved. Even in case of a catastrophic failure, you can
+exclude the ``user`` module, and start over.
+
+And now the function:
+
+.. code:: c
+
+   static mp_obj_t user_square(mp_obj_t arg) {
+       // the function takes a single dense ndarray, and calculates the 
+       // element-wise square of its entries
+       
+       // raise a TypeError exception, if the input is not an ndarray
+       if(!MP_OBJ_IS_TYPE(arg, &ulab_ndarray_type)) {
+           mp_raise_TypeError(translate("input must be an ndarray"));
+       }
+       ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(arg);
+       
+       // make sure that the input is a dense array
+       if(!ndarray_is_dense(ndarray)) {
+           mp_raise_TypeError(translate("input must be a dense ndarray"));
+       }
+       
+       // if the input is a dense array, create `results` with the same number of 
+       // dimensions, shape, and dtype
+       ndarray_obj_t *results = ndarray_new_dense_ndarray(ndarray->ndim, ndarray->shape, ndarray->dtype);
+       
+       // since in a dense array the iteration over the elements is trivial, we 
+       // can cast the data arrays ndarray->array and results->array to the actual type
+       if(ndarray->dtype == NDARRAY_UINT8) {
+           uint8_t *array = (uint8_t *)ndarray->array;
+           uint8_t *rarray = (uint8_t *)results->array;
+           for(size_t i=0; i < ndarray->len; i++, array++) {
+               *rarray++ = (*array) * (*array);
+           }
+       } else if(ndarray->dtype == NDARRAY_INT8) {
+           int8_t *array = (int8_t *)ndarray->array;
+           int8_t *rarray = (int8_t *)results->array;
+           for(size_t i=0; i < ndarray->len; i++, array++) {
+               *rarray++ = (*array) * (*array);
+           }
+       } else if(ndarray->dtype == NDARRAY_UINT16) {
+           uint16_t *array = (uint16_t *)ndarray->array;
+           uint16_t *rarray = (uint16_t *)results->array;
+           for(size_t i=0; i < ndarray->len; i++, array++) {
+               *rarray++ = (*array) * (*array);
+           }
+       } else if(ndarray->dtype == NDARRAY_INT16) {
+           int16_t *array = (int16_t *)ndarray->array;
+           int16_t *rarray = (int16_t *)results->array;
+           for(size_t i=0; i < ndarray->len; i++, array++) {
+               *rarray++ = (*array) * (*array);
+           }
+       } else { // if we end up here, the dtype is NDARRAY_FLOAT
+           mp_float_t *array = (mp_float_t *)ndarray->array;
+           mp_float_t *rarray = (mp_float_t *)results->array;
+           for(size_t i=0; i < ndarray->len; i++, array++) {
+               *rarray++ = (*array) * (*array);
+           }        
+       }
+       // at the end, return a micropython object
+       return MP_OBJ_FROM_PTR(results);
+   }
+
+To summarise, the steps for *implementing* a function are
+
+1. If necessary, inspect the type of the input object, which is always a
+   ``mp_obj_t`` object
+2. If the input is an ``ndarray_obj_t``, acquire a pointer to it by
+   calling ``ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(arg);``
+3. Create a new array, or modify the existing one; get a pointer to the
+   data by calling ``uint8_t *array = (uint8_t *)ndarray->array;``, or
+   something equivalent
+4. Once the new data have been calculated, return a ``micropython``
+   object by calling ``MP_OBJ_FROM_PTR(...)``.
+
+The listing above contains the implementation of the function, but as
+such, it cannot be called from ``python``: it still has to be bound to
+the name space. This we do by first defining a function object in
+
+.. code:: c
+
+   MP_DEFINE_CONST_FUN_OBJ_1(user_square_obj, user_square);
+
+``micropython`` defines a number of ``MP_DEFINE_CONST_FUN_OBJ_N`` macros
+in
+`obj.h <https://github.com/micropython/micropython/blob/master/py/obj.h>`__.
+``N`` is always the number of arguments the function takes. We had a
+function definition ``static mp_obj_t user_square(mp_obj_t arg)``, i.e.,
+we dealt with a single argument.
+
+Finally, we have to bind this function object in the globals table of
+the ``user`` module:
+
+.. code:: c
+
+   STATIC const mp_rom_map_elem_t ulab_user_globals_table[] = {
+       { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_user) },
+       { MP_OBJ_NEW_QSTR(MP_QSTR_square), (mp_obj_t)&user_square_obj },
+   };
+
+Thus, the three steps required for the definition of a user-defined
+function are
+
+1. The low-level implementation of the function itself
+2. The definition of a function object by calling
+   MP_DEFINE_CONST_FUN_OBJ_N()
+3. Binding this function object to the namespace in the
+   ``ulab_user_globals_table[]``
--- a/docs/manual/source/ulab.rst
+++ b/docs/manual/source/ulab.rst
--- a/docs/numpy-fft.ipynb
+++ b/docs/numpy-fft.ipynb
@ -0,0 +1,514 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-01T09:27:13.438054Z",
+     "start_time": "2020-05-01T09:27:13.191491Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Notebook magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-08-03T18:32:45.342280Z",
+     "start_time": "2020-08-03T18:32:45.338442Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.magic import Magics, magics_class, line_cell_magic\n",
+    "from IPython.core.magic import cell_magic, register_cell_magic, register_line_magic\n",
+    "from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring\n",
+    "import subprocess\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-07-23T20:31:25.296014Z",
+     "start_time": "2020-07-23T20:31:25.265937Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "@magics_class\n",
+    "class PyboardMagic(Magics):\n",
+    "    @cell_magic\n",
+    "    @magic_arguments()\n",
+    "    @argument('-skip')\n",
+    "    @argument('-unix')\n",
+    "    @argument('-pyboard')\n",
+    "    @argument('-file')\n",
+    "    @argument('-data')\n",
+    "    @argument('-time')\n",
+    "    @argument('-memory')\n",
+    "    def micropython(self, line='', cell=None):\n",
+    "        args = parse_argstring(self.micropython, line)\n",
+    "        if args.skip: # doesn't care about the cell's content\n",
+    "            print('skipped execution')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.unix: # tests the code on the unix port. Note that this works on unix only\n",
+    "            with open('/dev/shm/micropython.py', 'w') as fout:\n",
+    "                fout.write(cell)\n",
+    "            proc = subprocess.Popen([\"../../micropython/ports/unix/micropython\", \"/dev/shm/micropython.py\"], \n",
+    "                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
+    "            print(proc.stdout.read().decode(\"utf-8\"))\n",
+    "            print(proc.stderr.read().decode(\"utf-8\"))\n",
+    "            return None\n",
+    "        if args.file: # can be used to copy the cell content onto the pyboard's flash\n",
+    "            spaces = \"    \"\n",
+    "            try:\n",
+    "                with open(args.file, 'w') as fout:\n",
+    "                    fout.write(cell.replace('\\t', spaces))\n",
+    "                    printf('written cell to {}'.format(args.file))\n",
+    "            except:\n",
+    "                print('Failed to write to disc!')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.data: # can be used to load data from the pyboard directly into kernel space\n",
+    "            message = pyb.exec(cell)\n",
+    "            if len(message) == 0:\n",
+    "                print('pyboard >>>')\n",
+    "            else:\n",
+    "                print(message.decode('utf-8'))\n",
+    "                # register new variable in user namespace\n",
+    "                self.shell.user_ns[args.data] = string_to_matrix(message.decode(\"utf-8\"))\n",
+    "        \n",
+    "        if args.time: # measures the time of executions\n",
+    "            pyb.exec('import utime')\n",
+    "            message = pyb.exec('t = utime.ticks_us()\\n' + cell + '\\ndelta = utime.ticks_diff(utime.ticks_us(), t)' + \n",
+    "                               \"\\nprint('execution time: {:d} us'.format(delta))\")\n",
+    "            print(message.decode('utf-8'))\n",
+    "        \n",
+    "        if args.memory: # prints out memory information \n",
+    "            message = pyb.exec('from micropython import mem_info\\nprint(mem_info())\\n')\n",
+    "            print(\"memory before execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(\">>> \", message.decode('utf-8'))\n",
+    "            message = pyb.exec('print(mem_info())')\n",
+    "            print(\"memory after execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "\n",
+    "        if args.pyboard:\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(message.decode('utf-8'))\n",
+    "\n",
+    "ip = get_ipython()\n",
+    "ip.register_magics(PyboardMagic)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## pyboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:35.126401Z",
+     "start_time": "2020-05-07T07:35:35.105824Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pyboard\n",
+    "pyb = pyboard.Pyboard('/dev/ttyACM0')\n",
+    "pyb.enter_raw_repl()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:11:18.145548Z",
+     "start_time": "2020-05-19T19:11:18.137468Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pyb.exit_raw_repl()\n",
+    "pyb.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:38.725924Z",
+     "start_time": "2020-05-07T07:35:38.645488Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import utime\n",
+    "import ulab as np\n",
+    "\n",
+    "def timeit(n=1000):\n",
+    "    def wrapper(f, *args, **kwargs):\n",
+    "        func_name = str(f).split(' ')[1]\n",
+    "        def new_func(*args, **kwargs):\n",
+    "            run_times = np.zeros(n, dtype=np.uint16)\n",
+    "            for i in range(n):\n",
+    "                t = utime.ticks_us()\n",
+    "                result = f(*args, **kwargs)\n",
+    "                run_times[i] = utime.ticks_diff(utime.ticks_us(), t)\n",
+    "            print('{}() execution times based on {} cycles'.format(func_name, n, (delta2-delta1)/n))\n",
+    "            print('\\tbest: %d us'%np.min(run_times))\n",
+    "            print('\\tworst: %d us'%np.max(run_times))\n",
+    "            print('\\taverage: %d us'%np.mean(run_times))\n",
+    "            print('\\tdeviation: +/-%.3f us'%np.std(run_times))            \n",
+    "            return result\n",
+    "        return new_func\n",
+    "    return wrapper\n",
+    "\n",
+    "def timeit(f, *args, **kwargs):\n",
+    "    func_name = str(f).split(' ')[1]\n",
+    "    def new_func(*args, **kwargs):\n",
+    "        t = utime.ticks_us()\n",
+    "        result = f(*args, **kwargs)\n",
+    "        print('execution time: ', utime.ticks_diff(utime.ticks_us(), t), ' us')\n",
+    "        return result\n",
+    "    return new_func"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Fourier transforms\n",
+    "\n",
+    "Functions related to Fourier transforms can be called by prepending them with `numpy.fft.`. The module defines the following two functions:\n",
+    "\n",
+    "1. [numpy.fft.fft](#fft)\n",
+    "1. [numpy.fft.ifft](#ifft)\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.fft.ifft.html\n",
+    "\n",
+    "## fft\n",
+    "\n",
+    "Since `ulab`'s `ndarray` does not support complex numbers, the invocation of the Fourier transform differs from that in `numpy`. In `numpy`, you can simply pass an array or iterable to the function, and it will be treated as a complex array:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 341,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-10-17T17:33:38.487729Z",
+     "start_time": "2019-10-17T17:33:38.473515Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([20.+0.j,  0.+0.j, -4.+4.j,  0.+0.j, -4.+0.j,  0.+0.j, -4.-4.j,\n",
+       "        0.+0.j])"
+      ]
+     },
+     "execution_count": 341,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "fft.fft([1, 2, 3, 4, 1, 2, 3, 4])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**WARNING:** The array returned is also complex, i.e., the real and imaginary components are cast together. In `ulab`, the real and imaginary parts are treated separately: you have to pass two `ndarray`s to the function, although, the second argument is optional, in which case the imaginary part is assumed to be zero.\n",
+    "\n",
+    "**WARNING:** The function, as opposed to `numpy`, returns a 2-tuple, whose elements are two `ndarray`s, holding the real and imaginary parts of the transform separately. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 114,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-02-16T18:38:07.294862Z",
+     "start_time": "2020-02-16T18:38:07.233842Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "real part:\t array([5119.996, -5.004663, -5.004798, ..., -5.005482, -5.005643, -5.006577], dtype=float)\r\n",
+      "\r\n",
+      "imaginary part:\t array([0.0, 1631.333, 815.659, ..., -543.764, -815.6588, -1631.333], dtype=float)\r\n",
+      "\r\n",
+      "real part:\t array([5119.996, -5.004663, -5.004798, ..., -5.005482, -5.005643, -5.006577], dtype=float)\r\n",
+      "\r\n",
+      "imaginary part:\t array([0.0, 1631.333, 815.659, ..., -543.764, -815.6588, -1631.333], dtype=float)\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "x = np.linspace(0, 10, num=1024)\n",
+    "y = np.sin(x)\n",
+    "z = np.zeros(len(x))\n",
+    "\n",
+    "a, b = np.fft.fft(x)\n",
+    "print('real part:\\t', a)\n",
+    "print('\\nimaginary part:\\t', b)\n",
+    "\n",
+    "c, d = np.fft.fft(x, z)\n",
+    "print('\\nreal part:\\t', c)\n",
+    "print('\\nimaginary part:\\t', d)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ifft\n",
+    "\n",
+    "The above-mentioned rules apply to the inverse Fourier transform. The inverse is also normalised by `N`, the number of elements, as is customary in `numpy`. With the normalisation, we can ascertain that the inverse of the transform is equal to the original array."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 459,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-10-19T13:08:17.647416Z",
+     "start_time": "2019-10-19T13:08:17.597456Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "original vector:\t array([0.0, 0.009775016, 0.0195491, ..., -0.5275068, -0.5357859, -0.5440139], dtype=float)\n",
+      "\n",
+      "real part of inverse:\t array([-2.980232e-08, 0.0097754, 0.0195494, ..., -0.5275064, -0.5357857, -0.5440133], dtype=float)\n",
+      "\n",
+      "imaginary part of inverse:\t array([-2.980232e-08, -1.451171e-07, 3.693752e-08, ..., 6.44871e-08, 9.34986e-08, 2.18336e-07], dtype=float)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "x = np.linspace(0, 10, num=1024)\n",
+    "y = np.sin(x)\n",
+    "\n",
+    "a, b = np.fft.fft(y)\n",
+    "\n",
+    "print('original vector:\\t', y)\n",
+    "\n",
+    "y, z = np.fft.ifft(a, b)\n",
+    "# the real part should be equal to y\n",
+    "print('\\nreal part of inverse:\\t', y)\n",
+    "# the imaginary part should be equal to zero\n",
+    "print('\\nimaginary part of inverse:\\t', z)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that unlike in `numpy`, the length of the array on which the Fourier transform is carried out must be a power of 2. If this is not the case, the function raises a `ValueError` exception."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Computation and storage costs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### RAM\n",
+    "\n",
+    "The FFT routine of `ulab` calculates the transform in place. This means that beyond reserving space for the two `ndarray`s that will be returned (the computation uses these two as intermediate storage space), only a handful of temporary variables, all floats or 32-bit integers, are required. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Speed of FFTs\n",
+    "\n",
+    "A comment on the speed: a 1024-point transform implemented in python would cost around 90 ms, and 13 ms in assembly, if the code runs on the pyboard, v.1.1. You can gain a factor of four by moving to the D series \n",
+    "https://github.com/peterhinch/micropython-fourier/blob/master/README.md#8-performance. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 494,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-10-19T13:25:40.540913Z",
+     "start_time": "2019-10-19T13:25:40.509598Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "execution time:  1985  us\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import ulab as np\n",
+    "from ulab import vector\n",
+    "from ulab import fft\n",
+    "\n",
+    "x = np.linspace(0, 10, num=1024)\n",
+    "y = vector.sin(x)\n",
+    "\n",
+    "@timeit\n",
+    "def np_fft(y):\n",
+    "    return fft.fft(y)\n",
+    "\n",
+    "a, b = np_fft(y)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The C implementation runs in less than 2 ms on the pyboard (we have just measured that), and has been reported to run in under 0.8 ms on the D series board. That is an improvement of at least a factor of four. "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/numpy-functions.ipynb
+++ b/docs/numpy-functions.ipynb
--- a/docs/numpy-linalg.ipynb
+++ b/docs/numpy-linalg.ipynb
@ -0,0 +1,896 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T06:16:40.844266Z",
+     "start_time": "2021-01-13T06:16:39.992092Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Notebook magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T06:16:40.857076Z",
+     "start_time": "2021-01-13T06:16:40.852721Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.magic import Magics, magics_class, line_cell_magic\n",
+    "from IPython.core.magic import cell_magic, register_cell_magic, register_line_magic\n",
+    "from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring\n",
+    "import subprocess\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T06:16:40.947944Z",
+     "start_time": "2021-01-13T06:16:40.865720Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "@magics_class\n",
+    "class PyboardMagic(Magics):\n",
+    "    @cell_magic\n",
+    "    @magic_arguments()\n",
+    "    @argument('-skip')\n",
+    "    @argument('-unix')\n",
+    "    @argument('-pyboard')\n",
+    "    @argument('-file')\n",
+    "    @argument('-data')\n",
+    "    @argument('-time')\n",
+    "    @argument('-memory')\n",
+    "    def micropython(self, line='', cell=None):\n",
+    "        args = parse_argstring(self.micropython, line)\n",
+    "        if args.skip: # doesn't care about the cell's content\n",
+    "            print('skipped execution')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.unix: # tests the code on the unix port. Note that this works on unix only\n",
+    "            with open('/dev/shm/micropython.py', 'w') as fout:\n",
+    "                fout.write(cell)\n",
+    "            proc = subprocess.Popen([\"../../micropython/ports/unix/micropython\", \"/dev/shm/micropython.py\"], \n",
+    "                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
+    "            print(proc.stdout.read().decode(\"utf-8\"))\n",
+    "            print(proc.stderr.read().decode(\"utf-8\"))\n",
+    "            return None\n",
+    "        if args.file: # can be used to copy the cell content onto the pyboard's flash\n",
+    "            spaces = \"    \"\n",
+    "            try:\n",
+    "                with open(args.file, 'w') as fout:\n",
+    "                    fout.write(cell.replace('\\t', spaces))\n",
+    "                    printf('written cell to {}'.format(args.file))\n",
+    "            except:\n",
+    "                print('Failed to write to disc!')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.data: # can be used to load data from the pyboard directly into kernel space\n",
+    "            message = pyb.exec(cell)\n",
+    "            if len(message) == 0:\n",
+    "                print('pyboard >>>')\n",
+    "            else:\n",
+    "                print(message.decode('utf-8'))\n",
+    "                # register new variable in user namespace\n",
+    "                self.shell.user_ns[args.data] = string_to_matrix(message.decode(\"utf-8\"))\n",
+    "        \n",
+    "        if args.time: # measures the time of executions\n",
+    "            pyb.exec('import utime')\n",
+    "            message = pyb.exec('t = utime.ticks_us()\\n' + cell + '\\ndelta = utime.ticks_diff(utime.ticks_us(), t)' + \n",
+    "                               \"\\nprint('execution time: {:d} us'.format(delta))\")\n",
+    "            print(message.decode('utf-8'))\n",
+    "        \n",
+    "        if args.memory: # prints out memory information \n",
+    "            message = pyb.exec('from micropython import mem_info\\nprint(mem_info())\\n')\n",
+    "            print(\"memory before execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(\">>> \", message.decode('utf-8'))\n",
+    "            message = pyb.exec('print(mem_info())')\n",
+    "            print(\"memory after execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "\n",
+    "        if args.pyboard:\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(message.decode('utf-8'))\n",
+    "\n",
+    "ip = get_ipython()\n",
+    "ip.register_magics(PyboardMagic)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## pyboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:35.126401Z",
+     "start_time": "2020-05-07T07:35:35.105824Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pyboard\n",
+    "pyb = pyboard.Pyboard('/dev/ttyACM0')\n",
+    "pyb.enter_raw_repl()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:11:18.145548Z",
+     "start_time": "2020-05-19T19:11:18.137468Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pyb.exit_raw_repl()\n",
+    "pyb.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:38.725924Z",
+     "start_time": "2020-05-07T07:35:38.645488Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import utime\n",
+    "import ulab as np\n",
+    "\n",
+    "def timeit(n=1000):\n",
+    "    def wrapper(f, *args, **kwargs):\n",
+    "        func_name = str(f).split(' ')[1]\n",
+    "        def new_func(*args, **kwargs):\n",
+    "            run_times = np.zeros(n, dtype=np.uint16)\n",
+    "            for i in range(n):\n",
+    "                t = utime.ticks_us()\n",
+    "                result = f(*args, **kwargs)\n",
+    "                run_times[i] = utime.ticks_diff(utime.ticks_us(), t)\n",
+    "            print('{}() execution times based on {} cycles'.format(func_name, n, (delta2-delta1)/n))\n",
+    "            print('\\tbest: %d us'%np.min(run_times))\n",
+    "            print('\\tworst: %d us'%np.max(run_times))\n",
+    "            print('\\taverage: %d us'%np.mean(run_times))\n",
+    "            print('\\tdeviation: +/-%.3f us'%np.std(run_times))            \n",
+    "            return result\n",
+    "        return new_func\n",
+    "    return wrapper\n",
+    "\n",
+    "def timeit(f, *args, **kwargs):\n",
+    "    func_name = str(f).split(' ')[1]\n",
+    "    def new_func(*args, **kwargs):\n",
+    "        t = utime.ticks_us()\n",
+    "        result = f(*args, **kwargs)\n",
+    "        print('execution time: ', utime.ticks_diff(utime.ticks_us(), t), ' us')\n",
+    "        return result\n",
+    "    return new_func"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Linalg\n",
+    "\n",
+    "Functions in the `linalg` module can be called by prepending them by `numpy.linalg.`. The module defines the following seven functions:\n",
+    "\n",
+    "1. [numpy.linalg.cholesky](#cholesky)\n",
+    "1. [numpy.linalg.det](#det)\n",
+    "1. [numpy.linalg.dot](#dot)\n",
+    "1. [numpy.linalg.eig](#eig)\n",
+    "1. [numpy.linalg.inv](#inv)\n",
+    "1. [numpy.linalg.norm](#norm)\n",
+    "1. [numpy.linalg.trace](#trace)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## cholesky\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy-1.17.0/reference/generated/numpy.linalg.cholesky.html\n",
+    "\n",
+    "The function of the Cholesky decomposition takes a positive definite, symmetric square matrix as its single argument, and returns the *square root matrix* in the lower triangular form. If the input argument does not fulfill the positivity or symmetry condition, a `ValueError` is raised."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-03-10T19:25:21.754166Z",
+     "start_time": "2020-03-10T19:25:21.740726Z"
+    },
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "a:  array([[25.0, 15.0, -5.0],\n",
+      "\t [15.0, 18.0, 0.0],\n",
+      "\t [-5.0, 0.0, 11.0]], dtype=float)\n",
+      "\n",
+      "====================\n",
+      "Cholesky decomposition\n",
+      " array([[5.0, 0.0, 0.0],\n",
+      "\t [3.0, 3.0, 0.0],\n",
+      "\t [-1.0, 1.0, 3.0]], dtype=float)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array([[25, 15, -5], [15, 18,  0], [-5,  0, 11]])\n",
+    "print('a: ', a)\n",
+    "print('\\n' + '='*20 + '\\nCholesky decomposition\\n', np.linalg.cholesky(a))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## det\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.det.html\n",
+    "\n",
+    "The `det` function takes a square matrix as its single argument, and calculates the determinant. The calculation is based on successive elimination of the matrix elements, and the return value is a float, even if the input array was of integer type."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 495,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-10-19T13:27:24.246995Z",
+     "start_time": "2019-10-19T13:27:24.228698Z"
+    },
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "-2.0\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array([[1, 2], [3, 4]], dtype=np.uint8)\n",
+    "print(np.linalg.det(a))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Benchmark\n",
+    "\n",
+    "Since the routine for calculating the determinant is pretty much the same as for finding the [inverse of a matrix](#inv), the execution times are similar:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 557,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-10-20T07:14:59.778987Z",
+     "start_time": "2019-10-20T07:14:59.740021Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "execution time:  294  us\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "@timeit\n",
+    "def matrix_det(m):\n",
+    "    return np.linalg.inv(m)\n",
+    "\n",
+    "m = np.array([[1, 2, 3, 4, 5, 6, 7, 8], [0, 5, 6, 4, 5, 6, 4, 5], \n",
+    "              [0, 0, 9, 7, 8, 9, 7, 8], [0, 0, 0, 10, 11, 12, 11, 12], \n",
+    "             [0, 0, 0, 0, 4, 6, 7, 8], [0, 0, 0, 0, 0, 5, 6, 7], \n",
+    "             [0, 0, 0, 0, 0, 0, 7, 6], [0, 0, 0, 0, 0, 0, 0, 2]])\n",
+    "\n",
+    "matrix_det(m)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## dot\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html\n",
+    "\n",
+    "\n",
+    "**WARNING:** numpy applies upcasting rules for the multiplication of matrices, while `ulab` simply returns a float matrix. \n",
+    "\n",
+    "Once you can invert a matrix, you might want to know, whether the inversion is correct. You can simply take the original matrix and its inverse, and multiply them by calling the `dot` function, which takes the two matrices as its arguments. If the matrix dimensions do not match, the function raises a `ValueError`. The result of the multiplication is expected to be the unit matrix, which is demonstrated below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 556,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-10-20T07:13:30.102776Z",
+     "start_time": "2019-10-20T07:13:30.073704Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "m:\n",
+      " array([[1, 2, 3],\n",
+      "\t [4, 5, 6],\n",
+      "\t [7, 10, 9]], dtype=uint8)\n",
+      "\n",
+      "m^-1:\n",
+      " array([[-1.25, 1.0, -0.25],\n",
+      "\t [0.5, -1.0, 0.5],\n",
+      "\t [0.4166667, 0.3333334, -0.25]], dtype=float)\n",
+      "\n",
+      "m*m^-1:\n",
+      " array([[1.0, 2.384186e-07, -1.490116e-07],\n",
+      "\t [-2.980232e-07, 1.000001, -4.172325e-07],\n",
+      "\t [-3.278255e-07, 1.311302e-06, 0.9999992]], dtype=float)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "m = np.array([[1, 2, 3], [4, 5, 6], [7, 10, 9]], dtype=np.uint8)\n",
+    "n = np.linalg.inv(m)\n",
+    "print(\"m:\\n\", m)\n",
+    "print(\"\\nm^-1:\\n\", n)\n",
+    "# this should be the unit matrix\n",
+    "print(\"\\nm*m^-1:\\n\", np.linalg.dot(m, n))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that for matrix multiplication you don't necessarily need square matrices, it is enough, if their dimensions are compatible (i.e., the the left-hand-side matrix has as many columns, as does the right-hand-side matrix rows):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-10-10T17:33:17.921324Z",
+     "start_time": "2019-10-10T17:33:17.900587Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "array([[1, 2, 3, 4],\n",
+      "\t [5, 6, 7, 8]], dtype=uint8)\n",
+      "array([[1, 2],\n",
+      "\t [3, 4],\n",
+      "\t [5, 6],\n",
+      "\t [7, 8]], dtype=uint8)\n",
+      "array([[7.0, 10.0],\n",
+      "\t [23.0, 34.0]], dtype=float)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "m = np.array([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=np.uint8)\n",
+    "n = np.array([[1, 2], [3, 4], [5, 6], [7, 8]], dtype=np.uint8)\n",
+    "print(m)\n",
+    "print(n)\n",
+    "print(np.linalg.dot(m, n))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## eig\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eig.html\n",
+    "\n",
+    "The `eig` function calculates the eigenvalues and the eigenvectors of a real, symmetric square matrix. If the matrix is not symmetric, a `ValueError` will be raised. The function takes a single argument, and returns a tuple with the eigenvalues, and eigenvectors. With the help of the eigenvectors, amongst other things, you can implement sophisticated stabilisation routines for robots."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-11-03T20:25:26.952290Z",
+     "start_time": "2020-11-03T20:25:26.930184Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "eigenvectors of a:\n",
+      " array([[0.8151560042509081, -0.4499411232970823, -0.1644660242574522, 0.3256141906686505],\n",
+      "       [0.2211334179893007, 0.7846992598235538, 0.08372081379922657, 0.5730077734355189],\n",
+      "       [-0.1340114162071679, -0.3100776411558949, 0.8742786816656, 0.3486109343758527],\n",
+      "       [-0.5183258053659028, -0.292663481927148, -0.4489749870391468, 0.6664142156731531]], dtype=float)\n",
+      "\n",
+      "eigenvalues of a:\n",
+      " array([-1.165288365404889, 0.8029365530314914, 5.585625756072663, 13.77672605630074], dtype=float)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array([[1, 2, 1, 4], [2, 5, 3, 5], [1, 3, 6, 1], [4, 5, 1, 7]], dtype=np.uint8)\n",
+    "x, y = np.linalg.eig(a)\n",
+    "print('eigenvectors of a:\\n', y)\n",
+    "print('\\neigenvalues of a:\\n', x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The same matrix diagonalised with `numpy` yields:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-11-03T20:13:27.236159Z",
+     "start_time": "2020-11-03T20:13:27.069967Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "eigenvectors of a:\n",
+      " [[ 0.32561419  0.815156    0.44994112 -0.16446602]\n",
+      " [ 0.57300777  0.22113342 -0.78469926  0.08372081]\n",
+      " [ 0.34861093 -0.13401142  0.31007764  0.87427868]\n",
+      " [ 0.66641421 -0.51832581  0.29266348 -0.44897499]]\n",
+      "\n",
+      "eigenvalues of a:\n",
+      " [13.77672606 -1.16528837  0.80293655  5.58562576]\n"
+     ]
+    }
+   ],
+   "source": [
+    "a = array([[1, 2, 1, 4], [2, 5, 3, 5], [1, 3, 6, 1], [4, 5, 1, 7]], dtype=np.uint8)\n",
+    "x, y = eig(a)\n",
+    "print('eigenvectors of a:\\n', y)\n",
+    "print('\\neigenvalues of a:\\n', x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When comparing results, we should keep two things in mind: \n",
+    "\n",
+    "1. the eigenvalues and eigenvectors are not necessarily sorted in the same way\n",
+    "2. an eigenvector can be multiplied by an arbitrary non-zero scalar, and it is still an eigenvector with the same eigenvalue. This is why all signs of the eigenvector belonging to 5.58, and 0.80 are flipped in `ulab` with respect to `numpy`. This difference, however, is of absolutely no consequence. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Computation expenses\n",
+    "\n",
+    "Since the function is based on [Givens rotations](https://en.wikipedia.org/wiki/Givens_rotation) and runs till convergence is achieved, or till the maximum number of allowed rotations is exhausted, there is no universal estimate for the time required to find the eigenvalues. However, an order of magnitude can, at least, be guessed based on the measurement below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 559,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-10-20T07:18:52.520515Z",
+     "start_time": "2019-10-20T07:18:52.499653Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "execution time:  111  us\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "@timeit\n",
+    "def matrix_eig(a):\n",
+    "    return np.linalg.eig(a)\n",
+    "\n",
+    "a = np.array([[1, 2, 1, 4], [2, 5, 3, 5], [1, 3, 6, 1], [4, 5, 1, 7]], dtype=np.uint8)\n",
+    "\n",
+    "matrix_eig(a)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## inv\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy-1.17.0/reference/generated/numpy.linalg.inv.html\n",
+    "\n",
+    "A square matrix, provided that it is not singular, can be inverted by calling the `inv` function that takes a single argument. The inversion is based on successive elimination of elements in the lower left triangle, and raises a `ValueError` exception, if the matrix turns out to be singular (i.e., one of the diagonal entries is zero)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T06:17:13.053816Z",
+     "start_time": "2021-01-13T06:17:13.038403Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "array([[-2.166666666666667, 1.500000000000001, -0.8333333333333337, 1.0],\n",
+      "       [1.666666666666667, -3.333333333333335, 1.666666666666668, -0.0],\n",
+      "       [0.1666666666666666, 2.166666666666668, -0.8333333333333337, -1.0],\n",
+      "       [-0.1666666666666667, -0.3333333333333333, 0.0, 0.5]], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "m = np.array([[1, 2, 3, 4], [4, 5, 6, 4], [7, 8.6, 9, 4], [3, 4, 5, 6]])\n",
+    "\n",
+    "print(np.linalg.inv(m))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Computation expenses\n",
+    "\n",
+    "Note that the cost of inverting a matrix is approximately twice as many floats (RAM), as the number of entries in the original matrix, and approximately as many operations, as the number of entries. Here are a couple of numbers: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 552,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-10-20T07:10:39.190734Z",
+     "start_time": "2019-10-20T07:10:39.138872Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2 by 2 matrix:\n",
+      "execution time:  65  us\n",
+      "\n",
+      "4 by 4 matrix:\n",
+      "execution time:  105  us\n",
+      "\n",
+      "8 by 8 matrix:\n",
+      "execution time:  299  us\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "@timeit\n",
+    "def invert_matrix(m):\n",
+    "    return np.linalg.inv(m)\n",
+    "\n",
+    "m = np.array([[1, 2,], [4, 5]])\n",
+    "print('2 by 2 matrix:')\n",
+    "invert_matrix(m)\n",
+    "\n",
+    "m = np.array([[1, 2, 3, 4], [4, 5, 6, 4], [7, 8.6, 9, 4], [3, 4, 5, 6]])\n",
+    "print('\\n4 by 4 matrix:')\n",
+    "invert_matrix(m)\n",
+    "\n",
+    "m = np.array([[1, 2, 3, 4, 5, 6, 7, 8], [0, 5, 6, 4, 5, 6, 4, 5], \n",
+    "              [0, 0, 9, 7, 8, 9, 7, 8], [0, 0, 0, 10, 11, 12, 11, 12], \n",
+    "             [0, 0, 0, 0, 4, 6, 7, 8], [0, 0, 0, 0, 0, 5, 6, 7], \n",
+    "             [0, 0, 0, 0, 0, 0, 7, 6], [0, 0, 0, 0, 0, 0, 0, 2]])\n",
+    "print('\\n8 by 8 matrix:')\n",
+    "invert_matrix(m)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The above-mentioned scaling is not obeyed strictly. The reason for the discrepancy is that the function call is still the same for all three cases: the input must be inspected, the output array must be created, and so on. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## norm\n",
+    "\n",
+    "`numpy`: https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html\n",
+    "\n",
+    "The function takes a vector or matrix without options, and returns its 2-norm, i.e., the square root of the sum of the square of the elements."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-07-23T20:41:10.341349Z",
+     "start_time": "2020-07-23T20:41:10.327624Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "norm of a: 7.416198487095663\n",
+      "norm of b: 16.88194301613414\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array([1, 2, 3, 4, 5])\n",
+    "b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])\n",
+    "\n",
+    "print('norm of a:', np.linalg.norm(a))\n",
+    "print('norm of b:', np.linalg.norm(b))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## trace\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy-1.17.0/reference/generated/numpy.linalg.trace.html\n",
+    "\n",
+    "The `trace` function returns the sum of the diagonal elements of a square matrix. If the input argument is not a square matrix, an exception will be raised.\n",
+    "\n",
+    "The scalar so returned will inherit the type of the input array, i.e., integer arrays have integer trace, and floating point arrays a floating point trace."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "a:  array([[25, 15, -5],\n",
+      "\t [15, 18, 0],\n",
+      "\t [-5, 0, 11]], dtype=int8)\n",
+      "\n",
+      "trace of a:  54\n",
+      "====================\n",
+      "b:  array([[25.0, 15.0, -5.0],\n",
+      "\t [15.0, 18.0, 0.0],\n",
+      "\t [-5.0, 0.0, 11.0]], dtype=float)\n",
+      "\n",
+      "trace of b:  54.0\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array([[25, 15, -5], [15, 18,  0], [-5,  0, 11]], dtype=np.int8)\n",
+    "print('a: ', a)\n",
+    "print('\\ntrace of a: ', np.linalg.trace(a))\n",
+    "\n",
+    "b = np.array([[25, 15, -5], [15, 18,  0], [-5,  0, 11]], dtype=np.float)\n",
+    "\n",
+    "print('='*20 + '\\nb: ', b)\n",
+    "print('\\ntrace of b: ', np.linalg.trace(b))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/numpy-universal.ipynb
+++ b/docs/numpy-universal.ipynb
@ -0,0 +1,779 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T18:54:58.722373Z",
+     "start_time": "2021-01-13T18:54:57.178438Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Notebook magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T18:55:01.909310Z",
+     "start_time": "2021-01-13T18:55:01.903634Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.magic import Magics, magics_class, line_cell_magic\n",
+    "from IPython.core.magic import cell_magic, register_cell_magic, register_line_magic\n",
+    "from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring\n",
+    "import subprocess\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T18:55:02.434518Z",
+     "start_time": "2021-01-13T18:55:02.382296Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "@magics_class\n",
+    "class PyboardMagic(Magics):\n",
+    "    @cell_magic\n",
+    "    @magic_arguments()\n",
+    "    @argument('-skip')\n",
+    "    @argument('-unix')\n",
+    "    @argument('-pyboard')\n",
+    "    @argument('-file')\n",
+    "    @argument('-data')\n",
+    "    @argument('-time')\n",
+    "    @argument('-memory')\n",
+    "    def micropython(self, line='', cell=None):\n",
+    "        args = parse_argstring(self.micropython, line)\n",
+    "        if args.skip: # doesn't care about the cell's content\n",
+    "            print('skipped execution')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.unix: # tests the code on the unix port. Note that this works on unix only\n",
+    "            with open('/dev/shm/micropython.py', 'w') as fout:\n",
+    "                fout.write(cell)\n",
+    "            proc = subprocess.Popen([\"../../micropython/ports/unix/micropython\", \"/dev/shm/micropython.py\"], \n",
+    "                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
+    "            print(proc.stdout.read().decode(\"utf-8\"))\n",
+    "            print(proc.stderr.read().decode(\"utf-8\"))\n",
+    "            return None\n",
+    "        if args.file: # can be used to copy the cell content onto the pyboard's flash\n",
+    "            spaces = \"    \"\n",
+    "            try:\n",
+    "                with open(args.file, 'w') as fout:\n",
+    "                    fout.write(cell.replace('\\t', spaces))\n",
+    "                    printf('written cell to {}'.format(args.file))\n",
+    "            except:\n",
+    "                print('Failed to write to disc!')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.data: # can be used to load data from the pyboard directly into kernel space\n",
+    "            message = pyb.exec(cell)\n",
+    "            if len(message) == 0:\n",
+    "                print('pyboard >>>')\n",
+    "            else:\n",
+    "                print(message.decode('utf-8'))\n",
+    "                # register new variable in user namespace\n",
+    "                self.shell.user_ns[args.data] = string_to_matrix(message.decode(\"utf-8\"))\n",
+    "        \n",
+    "        if args.time: # measures the time of executions\n",
+    "            pyb.exec('import utime')\n",
+    "            message = pyb.exec('t = utime.ticks_us()\\n' + cell + '\\ndelta = utime.ticks_diff(utime.ticks_us(), t)' + \n",
+    "                               \"\\nprint('execution time: {:d} us'.format(delta))\")\n",
+    "            print(message.decode('utf-8'))\n",
+    "        \n",
+    "        if args.memory: # prints out memory information \n",
+    "            message = pyb.exec('from micropython import mem_info\\nprint(mem_info())\\n')\n",
+    "            print(\"memory before execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(\">>> \", message.decode('utf-8'))\n",
+    "            message = pyb.exec('print(mem_info())')\n",
+    "            print(\"memory after execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "\n",
+    "        if args.pyboard:\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(message.decode('utf-8'))\n",
+    "\n",
+    "ip = get_ipython()\n",
+    "ip.register_magics(PyboardMagic)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## pyboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:35.126401Z",
+     "start_time": "2020-05-07T07:35:35.105824Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pyboard\n",
+    "pyb = pyboard.Pyboard('/dev/ttyACM0')\n",
+    "pyb.enter_raw_repl()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:11:18.145548Z",
+     "start_time": "2020-05-19T19:11:18.137468Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pyb.exit_raw_repl()\n",
+    "pyb.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:38.725924Z",
+     "start_time": "2020-05-07T07:35:38.645488Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import utime\n",
+    "import ulab as np\n",
+    "\n",
+    "def timeit(n=1000):\n",
+    "    def wrapper(f, *args, **kwargs):\n",
+    "        func_name = str(f).split(' ')[1]\n",
+    "        def new_func(*args, **kwargs):\n",
+    "            run_times = np.zeros(n, dtype=np.uint16)\n",
+    "            for i in range(n):\n",
+    "                t = utime.ticks_us()\n",
+    "                result = f(*args, **kwargs)\n",
+    "                run_times[i] = utime.ticks_diff(utime.ticks_us(), t)\n",
+    "            print('{}() execution times based on {} cycles'.format(func_name, n, (delta2-delta1)/n))\n",
+    "            print('\\tbest: %d us'%np.min(run_times))\n",
+    "            print('\\tworst: %d us'%np.max(run_times))\n",
+    "            print('\\taverage: %d us'%np.mean(run_times))\n",
+    "            print('\\tdeviation: +/-%.3f us'%np.std(run_times))            \n",
+    "            return result\n",
+    "        return new_func\n",
+    "    return wrapper\n",
+    "\n",
+    "def timeit(f, *args, **kwargs):\n",
+    "    func_name = str(f).split(' ')[1]\n",
+    "    def new_func(*args, **kwargs):\n",
+    "        t = utime.ticks_us()\n",
+    "        result = f(*args, **kwargs)\n",
+    "        print('execution time: ', utime.ticks_diff(utime.ticks_us(), t), ' us')\n",
+    "        return result\n",
+    "    return new_func"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Universal functions\n",
+    "\n",
+    "Standard mathematical functions can be calculated on any scalar,  scalar-valued iterable (ranges, lists, tuples containing numbers), and on `ndarray`s without having to change the call signature. In all cases the functions return a new `ndarray` of typecode `float` (since these functions usually generate float values, anyway). The functions execute faster with `ndarray` arguments than with iterables, because the values of the input vector can be extracted faster. \n",
+    "\n",
+    "At present, the following functions are supported:\n",
+    "\n",
+    "`acos`, `acosh`, `arctan2`, `around`, `asin`, `asinh`, `atan`, `arctan2`, `atanh`, `ceil`, `cos`, `degrees`, `exp`, `expm1`, `floor`, `log`, `log10`, `log2`, `radians`, `sin`, `sinh`, `sqrt`, `tan`, `tanh`.\n",
+    "\n",
+    "These functions are applied element-wise to the arguments, thus, e.g., the exponential of a matrix cannot be calculated in this way."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T19:11:07.579601Z",
+     "start_time": "2021-01-13T19:11:07.554672Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "a:\t range(0, 9)\n",
+      "exp(a):\t array([1.0, 2.718281828459045, 7.38905609893065, 20.08553692318767, 54.59815003314424, 148.4131591025766, 403.4287934927351, 1096.633158428459, 2980.957987041728], dtype=float64)\n",
+      "\n",
+      "=============\n",
+      "b:\n",
+      " array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype=float64)\n",
+      "exp(b):\n",
+      " array([1.0, 2.718281828459045, 7.38905609893065, 20.08553692318767, 54.59815003314424, 148.4131591025766, 403.4287934927351, 1096.633158428459, 2980.957987041728], dtype=float64)\n",
+      "\n",
+      "=============\n",
+      "c:\n",
+      " array([[0.0, 1.0, 2.0],\n",
+      "       [3.0, 4.0, 5.0],\n",
+      "       [6.0, 7.0, 8.0]], dtype=float64)\n",
+      "exp(c):\n",
+      " array([[1.0, 2.718281828459045, 7.38905609893065],\n",
+      "       [20.08553692318767, 54.59815003314424, 148.4131591025766],\n",
+      "       [403.4287934927351, 1096.633158428459, 2980.957987041728]], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = range(9)\n",
+    "b = np.array(a)\n",
+    "\n",
+    "# works with ranges, lists, tuples etc.\n",
+    "print('a:\\t', a)\n",
+    "print('exp(a):\\t', np.exp(a))\n",
+    "\n",
+    "# with 1D arrays\n",
+    "print('\\n=============\\nb:\\n', b)\n",
+    "print('exp(b):\\n', np.exp(b))\n",
+    "\n",
+    "# as well as with matrices\n",
+    "c = np.array(range(9)).reshape((3, 3))\n",
+    "print('\\n=============\\nc:\\n', c)\n",
+    "print('exp(c):\\n', np.exp(c))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Computation expenses\n",
+    "\n",
+    "The overhead for calculating with micropython iterables is quite significant: for the 1000 samples below, the difference is more than 800 microseconds, because internally the function has to create the `ndarray` for the output, has to fetch the iterable's items of unknown type, and then convert them to floats. All these steps are skipped for `ndarray`s, because these pieces of information are already known. \n",
+    "\n",
+    "Doing the same with `list` comprehension requires 30 times more time than with the `ndarray`, which would become even more, if we converted the resulting list to an `ndarray`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:45.696282Z",
+     "start_time": "2020-05-07T07:35:45.629909Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "iterating over ndarray in ulab\r\n",
+      "execution time:  441  us\r\n",
+      "\r\n",
+      "iterating over list in ulab\r\n",
+      "execution time:  1266  us\r\n",
+      "\r\n",
+      "iterating over list in python\r\n",
+      "execution time:  11379  us\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "import math\n",
+    "\n",
+    "a = [0]*1000\n",
+    "b = np.array(a)\n",
+    "\n",
+    "@timeit\n",
+    "def timed_vector(iterable):\n",
+    "    return np.exp(iterable)\n",
+    "\n",
+    "@timeit\n",
+    "def timed_list(iterable):\n",
+    "    return [math.exp(i) for i in iterable]\n",
+    "\n",
+    "print('iterating over ndarray in ulab')\n",
+    "timed_vector(b)\n",
+    "\n",
+    "print('\\niterating over list in ulab')\n",
+    "timed_vector(a)\n",
+    "\n",
+    "print('\\niterating over list in python')\n",
+    "timed_list(a)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## arctan2\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy-1.17.0/reference/generated/numpy.arctan2.html\n",
+    "\n",
+    "The two-argument inverse tangent function is also part of the `vector` sub-module. The function implements broadcasting as discussed in the section on `ndarray`s. Scalars (`micropython` integers or floats) are also allowed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T19:15:08.215912Z",
+     "start_time": "2021-01-13T19:15:08.189806Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "a:\n",
+      " array([1.0, 2.2, 33.33, 444.444], dtype=float64)\n",
+      "\n",
+      "arctan2(a, 1.0)\n",
+      " array([0.7853981633974483, 1.14416883366802, 1.5408023243361, 1.568546328341769], dtype=float64)\n",
+      "\n",
+      "arctan2(1.0, a)\n",
+      " array([0.7853981633974483, 0.426627493126876, 0.02999400245879636, 0.002249998453127392], dtype=float64)\n",
+      "\n",
+      "arctan2(a, a): \n",
+      " array([0.7853981633974483, 0.7853981633974483, 0.7853981633974483, 0.7853981633974483], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array([1, 2.2, 33.33, 444.444])\n",
+    "print('a:\\n', a)\n",
+    "print('\\narctan2(a, 1.0)\\n', np.arctan2(a, 1.0))\n",
+    "print('\\narctan2(1.0, a)\\n', np.arctan2(1.0, a))\n",
+    "print('\\narctan2(a, a): \\n', np.arctan2(a, a))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## around\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy-1.17.0/reference/generated/numpy.around.html\n",
+    "\n",
+    "`numpy`'s `around` function can also be found in the `vector` sub-module. The function implements the `decimals` keyword argument with default value `0`. The first argument must be an `ndarray`. If this is not the case, the function raises a `TypeError` exception. Note that `numpy` accepts general iterables. The `out` keyword argument known from `numpy` is not accepted. The function always returns an ndarray of type `mp_float_t`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T19:19:46.728823Z",
+     "start_time": "2021-01-13T19:19:46.703348Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "a:\t\t array([1.0, 2.2, 33.33, 444.444], dtype=float64)\n",
+      "\n",
+      "decimals = 0\t array([1.0, 2.0, 33.0, 444.0], dtype=float64)\n",
+      "\n",
+      "decimals = 1\t array([1.0, 2.2, 33.3, 444.4], dtype=float64)\n",
+      "\n",
+      "decimals = -1\t array([0.0, 0.0, 30.0, 440.0], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array([1, 2.2, 33.33, 444.444])\n",
+    "print('a:\\t\\t', a)\n",
+    "print('\\ndecimals = 0\\t', np.around(a, decimals=0))\n",
+    "print('\\ndecimals = 1\\t', np.around(a, decimals=1))\n",
+    "print('\\ndecimals = -1\\t', np.around(a, decimals=-1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Vectorising generic python functions\n",
+    "\n",
+    "`numpy`: https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html\n",
+    "\n",
+    "The examples above use factory functions. In fact, they are nothing but the vectorised versions of the standard mathematical functions. User-defined `python` functions can also be vectorised by help of `vectorize`. This function takes a positional argument, namely, the `python` function that you want to vectorise, and a non-mandatory keyword argument, `otypes`, which determines the `dtype` of the output array. The `otypes` must be `None` (default), or any of the `dtypes` defined in `ulab`. With `None`, the output is automatically turned into a float array. \n",
+    "\n",
+    "The return value of `vectorize` is a `micropython` object that can be called as a standard function, but which now accepts either a scalar, an `ndarray`, or a generic `micropython` iterable as its sole argument. Note that the function that is to be vectorised must have a single argument."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T19:16:55.709617Z",
+     "start_time": "2021-01-13T19:16:55.688222Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "f on a scalar:       array([1936.0], dtype=float64)\n",
+      "f on an ndarray:     array([1.0, 4.0, 9.0, 16.0], dtype=float64)\n",
+      "f on a list:         array([4.0, 9.0, 16.0], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "def f(x):\n",
+    "    return x*x\n",
+    "\n",
+    "vf = np.vectorize(f)\n",
+    "\n",
+    "# calling with a scalar\n",
+    "print('{:20}'.format('f on a scalar: '), vf(44.0))\n",
+    "\n",
+    "# calling with an ndarray\n",
+    "a = np.array([1, 2, 3, 4])\n",
+    "print('{:20}'.format('f on an ndarray: '), vf(a))\n",
+    "\n",
+    "# calling with a list\n",
+    "print('{:20}'.format('f on a list: '), vf([2, 3, 4]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As mentioned, the `dtype` of the resulting `ndarray` can be specified via the `otypes` keyword. The value is bound to the function object that `vectorize` returns, therefore, if the same function is to be vectorised with different output types, then for each type a new function object must be created."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T19:19:36.090837Z",
+     "start_time": "2021-01-13T19:19:36.069088Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "output is uint8:     array([1, 4, 9, 16], dtype=uint8)\n",
+      "output is float:     array([1.0, 4.0, 9.0, 16.0], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "l = [1, 2, 3, 4]\n",
+    "def f(x):\n",
+    "    return x*x\n",
+    "\n",
+    "vf1 = np.vectorize(f, otypes=np.uint8)\n",
+    "vf2 = np.vectorize(f, otypes=np.float)\n",
+    "\n",
+    "print('{:20}'.format('output is uint8: '), vf1(l))\n",
+    "print('{:20}'.format('output is float: '), vf2(l))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `otypes` keyword argument cannot be used for type coercion: if the function evaluates to a float, but `otypes` would dictate an integer type, an exception will be raised:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-06T22:21:43.616220Z",
+     "start_time": "2020-05-06T22:21:43.601280Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "integer list:        array([1, 4, 9, 16], dtype=uint8)\n",
+      "\n",
+      "Traceback (most recent call last):\n",
+      "  File \"/dev/shm/micropython.py\", line 14, in <module>\n",
+      "TypeError: can't convert float to int\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "int_list = [1, 2, 3, 4]\n",
+    "float_list = [1.0, 2.0, 3.0, 4.0]\n",
+    "def f(x):\n",
+    "    return x*x\n",
+    "\n",
+    "vf = np.vectorize(f, otypes=np.uint8)\n",
+    "\n",
+    "print('{:20}'.format('integer list: '), vf(int_list))\n",
+    "# this will raise a TypeError exception\n",
+    "print(vf(float_list))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Benchmarks\n",
+    "\n",
+    "It should be pointed out that the `vectorize` function produces the pseudo-vectorised version of the `python` function that is fed into it, i.e., on the C level, the same `python` function is called, with the all-encompassing `mp_obj_t` type arguments, and all that happens is that the `for` loop in `[f(i) for i in iterable]` runs purely in C. Since type checking and type conversion in `f()` is expensive, the speed-up is not so spectacular as when iterating over an `ndarray` with a factory function: a gain of approximately 30% can be expected, when a native `python` type (e.g., `list`) is returned by the function, and this becomes around 50% (a factor of 2), if conversion to an `ndarray` is also counted.\n",
+    "\n",
+    "The following code snippet calculates the square of a 1000 numbers with the vectorised function (which returns an `ndarray`), with `list` comprehension, and with `list` comprehension followed by conversion to an `ndarray`. For comparison, the execution time is measured also for the case, when the square is calculated entirely in `ulab`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:32:20.048553Z",
+     "start_time": "2020-05-07T07:32:19.951851Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "vectorised function\r\n",
+      "execution time:  7237  us\r\n",
+      "\r\n",
+      "list comprehension\r\n",
+      "execution time:  10248  us\r\n",
+      "\r\n",
+      "list comprehension + ndarray conversion\r\n",
+      "execution time:  12562  us\r\n",
+      "\r\n",
+      "squaring an ndarray entirely in ulab\r\n",
+      "execution time:  560  us\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "def f(x):\n",
+    "    return x*x\n",
+    "\n",
+    "vf = np.vectorize(f)\n",
+    "\n",
+    "@timeit\n",
+    "def timed_vectorised_square(iterable):\n",
+    "    return vf(iterable)\n",
+    "\n",
+    "@timeit\n",
+    "def timed_python_square(iterable):\n",
+    "    return [f(i) for i in iterable]\n",
+    "\n",
+    "@timeit\n",
+    "def timed_ndarray_square(iterable):\n",
+    "    return np.array([f(i) for i in iterable])\n",
+    "\n",
+    "@timeit\n",
+    "def timed_ulab_square(ndarray):\n",
+    "    return ndarray**2\n",
+    "\n",
+    "print('vectorised function')\n",
+    "squares = timed_vectorised_square(range(1000))\n",
+    "\n",
+    "print('\\nlist comprehension')\n",
+    "squares = timed_python_square(range(1000))\n",
+    "\n",
+    "print('\\nlist comprehension + ndarray conversion')\n",
+    "squares = timed_ndarray_square(range(1000))\n",
+    "\n",
+    "print('\\nsquaring an ndarray entirely in ulab')\n",
+    "a = np.array(range(1000))\n",
+    "squares = timed_ulab_square(a)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the comparisons above, it is obvious that `python` functions should only be vectorised, when the same effect cannot be gotten in `ulab` only. However, although the time savings are not significant, there is still a good reason for caring about vectorised functions. Namely, user-defined `python` functions become universal, i.e., they can accept generic iterables as well as `ndarray`s as their arguments. A vectorised function is still a one-liner, resulting in transparent and elegant code.\n",
+    "\n",
+    "A final comment on this subject: the `f(x)` that we defined is a *generic* `python` function. This means that it is not required that it just crunches some numbers. It has to return a number object, but it can still access the hardware in the meantime. So, e.g., \n",
+    "\n",
+    "```python\n",
+    "\n",
+    "led = pyb.LED(2)\n",
+    "\n",
+    "def f(x):\n",
+    "    if x < 100:\n",
+    "        led.toggle()\n",
+    "    return x*x\n",
+    "```\n",
+    "\n",
+    "is perfectly valid code."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/scipy-optimize.ipynb
+++ b/docs/scipy-optimize.ipynb
@ -0,0 +1,515 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:50:51.417613Z",
+     "start_time": "2021-01-08T12:50:51.208257Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Notebook magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:50:52.581876Z",
+     "start_time": "2021-01-08T12:50:52.567901Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.magic import Magics, magics_class, line_cell_magic\n",
+    "from IPython.core.magic import cell_magic, register_cell_magic, register_line_magic\n",
+    "from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring\n",
+    "import subprocess\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:50:53.516712Z",
+     "start_time": "2021-01-08T12:50:53.454984Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "@magics_class\n",
+    "class PyboardMagic(Magics):\n",
+    "    @cell_magic\n",
+    "    @magic_arguments()\n",
+    "    @argument('-skip')\n",
+    "    @argument('-unix')\n",
+    "    @argument('-pyboard')\n",
+    "    @argument('-file')\n",
+    "    @argument('-data')\n",
+    "    @argument('-time')\n",
+    "    @argument('-memory')\n",
+    "    def micropython(self, line='', cell=None):\n",
+    "        args = parse_argstring(self.micropython, line)\n",
+    "        if args.skip: # doesn't care about the cell's content\n",
+    "            print('skipped execution')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.unix: # tests the code on the unix port. Note that this works on unix only\n",
+    "            with open('/dev/shm/micropython.py', 'w') as fout:\n",
+    "                fout.write(cell)\n",
+    "            proc = subprocess.Popen([\"../../micropython/ports/unix/micropython\", \"/dev/shm/micropython.py\"], \n",
+    "                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
+    "            print(proc.stdout.read().decode(\"utf-8\"))\n",
+    "            print(proc.stderr.read().decode(\"utf-8\"))\n",
+    "            return None\n",
+    "        if args.file: # can be used to copy the cell content onto the pyboard's flash\n",
+    "            spaces = \"    \"\n",
+    "            try:\n",
+    "                with open(args.file, 'w') as fout:\n",
+    "                    fout.write(cell.replace('\\t', spaces))\n",
+    "                    printf('written cell to {}'.format(args.file))\n",
+    "            except:\n",
+    "                print('Failed to write to disc!')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.data: # can be used to load data from the pyboard directly into kernel space\n",
+    "            message = pyb.exec(cell)\n",
+    "            if len(message) == 0:\n",
+    "                print('pyboard >>>')\n",
+    "            else:\n",
+    "                print(message.decode('utf-8'))\n",
+    "                # register new variable in user namespace\n",
+    "                self.shell.user_ns[args.data] = string_to_matrix(message.decode(\"utf-8\"))\n",
+    "        \n",
+    "        if args.time: # measures the time of executions\n",
+    "            pyb.exec('import utime')\n",
+    "            message = pyb.exec('t = utime.ticks_us()\\n' + cell + '\\ndelta = utime.ticks_diff(utime.ticks_us(), t)' + \n",
+    "                               \"\\nprint('execution time: {:d} us'.format(delta))\")\n",
+    "            print(message.decode('utf-8'))\n",
+    "        \n",
+    "        if args.memory: # prints out memory information \n",
+    "            message = pyb.exec('from micropython import mem_info\\nprint(mem_info())\\n')\n",
+    "            print(\"memory before execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(\">>> \", message.decode('utf-8'))\n",
+    "            message = pyb.exec('print(mem_info())')\n",
+    "            print(\"memory after execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "\n",
+    "        if args.pyboard:\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(message.decode('utf-8'))\n",
+    "\n",
+    "ip = get_ipython()\n",
+    "ip.register_magics(PyboardMagic)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## pyboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:35.126401Z",
+     "start_time": "2020-05-07T07:35:35.105824Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pyboard\n",
+    "pyb = pyboard.Pyboard('/dev/ttyACM0')\n",
+    "pyb.enter_raw_repl()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:11:18.145548Z",
+     "start_time": "2020-05-19T19:11:18.137468Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pyb.exit_raw_repl()\n",
+    "pyb.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:38.725924Z",
+     "start_time": "2020-05-07T07:35:38.645488Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import utime\n",
+    "import ulab as np\n",
+    "\n",
+    "def timeit(n=1000):\n",
+    "    def wrapper(f, *args, **kwargs):\n",
+    "        func_name = str(f).split(' ')[1]\n",
+    "        def new_func(*args, **kwargs):\n",
+    "            run_times = np.zeros(n, dtype=np.uint16)\n",
+    "            for i in range(n):\n",
+    "                t = utime.ticks_us()\n",
+    "                result = f(*args, **kwargs)\n",
+    "                run_times[i] = utime.ticks_diff(utime.ticks_us(), t)\n",
+    "            print('{}() execution times based on {} cycles'.format(func_name, n, (delta2-delta1)/n))\n",
+    "            print('\\tbest: %d us'%np.min(run_times))\n",
+    "            print('\\tworst: %d us'%np.max(run_times))\n",
+    "            print('\\taverage: %d us'%np.mean(run_times))\n",
+    "            print('\\tdeviation: +/-%.3f us'%np.std(run_times))            \n",
+    "            return result\n",
+    "        return new_func\n",
+    "    return wrapper\n",
+    "\n",
+    "def timeit(f, *args, **kwargs):\n",
+    "    func_name = str(f).split(' ')[1]\n",
+    "    def new_func(*args, **kwargs):\n",
+    "        t = utime.ticks_us()\n",
+    "        result = f(*args, **kwargs)\n",
+    "        print('execution time: ', utime.ticks_diff(utime.ticks_us(), t), ' us')\n",
+    "        return result\n",
+    "    return new_func"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Optimize\n",
+    "\n",
+    "Functions in the `optimize` module can be called by prepending them by `scipy.optimize.`. The module defines the following three functions:\n",
+    "\n",
+    "1. [scipy.optimize.bisect](#bisect)\n",
+    "1. [scipy.optimize.fmin](#fmin)\n",
+    "1. [scipy.optimize.newton](#newton)\n",
+    "\n",
+    "Note that routines that work with user-defined functions still have to call the underlying `python` code, and therefore, gains in speed are not as significant as with other vectorised operations. As a rule of thumb, a factor of two can be expected, when compared to an optimised `python` implementation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## bisect \n",
+    "\n",
+    "`scipy`: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.bisect.html\n",
+    "\n",
+    "`bisect` finds the root of a function of one variable using a simple bisection routine. It takes three positional arguments, the function itself, and two starting points. The function must have opposite signs\n",
+    "at the starting points. Returned is the position of the root.\n",
+    "\n",
+    "Two keyword arguments, `xtol`, and `maxiter` can be supplied to control the accuracy, and the number of bisections, respectively."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:58:28.444300Z",
+     "start_time": "2021-01-08T12:58:28.421989Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.9999997615814209\n",
+      "only 8 bisections:  0.984375\n",
+      "with 0.1 accuracy:  0.9375\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import scipy as spy\n",
+    "    \n",
+    "def f(x):\n",
+    "    return x*x - 1\n",
+    "\n",
+    "print(spy.optimize.bisect(f, 0, 4))\n",
+    "\n",
+    "print('only 8 bisections: ',  spy.optimize.bisect(f, 0, 4, maxiter=8))\n",
+    "\n",
+    "print('with 0.1 accuracy: ',  spy.optimize.bisect(f, 0, 4, xtol=0.1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Performance\n",
+    "\n",
+    "Since the `bisect` routine calls user-defined `python` functions, the speed gain is only about a factor of two, if compared to a purely `python` implementation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:08:24.750562Z",
+     "start_time": "2020-05-19T19:08:24.682959Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bisect running in python\r\n",
+      "execution time:  1270  us\r\n",
+      "bisect running in C\r\n",
+      "execution time:  642  us\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "def f(x):\n",
+    "    return (x-1)*(x-1) - 2.0\n",
+    "\n",
+    "def bisect(f, a, b, xtol=2.4e-7, maxiter=100):\n",
+    "    if f(a) * f(b) > 0:\n",
+    "        raise ValueError\n",
+    "\n",
+    "    rtb = a if f(a) < 0.0 else b\n",
+    "    dx = b - a if f(a) < 0.0 else a - b\n",
+    "    for i in range(maxiter):\n",
+    "        dx *= 0.5\n",
+    "        x_mid = rtb + dx\n",
+    "        mid_value = f(x_mid)\n",
+    "        if mid_value < 0:\n",
+    "            rtb = x_mid\n",
+    "        if abs(dx) < xtol:\n",
+    "            break\n",
+    "\n",
+    "    return rtb\n",
+    "\n",
+    "@timeit\n",
+    "def bisect_scipy(f, a, b):\n",
+    "    return spy.optimize.bisect(f, a, b)\n",
+    "\n",
+    "@timeit\n",
+    "def bisect_timed(f, a, b):\n",
+    "    return bisect(f, a, b)\n",
+    "\n",
+    "print('bisect running in python')\n",
+    "bisect_timed(f, 3, 2)\n",
+    "\n",
+    "print('bisect running in C')\n",
+    "bisect_scipy(f, 3, 2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## fmin\n",
+    "\n",
+    "`scipy`: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin.html\n",
+    "\n",
+    "The `fmin` function finds the position of the minimum of a user-defined function by using the downhill simplex method. Requires two positional arguments, the function, and the initial value. Three keyword arguments, `xatol`, `fatol`, and `maxiter` stipulate conditions for stopping."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T13:00:26.729947Z",
+     "start_time": "2021-01-08T13:00:26.702748Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.9996093749999952\n",
+      "1.199999999999996\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "def f(x):\n",
+    "    return (x-1)**2 - 1\n",
+    "\n",
+    "print(spy.optimize.fmin(f, 3.0))\n",
+    "print(spy.optimize.fmin(f, 3.0, xatol=0.1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## newton\n",
+    "\n",
+    "`scipy`:https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.newton.html\n",
+    "\n",
+    "`newton` finds a zero of a real, user-defined function using the Newton-Raphson (or secant or Halley’s) method. The routine requires two positional arguments, the function, and the initial value. Three keyword\n",
+    "arguments can be supplied to control the iteration. These are the absolute and relative tolerances `tol`, and `rtol`, respectively, and the number of iterations before stopping, `maxiter`. The function retuns a single scalar, the position of the root."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:56:35.139958Z",
+     "start_time": "2021-01-08T12:56:35.119712Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.260135727246117\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import scipy as spy\n",
+    "    \n",
+    "def f(x):\n",
+    "    return x*x*x - 2.0\n",
+    "\n",
+    "print(spy.optimize.newton(f, 3., tol=0.001, rtol=0.01))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/scipy-signal.ipynb
+++ b/docs/scipy-signal.ipynb
@ -0,0 +1,482 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-12T16:11:12.111639Z",
+     "start_time": "2021-01-12T16:11:11.914041Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Notebook magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-12T16:11:13.416714Z",
+     "start_time": "2021-01-12T16:11:13.404067Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.magic import Magics, magics_class, line_cell_magic\n",
+    "from IPython.core.magic import cell_magic, register_cell_magic, register_line_magic\n",
+    "from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring\n",
+    "import subprocess\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-12T16:11:13.920842Z",
+     "start_time": "2021-01-12T16:11:13.863737Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "@magics_class\n",
+    "class PyboardMagic(Magics):\n",
+    "    @cell_magic\n",
+    "    @magic_arguments()\n",
+    "    @argument('-skip')\n",
+    "    @argument('-unix')\n",
+    "    @argument('-pyboard')\n",
+    "    @argument('-file')\n",
+    "    @argument('-data')\n",
+    "    @argument('-time')\n",
+    "    @argument('-memory')\n",
+    "    def micropython(self, line='', cell=None):\n",
+    "        args = parse_argstring(self.micropython, line)\n",
+    "        if args.skip: # doesn't care about the cell's content\n",
+    "            print('skipped execution')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.unix: # tests the code on the unix port. Note that this works on unix only\n",
+    "            with open('/dev/shm/micropython.py', 'w') as fout:\n",
+    "                fout.write(cell)\n",
+    "            proc = subprocess.Popen([\"../../micropython/ports/unix/micropython\", \"/dev/shm/micropython.py\"], \n",
+    "                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
+    "            print(proc.stdout.read().decode(\"utf-8\"))\n",
+    "            print(proc.stderr.read().decode(\"utf-8\"))\n",
+    "            return None\n",
+    "        if args.file: # can be used to copy the cell content onto the pyboard's flash\n",
+    "            spaces = \"    \"\n",
+    "            try:\n",
+    "                with open(args.file, 'w') as fout:\n",
+    "                    fout.write(cell.replace('\\t', spaces))\n",
+    "                    printf('written cell to {}'.format(args.file))\n",
+    "            except:\n",
+    "                print('Failed to write to disc!')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.data: # can be used to load data from the pyboard directly into kernel space\n",
+    "            message = pyb.exec(cell)\n",
+    "            if len(message) == 0:\n",
+    "                print('pyboard >>>')\n",
+    "            else:\n",
+    "                print(message.decode('utf-8'))\n",
+    "                # register new variable in user namespace\n",
+    "                self.shell.user_ns[args.data] = string_to_matrix(message.decode(\"utf-8\"))\n",
+    "        \n",
+    "        if args.time: # measures the time of executions\n",
+    "            pyb.exec('import utime')\n",
+    "            message = pyb.exec('t = utime.ticks_us()\\n' + cell + '\\ndelta = utime.ticks_diff(utime.ticks_us(), t)' + \n",
+    "                               \"\\nprint('execution time: {:d} us'.format(delta))\")\n",
+    "            print(message.decode('utf-8'))\n",
+    "        \n",
+    "        if args.memory: # prints out memory information \n",
+    "            message = pyb.exec('from micropython import mem_info\\nprint(mem_info())\\n')\n",
+    "            print(\"memory before execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(\">>> \", message.decode('utf-8'))\n",
+    "            message = pyb.exec('print(mem_info())')\n",
+    "            print(\"memory after execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "\n",
+    "        if args.pyboard:\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(message.decode('utf-8'))\n",
+    "\n",
+    "ip = get_ipython()\n",
+    "ip.register_magics(PyboardMagic)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## pyboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:35.126401Z",
+     "start_time": "2020-05-07T07:35:35.105824Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pyboard\n",
+    "pyb = pyboard.Pyboard('/dev/ttyACM0')\n",
+    "pyb.enter_raw_repl()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:11:18.145548Z",
+     "start_time": "2020-05-19T19:11:18.137468Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pyb.exit_raw_repl()\n",
+    "pyb.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:38.725924Z",
+     "start_time": "2020-05-07T07:35:38.645488Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import utime\n",
+    "import ulab as np\n",
+    "\n",
+    "def timeit(n=1000):\n",
+    "    def wrapper(f, *args, **kwargs):\n",
+    "        func_name = str(f).split(' ')[1]\n",
+    "        def new_func(*args, **kwargs):\n",
+    "            run_times = np.zeros(n, dtype=np.uint16)\n",
+    "            for i in range(n):\n",
+    "                t = utime.ticks_us()\n",
+    "                result = f(*args, **kwargs)\n",
+    "                run_times[i] = utime.ticks_diff(utime.ticks_us(), t)\n",
+    "            print('{}() execution times based on {} cycles'.format(func_name, n, (delta2-delta1)/n))\n",
+    "            print('\\tbest: %d us'%np.min(run_times))\n",
+    "            print('\\tworst: %d us'%np.max(run_times))\n",
+    "            print('\\taverage: %d us'%np.mean(run_times))\n",
+    "            print('\\tdeviation: +/-%.3f us'%np.std(run_times))            \n",
+    "            return result\n",
+    "        return new_func\n",
+    "    return wrapper\n",
+    "\n",
+    "def timeit(f, *args, **kwargs):\n",
+    "    func_name = str(f).split(' ')[1]\n",
+    "    def new_func(*args, **kwargs):\n",
+    "        t = utime.ticks_us()\n",
+    "        result = f(*args, **kwargs)\n",
+    "        print('execution time: ', utime.ticks_diff(utime.ticks_us(), t), ' us')\n",
+    "        return result\n",
+    "    return new_func"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Signal\n",
+    "\n",
+    "Functions in the `signal` module can be called by prepending them by `scipy.signal.`. The module defines the following two functions:\n",
+    "\n",
+    "1. [scipy.signal.sosfilt](#sosfilt)\n",
+    "1. [scipy.signal.spectrogram](#spectrogram)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## sosfilt\n",
+    "\n",
+    "`scipy`: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.sosfilt.html \n",
+    "\n",
+    "Filter data along one dimension using cascaded second-order sections.\n",
+    "\n",
+    "The function takes two positional arguments, `sos`, the filter segments of length 6, and the one-dimensional, uniformly sampled data set to be filtered. Returns the filtered data, or the filtered data and the final filter delays, if the `zi` keyword arguments is supplied. The keyword argument must be a float `ndarray` of shape `(n_sections, 2)`. If `zi` is not passed to the function, the initial values are assumed to be 0."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-06-19T20:24:10.529668Z",
+     "start_time": "2020-06-19T20:24:10.520389Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "y:  array([0.0, 1.0, -4.0, 24.0, -104.0, 440.0, -1728.0, 6532.000000000001, -23848.0, 84864.0], dtype=float)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
+    "sos = [[1, 2, 3, 1, 5, 6], [1, 2, 3, 1, 5, 6]]\n",
+    "y = spy.signal.sosfilt(sos, x)\n",
+    "print('y: ', y)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-06-19T20:27:39.508508Z",
+     "start_time": "2020-06-19T20:27:39.498256Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "y:  array([4.0, -16.0, 63.00000000000001, -227.0, 802.9999999999999, -2751.0, 9271.000000000001, -30775.0, 101067.0, -328991.0000000001], dtype=float)\n",
+      "\n",
+      "========================================\n",
+      "zf:  array([[37242.0, 74835.0],\n",
+      "\t [1026187.0, 1936542.0]], dtype=float)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n",
+    "sos = [[1, 2, 3, 1, 5, 6], [1, 2, 3, 1, 5, 6]]\n",
+    "# initial conditions of the filter\n",
+    "zi = np.array([[1, 2], [3, 4]])\n",
+    "\n",
+    "y, zf = spy.signal.sosfilt(sos, x, zi=zi)\n",
+    "print('y: ', y)\n",
+    "print('\\n' + '='*40 + '\\nzf: ', zf)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## spectrogram\n",
+    "\n",
+    "In addition to the Fourier transform and its inverse, `ulab` also sports a function called `spectrogram`, which returns the absolute value of the Fourier transform. This could be used to find the dominant spectral component in a time series. The arguments are treated in the same way as in `fft`, and `ifft`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-12T16:12:06.573408Z",
+     "start_time": "2021-01-12T16:12:06.560558Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "original vector:\t array([0.0, 0.009775015390171337, 0.01954909674625918, ..., -0.5275140569487312, -0.5357931822978732, -0.5440211108893639], dtype=float64)\n",
+      "\n",
+      "spectrum:\t array([187.8635087634579, 315.3112063607119, 347.8814873399374, ..., 84.45888934298905, 347.8814873399374, 315.3112063607118], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "x = np.linspace(0, 10, num=1024)\n",
+    "y = np.sin(x)\n",
+    "\n",
+    "a = spy.signal.spectrogram(y)\n",
+    "\n",
+    "print('original vector:\\t', y)\n",
+    "print('\\nspectrum:\\t', a)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As such, `spectrogram` is really just a shorthand for `np.sqrt(a*a + b*b)`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-12T16:13:36.726662Z",
+     "start_time": "2021-01-12T16:13:36.705036Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "spectrum calculated the hard way:\t array([187.8635087634579, 315.3112063607119, 347.8814873399374, ..., 84.45888934298905, 347.8814873399374, 315.3112063607118], dtype=float64)\n",
+      "\n",
+      "spectrum calculated the lazy way:\t array([187.8635087634579, 315.3112063607119, 347.8814873399374, ..., 84.45888934298905, 347.8814873399374, 315.3112063607118], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "x = np.linspace(0, 10, num=1024)\n",
+    "y = np.sin(x)\n",
+    "\n",
+    "a, b = np.fft.fft(y)\n",
+    "\n",
+    "print('\\nspectrum calculated the hard way:\\t', np.sqrt(a*a + b*b))\n",
+    "\n",
+    "a = spy.signal.spectrogram(y)\n",
+    "\n",
+    "print('\\nspectrum calculated the lazy way:\\t', a)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/scipy-special.ipynb
+++ b/docs/scipy-special.ipynb
@ -0,0 +1,344 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T18:54:58.722373Z",
+     "start_time": "2021-01-13T18:54:57.178438Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Notebook magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T18:57:41.555892Z",
+     "start_time": "2021-01-13T18:57:41.551121Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.magic import Magics, magics_class, line_cell_magic\n",
+    "from IPython.core.magic import cell_magic, register_cell_magic, register_line_magic\n",
+    "from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring\n",
+    "import subprocess\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T18:57:42.313231Z",
+     "start_time": "2021-01-13T18:57:42.288402Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "@magics_class\n",
+    "class PyboardMagic(Magics):\n",
+    "    @cell_magic\n",
+    "    @magic_arguments()\n",
+    "    @argument('-skip')\n",
+    "    @argument('-unix')\n",
+    "    @argument('-pyboard')\n",
+    "    @argument('-file')\n",
+    "    @argument('-data')\n",
+    "    @argument('-time')\n",
+    "    @argument('-memory')\n",
+    "    def micropython(self, line='', cell=None):\n",
+    "        args = parse_argstring(self.micropython, line)\n",
+    "        if args.skip: # doesn't care about the cell's content\n",
+    "            print('skipped execution')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.unix: # tests the code on the unix port. Note that this works on unix only\n",
+    "            with open('/dev/shm/micropython.py', 'w') as fout:\n",
+    "                fout.write(cell)\n",
+    "            proc = subprocess.Popen([\"../../micropython/ports/unix/micropython\", \"/dev/shm/micropython.py\"], \n",
+    "                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
+    "            print(proc.stdout.read().decode(\"utf-8\"))\n",
+    "            print(proc.stderr.read().decode(\"utf-8\"))\n",
+    "            return None\n",
+    "        if args.file: # can be used to copy the cell content onto the pyboard's flash\n",
+    "            spaces = \"    \"\n",
+    "            try:\n",
+    "                with open(args.file, 'w') as fout:\n",
+    "                    fout.write(cell.replace('\\t', spaces))\n",
+    "                    printf('written cell to {}'.format(args.file))\n",
+    "            except:\n",
+    "                print('Failed to write to disc!')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.data: # can be used to load data from the pyboard directly into kernel space\n",
+    "            message = pyb.exec(cell)\n",
+    "            if len(message) == 0:\n",
+    "                print('pyboard >>>')\n",
+    "            else:\n",
+    "                print(message.decode('utf-8'))\n",
+    "                # register new variable in user namespace\n",
+    "                self.shell.user_ns[args.data] = string_to_matrix(message.decode(\"utf-8\"))\n",
+    "        \n",
+    "        if args.time: # measures the time of executions\n",
+    "            pyb.exec('import utime')\n",
+    "            message = pyb.exec('t = utime.ticks_us()\\n' + cell + '\\ndelta = utime.ticks_diff(utime.ticks_us(), t)' + \n",
+    "                               \"\\nprint('execution time: {:d} us'.format(delta))\")\n",
+    "            print(message.decode('utf-8'))\n",
+    "        \n",
+    "        if args.memory: # prints out memory information \n",
+    "            message = pyb.exec('from micropython import mem_info\\nprint(mem_info())\\n')\n",
+    "            print(\"memory before execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(\">>> \", message.decode('utf-8'))\n",
+    "            message = pyb.exec('print(mem_info())')\n",
+    "            print(\"memory after execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "\n",
+    "        if args.pyboard:\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(message.decode('utf-8'))\n",
+    "\n",
+    "ip = get_ipython()\n",
+    "ip.register_magics(PyboardMagic)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## pyboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:35.126401Z",
+     "start_time": "2020-05-07T07:35:35.105824Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pyboard\n",
+    "pyb = pyboard.Pyboard('/dev/ttyACM0')\n",
+    "pyb.enter_raw_repl()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:11:18.145548Z",
+     "start_time": "2020-05-19T19:11:18.137468Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pyb.exit_raw_repl()\n",
+    "pyb.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:38.725924Z",
+     "start_time": "2020-05-07T07:35:38.645488Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import utime\n",
+    "import ulab as np\n",
+    "\n",
+    "def timeit(n=1000):\n",
+    "    def wrapper(f, *args, **kwargs):\n",
+    "        func_name = str(f).split(' ')[1]\n",
+    "        def new_func(*args, **kwargs):\n",
+    "            run_times = np.zeros(n, dtype=np.uint16)\n",
+    "            for i in range(n):\n",
+    "                t = utime.ticks_us()\n",
+    "                result = f(*args, **kwargs)\n",
+    "                run_times[i] = utime.ticks_diff(utime.ticks_us(), t)\n",
+    "            print('{}() execution times based on {} cycles'.format(func_name, n, (delta2-delta1)/n))\n",
+    "            print('\\tbest: %d us'%np.min(run_times))\n",
+    "            print('\\tworst: %d us'%np.max(run_times))\n",
+    "            print('\\taverage: %d us'%np.mean(run_times))\n",
+    "            print('\\tdeviation: +/-%.3f us'%np.std(run_times))            \n",
+    "            return result\n",
+    "        return new_func\n",
+    "    return wrapper\n",
+    "\n",
+    "def timeit(f, *args, **kwargs):\n",
+    "    func_name = str(f).split(' ')[1]\n",
+    "    def new_func(*args, **kwargs):\n",
+    "        t = utime.ticks_us()\n",
+    "        result = f(*args, **kwargs)\n",
+    "        print('execution time: ', utime.ticks_diff(utime.ticks_us(), t), ' us')\n",
+    "        return result\n",
+    "    return new_func"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Special functions\n",
+    "\n",
+    "`scipy`'s `special` module defines several functions that behave as do the standard mathematical functions of the `numpy`, i.e., they can be called on any scalar, scalar-valued iterable (ranges, lists, tuples containing numbers), and on `ndarray`s without having to change the call signature. In all cases the functions return a new `ndarray` of typecode `float` (since these functions usually generate float values, anyway). \n",
+    "\n",
+    "At present, `ulab`'s `special` module contains the following functions:\n",
+    "\n",
+    "`erf`, `erfc`, `gamma`, and `gammaln`, and they can be called by prepending them by `scipy.special.`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T19:06:54.640444Z",
+     "start_time": "2021-01-13T19:06:54.623467Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "a:  range(0, 9)\n",
+      "array([0.0, 0.8427007929497149, 0.9953222650189527, 0.9999779095030014, 0.9999999845827421, 1.0, 1.0, 1.0, 1.0], dtype=float64)\n",
+      "\n",
+      "b:  array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype=float64)\n",
+      "array([1.0, 0.1572992070502851, 0.004677734981047265, 2.209049699858544e-05, 1.541725790028002e-08, 1.537459794428035e-12, 2.151973671249892e-17, 4.183825607779414e-23, 1.122429717298293e-29], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "a = range(9)\n",
+    "b = np.array(a)\n",
+    "\n",
+    "print('a: ', a)\n",
+    "print(spy.special.erf(a))\n",
+    "\n",
+    "print('\\nb: ', b)\n",
+    "print(spy.special.erfc(b))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/source/ulab.rst
+++ b/docs/source/ulab.rst
--- a/docs/ulab-approx.ipynb
+++ b/docs/ulab-approx.ipynb
@ -0,0 +1,613 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:50:51.417613Z",
+     "start_time": "2021-01-08T12:50:51.208257Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Notebook magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:50:52.581876Z",
+     "start_time": "2021-01-08T12:50:52.567901Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.magic import Magics, magics_class, line_cell_magic\n",
+    "from IPython.core.magic import cell_magic, register_cell_magic, register_line_magic\n",
+    "from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring\n",
+    "import subprocess\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:50:53.516712Z",
+     "start_time": "2021-01-08T12:50:53.454984Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "@magics_class\n",
+    "class PyboardMagic(Magics):\n",
+    "    @cell_magic\n",
+    "    @magic_arguments()\n",
+    "    @argument('-skip')\n",
+    "    @argument('-unix')\n",
+    "    @argument('-pyboard')\n",
+    "    @argument('-file')\n",
+    "    @argument('-data')\n",
+    "    @argument('-time')\n",
+    "    @argument('-memory')\n",
+    "    def micropython(self, line='', cell=None):\n",
+    "        args = parse_argstring(self.micropython, line)\n",
+    "        if args.skip: # doesn't care about the cell's content\n",
+    "            print('skipped execution')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.unix: # tests the code on the unix port. Note that this works on unix only\n",
+    "            with open('/dev/shm/micropython.py', 'w') as fout:\n",
+    "                fout.write(cell)\n",
+    "            proc = subprocess.Popen([\"../../micropython/ports/unix/micropython\", \"/dev/shm/micropython.py\"], \n",
+    "                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
+    "            print(proc.stdout.read().decode(\"utf-8\"))\n",
+    "            print(proc.stderr.read().decode(\"utf-8\"))\n",
+    "            return None\n",
+    "        if args.file: # can be used to copy the cell content onto the pyboard's flash\n",
+    "            spaces = \"    \"\n",
+    "            try:\n",
+    "                with open(args.file, 'w') as fout:\n",
+    "                    fout.write(cell.replace('\\t', spaces))\n",
+    "                    printf('written cell to {}'.format(args.file))\n",
+    "            except:\n",
+    "                print('Failed to write to disc!')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.data: # can be used to load data from the pyboard directly into kernel space\n",
+    "            message = pyb.exec(cell)\n",
+    "            if len(message) == 0:\n",
+    "                print('pyboard >>>')\n",
+    "            else:\n",
+    "                print(message.decode('utf-8'))\n",
+    "                # register new variable in user namespace\n",
+    "                self.shell.user_ns[args.data] = string_to_matrix(message.decode(\"utf-8\"))\n",
+    "        \n",
+    "        if args.time: # measures the time of executions\n",
+    "            pyb.exec('import utime')\n",
+    "            message = pyb.exec('t = utime.ticks_us()\\n' + cell + '\\ndelta = utime.ticks_diff(utime.ticks_us(), t)' + \n",
+    "                               \"\\nprint('execution time: {:d} us'.format(delta))\")\n",
+    "            print(message.decode('utf-8'))\n",
+    "        \n",
+    "        if args.memory: # prints out memory information \n",
+    "            message = pyb.exec('from micropython import mem_info\\nprint(mem_info())\\n')\n",
+    "            print(\"memory before execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(\">>> \", message.decode('utf-8'))\n",
+    "            message = pyb.exec('print(mem_info())')\n",
+    "            print(\"memory after execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "\n",
+    "        if args.pyboard:\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(message.decode('utf-8'))\n",
+    "\n",
+    "ip = get_ipython()\n",
+    "ip.register_magics(PyboardMagic)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## pyboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:35.126401Z",
+     "start_time": "2020-05-07T07:35:35.105824Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pyboard\n",
+    "pyb = pyboard.Pyboard('/dev/ttyACM0')\n",
+    "pyb.enter_raw_repl()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:11:18.145548Z",
+     "start_time": "2020-05-19T19:11:18.137468Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pyb.exit_raw_repl()\n",
+    "pyb.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:38.725924Z",
+     "start_time": "2020-05-07T07:35:38.645488Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import utime\n",
+    "import ulab as np\n",
+    "\n",
+    "def timeit(n=1000):\n",
+    "    def wrapper(f, *args, **kwargs):\n",
+    "        func_name = str(f).split(' ')[1]\n",
+    "        def new_func(*args, **kwargs):\n",
+    "            run_times = np.zeros(n, dtype=np.uint16)\n",
+    "            for i in range(n):\n",
+    "                t = utime.ticks_us()\n",
+    "                result = f(*args, **kwargs)\n",
+    "                run_times[i] = utime.ticks_diff(utime.ticks_us(), t)\n",
+    "            print('{}() execution times based on {} cycles'.format(func_name, n, (delta2-delta1)/n))\n",
+    "            print('\\tbest: %d us'%np.min(run_times))\n",
+    "            print('\\tworst: %d us'%np.max(run_times))\n",
+    "            print('\\taverage: %d us'%np.mean(run_times))\n",
+    "            print('\\tdeviation: +/-%.3f us'%np.std(run_times))            \n",
+    "            return result\n",
+    "        return new_func\n",
+    "    return wrapper\n",
+    "\n",
+    "def timeit(f, *args, **kwargs):\n",
+    "    func_name = str(f).split(' ')[1]\n",
+    "    def new_func(*args, **kwargs):\n",
+    "        t = utime.ticks_us()\n",
+    "        result = f(*args, **kwargs)\n",
+    "        print('execution time: ', utime.ticks_diff(utime.ticks_us(), t), ' us')\n",
+    "        return result\n",
+    "    return new_func"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Approximation methods\n",
+    "\n",
+    "`ulab` implements five functions that can be used for interpolating, root finding, and minimising arbitrary `python` functions in one dimension. Two of these functions, namely, `interp`, and `trapz` are defined in `numpy`, while the other three are parts of `scipy`'s `optimize` module. \n",
+    "\n",
+    "Note that routines that work with user-defined functions still have to call the underlying `python` code, and therefore, gains in speed are not as significant as with other vectorised operations. As a rule of thumb, a factor of two can be expected, when compared to an optimised `python` implementation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## interp\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy/numpy.interp\n",
+    "\n",
+    "The `interp` function returns the linearly interpolated values of a one-dimensional numerical array. It requires three positional arguments,`x`, at which the interpolated values are evaluated, `xp`, the array\n",
+    "of the independent data variable, and `fp`, the array of the dependent values of the data. `xp` must be a monotonically increasing sequence of numbers.\n",
+    "\n",
+    "Two keyword arguments, `left`, and `right` can also be supplied; these determine the return values, if `x < xp[0]`, and `x > xp[-1]`, respectively. If these arguments are not supplied, `left`, and `right` default to `fp[0]`, and `fp[-1]`, respectively."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:54:58.895801Z",
+     "start_time": "2021-01-08T12:54:58.869338Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "array([0.8, 1.8, 2.8, 3.8, 4.8], dtype=float64)\n",
+      "array([1.0, 1.8, 2.8, 4.6, 5.0], dtype=float64)\n",
+      "array([0.0, 1.8, 2.8, 4.6, 5.0], dtype=float64)\n",
+      "array([1.0, 1.8, 2.8, 4.6, 10.0], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "x = np.array([1, 2, 3, 4, 5]) - 0.2\n",
+    "xp = np.array([1, 2, 3, 4])\n",
+    "fp = np.array([1, 2, 3, 5])\n",
+    "\n",
+    "print(x)\n",
+    "print(np.interp(x, xp, fp))\n",
+    "print(np.interp(x, xp, fp, left=0.0))\n",
+    "print(np.interp(x, xp, fp, right=10.0))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## newton\n",
+    "\n",
+    "`scipy`:https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.newton.html\n",
+    "\n",
+    "`newton` finds a zero of a real, user-defined function using the Newton-Raphson (or secant or Halley’s) method. The routine requires two positional arguments, the function, and the initial value. Three keyword\n",
+    "arguments can be supplied to control the iteration. These are the absolute and relative tolerances `tol`, and `rtol`, respectively, and the number of iterations before stopping, `maxiter`. The function retuns a single scalar, the position of the root."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:56:35.139958Z",
+     "start_time": "2021-01-08T12:56:35.119712Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.260135727246117\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import scipy as spy\n",
+    "    \n",
+    "def f(x):\n",
+    "    return x*x*x - 2.0\n",
+    "\n",
+    "print(spy.optimize.newton(f, 3., tol=0.001, rtol=0.01))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## bisect \n",
+    "\n",
+    "`scipy`: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.bisect.html\n",
+    "\n",
+    "`bisect` finds the root of a function of one variable using a simple bisection routine. It takes three positional arguments, the function itself, and two starting points. The function must have opposite signs\n",
+    "at the starting points. Returned is the position of the root.\n",
+    "\n",
+    "Two keyword arguments, `xtol`, and `maxiter` can be supplied to control the accuracy, and the number of bisections, respectively."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:58:28.444300Z",
+     "start_time": "2021-01-08T12:58:28.421989Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.9999997615814209\n",
+      "only 8 bisections:  0.984375\n",
+      "with 0.1 accuracy:  0.9375\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import scipy as spy\n",
+    "    \n",
+    "def f(x):\n",
+    "    return x*x - 1\n",
+    "\n",
+    "print(spy.optimize.bisect(f, 0, 4))\n",
+    "\n",
+    "print('only 8 bisections: ',  spy.optimize.bisect(f, 0, 4, maxiter=8))\n",
+    "\n",
+    "print('with 0.1 accuracy: ',  spy.optimize.bisect(f, 0, 4, xtol=0.1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Performance\n",
+    "\n",
+    "Since the `bisect` routine calls user-defined `python` functions, the speed gain is only about a factor of two, if compared to a purely `python` implementation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:08:24.750562Z",
+     "start_time": "2020-05-19T19:08:24.682959Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bisect running in python\r\n",
+      "execution time:  1270  us\r\n",
+      "bisect running in C\r\n",
+      "execution time:  642  us\r\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "def f(x):\n",
+    "    return (x-1)*(x-1) - 2.0\n",
+    "\n",
+    "def bisect(f, a, b, xtol=2.4e-7, maxiter=100):\n",
+    "    if f(a) * f(b) > 0:\n",
+    "        raise ValueError\n",
+    "\n",
+    "    rtb = a if f(a) < 0.0 else b\n",
+    "    dx = b - a if f(a) < 0.0 else a - b\n",
+    "    for i in range(maxiter):\n",
+    "        dx *= 0.5\n",
+    "        x_mid = rtb + dx\n",
+    "        mid_value = f(x_mid)\n",
+    "        if mid_value < 0:\n",
+    "            rtb = x_mid\n",
+    "        if abs(dx) < xtol:\n",
+    "            break\n",
+    "\n",
+    "    return rtb\n",
+    "\n",
+    "@timeit\n",
+    "def bisect_scipy(f, a, b):\n",
+    "    return spy.optimize.bisect(f, a, b)\n",
+    "\n",
+    "@timeit\n",
+    "def bisect_timed(f, a, b):\n",
+    "    return bisect(f, a, b)\n",
+    "\n",
+    "print('bisect running in python')\n",
+    "bisect_timed(f, 3, 2)\n",
+    "\n",
+    "print('bisect running in C')\n",
+    "bisect_scipy(f, 3, 2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## fmin\n",
+    "\n",
+    "`scipy`: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin.html\n",
+    "\n",
+    "The `fmin` function finds the position of the minimum of a user-defined function by using the downhill simplex method. Requires two positional arguments, the function, and the initial value. Three keyword arguments, `xatol`, `fatol`, and `maxiter` stipulate conditions for stopping."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T13:00:26.729947Z",
+     "start_time": "2021-01-08T13:00:26.702748Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.9996093749999952\n",
+      "1.199999999999996\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "def f(x):\n",
+    "    return (x-1)**2 - 1\n",
+    "\n",
+    "print(spy.optimize.fmin(f, 3.0))\n",
+    "print(spy.optimize.fmin(f, 3.0, xatol=0.1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## trapz\n",
+    "\n",
+    "`numpy`: https://numpy.org/doc/stable/reference/generated/numpy.trapz.html\n",
+    "\n",
+    "The function takes one or two one-dimensional `ndarray`s, and integrates the dependent values (`y`) using the trapezoidal rule. If the independent variable (`x`) is given, that is taken as the sample points corresponding to `y`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T13:01:29.515166Z",
+     "start_time": "2021-01-08T13:01:29.494285Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "x:  array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], dtype=float64)\n",
+      "y:  array([0.0, 1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0, 64.0, 81.0], dtype=float64)\n",
+      "============================\n",
+      "integral of y:  244.5\n",
+      "integral of y at x:  244.5\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "x = np.linspace(0, 9, num=10)\n",
+    "y = x*x\n",
+    "\n",
+    "print('x: ',  x)\n",
+    "print('y: ',  y)\n",
+    "print('============================')\n",
+    "print('integral of y: ', np.trapz(y))\n",
+    "print('integral of y at x: ', np.trapz(y, x=x))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/ulab-change-log.md
+++ b/docs/ulab-change-log.md
@ -1,3 +1,464 @@
+Fri, 29 Jan 2021
+
+version 2.1.5
+
+    fixed error, when calculating standard deviation of iterables
+
+wed, 27 Jan 2021
+
+version 2.1.4
+
+    arrays can now be initialised from nested iterables
+
+Thu, 21 Jan 2021
+
+version 2.1.3
+
+    added ifndef/endif wrappers in ulab.h
+
+Fri, 15 Jan 2021
+
+version 2.1.2
+
+    fixed small error in frombuffer
+
+Thu, 14 Jan 2021
+
+version 2.1.1
+
+    fixed bad error in diff
+
+Thu, 26 Nov 2020
+
+version 2.1.0
+
+    implemented frombuffer
+
+Tue, 24 Nov 2020
+
+version 2.0.0
+
+    implemented numpy/scipy compatibility
+
+Tue, 24 Nov 2020
+
+version 1.6.0
+
+    added Boolean initialisation option
+
+Mon, 23 Nov 2020
+
+version 1.5.1
+
+    fixed nan definition
+
+version 1.5.0
+
+    added nan/inf class level constants
+
+version 1.4.10
+
+    fixed sosfilt
+
+version 1.4.9
+
+    added in-place sort
+
+version 1.4.8
+
+    fixed convolve
+
+version 1.4.7.
+
+    fixed iteration loop in norm
+
+Fri, 20 Nov 2020
+
+version 1.4.6
+
+    fixed interp
+
+Thu, 19 Nov 2020
+
+version 1.4.5
+
+    eliminated fatal micropython error in ndarray_init_helper
+
+version 1.4.4
+
+    fixed min, max
+
+version 1.4.3
+
+    fixed full, zeros, ones
+
+version 1.4.2
+
+    fixed dtype
+
+Wed, 18 Nov 2020
+
+version 1.4.1.
+
+    fixed std
+
+version 1.4.0
+
+    removed size from linalg
+
+version 1.3.8
+
+    fixed trapz
+
+Tue, 17 Nov 2020
+
+version 1.3.7
+
+    fixed in-place power, in-place divide, roll
+
+Mon, 16 Nov 2020
+
+version 1.3.6
+
+    fixed eye
+
+Mon, 16 Nov 2020
+
+version 1.3.5
+
+    fixed trace
+
+Mon, 16 Nov 2020
+
+version 1.3.4
+
+    fixed clip
+
+Mon, 16 Nov 2020
+
+version 1.3.3
+
+    added function pointer option to some binary operators
+
+Fri, 13 Nov 2020
+
+version 1.3.2
+
+    implemented function pointer option in vectorise
+
+Thu, 12 Nov 2020
+
+version 1.3.1
+
+    factored out some of the math functions in re-usable form
+
+Wed, 11 Nov 2020
+
+version 1.3.0
+
+    added dtype function/method/property
+
+Wed, 11 Nov 2020
+
+version 1.2.8
+
+    improved the accuracy of sum for float types
+
+Wed, 11 Nov 2020
+
+version 1.2.7
+
+    fixed transpose
+    improved the accuracy of trapz
+
+Tue, 10 Nov 2020
+
+version 1.2.6
+
+    fixed slicing
+
+Mon, 9 Nov 2020
+
+version 1.2.5
+
+    fixed array casting glitch in make_new_core
+
+Mon, 9 Nov 2020
+
+version 1.2.4
+
+    sum/mean/std can flatten the arrays now
+
+Tue, 3 Nov 2020
+
+version 1.2.1
+
+    fixed pointer issue in eig, and corrected the docs
+
+Tue, 3 Nov 2020
+
+version 1.2.0
+
+    added median function
+
+Tue, 3 Nov 2020
+
+version 1.1.4
+
+    fixed norm and shape
+
+Mon, 2 Nov 2020
+
+version 1.1.3
+
+    fixed small glitch in diagonal, and ndarray_make_new_core
+
+Sun, 1 Nov 2020
+
+version 1.1.1
+
+    fixed compilation error for 4D
+
+Sat, 31 Oct 2020
+
+version 1.1.0
+
+    added the diagonal function
+
+Fri, 30 Oct 2020
+
+version 1.0.0
+
+    added :
+        support for tensors of rank 4
+        proper broadcasting
+        views
+        .tobytes()
+        concatenate
+        cross
+        full
+        logspace
+        in-place operators
+
+Sat, 25 Oct 2020
+
+version 0.54.5
+
+    wrong type in slices raise TypeError exception
+
+Fri, 23 Oct 2020
+
+version 0.54.4
+
+    fixed indexing error in slices
+
+Mon, 17 Aug 2020
+
+version 0.54.3
+
+    fixed small error in linalg
+
+Mon, 03 Aug 2020
+
+version 0.54.2
+
+    argsort throws an error, if the array is longer than 65535
+
+Wed, 29 Jul 2020
+
+version 0.54.1
+
+    changed to size_t for the length of arrays
+
+Thu, 23 Jul 2020
+
+version 0.54.0
+
+    added norm to linalg
+
+Wed, 22 Jul 2020
+
+version 0.53.2
+
+    added circuitpython documentation stubs to the source files
+
+Wed, 22 Jul 2020
+
+version 0.53.1
+
+    fixed arange with negative steps
+
+Mon, 20 Jul 2020
+
+version 0.53.0
+
+    added arange to create.c
+
+Thu, 16 Jul 2020
+
+version 0.52.0
+
+    added trapz to approx
+
+Mon, 29 Jun 2020
+
+version 0.51.1
+
+    fixed argmin/argmax issue
+
+Fri, 19 Jun 2020
+
+version 0.51.0
+
+    add sosfilt to the filter sub-module
+
+Fri, 12 Jun 2020
+
+version 0.50.2
+
+    fixes compilation error in openmv
+
+Mon, 1 Jun 2020
+
+version 0.50.1
+
+    fixes error in numerical max/min
+
+Mon, 18 May 2020
+
+version 0.50.0
+
+    move interp to the approx sub-module
+
+Wed, 06 May 2020
+
+version 0.46.0
+
+    add curve_fit to the approx sub-module
+
+version 0.44.0
+
+    add approx sub-module with newton, fmin, and bisect functions
+
+Thu, 30 Apr 2020
+
+version 0.44.0
+
+    add approx sub-module with newton, fmin, and bisect functions
+
+Tue, 19 May 2020
+
+version 0.46.1
+
+    fixed bad error in binary_op
+
+Wed, 6 May 2020
+
+version 0.46
+
+    added vectorisation of python functions
+
+Sat, 2 May 2020
+
+version 0.45.0
+
+    add equal/not_equal to the compare module
+
+Tue, 21 Apr 2020
+
+version 0.42.0
+
+    add minimum/maximum/clip functions
+
+Mon, 20 Apr 2020
+
+version 0.41.6
+
+    argument handling improvement in polyfit
+
+Mon, 20 Apr 2020
+
+version 0.41.5
+
+    fix compilation errors due to https://github.com/micropython/micropython/commit/30840ebc9925bb8ef025dbc2d5982b1bfeb75f1b
+
+Sat, 18 Apr 2020
+
+version 0.41.4
+
+    fix compilation error on hardware ports
+
+Tue, 14 Apr 2020
+
+version 0.41.3
+
+    fix indexing error in dot function
+
+Thu, 9 Apr 2020
+
+version 0.41.2
+
+    fix transpose function
+
+Tue, 7 Apr 2020
+
+version 0.41.2
+
+    fix discrepancy in argmin/argmax behaviour
+
+Tue, 7 Apr 2020
+
+version 0.41.1
+
+    fix error in argsort
+
+Sat, 4 Apr 2020
+
+version 0.41.0
+
+    implemented == and != binary operators
+
+Fri, 3 Apr 2020
+
+version 0.40.0
+
+    added trace to linalg
+
+Thu, 2 Apr 2020
+
+version 0.39.0
+
+    added the ** operator, and operand swapping in binary operators
+
+Thu, 2 Apr 2020
+
+version 0.38.1
+
+    added fast option, when initialising from ndarray_properties
+
+Thu, 12 Mar 2020
+
+version 0.38.0
+
+    added initialisation from ndarray, and the around function
+
+Tue, 10 Mar 2020
+
+version 0.37.0
+
+    added Cholesky decomposition to linalg.c
+
+Thu, 27 Feb 2020
+
+version 0.36.0
+
+    moved zeros, ones, eye and linspace into separate module (they are still bound at the top level)
+
+Thu, 27 Feb 2020
+
+version 0.35.0
+
+    Move zeros, ones back into top level ulab module

 Tue, 18 Feb 2020

--- a/docs/ulab-compare.ipynb
+++ b/docs/ulab-compare.ipynb
@ -0,0 +1,467 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T13:02:42.934528Z",
+     "start_time": "2021-01-08T13:02:42.720862Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Notebook magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T13:02:44.890094Z",
+     "start_time": "2021-01-08T13:02:44.878787Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.magic import Magics, magics_class, line_cell_magic\n",
+    "from IPython.core.magic import cell_magic, register_cell_magic, register_line_magic\n",
+    "from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring\n",
+    "import subprocess\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T13:06:20.583308Z",
+     "start_time": "2021-01-08T13:06:20.525830Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "@magics_class\n",
+    "class PyboardMagic(Magics):\n",
+    "    @cell_magic\n",
+    "    @magic_arguments()\n",
+    "    @argument('-skip')\n",
+    "    @argument('-unix')\n",
+    "    @argument('-pyboard')\n",
+    "    @argument('-file')\n",
+    "    @argument('-data')\n",
+    "    @argument('-time')\n",
+    "    @argument('-memory')\n",
+    "    def micropython(self, line='', cell=None):\n",
+    "        args = parse_argstring(self.micropython, line)\n",
+    "        if args.skip: # doesn't care about the cell's content\n",
+    "            print('skipped execution')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.unix: # tests the code on the unix port. Note that this works on unix only\n",
+    "            with open('/dev/shm/micropython.py', 'w') as fout:\n",
+    "                fout.write(cell)\n",
+    "            proc = subprocess.Popen([\"../../micropython/ports/unix/micropython\", \"/dev/shm/micropython.py\"], \n",
+    "                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
+    "            print(proc.stdout.read().decode(\"utf-8\"))\n",
+    "            print(proc.stderr.read().decode(\"utf-8\"))\n",
+    "            return None\n",
+    "        if args.file: # can be used to copy the cell content onto the pyboard's flash\n",
+    "            spaces = \"    \"\n",
+    "            try:\n",
+    "                with open(args.file, 'w') as fout:\n",
+    "                    fout.write(cell.replace('\\t', spaces))\n",
+    "                    printf('written cell to {}'.format(args.file))\n",
+    "            except:\n",
+    "                print('Failed to write to disc!')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.data: # can be used to load data from the pyboard directly into kernel space\n",
+    "            message = pyb.exec(cell)\n",
+    "            if len(message) == 0:\n",
+    "                print('pyboard >>>')\n",
+    "            else:\n",
+    "                print(message.decode('utf-8'))\n",
+    "                # register new variable in user namespace\n",
+    "                self.shell.user_ns[args.data] = string_to_matrix(message.decode(\"utf-8\"))\n",
+    "        \n",
+    "        if args.time: # measures the time of executions\n",
+    "            pyb.exec('import utime')\n",
+    "            message = pyb.exec('t = utime.ticks_us()\\n' + cell + '\\ndelta = utime.ticks_diff(utime.ticks_us(), t)' + \n",
+    "                               \"\\nprint('execution time: {:d} us'.format(delta))\")\n",
+    "            print(message.decode('utf-8'))\n",
+    "        \n",
+    "        if args.memory: # prints out memory information \n",
+    "            message = pyb.exec('from micropython import mem_info\\nprint(mem_info())\\n')\n",
+    "            print(\"memory before execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(\">>> \", message.decode('utf-8'))\n",
+    "            message = pyb.exec('print(mem_info())')\n",
+    "            print(\"memory after execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "\n",
+    "        if args.pyboard:\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(message.decode('utf-8'))\n",
+    "\n",
+    "ip = get_ipython()\n",
+    "ip.register_magics(PyboardMagic)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## pyboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:35.126401Z",
+     "start_time": "2020-05-07T07:35:35.105824Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pyboard\n",
+    "pyb = pyboard.Pyboard('/dev/ttyACM0')\n",
+    "pyb.enter_raw_repl()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:11:18.145548Z",
+     "start_time": "2020-05-19T19:11:18.137468Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pyb.exit_raw_repl()\n",
+    "pyb.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:38.725924Z",
+     "start_time": "2020-05-07T07:35:38.645488Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import utime\n",
+    "import ulab as np\n",
+    "\n",
+    "def timeit(n=1000):\n",
+    "    def wrapper(f, *args, **kwargs):\n",
+    "        func_name = str(f).split(' ')[1]\n",
+    "        def new_func(*args, **kwargs):\n",
+    "            run_times = np.zeros(n, dtype=np.uint16)\n",
+    "            for i in range(n):\n",
+    "                t = utime.ticks_us()\n",
+    "                result = f(*args, **kwargs)\n",
+    "                run_times[i] = utime.ticks_diff(utime.ticks_us(), t)\n",
+    "            print('{}() execution times based on {} cycles'.format(func_name, n, (delta2-delta1)/n))\n",
+    "            print('\\tbest: %d us'%np.min(run_times))\n",
+    "            print('\\tworst: %d us'%np.max(run_times))\n",
+    "            print('\\taverage: %d us'%np.mean(run_times))\n",
+    "            print('\\tdeviation: +/-%.3f us'%np.std(run_times))            \n",
+    "            return result\n",
+    "        return new_func\n",
+    "    return wrapper\n",
+    "\n",
+    "def timeit(f, *args, **kwargs):\n",
+    "    func_name = str(f).split(' ')[1]\n",
+    "    def new_func(*args, **kwargs):\n",
+    "        t = utime.ticks_us()\n",
+    "        result = f(*args, **kwargs)\n",
+    "        print('execution time: ', utime.ticks_diff(utime.ticks_us(), t), ' us')\n",
+    "        return result\n",
+    "    return new_func"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Comparison of arrays"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## equal, not_equal\n",
+    "\n",
+    "`numpy`: https://numpy.org/doc/stable/reference/generated/numpy.equal.html\n",
+    "\n",
+    "`numpy`: https://numpy.org/doc/stable/reference/generated/numpy.not_equal.html\n",
+    "\n",
+    "In `micropython`, equality of arrays or scalars can be established by utilising the `==`, `!=`, `<`, `>`, `<=`, or `=>` binary operators. In `circuitpython`, `==` and `!=` will produce unexpected results. In order to avoid this discrepancy, and to maintain compatibility with `numpy`, `ulab` implements the `equal` and `not_equal` operators that return the same results, irrespective of the `python` implementation.\n",
+    "\n",
+    "These two functions take two `ndarray`s, or scalars as their arguments. No keyword arguments are implemented."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T14:22:13.990898Z",
+     "start_time": "2021-01-08T14:22:13.941896Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "a:  array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype=float64)\n",
+      "b:  array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], dtype=float64)\n",
+      "\n",
+      "a == b:  array([True, False, False, False, False, False, False, False, False], dtype=bool)\n",
+      "a != b:  array([False, True, True, True, True, True, True, True, True], dtype=bool)\n",
+      "a == 2:  array([False, False, True, False, False, False, False, False, False], dtype=bool)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array(range(9))\n",
+    "b = np.zeros(9)\n",
+    "\n",
+    "print('a: ', a)\n",
+    "print('b: ', b)\n",
+    "print('\\na == b: ', np.equal(a, b))\n",
+    "print('a != b: ', np.not_equal(a, b))\n",
+    "\n",
+    "# comparison with scalars\n",
+    "print('a == 2: ', np.equal(a, 2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## minimum\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.minimum.html\n",
+    "\n",
+    "Returns the minimum of two arrays, or two scalars, or an array, and a scalar. If the arrays are of different `dtype`, the output is upcast as in [Binary operators](#Binary-operators). If both inputs are scalars, a scalar is returned. Only positional arguments are implemented.\n",
+    "\n",
+    "## maximum\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.maximum.html\n",
+    "\n",
+    "Returns the maximum of two arrays, or two scalars, or an array, and a scalar. If the arrays are of different `dtype`, the output is upcast as in [Binary operators](#Binary-operators). If both inputs are scalars, a scalar is returned. Only positional arguments are implemented."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T13:21:17.151280Z",
+     "start_time": "2021-01-08T13:21:17.123768Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "minimum of a, and b:\n",
+      "array([1.0, 2.0, 3.0, 2.0, 1.0], dtype=float64)\n",
+      "\n",
+      "maximum of a, and b:\n",
+      "array([5.0, 4.0, 3.0, 4.0, 5.0], dtype=float64)\n",
+      "\n",
+      "maximum of 1, and 5.5:\n",
+      "5.5\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array([1, 2, 3, 4, 5], dtype=np.uint8)\n",
+    "b = np.array([5, 4, 3, 2, 1], dtype=np.float)\n",
+    "print('minimum of a, and b:')\n",
+    "print(np.minimum(a, b))\n",
+    "\n",
+    "print('\\nmaximum of a, and b:')\n",
+    "print(np.maximum(a, b))\n",
+    "\n",
+    "print('\\nmaximum of 1, and 5.5:')\n",
+    "print(np.maximum(1, 5.5))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## clip\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.clip.html\n",
+    "\n",
+    "Clips an array, i.e., values that are outside of an interval are clipped to the interval edges. The function is equivalent to `maximum(a_min, minimum(a, a_max))` broadcasting takes place exactly as in [minimum](#minimum). If the arrays are of different `dtype`, the output is upcast as in [Binary operators](#Binary-operators)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T13:22:14.147310Z",
+     "start_time": "2021-01-08T13:22:14.123961Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "a:\t\t array([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=uint8)\n",
+      "clipped:\t array([3, 3, 3, 3, 4, 5, 6, 7, 7], dtype=uint8)\n",
+      "\n",
+      "a:\t\t array([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=uint8)\n",
+      "b:\t\t array([3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0], dtype=float64)\n",
+      "clipped:\t array([3.0, 3.0, 3.0, 3.0, 4.0, 5.0, 6.0, 7.0, 7.0], dtype=float64)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array(range(9), dtype=np.uint8)\n",
+    "print('a:\\t\\t', a)\n",
+    "print('clipped:\\t', np.clip(a, 3, 7))\n",
+    "\n",
+    "b = 3 * np.ones(len(a), dtype=np.float)\n",
+    "print('\\na:\\t\\t', a)\n",
+    "print('b:\\t\\t', b)\n",
+    "print('clipped:\\t', np.clip(a, b, 7))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/ulab-convert.ipynb
+++ b/docs/ulab-convert.ipynb
@ -0,0 +1,497 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-01T09:27:13.438054Z",
+     "start_time": "2020-05-01T09:27:13.191491Z"
+    }
+   },
+   "source": [
+    "# conf.py"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-15T13:53:42.464150Z",
+     "start_time": "2021-01-15T13:53:42.449894Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Overwriting manual/source/conf.py\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%writefile manual/source/conf.py\n",
+    "# Configuration file for the Sphinx documentation builder.\n",
+    "#\n",
+    "# This file only contains a selection of the most common options. For a full\n",
+    "# list see the documentation:\n",
+    "# http://www.sphinx-doc.org/en/master/config\n",
+    "\n",
+    "# -- Path setup --------------------------------------------------------------\n",
+    "\n",
+    "# If extensions (or modules to document with autodoc) are in another directory,\n",
+    "# add these directories to sys.path here. If the directory is relative to the\n",
+    "# documentation root, use os.path.abspath to make it absolute, like shown here.\n",
+    "#\n",
+    "import os\n",
+    "# import sys\n",
+    "# sys.path.insert(0, os.path.abspath('.'))\n",
+    "\n",
+    "#import sphinx_rtd_theme\n",
+    "\n",
+    "from sphinx.transforms import SphinxTransform\n",
+    "from docutils import nodes\n",
+    "from sphinx import addnodes\n",
+    "\n",
+    "# -- Project information -----------------------------------------------------\n",
+    "\n",
+    "project = 'The ulab book'\n",
+    "copyright = '2019-2021, Zoltán Vörös and contributors'\n",
+    "author = 'Zoltán Vörös'\n",
+    "\n",
+    "# The full version, including alpha/beta/rc tags\n",
+    "release = '2.1.2'\n",
+    "\n",
+    "\n",
+    "# -- General configuration ---------------------------------------------------\n",
+    "\n",
+    "# Add any Sphinx extension module names here, as strings. They can be\n",
+    "# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom\n",
+    "# ones.\n",
+    "extensions = [\n",
+    "]\n",
+    "\n",
+    "# Add any paths that contain templates here, relative to this directory.\n",
+    "templates_path = ['_templates']\n",
+    "\n",
+    "# List of patterns, relative to source directory, that match files and\n",
+    "# directories to ignore when looking for source files.\n",
+    "# This pattern also affects html_static_path and html_extra_path.\n",
+    "exclude_patterns = []\n",
+    "\n",
+    "\n",
+    "# Add any paths that contain custom static files (such as style sheets) here,\n",
+    "# relative to this directory. They are copied after the builtin static files,\n",
+    "# so a file named \"default.css\" will overwrite the builtin \"default.css\".\n",
+    "html_static_path = ['_static']\n",
+    "\n",
+    "latex_maketitle = r'''\n",
+    "\\begin{titlepage}\n",
+    "\\begin{flushright}\n",
+    "\\Huge\\textbf{The $\\mu$lab book}\n",
+    "\\vskip 0.5em\n",
+    "\\LARGE\n",
+    "\\textbf{Release %s}\n",
+    "\\vskip 5em\n",
+    "\\huge\\textbf{Zoltán Vörös}\n",
+    "\\end{flushright}\n",
+    "\\begin{flushright}\n",
+    "\\LARGE\n",
+    "\\vskip 2em\n",
+    "with contributions by\n",
+    "\\vskip 2em\n",
+    "\\textbf{Roberto Colistete Jr.}\n",
+    "\\vskip 0.2em\n",
+    "\\textbf{Jeff Epler}\n",
+    "\\vskip 0.2em\n",
+    "\\textbf{Taku Fukada}\n",
+    "\\vskip 0.2em\n",
+    "\\textbf{Diego Elio Pettenò}\n",
+    "\\vskip 0.2em\n",
+    "\\textbf{Scott Shawcroft}\n",
+    "\\vskip 5em\n",
+    "\\today\n",
+    "\\end{flushright}\n",
+    "\\end{titlepage}\n",
+    "'''%release\n",
+    "\n",
+    "latex_elements = {\n",
+    "    'maketitle': latex_maketitle\n",
+    "}\n",
+    "\n",
+    "\n",
+    "master_doc = 'index'\n",
+    "\n",
+    "author=u'Zoltán Vörös'\n",
+    "copyright=author\n",
+    "language='en'\n",
+    "\n",
+    "latex_documents = [\n",
+    "(master_doc, 'the-ulab-book.tex', 'The $\\mu$lab book',\n",
+    "'Zoltán Vörös', 'manual'),\n",
+    "]\n",
+    "\n",
+    "# Read the docs theme\n",
+    "on_rtd = os.environ.get('READTHEDOCS', None) == 'True'\n",
+    "if not on_rtd:\n",
+    "    try:\n",
+    "        import sphinx_rtd_theme\n",
+    "        html_theme = 'sphinx_rtd_theme'\n",
+    "        html_theme_path = [sphinx_rtd_theme.get_html_theme_path(), '.']\n",
+    "    except ImportError:\n",
+    "        html_theme = 'default'\n",
+    "        html_theme_path = ['.']\n",
+    "else:\n",
+    "    html_theme_path = ['.']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-15T14:18:49.025168Z",
+     "start_time": "2021-01-15T14:18:49.015858Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Overwriting manual/source/index.rst\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%writefile manual/source/index.rst\n",
+    "\n",
+    ".. ulab-manual documentation master file, created by\n",
+    "   sphinx-quickstart on Sat Oct 19 12:48:00 2019.\n",
+    "   You can adapt this file completely to your liking, but it should at least\n",
+    "   contain the root `toctree` directive.\n",
+    "\n",
+    "Welcome to the ulab book!\n",
+    "=======================================\n",
+    "\n",
+    ".. toctree::\n",
+    "   :maxdepth: 2\n",
+    "   :caption: Introduction\n",
+    "\n",
+    "   ulab-intro\n",
+    "\n",
+    ".. toctree::\n",
+    "   :maxdepth: 2\n",
+    "   :caption: User's guide:\n",
+    "\n",
+    "   ulab-ndarray\n",
+    "   numpy-functions\n",
+    "   numpy-universal\n",
+    "   numpy-fft\n",
+    "   numpy-linalg\n",
+    "   scipy-optimize\n",
+    "   scipy-signal\n",
+    "   scipy-special\n",
+    "   ulab-programming\n",
+    "\n",
+    "Indices and tables\n",
+    "==================\n",
+    "\n",
+    "* :ref:`genindex`\n",
+    "* :ref:`modindex`\n",
+    "* :ref:`search`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Notebook conversion"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-15T14:09:47.022621Z",
+     "start_time": "2021-01-15T14:09:46.985214Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import nbformat as nb\n",
+    "import nbformat.v4.nbbase as nb4\n",
+    "from nbconvert import RSTExporter\n",
+    "\n",
+    "from jinja2 import FileSystemLoader\n",
+    "rstexporter = RSTExporter(\n",
+    "    extra_loaders=[FileSystemLoader('./templates')],\n",
+    "    template_file = './templates/manual.tpl'\n",
+    ")\n",
+    "\n",
+    "def convert_notebook(fn):\n",
+    "    source = nb.read(fn+'.ipynb', nb.NO_CONVERT)\n",
+    "    notebook = nb4.new_notebook()\n",
+    "    notebook.cells = []\n",
+    "    append_cell = False\n",
+    "    for cell in source['cells']:\n",
+    "        if append_cell:\n",
+    "            notebook.cells.append(cell)\n",
+    "        else:\n",
+    "            if cell.cell_type == 'markdown':\n",
+    "                if cell.source == '__END_OF_DEFS__':\n",
+    "                    append_cell = True\n",
+    "                    \n",
+    "    (rst, resources) = rstexporter.from_notebook_node(notebook)\n",
+    "    with open('./manual/source/' + fn + '.rst', 'w') as fout:\n",
+    "        # it's a bit odd, but even an emtpy notebook is converted into a \"None\" string\n",
+    "        rst = rst.lstrip('None')\n",
+    "        fout.write(rst)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "ExecuteTime": {
+     "start_time": "2021-01-15T14:38:15.993Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "files = ['ulab-intro',\n",
+    "         'ulab-ndarray',\n",
+    "         'numpy-functions', \n",
+    "         'numpy-universal',\n",
+    "         'numpy-fft',\n",
+    "         'numpy-linalg',\n",
+    "         'scipy-optimize',\n",
+    "         'scipy-signal',\n",
+    "         'scipy-special',\n",
+    "         'ulab-programming']\n",
+    "\n",
+    "for file in files:\n",
+    "    convert_notebook(file)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Template"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-10-30T19:04:50.295563Z",
+     "start_time": "2020-10-30T19:04:50.227535Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Overwriting ./templates/manual.tpl\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%writefile ./templates/manual.tpl\n",
+    "\n",
+    "{%- extends 'display_priority.tpl' -%}\n",
+    "\n",
+    "\n",
+    "{% block in_prompt %}\n",
+    "{% endblock in_prompt %}\n",
+    "\n",
+    "{% block output_prompt %}\n",
+    "{% endblock output_prompt %}\n",
+    "\n",
+    "{% block input scoped%}\n",
+    "\n",
+    "{%- if cell.source.split('\\n')[0].startswith('%%micropython') -%}\n",
+    ".. code::\n",
+    "        \n",
+    "{{ '\\n'.join(['# code to be run in micropython'] + cell.source.strip().split('\\n')[1:]) | indent}}\n",
+    "\n",
+    "{%- else -%}\n",
+    ".. code::\n",
+    "\n",
+    "{{ '\\n'.join(['# code to be run in CPython\\n'] + cell.source.strip().split('\\n')) | indent}}\n",
+    "{%- endif -%}\n",
+    "{% endblock input %}\n",
+    "\n",
+    "{% block error %}\n",
+    "::\n",
+    "\n",
+    "{{ super() }}\n",
+    "{% endblock error %}\n",
+    "\n",
+    "{% block traceback_line %}\n",
+    "{{ line | indent | strip_ansi }}\n",
+    "{% endblock traceback_line %}\n",
+    "\n",
+    "{% block execute_result %}\n",
+    "{% block data_priority scoped %}\n",
+    "{{ super() }}\n",
+    "{% endblock %}\n",
+    "{% endblock execute_result %}\n",
+    "\n",
+    "{% block stream %}\n",
+    ".. parsed-literal::\n",
+    "\n",
+    "{{ output.text | indent }}\n",
+    "{% endblock stream %}\n",
+    "\n",
+    "{% block data_svg %}\n",
+    ".. image:: {{ output.metadata.filenames['image/svg+xml'] | urlencode }}\n",
+    "{% endblock data_svg %}\n",
+    "\n",
+    "{% block data_png %}\n",
+    ".. image:: {{ output.metadata.filenames['image/png'] | urlencode }}\n",
+    "{%- set width=output | get_metadata('width', 'image/png') -%}\n",
+    "{%- if width is not none %}\n",
+    "   :width: {{ width }}px\n",
+    "{%- endif %}\n",
+    "{%- set height=output | get_metadata('height', 'image/png') -%}\n",
+    "{%- if height is not none %}\n",
+    "   :height: {{ height }}px\n",
+    "{%- endif %}\n",
+    "{% endblock data_png %}\n",
+    "\n",
+    "{% block data_jpg %}\n",
+    ".. image:: {{ output.metadata.filenames['image/jpeg'] | urlencode }}\n",
+    "{%- set width=output | get_metadata('width', 'image/jpeg') -%}\n",
+    "{%- if width is not none %}\n",
+    "   :width: {{ width }}px\n",
+    "{%- endif %}\n",
+    "{%- set height=output | get_metadata('height', 'image/jpeg') -%}\n",
+    "{%- if height is not none %}\n",
+    "   :height: {{ height }}px\n",
+    "{%- endif %}\n",
+    "{% endblock data_jpg %}\n",
+    "\n",
+    "{% block data_markdown %}\n",
+    "{{ output.data['text/markdown'] | convert_pandoc(\"markdown\", \"rst\") }}\n",
+    "{% endblock data_markdown %}\n",
+    "\n",
+    "{% block data_latex %}\n",
+    ".. math::\n",
+    "\n",
+    "{{ output.data['text/latex'] | strip_dollars | indent }}\n",
+    "{% endblock data_latex %}\n",
+    "\n",
+    "{% block data_text scoped %}\n",
+    ".. parsed-literal::\n",
+    "\n",
+    "{{ output.data['text/plain'] | indent }}\n",
+    "{% endblock data_text %}\n",
+    "\n",
+    "{% block data_html scoped %}\n",
+    ".. raw:: html\n",
+    "\n",
+    "{{ output.data['text/html'] | indent }}\n",
+    "{% endblock data_html %}\n",
+    "\n",
+    "{% block markdowncell scoped %}\n",
+    "{{ cell.source | convert_pandoc(\"markdown\", \"rst\") }}\n",
+    "{% endblock markdowncell %}\n",
+    "\n",
+    "{%- block rawcell scoped -%}\n",
+    "{%- if cell.metadata.get('raw_mimetype', '').lower() in resources.get('raw_mimetypes', ['']) %}\n",
+    "{{cell.source}}\n",
+    "{% endif -%}\n",
+    "{%- endblock rawcell -%}\n",
+    "\n",
+    "{% block headingcell scoped %}\n",
+    "{{ (\"#\" * cell.level + cell.source) | replace('\\n', ' ') | convert_pandoc(\"markdown\", \"rst\") }}\n",
+    "{% endblock headingcell %}\n",
+    "\n",
+    "{% block unknowncell scoped %}\n",
+    "unknown type  {{cell.type}}\n",
+    "{% endblock unknowncell %}\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/ulab-intro.ipynb
+++ b/docs/ulab-intro.ipynb
@ -0,0 +1,846 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:07:55.382930Z",
+     "start_time": "2021-01-08T12:07:46.895325Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Matplotlib is building the font cache; this may take a moment.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Notebook magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:07:56.746059Z",
+     "start_time": "2021-01-08T12:07:56.737187Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.magic import Magics, magics_class, line_cell_magic\n",
+    "from IPython.core.magic import cell_magic, register_cell_magic, register_line_magic\n",
+    "from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring\n",
+    "import subprocess\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:08:00.405800Z",
+     "start_time": "2021-01-08T12:08:00.382869Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "@magics_class\n",
+    "class PyboardMagic(Magics):\n",
+    "    @cell_magic\n",
+    "    @magic_arguments()\n",
+    "    @argument('-skip')\n",
+    "    @argument('-unix')\n",
+    "    @argument('-pyboard')\n",
+    "    @argument('-file')\n",
+    "    @argument('-data')\n",
+    "    @argument('-time')\n",
+    "    @argument('-memory')\n",
+    "    def micropython(self, line='', cell=None):\n",
+    "        args = parse_argstring(self.micropython, line)\n",
+    "        if args.skip: # doesn't care about the cell's content\n",
+    "            print('skipped execution')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.unix: # tests the code on the unix port. Note that this works on unix only\n",
+    "            with open('/dev/shm/micropython.py', 'w') as fout:\n",
+    "                fout.write(cell)\n",
+    "            proc = subprocess.Popen([\"../../micropython/ports/unix/micropython\", \"/dev/shm/micropython.py\"], \n",
+    "                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
+    "            print(proc.stdout.read().decode(\"utf-8\"))\n",
+    "            print(proc.stderr.read().decode(\"utf-8\"))\n",
+    "            return None\n",
+    "        if args.file: # can be used to copy the cell content onto the pyboard's flash\n",
+    "            spaces = \"    \"\n",
+    "            try:\n",
+    "                with open(args.file, 'w') as fout:\n",
+    "                    fout.write(cell.replace('\\t', spaces))\n",
+    "                    printf('written cell to {}'.format(args.file))\n",
+    "            except:\n",
+    "                print('Failed to write to disc!')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.data: # can be used to load data from the pyboard directly into kernel space\n",
+    "            message = pyb.exec(cell)\n",
+    "            if len(message) == 0:\n",
+    "                print('pyboard >>>')\n",
+    "            else:\n",
+    "                print(message.decode('utf-8'))\n",
+    "                # register new variable in user namespace\n",
+    "                self.shell.user_ns[args.data] = string_to_matrix(message.decode(\"utf-8\"))\n",
+    "        \n",
+    "        if args.time: # measures the time of executions\n",
+    "            pyb.exec('import utime')\n",
+    "            message = pyb.exec('t = utime.ticks_us()\\n' + cell + '\\ndelta = utime.ticks_diff(utime.ticks_us(), t)' + \n",
+    "                               \"\\nprint('execution time: {:d} us'.format(delta))\")\n",
+    "            print(message.decode('utf-8'))\n",
+    "        \n",
+    "        if args.memory: # prints out memory information \n",
+    "            message = pyb.exec('from micropython import mem_info\\nprint(mem_info())\\n')\n",
+    "            print(\"memory before execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(\">>> \", message.decode('utf-8'))\n",
+    "            message = pyb.exec('print(mem_info())')\n",
+    "            print(\"memory after execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "\n",
+    "        if args.pyboard:\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(message.decode('utf-8'))\n",
+    "\n",
+    "ip = get_ipython()\n",
+    "ip.register_magics(PyboardMagic)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## pyboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:35.126401Z",
+     "start_time": "2020-05-07T07:35:35.105824Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pyboard\n",
+    "pyb = pyboard.Pyboard('/dev/ttyACM0')\n",
+    "pyb.enter_raw_repl()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:11:18.145548Z",
+     "start_time": "2020-05-19T19:11:18.137468Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pyb.exit_raw_repl()\n",
+    "pyb.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:38.725924Z",
+     "start_time": "2020-05-07T07:35:38.645488Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import utime\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "def timeit(n=1000):\n",
+    "    def wrapper(f, *args, **kwargs):\n",
+    "        func_name = str(f).split(' ')[1]\n",
+    "        def new_func(*args, **kwargs):\n",
+    "            run_times = np.zeros(n, dtype=np.uint16)\n",
+    "            for i in range(n):\n",
+    "                t = utime.ticks_us()\n",
+    "                result = f(*args, **kwargs)\n",
+    "                run_times[i] = utime.ticks_diff(utime.ticks_us(), t)\n",
+    "            print('{}() execution times based on {} cycles'.format(func_name, n, (delta2-delta1)/n))\n",
+    "            print('\\tbest: %d us'%np.min(run_times))\n",
+    "            print('\\tworst: %d us'%np.max(run_times))\n",
+    "            print('\\taverage: %d us'%np.mean(run_times))\n",
+    "            print('\\tdeviation: +/-%.3f us'%np.std(run_times))            \n",
+    "            return result\n",
+    "        return new_func\n",
+    "    return wrapper\n",
+    "\n",
+    "def timeit(f, *args, **kwargs):\n",
+    "    func_name = str(f).split(' ')[1]\n",
+    "    def new_func(*args, **kwargs):\n",
+    "        t = utime.ticks_us()\n",
+    "        result = f(*args, **kwargs)\n",
+    "        print('execution time: ', utime.ticks_diff(utime.ticks_us(), t), ' us')\n",
+    "        return result\n",
+    "    return new_func"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Enter ulab\n",
+    "\n",
+    "`ulab` is a `numpy`-like module for `micropython` and its derivatives, meant to simplify and speed up common mathematical operations on arrays. `ulab` implements a small subset of `numpy` and `scipy`. The functions were chosen such that they might be useful in the context of a microcontroller. However, the project is a living one, and suggestions for new features are always welcome. \n",
+    "\n",
+    "This document discusses how you can use the library, starting from building your own firmware, through questions like what affects the firmware size, what are the trade-offs, and what are the most important differences to `numpy` and `scipy`, respectively. The document is organised as follows:\n",
+    "\n",
+    "The chapter after this one helps you with firmware customisation.\n",
+    "\n",
+    "The third chapter gives a very concise summary of the `ulab` functions and array methods. This chapter can be used as a quick reference.\n",
+    "\n",
+    "The chapters after that are an in-depth review of most functions. Here you can find usage examples, benchmarks, as well as a thorough discussion of such concepts as broadcasting, and views versus copies. \n",
+    "\n",
+    "The final chapter of this book can be regarded as the programming manual. The inner working of `ulab` is dissected here, and you will also find hints as to how to implement your own `numpy`-compatible functions.\n",
+    "\n",
+    "\n",
+    "## Purpose\n",
+    "\n",
+    "Of course, the first question that one has to answer is, why on Earth one would need a fast math library on a microcontroller. After all, it is not expected that heavy number crunching is going to take place on bare metal. It is not meant to. On a PC, the main reason for writing fast code is the sheer amount of data that one wants to process. On a microcontroller, the data volume is probably small, but it might lead to catastrophic system failure, if these data are not processed in time, because the microcontroller is supposed to interact with the outside world in a timely fashion. In fact, this latter objective was the initiator of this project: I needed the Fourier transform of a signal coming from the ADC of the `pyboard`, and all available options were simply too slow. \n",
+    "\n",
+    "In addition to speed, another issue that one has to keep in mind when working with embedded systems is the amount of available RAM: I believe, everything here could be implemented in pure `python` with relatively little effort (in fact, there are a couple of `python`-only implementations of `numpy` functions out there), but the price we would have to pay for that is not only speed, but RAM, too. `python` code, if is not frozen, and compiled into the firmware, has to be compiled at runtime, which is not exactly a cheap process. On top of that, if numbers are stored in a list or tuple, which would be the high-level container, then they occupy 8 bytes, no matter, whether they are all smaller than 100, or larger than one hundred million. This is obviously a waste of resources in an environment, where resources are scarce. \n",
+    "\n",
+    "Finally, there is a reason for using `micropython` in the first place. Namely, that a microcontroller can be programmed in a very elegant, and *pythonic* way. But if it is so, why should we not extend this idea to other tasks and concepts that might come up in this context? If there was no other reason than this *elegance*, I would find that convincing enough.\n",
+    "\n",
+    "Based on the above-mentioned considerations, all functions in `ulab` are implemented in a way that \n",
+    "\n",
+    "1. conforms to `numpy` as much as possible\n",
+    "2. is so frugal with RAM as possible,\n",
+    "3. and yet, fast. Much faster than pure python. Think of speed-ups of 30-50!\n",
+    "\n",
+    "The main points of `ulab` are \n",
+    "\n",
+    "- compact, iterable and slicable containers of numerical data in one to four dimensions. These containers support all the relevant unary and binary operators (e.g., `len`, ==, +, *, etc.)\n",
+    "- vectorised computations on `micropython` iterables and numerical arrays (in `numpy`-speak, universal functions)\n",
+    "- computing statistical properties (mean, standard deviation etc.) on arrays\n",
+    "- basic linear algebra routines (matrix inversion, multiplication, reshaping, transposition, determinant, and eigenvalues, Cholesky decomposition and so on)\n",
+    "- polynomial fits to numerical data, and evaluation of polynomials\n",
+    "- fast Fourier transforms\n",
+    "- filtering of data (convolution and second-order filters)\n",
+    "- function minimisation, fitting, and numerical approximation routines\n",
+    "\n",
+    "`ulab` implements close to a hundred functions and array methods. At the time of writing this manual (for version 2.1.0), the library adds approximately 120 kB of extra compiled code to the `micropython` (pyboard.v.11) firmware. However, if you are tight with flash space, you can easily shave tens of kB off the firmware. In fact, if only a small sub-set of functions are needed, you can get away with less than 10 kB of flash space. See the section on [customising ulab](#Customising-the-firmware).\n",
+    "\n",
+    "## Resources and legal matters\n",
+    "\n",
+    "The source code of the module can be found under https://github.com/v923z/micropython-ulab/tree/master/code. while the source of this user manual is under https://github.com/v923z/micropython-ulab/tree/master/docs.\n",
+    "\n",
+    "The MIT licence applies to all material. \n",
+    "\n",
+    "## Friendly request\n",
+    "\n",
+    "If you use `ulab`, and bump into a bug, or think that a particular function is missing, or its behaviour does not conform to `numpy`, please, raise a [ulab issue](#https://github.com/v923z/micropython-ulab/issues) on github, so that the community can profit from your experiences. \n",
+    "\n",
+    "Even better, if you find the project to be useful, and think that it could be made better, faster, tighter, and shinier, please, consider contributing, and issue a pull request with the implementation of your improvements and new features. `ulab` can only become successful, if it offers what the community needs.\n",
+    "\n",
+    "These last comments apply to the documentation, too. If, in your opinion, the documentation is obscure, misleading, or not detailed enough, please, let us know, so that *we* can fix it.\n",
+    "\n",
+    "## Differences between micropython-ulab and circuitpython-ulab\n",
+    "\n",
+    "`ulab` has originally been developed for `micropython`, but has since been integrated into a number of its flavours. Most of these flavours are simply forks of `micropython` itself, with some additional functionality. One of the notable exceptions is `circuitpython`, which has slightly diverged at the core level, and this has some minor consequences. Some of these concern the C implementation details only, which all have been sorted out with the generous and enthusiastic support of Jeff Epler from [Adafruit Industries](http://www.adafruit.com).\n",
+    "\n",
+    "There are, however, a couple of instances, where the two environments differ at the python level in how the class properties can be accessed. We will point out the differences and possible workarounds at the relevant places in this document."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Customising the firmware\n",
+    "\n",
+    "\n",
+    "As mentioned above, `ulab` has considerably grown since its conception, which also means that it might no longer fit on the microcontroller of your choice. There are, however, a couple of ways of customising the firmware, and thereby reducing its size. \n",
+    "\n",
+    "All `ulab` options are listed in a single header file, [ulab.h](https://github.com/v923z/micropython-ulab/blob/master/code/ulab.h), which contains pre-processor flags for each feature that can be fine-tuned. The first couple of lines of the file look like this\n",
+    "\n",
+    "```c\n",
+    "// The pre-processor constants in this file determine how ulab behaves:\n",
+    "//\n",
+    "// - how many dimensions ulab can handle\n",
+    "// - which functions are included in the compiled firmware\n",
+    "// - whether the python syntax is numpy-like, or modular\n",
+    "// - whether arrays can be sliced and iterated over\n",
+    "// - which binary/unary operators are supported\n",
+    "//\n",
+    "// A considerable amount of flash space can be saved by removing (setting\n",
+    "// the corresponding constants to 0) the unnecessary functions and features.\n",
+    "\n",
+    "// Determines, whether scipy is defined in ulab. The sub-modules and functions\n",
+    "// of scipy have to be defined separately\n",
+    "#define ULAB_HAS_SCIPY                      (1)\n",
+    "\n",
+    "// The maximum number of dimensions the firmware should be able to support\n",
+    "// Possible values lie between 1, and 4, inclusive\n",
+    "#define ULAB_MAX_DIMS                       2\n",
+    "\n",
+    "// By setting this constant to 1, iteration over array dimensions will be implemented\n",
+    "// as a function (ndarray_rewind_array), instead of writing out the loops in macros\n",
+    "// This reduces firmware size at the expense of speed\n",
+    "#define ULAB_HAS_FUNCTION_ITERATOR          (0)\n",
+    "\n",
+    "// If NDARRAY_IS_ITERABLE is 1, the ndarray object defines its own iterator function\n",
+    "// This option saves approx. 250 bytes of flash space\n",
+    "#define NDARRAY_IS_ITERABLE                 (1)\n",
+    "\n",
+    "// Slicing can be switched off by setting this variable to 0\n",
+    "#define NDARRAY_IS_SLICEABLE                (1)\n",
+    "\n",
+    "// The default threshold for pretty printing. These variables can be overwritten\n",
+    "// at run-time via the set_printoptions() function\n",
+    "#define ULAB_HAS_PRINTOPTIONS               (1)\n",
+    "#define NDARRAY_PRINT_THRESHOLD             10\n",
+    "#define NDARRAY_PRINT_EDGEITEMS             3\n",
+    "\n",
+    "// determines, whether the dtype is an object, or simply a character\n",
+    "// the object implementation is numpythonic, but requires more space\n",
+    "#define ULAB_HAS_DTYPE_OBJECT               (0)\n",
+    "\n",
+    "// the ndarray binary operators\n",
+    "#define NDARRAY_HAS_BINARY_OPS              (1)\n",
+    "\n",
+    "// Firmware size can be reduced at the expense of speed by using function\n",
+    "// pointers in iterations. For each operator, he function pointer saves around\n",
+    "// 2 kB in the two-dimensional case, and around 4 kB in the four-dimensional case.\n",
+    "\n",
+    "#define NDARRAY_BINARY_USES_FUN_POINTER     (0)\n",
+    "\n",
+    "#define NDARRAY_HAS_BINARY_OP_ADD           (1)\n",
+    "#define NDARRAY_HAS_BINARY_OP_EQUAL         (1)\n",
+    "#define NDARRAY_HAS_BINARY_OP_LESS          (1)\n",
+    "#define NDARRAY_HAS_BINARY_OP_LESS_EQUAL    (1)\n",
+    "#define NDARRAY_HAS_BINARY_OP_MORE          (1)\n",
+    "#define NDARRAY_HAS_BINARY_OP_MORE_EQUAL    (1)\n",
+    "#define NDARRAY_HAS_BINARY_OP_MULTIPLY      (1)\n",
+    "#define NDARRAY_HAS_BINARY_OP_NOT_EQUAL     (1)\n",
+    "#define NDARRAY_HAS_BINARY_OP_POWER         (1)\n",
+    "#define NDARRAY_HAS_BINARY_OP_SUBTRACT      (1)\n",
+    "#define NDARRAY_HAS_BINARY_OP_TRUE_DIVIDE   (1)\n",
+    "...     \n",
+    "```\n",
+    "\n",
+    "The meaning of flags with names `_HAS_` should be obvious, so we will just explain the other options. \n",
+    "\n",
+    "To see how much you can gain by un-setting the functions that you do not need, here are some pointers. In four dimensions, including all functions adds around 120 kB to the `micropython` firmware. On the other hand, if you are interested in Fourier transforms only, and strip everything else, you get away with less than 5 kB extra. \n",
+    "\n",
+    "## Compatibility with numpy\n",
+    "\n",
+    "The functions implemented in `ulab` are organised in three sub-modules at the C level, namely, `numpy`, `scipy`, and `user`. This modularity is elevated to `python`, meaning that in order to use functions that are part of `numpy`, you have to import `numpy` as\n",
+    "\n",
+    "```python\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "x = np.array([4, 5, 6])\n",
+    "p = np.array([1, 2, 3])\n",
+    "np.polyval(p, x)\n",
+    "```\n",
+    "\n",
+    "There are a couple of exceptions to this rule, namely `fft`, and `linalg`, which are sub-modules even in `numpy`, thus you have to write them out as \n",
+    "\n",
+    "```python\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "A = np.array([1, 2, 3, 4]).reshape()\n",
+    "np.linalg.trace(A)\n",
+    "```\n",
+    "\n",
+    "Some of the functions in `ulab` are re-implementations of `scipy` functions, and they are to be imported as \n",
+    "\n",
+    "```python\n",
+    "from ulab import numpy as np\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "\n",
+    "x = np.array([1, 2, 3])\n",
+    "spy.special.erf(x)\n",
+    "```\n",
+    "\n",
+    "`numpy`-compatibility has an enormous benefit : namely, by `try`ing to `import`, we can guarantee that the same, unmodified code runs in `CPython`, as in `micropython`. The following snippet is platform-independent, thus, the `python` code can be tested and debugged on a computer before loading it onto the microcontroller.\n",
+    "\n",
+    "```python\n",
+    "\n",
+    "try:\n",
+    "    from ulab import numpy as np\n",
+    "    from ulab import scipy as spy\n",
+    "except ImportError:\n",
+    "    import numpy as np\n",
+    "    import scipy as spy\n",
+    "    \n",
+    "x = np.array([1, 2, 3])\n",
+    "spy.special.erf(x)    \n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The impact of dimensionality\n",
+    "\n",
+    "### Reducing the number of dimensions\n",
+    "\n",
+    "`ulab` supports tensors of rank four, but this is expensive in terms of flash: with all available functions and options, the library adds around 100 kB to the firmware. However, if such high dimensions are not required, significant reductions in size can be gotten by changing the value of \n",
+    "\n",
+    "```c\n",
+    "#define ULAB_MAX_DIMS                   2\n",
+    "```\n",
+    "\n",
+    "Two dimensions cost a bit more than half of four, while you can get away with around 20 kB of flash in one dimension, because all those functions that don't make sense (e.g., matrix inversion, eigenvalues etc.) are automatically stripped from the firmware.\n",
+    "\n",
+    "### Using the function iterator\n",
+    "\n",
+    "In higher dimensions, the firmware size increases, because each dimension (axis) adds another level of nested loops. An example of this is the macro of the binary operator in three dimensions\n",
+    "\n",
+    "```c\n",
+    "#define BINARY_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\n",
+    "    type_out *array = (type_out *)results->array;\n",
+    "    size_t j = 0;\n",
+    "    do {\n",
+    "        size_t k = 0;\n",
+    "        do {\n",
+    "            size_t l = 0;\n",
+    "            do {\n",
+    "                *array++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\n",
+    "                (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\n",
+    "                (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\n",
+    "                l++;\n",
+    "            } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\n",
+    "            (larray) -= (lstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\n",
+    "            (larray) += (lstrides)[ULAB_MAX_DIMS - 2];\n",
+    "            (rarray) -= (rstrides)[ULAB_MAX_DIMS - 1] * (results)->shape[ULAB_MAX_DIMS-1];\n",
+    "            (rarray) += (rstrides)[ULAB_MAX_DIMS - 2];\n",
+    "            k++;\n",
+    "        } while(k < (results)->shape[ULAB_MAX_DIMS - 2]);\n",
+    "        (larray) -= (lstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];\n",
+    "        (larray) += (lstrides)[ULAB_MAX_DIMS - 3];\n",
+    "        (rarray) -= (rstrides)[ULAB_MAX_DIMS - 2] * results->shape[ULAB_MAX_DIMS-2];\n",
+    "        (rarray) += (rstrides)[ULAB_MAX_DIMS - 3];\n",
+    "        j++;\n",
+    "    } while(j < (results)->shape[ULAB_MAX_DIMS - 3]);\n",
+    "```\n",
+    "\n",
+    "In order to reduce firmware size, it *might* make sense in higher dimensions to make use of the function iterator by setting the \n",
+    "\n",
+    "```c\n",
+    "#define ULAB_HAS_FUNCTION_ITERATOR      (1)\n",
+    "```\n",
+    "\n",
+    "constant to 1. This allows the compiler to call the `ndarray_rewind_array` function, so that it doesn't have to unwrap the loops for `k`, and `j`. Instead of the macro above, we now have \n",
+    "\n",
+    "```c\n",
+    "#define BINARY_LOOP(results, type_out, type_left, type_right, larray, lstrides, rarray, rstrides, OPERATOR)\n",
+    "    type_out *array = (type_out *)(results)->array;\n",
+    "    size_t *lcoords = ndarray_new_coords((results)->ndim);\n",
+    "    size_t *rcoords = ndarray_new_coords((results)->ndim);\n",
+    "    for(size_t i=0; i < (results)->len/(results)->shape[ULAB_MAX_DIMS -1]; i++) {\n",
+    "        size_t l = 0;\n",
+    "        do {\n",
+    "            *array++ = *((type_left *)(larray)) OPERATOR *((type_right *)(rarray));\n",
+    "            (larray) += (lstrides)[ULAB_MAX_DIMS - 1];\n",
+    "            (rarray) += (rstrides)[ULAB_MAX_DIMS - 1];\n",
+    "            l++;\n",
+    "        } while(l < (results)->shape[ULAB_MAX_DIMS - 1]);\n",
+    "        ndarray_rewind_array((results)->ndim, larray, (results)->shape, lstrides, lcoords);\n",
+    "        ndarray_rewind_array((results)->ndim, rarray, (results)->shape, rstrides, rcoords);\n",
+    "    } while(0)\n",
+    "```\n",
+    "\n",
+    "Since the `ndarray_rewind_array` function is implemented only once, a lot of space can be saved. Obviously,  function calls cost time, thus such trade-offs must be evaluated for each application. The gain also depends on which functions and features you include. Operators and functions that involve two arrays are expensive, because at the C level, the number of cases that must be handled scales with the squares of the number of data types. As an example, the innocent-looking expression\n",
+    "\n",
+    "```python\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array([1, 2, 3])\n",
+    "b = np.array([4, 5, 6])\n",
+    "\n",
+    "c = a + b\n",
+    "```\n",
+    "requires 25 loops in C, because the `dtypes` of both `a`, and `b` can assume 5 different values, and the addition has to be resolved for all possible cases. Hint: each binary operator costs between 3 and 4 kB in two dimensions."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The ulab version string\n",
+    "\n",
+    "As is customary with `python` packages, information on the package version can be found be querying the `__version__` string. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-12T06:25:27.328061Z",
+     "start_time": "2021-01-12T06:25:27.308199Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "you are running ulab version 2.1.0-2D\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "import ulab\n",
+    "\n",
+    "print('you are running ulab version', ulab.__version__)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The first three numbers indicate the major, minor, and sub-minor versions of `ulab` (defined by the `ULAB_VERSION` constant in [ulab.c](https://github.com/v923z/micropython-ulab/blob/master/code/ulab.c)). We usually change the minor version, whenever a new function is added to the code, and the sub-minor version will be incremented, if a bug fix is implemented. \n",
+    "\n",
+    "`2D` tells us that the particular firmware supports tensors of rank 2 (defined by `ULAB_MAX_DIMS` in [ulab.h](https://github.com/v923z/micropython-ulab/blob/master/code/ulab.h)). \n",
+    "\n",
+    "If you find a bug, please, include the version string in your report!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Should you need the numerical value of `ULAB_MAX_DIMS`, you can get it from the version string in the following way:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-13T06:00:00.616473Z",
+     "start_time": "2021-01-13T06:00:00.602787Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "version string:  2.1.0-2D\n",
+      "version dimensions:  2D\n",
+      "numerical value of dimensions:  2\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "import ulab\n",
+    "\n",
+    "version = ulab.__version__\n",
+    "version_dims = version.split('-')[1]\n",
+    "version_num = int(version_dims.replace('D', ''))\n",
+    "\n",
+    "print('version string: ', version)\n",
+    "print('version dimensions: ', version_dims)\n",
+    "print('numerical value of dimensions: ', version_num)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Finding out what your firmware supports\n",
+    "\n",
+    "`ulab` implements a number of array operators and functions, but this does not mean that all of these functions and methods are actually compiled into the firmware. You can fine-tune your firmware by setting/unsetting any of the `_HAS_` constants in [ulab.h](https://github.com/v923z/micropython-ulab/blob/master/code/ulab.h). \n",
+    "\n",
+    "### Functions included  in the firmware\n",
+    "\n",
+    "The version string will not tell you everything about your firmware, because the supported functions and sub-modules can still arbitrarily be included or excluded. One way of finding out what is compiled into the firmware is calling `dir` with `ulab` as its argument."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:47:37.963507Z",
+     "start_time": "2021-01-08T12:47:37.936641Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "===== constants, functions, and modules of numpy =====\n",
+      "\n",
+      " ['__class__', '__name__', 'bool', 'sort', 'sum', 'acos', 'acosh', 'arange', 'arctan2', 'argmax', 'argmin', 'argsort', 'around', 'array', 'asin', 'asinh', 'atan', 'atanh', 'ceil', 'clip', 'concatenate', 'convolve', 'cos', 'cosh', 'cross', 'degrees', 'diag', 'diff', 'e', 'equal', 'exp', 'expm1', 'eye', 'fft', 'flip', 'float', 'floor', 'frombuffer', 'full', 'get_printoptions', 'inf', 'int16', 'int8', 'interp', 'linalg', 'linspace', 'log', 'log10', 'log2', 'logspace', 'max', 'maximum', 'mean', 'median', 'min', 'minimum', 'nan', 'ndinfo', 'not_equal', 'ones', 'pi', 'polyfit', 'polyval', 'radians', 'roll', 'set_printoptions', 'sin', 'sinh', 'sqrt', 'std', 'tan', 'tanh', 'trapz', 'uint16', 'uint8', 'vectorize', 'zeros']\n",
+      "\n",
+      "functions included in the fft module:\n",
+      " ['__class__', '__name__', 'fft', 'ifft']\n",
+      "\n",
+      "functions included in the linalg module:\n",
+      " ['__class__', '__name__', 'cholesky', 'det', 'dot', 'eig', 'inv', 'norm', 'trace']\n",
+      "\n",
+      "\n",
+      "===== modules of scipy =====\n",
+      "\n",
+      " ['__class__', '__name__', 'optimize', 'signal', 'special']\n",
+      "\n",
+      "functions included in the optimize module:\n",
+      " ['__class__', '__name__', 'bisect', 'fmin', 'newton']\n",
+      "\n",
+      "functions included in the signal module:\n",
+      " ['__class__', '__name__', 'sosfilt', 'spectrogram']\n",
+      "\n",
+      "functions included in the special module:\n",
+      " ['__class__', '__name__', 'erf', 'erfc', 'gamma', 'gammaln']\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "from ulab import scipy as spy\n",
+    "\n",
+    "\n",
+    "print('===== constants, functions, and modules of numpy =====\\n\\n', dir(np))\n",
+    "\n",
+    "# since fft and linalg are sub-modules, print them separately\n",
+    "print('\\nfunctions included in the fft module:\\n', dir(np.fft))\n",
+    "print('\\nfunctions included in the linalg module:\\n', dir(np.linalg))\n",
+    "\n",
+    "print('\\n\\n===== modules of scipy =====\\n\\n', dir(spy))\n",
+    "print('\\nfunctions included in the optimize module:\\n', dir(spy.optimize))\n",
+    "print('\\nfunctions included in the signal module:\\n', dir(spy.signal))\n",
+    "print('\\nfunctions included in the special module:\\n', dir(spy.special))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Methods included in the firmware\n",
+    "\n",
+    "The `dir` function applied to the module or its sub-modules gives information on what the module and sub-modules include, but is not enough to find out which methods the `ndarray` class supports. We can list the methods by calling `dir` with the `array` object itself:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:48:17.927709Z",
+     "start_time": "2021-01-08T12:48:17.903132Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['__class__', '__name__', 'copy', 'sort', '__bases__', '__dict__', 'dtype', 'flatten', 'itemsize', 'reshape', 'shape', 'size', 'strides', 'tobytes', 'transpose']\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "print(dir(np.array))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Operators included in the firmware\n",
+    "\n",
+    "A list of operators cannot be generated as shown above. If you really need to find out, whether, e.g., the `**` operator is supported by the firmware, you have to `try` it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2021-01-08T12:49:59.902054Z",
+     "start_time": "2021-01-08T12:49:59.875760Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "operator is not supported:  unsupported types for __pow__: 'ndarray', 'ndarray'\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "from ulab import numpy as np\n",
+    "\n",
+    "a = np.array([1, 2, 3])\n",
+    "b = np.array([4, 5, 6])\n",
+    "\n",
+    "try:\n",
+    "    print(a ** b)\n",
+    "except Exception as e:\n",
+    "    print('operator is not supported: ', e)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The exception above would be raised, if the firmware was compiled with the \n",
+    "\n",
+    "```c\n",
+    "#define NDARRAY_HAS_BINARY_OP_POWER         (0)\n",
+    "```\n",
+    "\n",
+    "definition."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/ulab-manual.ipynb
+++ b/docs/ulab-manual.ipynb
--- a/docs/ulab-ndarray.ipynb
+++ b/docs/ulab-ndarray.ipynb
--- a/docs/ulab-numerical.ipynb
+++ b/docs/ulab-numerical.ipynb
--- a/docs/ulab-poly.ipynb
+++ b/docs/ulab-poly.ipynb
@ -0,0 +1,454 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-01T09:27:13.438054Z",
+     "start_time": "2020-05-01T09:27:13.191491Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Notebook magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-08-03T18:32:45.342280Z",
+     "start_time": "2020-08-03T18:32:45.338442Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.magic import Magics, magics_class, line_cell_magic\n",
+    "from IPython.core.magic import cell_magic, register_cell_magic, register_line_magic\n",
+    "from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring\n",
+    "import subprocess\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-07-23T20:31:25.296014Z",
+     "start_time": "2020-07-23T20:31:25.265937Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "@magics_class\n",
+    "class PyboardMagic(Magics):\n",
+    "    @cell_magic\n",
+    "    @magic_arguments()\n",
+    "    @argument('-skip')\n",
+    "    @argument('-unix')\n",
+    "    @argument('-pyboard')\n",
+    "    @argument('-file')\n",
+    "    @argument('-data')\n",
+    "    @argument('-time')\n",
+    "    @argument('-memory')\n",
+    "    def micropython(self, line='', cell=None):\n",
+    "        args = parse_argstring(self.micropython, line)\n",
+    "        if args.skip: # doesn't care about the cell's content\n",
+    "            print('skipped execution')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.unix: # tests the code on the unix port. Note that this works on unix only\n",
+    "            with open('/dev/shm/micropython.py', 'w') as fout:\n",
+    "                fout.write(cell)\n",
+    "            proc = subprocess.Popen([\"../../micropython/ports/unix/micropython\", \"/dev/shm/micropython.py\"], \n",
+    "                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
+    "            print(proc.stdout.read().decode(\"utf-8\"))\n",
+    "            print(proc.stderr.read().decode(\"utf-8\"))\n",
+    "            return None\n",
+    "        if args.file: # can be used to copy the cell content onto the pyboard's flash\n",
+    "            spaces = \"    \"\n",
+    "            try:\n",
+    "                with open(args.file, 'w') as fout:\n",
+    "                    fout.write(cell.replace('\\t', spaces))\n",
+    "                    printf('written cell to {}'.format(args.file))\n",
+    "            except:\n",
+    "                print('Failed to write to disc!')\n",
+    "            return None # do not parse the rest\n",
+    "        if args.data: # can be used to load data from the pyboard directly into kernel space\n",
+    "            message = pyb.exec(cell)\n",
+    "            if len(message) == 0:\n",
+    "                print('pyboard >>>')\n",
+    "            else:\n",
+    "                print(message.decode('utf-8'))\n",
+    "                # register new variable in user namespace\n",
+    "                self.shell.user_ns[args.data] = string_to_matrix(message.decode(\"utf-8\"))\n",
+    "        \n",
+    "        if args.time: # measures the time of executions\n",
+    "            pyb.exec('import utime')\n",
+    "            message = pyb.exec('t = utime.ticks_us()\\n' + cell + '\\ndelta = utime.ticks_diff(utime.ticks_us(), t)' + \n",
+    "                               \"\\nprint('execution time: {:d} us'.format(delta))\")\n",
+    "            print(message.decode('utf-8'))\n",
+    "        \n",
+    "        if args.memory: # prints out memory information \n",
+    "            message = pyb.exec('from micropython import mem_info\\nprint(mem_info())\\n')\n",
+    "            print(\"memory before execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(\">>> \", message.decode('utf-8'))\n",
+    "            message = pyb.exec('print(mem_info())')\n",
+    "            print(\"memory after execution:\\n========================\\n\", message.decode('utf-8'))\n",
+    "\n",
+    "        if args.pyboard:\n",
+    "            message = pyb.exec(cell)\n",
+    "            print(message.decode('utf-8'))\n",
+    "\n",
+    "ip = get_ipython()\n",
+    "ip.register_magics(PyboardMagic)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## pyboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:35.126401Z",
+     "start_time": "2020-05-07T07:35:35.105824Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pyboard\n",
+    "pyb = pyboard.Pyboard('/dev/ttyACM0')\n",
+    "pyb.enter_raw_repl()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-19T19:11:18.145548Z",
+     "start_time": "2020-05-19T19:11:18.137468Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pyb.exit_raw_repl()\n",
+    "pyb.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-05-07T07:35:38.725924Z",
+     "start_time": "2020-05-07T07:35:38.645488Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import utime\n",
+    "import ulab as np\n",
+    "\n",
+    "def timeit(n=1000):\n",
+    "    def wrapper(f, *args, **kwargs):\n",
+    "        func_name = str(f).split(' ')[1]\n",
+    "        def new_func(*args, **kwargs):\n",
+    "            run_times = np.zeros(n, dtype=np.uint16)\n",
+    "            for i in range(n):\n",
+    "                t = utime.ticks_us()\n",
+    "                result = f(*args, **kwargs)\n",
+    "                run_times[i] = utime.ticks_diff(utime.ticks_us(), t)\n",
+    "            print('{}() execution times based on {} cycles'.format(func_name, n, (delta2-delta1)/n))\n",
+    "            print('\\tbest: %d us'%np.min(run_times))\n",
+    "            print('\\tworst: %d us'%np.max(run_times))\n",
+    "            print('\\taverage: %d us'%np.mean(run_times))\n",
+    "            print('\\tdeviation: +/-%.3f us'%np.std(run_times))            \n",
+    "            return result\n",
+    "        return new_func\n",
+    "    return wrapper\n",
+    "\n",
+    "def timeit(f, *args, **kwargs):\n",
+    "    func_name = str(f).split(' ')[1]\n",
+    "    def new_func(*args, **kwargs):\n",
+    "        t = utime.ticks_us()\n",
+    "        result = f(*args, **kwargs)\n",
+    "        print('execution time: ', utime.ticks_diff(utime.ticks_us(), t), ' us')\n",
+    "        return result\n",
+    "    return new_func"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Polynomials\n",
+    "\n",
+    "Functions in the polynomial sub-module can be invoked by importing the module first."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## polyval\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyval.html\n",
+    "\n",
+    "`polyval` takes two arguments, both arrays or other iterables."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 187,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-11-01T12:53:22.448303Z",
+     "start_time": "2019-11-01T12:53:22.435176Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "coefficients:  [1, 1, 1, 0]\n",
+      "independent values:  [0, 1, 2, 3, 4]\n",
+      "\n",
+      "values of p(x):  array([0.0, 3.0, 14.0, 39.0, 84.0], dtype=float)\n",
+      "\n",
+      "ndarray (a):  array([0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)\n",
+      "value of p(a):  array([0.0, 3.0, 14.0, 39.0, 84.0], dtype=float)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "import ulab as np\n",
+    "from ulab import poly\n",
+    "\n",
+    "p = [1, 1, 1, 0]\n",
+    "x = [0, 1, 2, 3, 4]\n",
+    "print('coefficients: ', p)\n",
+    "print('independent values: ', x)\n",
+    "print('\\nvalues of p(x): ', poly.polyval(p, x))\n",
+    "\n",
+    "# the same works with one-dimensional ndarrays\n",
+    "a = np.array(x)\n",
+    "print('\\nndarray (a): ', a)\n",
+    "print('value of p(a): ', poly.polyval(p, a))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## polyfit\n",
+    "\n",
+    "`numpy`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html\n",
+    "\n",
+    "polyfit takes two, or three arguments. The last one is the degree of the polynomial that will be fitted, the last but one is an array or iterable with the `y` (dependent) values, and the first one, an array or iterable with the `x` (independent) values, can be dropped. If that is the case, `x` will be generated in the function, assuming uniform sampling. \n",
+    "\n",
+    "If the length of `x`, and `y` are not the same, the function raises a `ValueError`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 189,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-11-01T12:54:08.326802Z",
+     "start_time": "2019-11-01T12:54:08.311182Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "independent values:\t array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0], dtype=float)\n",
+      "dependent values:\t array([9.0, 4.0, 1.0, 0.0, 1.0, 4.0, 9.0], dtype=float)\n",
+      "fitted values:\t\t array([1.0, -6.0, 9.000000000000004], dtype=float)\n",
+      "\n",
+      "dependent values:\t array([9.0, 4.0, 1.0, 0.0, 1.0, 4.0, 9.0], dtype=float)\n",
+      "fitted values:\t\t array([1.0, -6.0, 9.000000000000004], dtype=float)\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -unix 1\n",
+    "\n",
+    "import ulab as np\n",
+    "from ulab import poly\n",
+    "\n",
+    "x = np.array([0, 1, 2, 3, 4, 5, 6])\n",
+    "y = np.array([9, 4, 1, 0, 1, 4, 9])\n",
+    "print('independent values:\\t', x)\n",
+    "print('dependent values:\\t', y)\n",
+    "print('fitted values:\\t\\t', poly.polyfit(x, y, 2))\n",
+    "\n",
+    "# the same with missing x\n",
+    "print('\\ndependent values:\\t', y)\n",
+    "print('fitted values:\\t\\t', poly.polyfit(y, 2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Execution time\n",
+    "\n",
+    "`polyfit` is based on the inversion of a matrix (there is more on the background in  https://en.wikipedia.org/wiki/Polynomial_regression), and it requires the intermediate storage of `2*N*(deg+1)` floats, where `N` is the number of entries in the input array, and `deg` is the fit's degree. The additional computation costs of the matrix inversion discussed in [inv](#inv) also apply. The example from above needs around 150 microseconds to return:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 560,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2019-10-20T07:24:39.002243Z",
+     "start_time": "2019-10-20T07:24:38.978687Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "execution time:  153  us\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%micropython -pyboard 1\n",
+    "\n",
+    "import ulab as np\n",
+    "from ulab import poly\n",
+    "\n",
+    "@timeit\n",
+    "def time_polyfit(x, y, n):\n",
+    "    return poly.polyfit(x, y, n)\n",
+    "\n",
+    "x = np.array([0, 1, 2, 3, 4, 5, 6])\n",
+    "y = np.array([9, 4, 1, 0, 1, 4, 9])\n",
+    "\n",
+    "time_polyfit(x, y, 2)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "382.797px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/ulab-programming.ipynb
+++ b/docs/ulab-programming.ipynb
@ -0,0 +1,798 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2020-10-25T21:25:53.804315Z",
+     "start_time": "2020-10-25T21:25:43.765649Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Populating the interactive namespace from numpy and matplotlib\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__END_OF_DEFS__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Programming ulab\n",
+    "\n",
+    "Earlier we have seen, how `ulab`'s functions and methods can be accessed in `micropython`. This last section of the book explains, how these functions are implemented. By the end of this chapter, not only would you be able to extend `ulab`, and write your own `numpy`-compatible functions, but through a deeper understanding of the inner workings of the functions, you would also be able to see what the trade-offs are at the `python` level.\n",
+    "\n",
+    "\n",
+    "## Code organisation\n",
+    "\n",
+    "As mentioned earlier, the `python` functions are organised into sub-modules at the C level. The C sub-modules can be found in `./ulab/code/`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The `ndarray` object"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### General comments\n",
+    "\n",
+    "`ndarrays` are efficient containers of numerical data of the same type (i.e., signed/unsigned chars, signed/unsigned integers or `mp_float_t`s, which, depending on the platform, are either C `float`s, or C `double`s). Beyond storing the actual data in the void pointer `*array`, the type definition has eight additional members (on top of the `base` type). Namely, the `dtype`, which tells us, how the bytes are to be interpreted. Moreover, the `itemsize`, which stores the size of a single entry in the array, `boolean`, an unsigned integer, which determines, whether the arrays is to be treated as a set of Booleans, or as numerical data, `ndim`, the number of dimensions (`uint8_t`), `len`, the length of the array (the number of entries), the shape (`*size_t`), the strides (`*int32_t`). The length is simply the product of the numbers in `shape`.\n",
+    "\n",
+    "The type definition is as follows:\n",
+    "\n",
+    "```c\n",
+    "typedef struct _ndarray_obj_t {\n",
+    "    mp_obj_base_t base;\n",
+    "    uint8_t dtype;\n",
+    "    uint8_t itemsize;\n",
+    "    uint8_t boolean;\n",
+    "    uint8_t ndim;\n",
+    "    size_t len;\n",
+    "    size_t shape[ULAB_MAX_DIMS];\n",
+    "    int32_t strides[ULAB_MAX_DIMS];\n",
+    "    void *array;\n",
+    "} ndarray_obj_t;\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Memory layout\n",
+    "\n",
+    "The values of an `ndarray` are stored in a contiguous segment in the RAM. The `ndarray` can be dense, meaning that all numbers in the linear memory segment belong to a linar combination of coordinates, and it can also be sparse, i.e., some elements of the linear storage space will be skipped, when the elements of the tensor are traversed. \n",
+    "\n",
+    "In the RAM, the position of the item $M(n_1, n_2, ..., n_{k-1}, n_k)$ in a dense tensor of rank $k$ is given by the linear combination \n",
+    "\n",
+    "\\begin{equation}\n",
+    "P(n_1, n_2, ..., n_{k-1}, n_k) = n_1 s_1 + n_2 s_2 + ... + n_{k-1}s_{k-1} + n_ks_k = \\sum_{i=1}^{k}n_is_i\n",
+    "\\end{equation}\n",
+    "where $s_i$ are the strides of the tensor, defined as \n",
+    "\n",
+    "\\begin{equation}\n",
+    "s_i = \\prod_{j=i+1}^k l_j\n",
+    "\\end{equation}\n",
+    "\n",
+    "where $l_j$ is length of the tensor along the $j$th axis. When the tensor is sparse (e.g., when the tensor is sliced), the strides along a particular axis will be multiplied by a non-zero integer. If this integer is different to $\\pm 1$, the linear combination above cannot access all elements in the RAM, i.e., some numbers will be skipped. Note that $|s_1| > |s_2| > ... > |s_{k-1}| > |s_k|$, even if the tensor is sparse. The statement is trivial for dense tensors, and it follows from the definition of $s_i$. For sparse tensors, a slice cannot have a step larger than the shape along that axis. But for dense tensors, $s_i/s_{i+1} = l_i$. \n",
+    "\n",
+    "When creating a *view*, we simply re-calculate the `strides`, and re-set the `*array` pointer.\n",
+    "\n",
+    "## Iterating over elements of a tensor\n",
+    "\n",
+    "The `shape` and `strides` members of the array tell us how we have to move our pointer, when we want to read out the numbers. For technical reasons that will become clear later, the numbers in `shape` and in `strides` are aligned to the right, and begin on the right hand side, i.e., if the number of possible dimensions is `ULAB_MAX_DIMS`, then `shape[ULAB_MAX_DIMS-1]` is the length of the last axis, `shape[ULAB_MAX_DIMS-2]` is the length of the last but one axis, and so on. If the number of actual dimensions, `ndim < ULAB_MAX_DIMS`, the first `ULAB_MAX_DIMS - ndim` entries in `shape` and `strides` will be equal to zero, but they could, in fact, be assigned any value, because these will never be accessed in an operation.\n",
+    "\n",
+    "With this definition of the strides, the linear combination in $P(n_1, n_2, ..., n_{k-1}, n_k)$ is a one-to-one mapping from the space of tensor coordinates, $(n_1, n_2, ..., n_{k-1}, n_k)$, and the coordinate in the linear array, $n_1s_1 + n_2s_2 + ... + n_{k-1}s_{k-1} + n_ks_k$, i.e., no two distinct sets of coordinates will result in the same position in the linear array. \n",
+    "\n",
+    "Since the `strides` are given in terms of bytes, when we iterate over an array, the void data pointer is usually cast to `uint8_t`, and the values are converted using the proper data type stored in `ndarray->dtype`. However, there might be cases, when it makes perfect sense to cast `*array` to a different type, in which case the `strides` have to be re-scaled by the value of `ndarray->itemsize`.\n",
+    "\n",
+    "### Iterating using the unwrapped loops\n",
+    "\n",
+    "The following macro definition is taken from [vector.h](https://github.com/v923z/micropython-ulab/blob/master/code/numpy/vector/vector.h), and demonstrates, how we can iterate over a single array in four dimensions. \n",
+    "\n",
+    "```c\n",
+    "#define ITERATE_VECTOR(type, array, source, sarray) do {\n",
+    "    size_t i=0;\n",
+    "    do {\n",
+    "        size_t j = 0;\n",
+    "        do {\n",
+    "            size_t k = 0;\n",
+    "            do {\n",
+    "                size_t l = 0;\n",
+    "                do {\n",
+    "                    *(array)++ = f(*((type *)(sarray)));\n",
+    "                    (sarray) += (source)->strides[ULAB_MAX_DIMS - 1];\n",
+    "                    l++;\n",
+    "                } while(l < (source)->shape[ULAB_MAX_DIMS-1]);\n",
+    "                (sarray) -= (source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1];\n",
+    "                (sarray) += (source)->strides[ULAB_MAX_DIMS - 2];\n",
+    "                k++;\n",
+    "            } while(k < (source)->shape[ULAB_MAX_DIMS-2]);\n",
+    "            (sarray) -= (source)->strides[ULAB_MAX_DIMS - 2] * (source)->shape[ULAB_MAX_DIMS-2];\n",
+    "            (sarray) += (source)->strides[ULAB_MAX_DIMS - 3];\n",
+    "            j++;\n",
+    "        } while(j < (source)->shape[ULAB_MAX_DIMS-3]);\n",
+    "        (sarray) -= (source)->strides[ULAB_MAX_DIMS - 3] * (source)->shape[ULAB_MAX_DIMS-3];\n",
+    "        (sarray) += (source)->strides[ULAB_MAX_DIMS - 4];\n",
+    "        i++;\n",
+    "    } while(i < (source)->shape[ULAB_MAX_DIMS-4]);\n",
+    "} while(0)\n",
+    "```\n",
+    "\n",
+    "We start with the innermost loop, the one recursing `l`. `array` is already of type `mp_float_t`, while the source array, `sarray`, has been cast to `uint8_t` in the calling function. The numbers contained in `sarray` have to be read out in the proper type dictated by `ndarray->dtype`. This is what happens in the statement `*((type *)(sarray))`, and this number is then fed into the function `f`. Vectorised mathematical functions produce *dense* arrays, and for this reason, we can simply advance the `array` pointer. \n",
+    "\n",
+    "The advancing of the `sarray` pointer is a bit more involving: first, in the innermost loop, we simply move forward by the amount given by the last stride, which is `(source)->strides[ULAB_MAX_DIMS - 1]`, because the `shape` and the `strides` are aligned to the right. We move the pointer as many times as given by `(source)->shape[ULAB_MAX_DIMS-1]`, which is the length of the very last axis. Hence the the structure of the loop\n",
+    "\n",
+    "```c\n",
+    "    size_t l = 0;\n",
+    "    do {\n",
+    "        ...\n",
+    "        l++;\n",
+    "    } while(l < (source)->shape[ULAB_MAX_DIMS-1]);\n",
+    "\n",
+    "```\n",
+    "Once we have exhausted the last axis, we have to re-wind the pointer, and advance it by an amount given by the last but one stride. Keep in mind that in the the innermost loop we moved our pointer `(source)->shape[ULAB_MAX_DIMS-1]` times by `(source)->strides[ULAB_MAX_DIMS - 1]`, i.e., we re-wind it by moving it backwards by `(source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1]`. In the next step, we move forward by `(source)->strides[ULAB_MAX_DIMS - 2]`, which is the last but one stride. \n",
+    "\n",
+    "\n",
+    "```c\n",
+    "    (sarray) -= (source)->strides[ULAB_MAX_DIMS - 1] * (source)->shape[ULAB_MAX_DIMS-1];\n",
+    "    (sarray) += (source)->strides[ULAB_MAX_DIMS - 2];\n",
+    "\n",
+    "```\n",
+    "\n",
+    "This pattern must be repeated for each axis of the array, and this is how we arrive at the four nested loops listed above."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Re-winding arrays by means of a function\n",
+    "\n",
+    "\n",
+    "In addition to un-wrapping the iteration loops by means of macros, there is another way of traversing all elements of a tensor: we note that, since $|s_1| > |s_2| > ... > |s_{k-1}| > |s_k|$, $P(n1, n2, ..., n_{k-1}, n_k)$ changes most slowly in the last coordinate. Hence, if we start from the very beginning, ($n_i = 0$ for all $i$), and walk along the linear RAM segment, we increment the value of $n_k$ as long as $n_k < l_k$. Once $n_k = l_k$, we have to reset $n_k$ to 0, and increment $n_{k-1}$ by one. After each such round, $n_{k-1}$ will be incremented by one, as long as $n_{k-1} < l_{k-1}$. Once $n_{k-1} = l_{k-1}$, we reset both $n_k$, and $n_{k-1}$ to 0, and increment $n_{k-2}$ by one. \n",
+    "\n",
+    "Rewinding the arrays in this way is implemented in the function `ndarray_rewind_array` in [ndarray.c](https://github.com/v923z/micropython-ulab/blob/master/code/ndarray.c). \n",
+    "\n",
+    "```c\n",
+    "void ndarray_rewind_array(uint8_t ndim, uint8_t *array, size_t *shape, int32_t *strides, size_t *coords) {\n",
+    "    // resets the data pointer of a single array, whenever an axis is full\n",
+    "    // since we always iterate over the very last axis, we have to keep track of\n",
+    "    // the last ndim-2 axes only\n",
+    "    array -= shape[ULAB_MAX_DIMS - 1] * strides[ULAB_MAX_DIMS - 1];\n",
+    "    array += strides[ULAB_MAX_DIMS - 2];\n",
+    "    for(uint8_t i=1; i < ndim-1; i++) {\n",
+    "        coords[ULAB_MAX_DIMS - 1 - i] += 1;\n",
+    "        if(coords[ULAB_MAX_DIMS - 1 - i] == shape[ULAB_MAX_DIMS - 1 - i]) { // we are at a dimension boundary\n",
+    "            array -= shape[ULAB_MAX_DIMS - 1 - i] * strides[ULAB_MAX_DIMS - 1 - i];\n",
+    "            array += strides[ULAB_MAX_DIMS - 2 - i];\n",
+    "            coords[ULAB_MAX_DIMS - 1 - i] = 0;\n",
+    "            coords[ULAB_MAX_DIMS - 2 - i] += 1;\n",
+    "        } else { // coordinates can change only, if the last coordinate changes\n",
+    "            return;\n",
+    "        }\n",
+    "    }\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "and the function would be called as in the snippet below. Note that the innermost loop is factored out, so that we can save the `if(...)` statement for the last axis.\n",
+    "\n",
+    "```c\n",
+    "    size_t *coords = ndarray_new_coords(results->ndim);\n",
+    "    for(size_t i=0; i < results->len/results->shape[ULAB_MAX_DIMS -1]; i++) {\n",
+    "        size_t l = 0;\n",
+    "        do {\n",
+    "            ...\n",
+    "            l++;\n",
+    "        } while(l < results->shape[ULAB_MAX_DIMS - 1]);\n",
+    "        ndarray_rewind_array(results->ndim, array, results->shape, strides, coords);\n",
+    "    } while(0)\n",
+    "\n",
+    "```\n",
+    "\n",
+    "The advantage of this method is that the implementation is independent of the number of dimensions: the iteration requires more or less the same flash space for 2 dimensions as for 22. However, the price we have to pay for this convenience is the extra function call."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Iterating over two ndarrays simultaneously: broadcasting\n",
+    "\n",
+    "Whenever we invoke a binary operator, call a function with two arguments of `ndarray` type, or assign something to an `ndarray`, we have to iterate over two views at the same time. The task is trivial, if the two `ndarray`s in question have the same shape (but not necessarily the same set of strides), because in this case, we can still iterate in the same loop. All that happens is that we move two data pointers in sync.\n",
+    "\n",
+    "The problem becomes a bit more involving, when the shapes of the two `ndarray`s are not identical. For such cases, `numpy` defines so-called broadcasting, which boils down to two rules. \n",
+    "\n",
+    "1. The shapes in the tensor with lower rank has to be prepended with axes of size 1 till the two ranks become equal.\n",
+    "2. Along all axes the two tensors should have the same size, or one of the sizes must be 1. \n",
+    "\n",
+    "If, after applying the first rule the second is not satisfied, the two `ndarray`s cannot be broadcast together. \n",
+    "\n",
+    "Now, let us suppose that we have two compatible `ndarray`s, i.e., after applying the first rule, the second is satisfied. How do we iterate over the elements in the tensors? \n",
+    "\n",
+    "We should recall, what exactly we do, when iterating over a single array: normally, we move the data pointer by the last stride, except, when we arrive at a dimension boundary (when the last axis is exhausted). At that point, we move the pointer by an amount dictated by the strides. And this is the key: *dictated by the strides*. Now, if we have two arrays that are originally not compatible, we define new strides for them, and use these in the iteration. With that, we are back to the case, where we had two compatible arrays. \n",
+    "\n",
+    "Now, let us look at the second broadcasting rule: if the two arrays have the same size, we take both `ndarray`s' strides along that axis. If, on the other hand, one of the `ndarray`s is of length 1 along one of its axes, we set the corresponding strides to 0. This will ensure that that data pointer is not moved, when we iterate over both `ndarray`s at the same time. \n",
+    "\n",
+    "Thus, in order to implement broadcasting, we first have to check, whether the two above-mentioned rules can be satisfied, and if so, we have to find the two new sets strides. \n",
+    "\n",
+    "The `ndarray_can_broadcast` function from [ndarray.c](https://github.com/v923z/micropython-ulab/blob/master/code/ndarray.c) takes two `ndarray`s, and returns `true`, if the two arrays can be broadcast together. At the same time, it also calculates new strides for the two arrays, so that they can be iterated over at the same time. \n",
+    "\n",
+    "```c\n",
+    "bool ndarray_can_broadcast(ndarray_obj_t *lhs, ndarray_obj_t *rhs, uint8_t *ndim, size_t *shape, int32_t *lstrides, int32_t *rstrides) {\n",
+    "    // returns True or False, depending on, whether the two arrays can be broadcast together\n",
+    "    // numpy's broadcasting rules are as follows:\n",
+    "    //\n",
+    "    // 1. the two shapes are either equal\n",
+    "    // 2. one of the shapes is 1\n",
+    "    memset(lstrides, 0, sizeof(size_t)*ULAB_MAX_DIMS);\n",
+    "    memset(rstrides, 0, sizeof(size_t)*ULAB_MAX_DIMS);\n",
+    "    lstrides[ULAB_MAX_DIMS - 1] = lhs->strides[ULAB_MAX_DIMS - 1];\n",
+    "    rstrides[ULAB_MAX_DIMS - 1] = rhs->strides[ULAB_MAX_DIMS - 1];\n",
+    "    for(uint8_t i=ULAB_MAX_DIMS; i > 0; i--) {\n",
+    "        if((lhs->shape[i-1] == rhs->shape[i-1]) || (lhs->shape[i-1] == 0) || (lhs->shape[i-1] == 1) ||\n",
+    "        (rhs->shape[i-1] == 0) || (rhs->shape[i-1] == 1)) {\n",
+    "            shape[i-1] = MAX(lhs->shape[i-1], rhs->shape[i-1]);\n",
+    "            if(shape[i-1] > 0) (*ndim)++;\n",
+    "            if(lhs->shape[i-1] < 2) {\n",
+    "                lstrides[i-1] = 0;\n",
+    "            } else {\n",
+    "                lstrides[i-1] = lhs->strides[i-1];\n",
+    "            }\n",
+    "            if(rhs->shape[i-1] < 2) {\n",
+    "                rstrides[i-1] = 0;\n",
+    "            } else {\n",
+    "                rstrides[i-1] = rhs->strides[i-1];\n",
+    "            }\n",
+    "        } else {\n",
+    "            return false;\n",
+    "        }\n",
+    "    }\n",
+    "    return true;\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "A good example of how the function would be called can be found in [vector.c](https://github.com/v923z/micropython-ulab/blob/master/code/numpy/vector/vector.c), in the `vector_arctan2` function:\n",
+    "\n",
+    "```c\n",
+    "mp_obj_t vectorise_arctan2(mp_obj_t y, mp_obj_t x) {\n",
+    "    ...\n",
+    "    uint8_t ndim = 0;\n",
+    "    size_t *shape = m_new(size_t, ULAB_MAX_DIMS);\n",
+    "    int32_t *xstrides = m_new(int32_t, ULAB_MAX_DIMS);\n",
+    "    int32_t *ystrides = m_new(int32_t, ULAB_MAX_DIMS);\n",
+    "    if(!ndarray_can_broadcast(ndarray_x, ndarray_y, &ndim, shape, xstrides, ystrides)) {\n",
+    "        mp_raise_ValueError(translate(\"operands could not be broadcast together\"));\n",
+    "        m_del(size_t, shape, ULAB_MAX_DIMS);\n",
+    "        m_del(int32_t, xstrides, ULAB_MAX_DIMS);\n",
+    "        m_del(int32_t, ystrides, ULAB_MAX_DIMS);\n",
+    "    }\n",
+    "\n",
+    "    uint8_t *xarray = (uint8_t *)ndarray_x->array;\n",
+    "    uint8_t *yarray = (uint8_t *)ndarray_y->array;\n",
+    "    \n",
+    "    ndarray_obj_t *results = ndarray_new_dense_ndarray(ndim, shape, NDARRAY_FLOAT);\n",
+    "    mp_float_t *rarray = (mp_float_t *)results->array;\n",
+    "    ...\n",
+    "```\n",
+    "\n",
+    "After the new strides have been calculated, the iteration loop is identical to what we discussed in the previous section."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Contracting an `ndarray`\n",
+    "\n",
+    "\n",
+    "There are many operations that reduce the number of dimensions of an `ndarray` by 1, i.e., that remove an axis from the tensor. The drill is the same as before, with the exception that first we have to remove the `strides` and `shape` that corresponds to the axis along which we intend to contract. The `numerical_reduce_axes` function from [numerical.c](https://github.com/v923z/micropython-ulab/blob/master/code/numerical/numerical.c) does that. \n",
+    "\n",
+    "\n",
+    "```c\n",
+    "static void numerical_reduce_axes(ndarray_obj_t *ndarray, int8_t axis, size_t *shape, int32_t *strides) {\n",
+    "    // removes the values corresponding to a single axis from the shape and strides array\n",
+    "    uint8_t index = ULAB_MAX_DIMS - ndarray->ndim + axis;\n",
+    "    if((ndarray->ndim == 1) && (axis == 0)) {\n",
+    "        index = 0;\n",
+    "        shape[ULAB_MAX_DIMS - 1] = 0;\n",
+    "        return;\n",
+    "    }\n",
+    "    for(uint8_t i = ULAB_MAX_DIMS - 1; i > 0; i--) {\n",
+    "        if(i > index) {\n",
+    "            shape[i] = ndarray->shape[i];\n",
+    "            strides[i] = ndarray->strides[i];\n",
+    "        } else {\n",
+    "            shape[i] = ndarray->shape[i-1];\n",
+    "            strides[i] = ndarray->strides[i-1];\n",
+    "        }\n",
+    "    }\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "Once the reduced `strides` and `shape` are known, we place the axis in question in the innermost loop, and wrap it with the loops, whose coordinates are in the `strides`, and `shape` arrays. The `RUN_STD` macro from [numerical.h](https://github.com/v923z/micropython-ulab/blob/master/code/numpy/numerical/numerical.h) is a good example. The macro is expanded in the `numerical_sum_mean_std_ndarray` function. \n",
+    "\n",
+    "\n",
+    "```c\n",
+    "static mp_obj_t numerical_sum_mean_std_ndarray(ndarray_obj_t *ndarray, mp_obj_t axis, uint8_t optype, size_t ddof) {\n",
+    "    uint8_t *array = (uint8_t *)ndarray->array;\n",
+    "    size_t *shape = m_new(size_t, ULAB_MAX_DIMS);\n",
+    "    memset(shape, 0, sizeof(size_t)*ULAB_MAX_DIMS);\n",
+    "    int32_t *strides = m_new(int32_t, ULAB_MAX_DIMS);\n",
+    "    memset(strides, 0, sizeof(uint32_t)*ULAB_MAX_DIMS);\n",
+    "\n",
+    "    int8_t ax = mp_obj_get_int(axis);\n",
+    "    if(ax < 0) ax += ndarray->ndim;\n",
+    "    if((ax < 0) || (ax > ndarray->ndim - 1)) {\n",
+    "        mp_raise_ValueError(translate(\"index out of range\"));\n",
+    "    }\n",
+    "    numerical_reduce_axes(ndarray, ax, shape, strides);\n",
+    "    uint8_t index = ULAB_MAX_DIMS - ndarray->ndim + ax;\n",
+    "    ndarray_obj_t *results = NULL;\n",
+    "    uint8_t *rarray = NULL;\n",
+    "    ...\n",
+    "\n",
+    "```\n",
+    "Here is the macro for the three-dimensional case: \n",
+    "\n",
+    "```c\n",
+    "#define RUN_STD(ndarray, type, array, results, r, shape, strides, index, div) do {\n",
+    "    size_t k = 0;\n",
+    "    do {\n",
+    "        size_t l = 0;\n",
+    "        do {\n",
+    "            RUN_STD1((ndarray), type, (array), (results), (r), (index), (div));\n",
+    "            (array) -= (ndarray)->strides[(index)] * (ndarray)->shape[(index)];\n",
+    "            (array) += (strides)[ULAB_MAX_DIMS - 1];\n",
+    "            l++;\n",
+    "        } while(l < (shape)[ULAB_MAX_DIMS - 1]);\n",
+    "        (array) -= (strides)[ULAB_MAX_DIMS - 2] * (shape)[ULAB_MAX_DIMS-2];\n",
+    "        (array) += (strides)[ULAB_MAX_DIMS - 3];\n",
+    "        k++;\n",
+    "    } while(k < (shape)[ULAB_MAX_DIMS - 2]);\n",
+    "} while(0)\n",
+    "```\n",
+    "In `RUN_STD`, we simply move our pointers; the calculation itself happens in the `RUN_STD1` macro below. (Note that this is the implementation of the numerically stable Welford algorithm.)\n",
+    "\n",
+    "```c\n",
+    "#define RUN_STD1(ndarray, type, array, results, r, index, div)\n",
+    "({\n",
+    "    mp_float_t M, m, S = 0.0, s = 0.0;\n",
+    "    M = m = *(mp_float_t *)((type *)(array));\n",
+    "    for(size_t i=1; i < (ndarray)->shape[(index)]; i++) {\n",
+    "        (array) += (ndarray)->strides[(index)];\n",
+    "        mp_float_t value = *(mp_float_t *)((type *)(array));\n",
+    "        m = M + (value - M) / (mp_float_t)i;\n",
+    "        s = S + (value - M) * (value - m);\n",
+    "        M = m;\n",
+    "        S = s;\n",
+    "    }\n",
+    "    (array) += (ndarray)->strides[(index)];\n",
+    "    *(r)++ = MICROPY_FLOAT_C_FUN(sqrt)((ndarray)->shape[(index)] * s / (div));\n",
+    "})\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Upcasting\n",
+    "\n",
+    "When in an operation the `dtype`s of two arrays are different, the result's `dtype` will be decided by the following upcasting rules:\n",
+    "\n",
+    "1. Operations with two `ndarray`s of the same `dtype` preserve their `dtype`, even when the results overflow.\n",
+    "\n",
+    "2. if either of the operands is a float, the result automatically becomes a float\n",
+    "\n",
+    "3. otherwise\n",
+    "\n",
+    "    - `uint8` + `int8` => `int16`, \n",
+    "    - `uint8` + `int16` => `int16`\n",
+    "    - `uint8` + `uint16` => `uint16`\n",
+    "    \n",
+    "    - `int8` + `int16` => `int16`\n",
+    "    - `int8` + `uint16` => `uint16` (in numpy, the result is a `int32`)\n",
+    "\n",
+    "    - `uint16` + `int16` => `float` (in numpy, the result is a `int32`)\n",
+    "    \n",
+    "4. When one operand of a binary operation is a generic scalar `micropython` variable, i.e., `mp_obj_int`, or `mp_obj_float`, it will be converted to a linear array of length 1, and with the smallest `dtype` that can accommodate the variable in question. After that the broadcasting rules apply, as described in the section [Iterating over two ndarrays simultaneously: broadcasting](#Iterating_over_two_ndarrays_simultaneously:_broadcasting)\n",
+    "\n",
+    "Upcasting is resolved in place, wherever it is required. Notable examples can be found in [ndarray_operators.c](https://github.com/v923z/micropython-ulab/blob/master/code/ndarray_operators.c)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Slicing and indexing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "An `ndarray` can be indexed with three types of objects: integer scalars, slices, and another `ndarray`, whose elements are either integer scalars, or Booleans. Since slice and integer indices can be thought of as modifications of the `strides`, these indices return a view of the `ndarray`. This statement does not hold for `ndarray` indices, and therefore, the return a copy of the array."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extending ulab\n",
+    "\n",
+    "The `user` module is disabled by default, as can be seen from the last couple of lines of [ulab.h](https://github.com/v923z/micropython-ulab/blob/master/code/ulab.h)\n",
+    "\n",
+    "```c\n",
+    "// user-defined module\n",
+    "#ifndef ULAB_USER_MODULE\n",
+    "#define ULAB_USER_MODULE                (0)\n",
+    "#endif\n",
+    "```\n",
+    "\n",
+    "The module contains a very simple function, `user_dummy`, and this function is bound to the module itself. In other words, even if the module is enabled, one has to `import`:\n",
+    "\n",
+    "```python\n",
+    "\n",
+    "import ulab\n",
+    "from ulab import user\n",
+    "\n",
+    "user.dummy_function(2.5)\n",
+    "```\n",
+    "which should just return 5.0. Even if `numpy`-compatibility is required (i.e., if most functions are bound at the top level to `ulab` directly), having to `import` the module has a great advantage. Namely, only the [user.h](https://github.com/v923z/micropython-ulab/blob/master/code/user/user.h) and [user.c](https://github.com/v923z/micropython-ulab/blob/master/code/user/user.c) files have to be modified, thus it should be relatively straightforward to update your local copy from [github](https://github.com/v923z/micropython-ulab/blob/master/). \n",
+    "\n",
+    "Now, let us see, how we can add a more meaningful function. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating a new ndarray\n",
+    "\n",
+    "In the [General comments](#General_comments) sections we have seen the type definition of an `ndarray`. This structure can be generated by means of a couple of functions listed in [ndarray.c](https://github.com/v923z/micropython-ulab/blob/master/code/ndarray.c). \n",
+    "\n",
+    "\n",
+    "### ndarray_new_ndarray\n",
+    "\n",
+    "The `ndarray_new_ndarray` functions is called by all other array-generating functions. It takes the number of dimensions, `ndim`, a `uint8_t`, the `shape`, a pointer to `size_t`, the `strides`, a pointer to `int32_t`, and `dtype`, another `uint8_t` as its arguments, and returns a new array with all entries initialised to 0. \n",
+    "\n",
+    "Assuming that `ULAB_MAX_DIMS > 2`, a new dense array of dimension 3, of `shape` (3, 4, 5), of `strides` (1000, 200, 10), and `dtype` `uint16_t` can be generated by the following instructions\n",
+    "\n",
+    "```c\n",
+    "size_t *shape = m_new(size_t, ULAB_MAX_DIMS);\n",
+    "shape[ULAB_MAX_DIMS - 1] = 5;\n",
+    "shape[ULAB_MAX_DIMS - 2] = 4;\n",
+    "shape[ULAB_MAX_DIMS - 3] = 3;\n",
+    "\n",
+    "int32_t *strides = m_new(int32_t, ULAB_MAX_DIMS);\n",
+    "strides[ULAB_MAX_DIMS - 1] = 10;\n",
+    "strides[ULAB_MAX_DIMS - 2] = 200;\n",
+    "strides[ULAB_MAX_DIMS - 3] = 1000;\n",
+    "\n",
+    "ndarray_obj_t *new_ndarray = ndarray_new_ndarray(3, shape, strides, NDARRAY_UINT16);\n",
+    "```\n",
+    "\n",
+    "### ndarray_new_dense_ndarray\n",
+    "\n",
+    "The functions simply calculates the `strides` from the `shape`, and calls `ndarray_new_ndarray`. Assuming that `ULAB_MAX_DIMS > 2`, a new dense array of dimension 3, of `shape` (3, 4, 5), and `dtype` `mp_float_t` can be generated by the following instructions\n",
+    "\n",
+    "```c\n",
+    "size_t *shape = m_new(size_t, ULAB_MAX_DIMS);\n",
+    "shape[ULAB_MAX_DIMS - 1] = 5;\n",
+    "shape[ULAB_MAX_DIMS - 2] = 4;\n",
+    "shape[ULAB_MAX_DIMS - 3] = 3;\n",
+    "\n",
+    "ndarray_obj_t *new_ndarray = ndarray_new_dense_ndarray(3, shape, NDARRAY_FLOAT);\n",
+    "```\n",
+    "\n",
+    "### ndarray_new_linear_array\n",
+    "\n",
+    "Since the dimensions of a linear array are known (1), the `ndarray_new_linear_array` takes the `length`, a `size_t`, and the `dtype`, an `uint8_t`. Internally, `ndarray_new_linear_array` generates the `shape` array, and calls `ndarray_new_dense_array` with `ndim = 1`.\n",
+    "\n",
+    "A linear array of length 100, and `dtype` `uint8` could be created by the function call\n",
+    "\n",
+    "```c\n",
+    "ndarray_obj_t *new_ndarray = ndarray_new_linear_array(100, NDARRAY_UINT8)\n",
+    "```\n",
+    "\n",
+    "### ndarray_new_ndarray_from_tuple\n",
+    "\n",
+    "This function takes a `tuple`, which should hold the lengths of the axes (in other words, the `shape`), and the `dtype`, and calls internally `ndarray_new_dense_array`. A new `ndarray` can be generated by calling \n",
+    "\n",
+    "```c\n",
+    "ndarray_obj_t *new_ndarray = ndarray_new_ndarray_from_tuple(shape, NDARRAY_FLOAT);\n",
+    "```\n",
+    "where `shape` is a tuple.\n",
+    "\n",
+    "\n",
+    "### ndarray_new_view\n",
+    "\n",
+    "This function crates a *view*, and takes the source, an `ndarray`, the number of dimensions, an `uint8_t`, the `shape`, a pointer to `size_t`, the `strides`, a pointer to `int32_t`, and the offset, an `int32_t` as arguments. The offset is the number of bytes by which the void `array` pointer is shifted. E.g., the `python` statement\n",
+    "\n",
+    "```python\n",
+    "a = np.array([0, 1, 2, 3, 4, 5], dtype=uint8)\n",
+    "b = a[1::2]\n",
+    "```\n",
+    "\n",
+    "produces the array\n",
+    "\n",
+    "```python\n",
+    "array([1, 3, 5], dtype=uint8)\n",
+    "```\n",
+    "which holds its data at position `x0 + 1`, if `a`'s pointer is at `x0`. In this particular case, the offset is 1. \n",
+    "\n",
+    "The array `b` from the example above could be generated as \n",
+    "\n",
+    "```c\n",
+    "size_t *shape = m_new(size_t, ULAB_MAX_DIMS);\n",
+    "shape[ULAB_MAX_DIMS - 1] = 3;\n",
+    "\n",
+    "int32_t *strides = m_new(int32_t, ULAB_MAX_DIMS);\n",
+    "strides[ULAB_MAX_DIMS - 1] = 2;\n",
+    "\n",
+    "int32_t offset = 1;\n",
+    "uint8_t ndim = 1;\n",
+    "\n",
+    "ndarray_obj_t *new_ndarray = ndarray_new_view(ndarray_a, ndim, shape, strides, offset);\n",
+    "```\n",
+    "\n",
+    "### ndarray_copy_array\n",
+    "\n",
+    "The `ndarray_copy_array` function can be used for copying the contents of an array. Note that the target array has to be created beforehand. E.g., a one-to-one copy can be gotten by \n",
+    "\n",
+    "```c\n",
+    "ndarray_obj_t *new_ndarray = ndarray_new_ndarray(source->ndim, source->shape, source->strides, source->dtype);\n",
+    "ndarray_copy_array(source, new_ndarray);\n",
+    "\n",
+    "```\n",
+    "Note that the function cannot be used for forcing type conversion, i.e., the input and output types must be identical, because the function simply calls the `memcpy` function. On the other hand, the input and output `strides` do not necessarily have to be equal.\n",
+    "\n",
+    "### ndarray_copy_view\n",
+    "\n",
+    "The `ndarray_obj_t *new_ndarray = ...` instruction can be saved by calling the `ndarray_copy_view` function with the single `source` argument. \n",
+    "\n",
+    "\n",
+    "## Accessing data in the ndarray\n",
+    "\n",
+    "Having seen, how arrays can be generated and copied, it is time to look at how the data in an `ndarray` can be accessed and modified. \n",
+    "\n",
+    "For starters, let us suppose that the object in question comes from the user (i.e., via the `micropython` interface), First, we have to acquire a pointer to the `ndarray` by calling \n",
+    "\n",
+    "```c\n",
+    "ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(object_in);\n",
+    "```\n",
+    "\n",
+    "If it is not clear, whether the object is an `ndarray` (e.g., if we want to write a function that can take `ndarray`s, and other iterables as its argument), we find this out by evaluating \n",
+    "\n",
+    "```c\n",
+    "MP_OBJ_IS_TYPE(object_in, &ulab_ndarray_type)\n",
+    "```\n",
+    "which should return `true`. Once the pointer is at our disposal, we can get a pointer to the underlying numerical array as discussed earlier, i.e., \n",
+    "\n",
+    "```c\n",
+    "uint8_t *array = (uint8_t *)ndarray->array;\n",
+    "```\n",
+    "\n",
+    "If you need to find out the `dtype` of the array, you can get it by accessing the `dtype` member of the `ndarray`, i.e., \n",
+    "\n",
+    "```c\n",
+    "ndarray->dtype\n",
+    "```\n",
+    "should be equal to `B`, `b`, `H`, `h`, or `f`. The size of a single item is stored in the `itemsize` member. This number should be equal to 1, if the `dtype` is `B`, or `b`, 2, if the `dtype` is `H`, or `h`, 4, if the `dtype` is `f`, and 8 for `d`. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Boilerplate\n",
+    "\n",
+    "In the next section, we will construct a function that generates the element-wise square of a dense array, otherwise, raises a `TypeError` exception. Dense arrays can easily be iterated over, since we do not have to care about the `shape` and the `strides`. If the array is sparse, the section [Iterating over elements of a tensor](#Iterating-over-elements-of-a-tensor) should contain hints as to how the iteration can be implemented.\n",
+    "\n",
+    "The function is listed under [user.c](https://github.com/v923z/micropython-ulab/tree/master/code/user/). The `user` module is bound to `ulab` in [ulab.c](https://github.com/v923z/micropython-ulab/tree/master/code/ulab.c) in the lines \n",
+    "\n",
+    "```c\n",
+    "    #if ULAB_USER_MODULE\n",
+    "        { MP_ROM_QSTR(MP_QSTR_user), MP_ROM_PTR(&ulab_user_module) },\n",
+    "    #endif\n",
+    "```\n",
+    "which assumes that at the very end of [ulab.h](https://github.com/v923z/micropython-ulab/tree/master/code/ulab.h) the \n",
+    "\n",
+    "```c\n",
+    "// user-defined module\n",
+    "#ifndef ULAB_USER_MODULE\n",
+    "#define ULAB_USER_MODULE                (1)\n",
+    "#endif\n",
+    "```\n",
+    "constant has been set to 1. After compilation, you can call a particular `user` function in `python` by importing the module first, i.e., \n",
+    "\n",
+    "```python\n",
+    "from ulab import numpy as np\n",
+    "from ulab import user\n",
+    "\n",
+    "user.some_function(...)\n",
+    "```\n",
+    "\n",
+    "This separation of user-defined functions from the rest of the code ensures that the integrity of the main module and all its functions are always preserved. Even in case of a catastrophic failure, you can exclude the `user` module, and start over.\n",
+    "\n",
+    "And now the function:\n",
+    "\n",
+    "\n",
+    "```c\n",
+    "static mp_obj_t user_square(mp_obj_t arg) {\n",
+    "    // the function takes a single dense ndarray, and calculates the \n",
+    "    // element-wise square of its entries\n",
+    "    \n",
+    "    // raise a TypeError exception, if the input is not an ndarray\n",
+    "    if(!MP_OBJ_IS_TYPE(arg, &ulab_ndarray_type)) {\n",
+    "        mp_raise_TypeError(translate(\"input must be an ndarray\"));\n",
+    "    }\n",
+    "    ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(arg);\n",
+    "    \n",
+    "    // make sure that the input is a dense array\n",
+    "    if(!ndarray_is_dense(ndarray)) {\n",
+    "        mp_raise_TypeError(translate(\"input must be a dense ndarray\"));\n",
+    "    }\n",
+    "    \n",
+    "    // if the input is a dense array, create `results` with the same number of \n",
+    "    // dimensions, shape, and dtype\n",
+    "    ndarray_obj_t *results = ndarray_new_dense_ndarray(ndarray->ndim, ndarray->shape, ndarray->dtype);\n",
+    "    \n",
+    "    // since in a dense array the iteration over the elements is trivial, we \n",
+    "    // can cast the data arrays ndarray->array and results->array to the actual type\n",
+    "    if(ndarray->dtype == NDARRAY_UINT8) {\n",
+    "        uint8_t *array = (uint8_t *)ndarray->array;\n",
+    "        uint8_t *rarray = (uint8_t *)results->array;\n",
+    "        for(size_t i=0; i < ndarray->len; i++, array++) {\n",
+    "            *rarray++ = (*array) * (*array);\n",
+    "        }\n",
+    "    } else if(ndarray->dtype == NDARRAY_INT8) {\n",
+    "        int8_t *array = (int8_t *)ndarray->array;\n",
+    "        int8_t *rarray = (int8_t *)results->array;\n",
+    "        for(size_t i=0; i < ndarray->len; i++, array++) {\n",
+    "            *rarray++ = (*array) * (*array);\n",
+    "        }\n",
+    "    } else if(ndarray->dtype == NDARRAY_UINT16) {\n",
+    "        uint16_t *array = (uint16_t *)ndarray->array;\n",
+    "        uint16_t *rarray = (uint16_t *)results->array;\n",
+    "        for(size_t i=0; i < ndarray->len; i++, array++) {\n",
+    "            *rarray++ = (*array) * (*array);\n",
+    "        }\n",
+    "    } else if(ndarray->dtype == NDARRAY_INT16) {\n",
+    "        int16_t *array = (int16_t *)ndarray->array;\n",
+    "        int16_t *rarray = (int16_t *)results->array;\n",
+    "        for(size_t i=0; i < ndarray->len; i++, array++) {\n",
+    "            *rarray++ = (*array) * (*array);\n",
+    "        }\n",
+    "    } else { // if we end up here, the dtype is NDARRAY_FLOAT\n",
+    "        mp_float_t *array = (mp_float_t *)ndarray->array;\n",
+    "        mp_float_t *rarray = (mp_float_t *)results->array;\n",
+    "        for(size_t i=0; i < ndarray->len; i++, array++) {\n",
+    "            *rarray++ = (*array) * (*array);\n",
+    "        }        \n",
+    "    }\n",
+    "    // at the end, return a micropython object\n",
+    "    return MP_OBJ_FROM_PTR(results);\n",
+    "}\n",
+    "\n",
+    "```\n",
+    "\n",
+    "To summarise, the steps for *implementing* a function are\n",
+    "\n",
+    "1. If necessary, inspect the type of the input object, which is always a `mp_obj_t` object\n",
+    "2. If the input is an `ndarray_obj_t`, acquire a pointer to it by calling `ndarray_obj_t *ndarray = MP_OBJ_TO_PTR(arg);`\n",
+    "3. Create a new array, or modify the existing one; get a pointer to the data by calling `uint8_t *array = (uint8_t *)ndarray->array;`, or something equivalent\n",
+    "4. Once the new data have been calculated, return a `micropython` object by calling `MP_OBJ_FROM_PTR(...)`.\n",
+    "\n",
+    "The listing above contains the implementation of the function, but as such, it cannot be called from `python`: \n",
+    "it still has to be bound to the name space. This we do by first defining a function object in \n",
+    "\n",
+    "```c\n",
+    "MP_DEFINE_CONST_FUN_OBJ_1(user_square_obj, user_square);\n",
+    "\n",
+    "```\n",
+    "\n",
+    "`micropython` defines a number of `MP_DEFINE_CONST_FUN_OBJ_N` macros in [obj.h](https://github.com/micropython/micropython/blob/master/py/obj.h). `N` is always the number of arguments the function takes. We had a function definition `static mp_obj_t user_square(mp_obj_t arg)`, i.e., we dealt with a single argument. \n",
+    "\n",
+    "Finally, we have to bind this function object in the globals table of the `user` module: \n",
+    "\n",
+    "```c\n",
+    "STATIC const mp_rom_map_elem_t ulab_user_globals_table[] = {\n",
+    "    { MP_OBJ_NEW_QSTR(MP_QSTR___name__), MP_OBJ_NEW_QSTR(MP_QSTR_user) },\n",
+    "    { MP_OBJ_NEW_QSTR(MP_QSTR_square), (mp_obj_t)&user_square_obj },\n",
+    "};\n",
+    "```\n",
+    "\n",
+    "Thus, the three steps required for the definition of a user-defined function are \n",
+    "\n",
+    "1. The low-level implementation of the function itself\n",
+    "2. The definition of a function object by calling MP_DEFINE_CONST_FUN_OBJ_N()\n",
+    "3. Binding this function object to the namespace in the `ulab_user_globals_table[]`"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/docs/ulab.ipynb
+++ b/docs/ulab.ipynb
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1 @@
+sphinx-autoapi
--- a/570
+++ b/570
@ -0,0 +1,570 @@
+#! /usr/bin/env python3
+
+import os
+import subprocess
+import sys
+import platform
+import argparse
+import re
+import threading
+import multiprocessing
+from multiprocessing.pool import ThreadPool
+from glob import glob
+
+if os.name == 'nt':
+    MICROPYTHON = os.getenv('MICROPY_MICROPYTHON', 'micropython/ports/windows/micropython.exe')
+else:
+    MICROPYTHON = os.getenv('MICROPY_MICROPYTHON', 'micropython/ports/unix/micropython')
+
+# mpy-cross is only needed if --via-mpy command-line arg is passed
+MPYCROSS = os.getenv('MICROPY_MPYCROSS', '../mpy-cross/mpy-cross')
+
+# Set PYTHONIOENCODING so that CPython will use utf-8 on systems which set another encoding in the locale
+os.environ['PYTHONIOENCODING'] = 'utf-8'
+
+def rm_f(fname):
+    if os.path.exists(fname):
+        os.remove(fname)
+
+
+# unescape wanted regex chars and escape unwanted ones
+def convert_regex_escapes(line):
+    cs = []
+    escape = False
+    for c in str(line, 'utf8'):
+        if escape:
+            escape = False
+            cs.append(c)
+        elif c == '\\':
+            escape = True
+        elif c in ('(', ')', '[', ']', '{', '}', '.', '*', '+', '^', '$'):
+            cs.append('\\' + c)
+        else:
+            cs.append(c)
+    # accept carriage-return(s) before final newline
+    if cs[-1] == '\n':
+        cs[-1] = '\r*\n'
+    return bytes(''.join(cs), 'utf8')
+
+
+def run_micropython(pyb, args, test_file, is_special=False):
+    special_tests = (
+        'micropython/meminfo.py', 'basics/bytes_compare3.py',
+        'basics/builtin_help.py', 'thread/thread_exc2.py',
+    )
+    had_crash = False
+    if pyb is None:
+        # run on PC
+        if test_file.startswith(('cmdline/', 'feature_check/')) or test_file in special_tests:
+            # special handling for tests of the unix cmdline program
+            is_special = True
+
+        if is_special:
+            # check for any cmdline options needed for this test
+            args = [MICROPYTHON]
+            with open(test_file, 'rb') as f:
+                line = f.readline()
+                if line.startswith(b'# cmdline:'):
+                    # subprocess.check_output on Windows only accepts strings, not bytes
+                    args += [str(c, 'utf-8') for c in line[10:].strip().split()]
+
+            # run the test, possibly with redirected input
+            try:
+                if 'repl_' in test_file:
+                    # Need to use a PTY to test command line editing
+                    try:
+                        import pty
+                    except ImportError:
+                        # in case pty module is not available, like on Windows
+                        return b'SKIP\n'
+                    import select
+
+                    def get(required=False):
+                        rv = b''
+                        while True:
+                            ready = select.select([emulator], [], [], 0.02)
+                            if ready[0] == [emulator]:
+                                rv += os.read(emulator, 1024)
+                            else:
+                                if not required or rv:
+                                    return rv
+
+                    def send_get(what):
+                        os.write(emulator, what)
+                        return get()
+
+                    with open(test_file, 'rb') as f:
+                        # instead of: output_mupy = subprocess.check_output(args, stdin=f)
+                        # openpty returns two read/write file descriptors.  The first one is
+                        # used by the program which provides the virtual
+                        # terminal service, and the second one is used by the
+                        # subprogram which requires a tty to work.
+                        emulator, subterminal = pty.openpty()
+                        p = subprocess.Popen(args, stdin=subterminal, stdout=subterminal,
+                                             stderr=subprocess.STDOUT, bufsize=0)
+                        banner = get(True)
+                        output_mupy = banner + b''.join(send_get(line) for line in f)
+                        send_get(b'\x04') # exit the REPL, so coverage info is saved
+                        p.kill()
+                        os.close(emulator)
+                        os.close(subterminal)
+                else:
+                    output_mupy = subprocess.check_output(args + [test_file], stderr=subprocess.STDOUT)
+            except subprocess.CalledProcessError:
+                return b'CRASH'
+
+        else:
+            # a standard test run on PC
+
+            # create system command
+            cmdlist = [MICROPYTHON, '-X', 'emit=' + args.emit]
+            if args.heapsize is not None:
+                cmdlist.extend(['-X', 'heapsize=' + args.heapsize])
+
+            # if running via .mpy, first compile the .py file
+            if args.via_mpy:
+                subprocess.check_output([MPYCROSS, '-mcache-lookup-bc', '-o', 'mpytest.mpy', test_file])
+                cmdlist.extend(['-m', 'mpytest'])
+            else:
+                cmdlist.append(test_file)
+
+            # run the actual test
+            e = {"MICROPYPATH": os.getcwd() + ":", "LANG": "en_US.UTF-8"}
+            p = subprocess.Popen(cmdlist, env=e, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+            output_mupy = b''
+            while p.poll() is None:
+                output_mupy += p.stdout.read()
+            output_mupy += p.stdout.read()
+            if p.returncode != 0:
+                output_mupy = b'CRASH'
+
+            # clean up if we had an intermediate .mpy file
+            if args.via_mpy:
+                rm_f('mpytest.mpy')
+
+    else:
+        # run on pyboard
+        import pyboard
+        pyb.enter_raw_repl()
+        try:
+            output_mupy = pyb.execfile(test_file)
+        except pyboard.PyboardError:
+            had_crash = True
+            output_mupy = b'CRASH'
+
+    # canonical form for all ports/platforms is to use \n for end-of-line
+    output_mupy = output_mupy.replace(b'\r\n', b'\n')
+
+    # don't try to convert the output if we should skip this test
+    if had_crash or output_mupy in (b'SKIP\n', b'CRASH'):
+        return output_mupy
+
+    if is_special or test_file in special_tests:
+        # convert parts of the output that are not stable across runs
+        with open(test_file + '.exp', 'rb') as f:
+            lines_exp = []
+            for line in f.readlines():
+                if line == b'########\n':
+                    line = (line,)
+                else:
+                    line = (line, re.compile(convert_regex_escapes(line)))
+                lines_exp.append(line)
+        lines_mupy = [line + b'\n' for line in output_mupy.split(b'\n')]
+        if output_mupy.endswith(b'\n'):
+            lines_mupy = lines_mupy[:-1] # remove erroneous last empty line
+        i_mupy = 0
+        for i in range(len(lines_exp)):
+            if lines_exp[i][0] == b'########\n':
+                # 8x #'s means match 0 or more whole lines
+                line_exp = lines_exp[i + 1]
+                skip = 0
+                while i_mupy + skip < len(lines_mupy) and not line_exp[1].match(lines_mupy[i_mupy + skip]):
+                    skip += 1
+                if i_mupy + skip >= len(lines_mupy):
+                    lines_mupy[i_mupy] = b'######## FAIL\n'
+                    break
+                del lines_mupy[i_mupy:i_mupy + skip]
+                lines_mupy.insert(i_mupy, b'########\n')
+                i_mupy += 1
+            else:
+                # a regex
+                if lines_exp[i][1].match(lines_mupy[i_mupy]):
+                    lines_mupy[i_mupy] = lines_exp[i][0]
+                else:
+                    #print("don't match: %r %s" % (lines_exp[i][1], lines_mupy[i_mupy])) # DEBUG
+                    pass
+                i_mupy += 1
+            if i_mupy >= len(lines_mupy):
+                break
+        output_mupy = b''.join(lines_mupy)
+
+    return output_mupy
+
+
+def run_feature_check(pyb, args, base_path, test_file):
+    return run_micropython(pyb, args, base_path + "/feature_check/" + test_file, is_special=True)
+
+class ThreadSafeCounter:
+    def __init__(self, start=0):
+        self._value = start
+        self._lock = threading.Lock()
+
+    def add(self, to_add):
+        with self._lock: self._value += to_add
+
+    def append(self, arg):
+        self.add([arg])
+
+    @property
+    def value(self):
+        return self._value
+
+def run_tests(pyb, tests, args, base_path=".", num_threads=1):
+    test_count = ThreadSafeCounter()
+    testcase_count = ThreadSafeCounter()
+    passed_count = ThreadSafeCounter()
+    failed_tests = ThreadSafeCounter([])
+    skipped_tests = ThreadSafeCounter([])
+
+    skip_tests = set()
+    skip_native = False
+    skip_int_big = False
+    skip_set_type = False
+    skip_async = False
+    skip_const = False
+    skip_revops = False
+    skip_endian = False
+    has_complex = True
+    has_coverage = False
+
+    upy_float_precision = 32
+
+    # Some tests shouldn't be run under Travis CI
+    if os.getenv('TRAVIS') == 'true':
+        skip_tests.add('basics/memoryerror.py')
+        skip_tests.add('thread/thread_gc1.py') # has reliability issues
+        skip_tests.add('thread/thread_lock4.py') # has reliability issues
+        skip_tests.add('thread/stress_heap.py') # has reliability issues
+        skip_tests.add('thread/stress_recurse.py') # has reliability issues
+
+    if upy_float_precision == 0:
+        skip_tests.add('extmod/ujson_dumps_float.py')
+        skip_tests.add('extmod/ujson_loads_float.py')
+        skip_tests.add('misc/rge_sm.py')
+    if upy_float_precision < 32:
+        skip_tests.add('float/float2int_intbig.py') # requires fp32, there's float2int_fp30_intbig.py instead
+        skip_tests.add('float/string_format.py') # requires fp32, there's string_format_fp30.py instead
+        skip_tests.add('float/bytes_construct.py') # requires fp32
+        skip_tests.add('float/bytearray_construct.py') # requires fp32
+    if upy_float_precision < 64:
+        skip_tests.add('float/float_divmod.py') # tested by float/float_divmod_relaxed.py instead
+        skip_tests.add('float/float2int_doubleprec_intbig.py')
+        skip_tests.add('float/float_parse_doubleprec.py')
+
+    if not has_complex:
+        skip_tests.add('float/complex1.py')
+        skip_tests.add('float/complex1_intbig.py')
+        skip_tests.add('float/int_big_float.py')
+        skip_tests.add('float/true_value.py')
+        skip_tests.add('float/types.py')
+
+    if not has_coverage:
+        skip_tests.add('cmdline/cmd_parsetree.py')
+
+    # Some tests shouldn't be run on a PC
+    if args.target == 'unix':
+        # unix build does not have the GIL so can't run thread mutation tests
+        for t in tests:
+            if t.startswith('thread/mutate_'):
+                skip_tests.add(t)
+
+    # Some tests shouldn't be run on pyboard
+    if args.target != 'unix':
+        skip_tests.add('basics/exception_chain.py') # warning is not printed
+        skip_tests.add('micropython/meminfo.py') # output is very different to PC output
+        skip_tests.add('extmod/machine_mem.py') # raw memory access not supported
+
+        if args.target == 'wipy':
+            skip_tests.add('misc/print_exception.py')       # requires error reporting full
+            skip_tests.update({'extmod/uctypes_%s.py' % t for t in 'bytearray le native_le ptr_le ptr_native_le sizeof sizeof_native array_assign_le array_assign_native_le'.split()}) # requires uctypes
+            skip_tests.add('extmod/zlibd_decompress.py')    # requires zlib
+            skip_tests.add('extmod/uheapq1.py')             # uheapq not supported by WiPy
+            skip_tests.add('extmod/urandom_basic.py')       # requires urandom
+            skip_tests.add('extmod/urandom_extra.py')       # requires urandom
+        elif args.target == 'esp8266':
+            skip_tests.add('misc/rge_sm.py')                # too large
+        elif args.target == 'minimal':
+            skip_tests.add('basics/class_inplace_op.py')    # all special methods not supported
+            skip_tests.add('basics/subclass_native_init.py')# native subclassing corner cases not support
+            skip_tests.add('misc/rge_sm.py')                # too large
+            skip_tests.add('micropython/opt_level.py')      # don't assume line numbers are stored
+
+    # Some tests are known to fail on 64-bit machines
+    if pyb is None and platform.architecture()[0] == '64bit':
+        pass
+
+    # Some tests use unsupported features on Windows
+    if os.name == 'nt':
+        skip_tests.add('import/import_file.py') # works but CPython prints forward slashes
+
+    # Some tests are known to fail with native emitter
+    # Remove them from the below when they work
+    if args.emit == 'native':
+        skip_tests.update({'basics/%s.py' % t for t in 'gen_yield_from gen_yield_from_close gen_yield_from_ducktype gen_yield_from_exc gen_yield_from_executing gen_yield_from_iter gen_yield_from_send gen_yield_from_stopped gen_yield_from_throw gen_yield_from_throw2 gen_yield_from_throw3 generator1 generator2 generator_args generator_close generator_closure generator_exc generator_pend_throw generator_return generator_send'.split()}) # require yield
+        skip_tests.update({'basics/%s.py' % t for t in 'bytes_gen class_store_class globals_del string_join gen_stack_overflow'.split()}) # require yield
+        skip_tests.update({'basics/async_%s.py' % t for t in 'def await await2 for for2 with with2 coroutine'.split()}) # require yield
+        skip_tests.update({'basics/%s.py' % t for t in 'try_reraise try_reraise2'.split()}) # require raise_varargs
+        skip_tests.update({'basics/%s.py' % t for t in 'with_break with_continue with_return'.split()}) # require complete with support
+        skip_tests.add('basics/array_construct2.py') # requires generators
+        skip_tests.add('basics/bool1.py') # seems to randomly fail
+        skip_tests.add('basics/builtin_hash_gen.py') # requires yield
+        skip_tests.add('basics/class_bind_self.py') # requires yield
+        skip_tests.add('basics/del_deref.py') # requires checking for unbound local
+        skip_tests.add('basics/del_local.py') # requires checking for unbound local
+        skip_tests.add('basics/exception_chain.py') # raise from is not supported
+        skip_tests.add('basics/for_range.py') # requires yield_value
+        skip_tests.add('basics/try_finally_loops.py') # requires proper try finally code
+        skip_tests.add('basics/try_finally_return.py') # requires proper try finally code
+        skip_tests.add('basics/try_finally_return2.py') # requires proper try finally code
+        skip_tests.add('basics/unboundlocal.py') # requires checking for unbound local
+        skip_tests.add('import/gen_context.py') # requires yield_value
+        skip_tests.add('misc/features.py') # requires raise_varargs
+        skip_tests.add('misc/rge_sm.py') # requires yield
+        skip_tests.add('misc/print_exception.py') # because native doesn't have proper traceback info
+        skip_tests.add('misc/sys_exc_info.py') # sys.exc_info() is not supported for native
+        skip_tests.add('micropython/emg_exc.py') # because native doesn't have proper traceback info
+        skip_tests.add('micropython/heapalloc_traceback.py') # because native doesn't have proper traceback info
+        skip_tests.add('micropython/heapalloc_iter.py') # requires generators
+        skip_tests.add('micropython/schedule.py') # native code doesn't check pending events
+        skip_tests.add('stress/gc_trace.py') # requires yield
+        skip_tests.add('stress/recursive_gen.py') # requires yield
+        skip_tests.add('extmod/vfs_userfs.py') # because native doesn't properly handle globals across different modules
+        skip_tests.add('../extmod/ulab/tests/argminmax.py') # requires yield
+
+    def run_one_test(test_file):
+        test_file = test_file.replace('\\', '/')
+
+        if args.filters:
+            # Default verdict is the opposit of the first action
+            verdict = "include" if args.filters[0][0] == "exclude" else "exclude"
+            for action, pat in args.filters:
+                if pat.search(test_file):
+                    verdict = action
+            if verdict == "exclude":
+                return
+
+        test_basename = os.path.basename(test_file)
+        test_name = os.path.splitext(test_basename)[0]
+        is_native = test_name.startswith("native_") or test_name.startswith("viper_")
+        is_endian = test_name.endswith("_endian")
+        is_int_big = test_name.startswith("int_big") or test_name.endswith("_intbig")
+        is_set_type = test_name.startswith("set_") or test_name.startswith("frozenset")
+        is_async = test_name.startswith("async_")
+        is_const = test_name.startswith("const")
+
+        skip_it = test_file in skip_tests
+        skip_it |= skip_native and is_native
+        skip_it |= skip_endian and is_endian
+        skip_it |= skip_int_big and is_int_big
+        skip_it |= skip_set_type and is_set_type
+        skip_it |= skip_async and is_async
+        skip_it |= skip_const and is_const
+        skip_it |= skip_revops and test_name.startswith("class_reverse_op")
+
+        if args.list_tests:
+            if not skip_it:
+                print(test_file)
+            return
+
+        if skip_it:
+            print("skip ", test_file)
+            skipped_tests.append(test_name)
+            return
+
+        # get expected output
+        test_file_expected = test_file + '.exp'
+        if os.path.isfile(test_file_expected):
+            # expected output given by a file, so read that in
+            with open(test_file_expected, 'rb') as f:
+                output_expected = f.read()
+        else:
+            if not args.write_exp:
+                output_expected = b"NOEXP\n"
+            else:
+                # run CPython to work out expected output
+                e = {"PYTHONPATH": os.getcwd(),
+                     "PATH": os.environ["PATH"],
+                     "LANG": "en_US.UTF-8"}
+                p = subprocess.Popen([MICROPYTHON, test_file], env=e, stdout=subprocess.PIPE)
+                output_expected = b''
+                while p.poll() is None:
+                    output_expected += p.stdout.read()
+                output_expected += p.stdout.read()
+                with open(test_file_expected, 'wb') as f:
+                    f.write(output_expected)
+
+        # canonical form for all host platforms is to use \n for end-of-line
+        output_expected = output_expected.replace(b'\r\n', b'\n')
+
+        if args.write_exp:
+            return
+
+        # run MicroPython
+        output_mupy = run_micropython(pyb, args, test_file)
+
+        if output_mupy == b'SKIP\n':
+            print("skip ", test_file)
+            skipped_tests.append(test_name)
+            return
+
+        if output_expected == b'NOEXP\n':
+            print("noexp", test_file)
+            failed_tests.append(test_name)
+            return
+
+        testcase_count.add(len(output_expected.splitlines()))
+
+        filename_expected = test_basename + ".exp"
+        filename_mupy = test_basename + ".out"
+
+        if output_expected == output_mupy:
+            print("pass ", test_file)
+            passed_count.add(1)
+            rm_f(filename_expected)
+            rm_f(filename_mupy)
+        else:
+            with open(filename_expected, "wb") as f:
+                f.write(output_expected)
+            with open(filename_mupy, "wb") as f:
+                f.write(output_mupy)
+            print("### Expected")
+            print(output_expected)
+            print("### Actual")
+            print(output_mupy)
+            print("FAIL ", test_file)
+            failed_tests.append(test_name)
+
+        test_count.add(1)
+
+    if args.list_tests:
+        return True
+
+    if num_threads > 1:
+        pool = ThreadPool(num_threads)
+        pool.map(run_one_test, tests)
+    else:
+        for test in tests:
+            run_one_test(test)
+
+    print("{} tests performed ({} individual testcases)".format(test_count.value, testcase_count.value))
+    print("{} tests passed".format(passed_count.value))
+
+    if len(skipped_tests.value) > 0:
+        print("{} tests skipped: {}".format(len(skipped_tests.value), ' '.join(sorted(skipped_tests.value))))
+    if len(failed_tests.value) > 0:
+        print("{} tests failed: {}".format(len(failed_tests.value), ' '.join(sorted(failed_tests.value))))
+        return False
+
+    # all tests succeeded
+    return True
+
+
+class append_filter(argparse.Action):
+
+    def __init__(self, option_strings, dest, **kwargs):
+        super().__init__(option_strings, dest, default=[], **kwargs)
+
+    def __call__(self, parser, args, value, option):
+        if not hasattr(args, self.dest):
+            args.filters = []
+        if option.startswith(("-e", "--e")):
+            option = "exclude"
+        else:
+            option = "include"
+        args.filters.append((option, re.compile(value)))
+
+
+def main():
+    cmd_parser = argparse.ArgumentParser(
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        description='Run and manage tests for MicroPython.',
+        epilog='''\
+Options -i and -e can be multiple and processed in the order given. Regex
+"search" (vs "match") operation is used. An action (include/exclude) of
+the last matching regex is used:
+  run-tests -i async - exclude all, then include tests containg "async" anywhere
+  run-tests -e '/big.+int' - include all, then exclude by regex
+  run-tests -e async -i async_foo - include all, exclude async, yet still include async_foo
+''')
+    cmd_parser.add_argument('--target', default='unix', help='the target platform')
+    cmd_parser.add_argument('--device', default='/dev/ttyACM0', help='the serial device or the IP address of the pyboard')
+    cmd_parser.add_argument('-b', '--baudrate', default=115200, help='the baud rate of the serial device')
+    cmd_parser.add_argument('-u', '--user', default='micro', help='the telnet login username')
+    cmd_parser.add_argument('-p', '--password', default='python', help='the telnet login password')
+    cmd_parser.add_argument('-d', '--test-dirs', nargs='*', help='input test directories (if no files given)')
+    cmd_parser.add_argument('-e', '--exclude', action=append_filter, metavar='REGEX', dest='filters', help='exclude test by regex on path/name.py')
+    cmd_parser.add_argument('-i', '--include', action=append_filter, metavar='REGEX', dest='filters', help='include test by regex on path/name.py')
+    cmd_parser.add_argument('--write-exp', action='store_true', help='save .exp files to run tests w/o CPython')
+    cmd_parser.add_argument('--list-tests', action='store_true', help='list tests instead of running them')
+    cmd_parser.add_argument('--emit', default='bytecode', help='MicroPython emitter to use (bytecode or native)')
+    cmd_parser.add_argument('--heapsize', help='heapsize to use (use default if not specified)')
+    cmd_parser.add_argument('--via-mpy', action='store_true', help='compile .py files to .mpy first')
+    cmd_parser.add_argument('--keep-path', action='store_true', help='do not clear MICROPYPATH when running tests')
+    cmd_parser.add_argument('-j', '--jobs', default=1, metavar='N', type=int, help='Number of tests to run simultaneously')
+    cmd_parser.add_argument('--auto-jobs', action='store_const', dest='jobs', const=multiprocessing.cpu_count(), help='Set the -j values to the CPU (thread) count')
+    cmd_parser.add_argument('files', nargs='*', help='input test files')
+    args = cmd_parser.parse_args()
+
+    EXTERNAL_TARGETS = ('pyboard', 'wipy', 'esp8266', 'esp32', 'minimal')
+    if args.target == 'unix' or args.list_tests:
+        pyb = None
+    elif args.target in EXTERNAL_TARGETS:
+        import pyboard
+        pyb = pyboard.Pyboard(args.device, args.baudrate, args.user, args.password)
+        pyb.enter_raw_repl()
+    else:
+        raise ValueError('target must be either %s or unix' % ", ".join(EXTERNAL_TARGETS))
+
+    if len(args.files) == 0:
+        if args.test_dirs is None:
+            if args.target == 'pyboard':
+                # run pyboard tests
+                test_dirs = ('basics', 'micropython', 'float', 'misc', 'stress', 'extmod', 'pyb', 'pybnative', 'inlineasm')
+            elif args.target in ('esp8266', 'esp32', 'minimal'):
+                test_dirs = ('basics', 'micropython', 'float', 'misc', 'extmod')
+            elif args.target == 'wipy':
+                # run WiPy tests
+                test_dirs = ('basics', 'micropython', 'misc', 'extmod', 'wipy')
+            else:
+                # run PC tests
+                test_dirs = (
+                    'basics', 'micropython', 'float', 'import', 'io', 'misc',
+                    'stress', 'unicode', 'extmod', '../extmod/ulab/tests', 'unix', 'cmdline',
+                )
+        else:
+            # run tests from these directories
+            test_dirs = args.test_dirs
+        tests = sorted(test_file for test_files in (glob('{}/*.py'.format(dir)) for dir in test_dirs) for test_file in test_files)
+    else:
+        # tests explicitly given
+        tests = args.files
+
+    if not args.keep_path:
+        # clear search path to make sure tests use only builtin modules
+        os.environ['MICROPYPATH'] = ''
+
+    # Even if we run completely different tests in a different directory,
+    # we need to access feature_check's from the same directory as the
+    # run-tests script itself.
+    base_path = os.path.dirname(sys.argv[0]) or "."
+    try:
+        res = run_tests(pyb, tests, args, base_path, args.jobs)
+    finally:
+        if pyb:
+            pyb.close()
+
+    if not res:
+        sys.exit(1)
+
+if __name__ == "__main__":
+    main()
--- a/tests/00smoke.py
+++ b/tests/00smoke.py
@ -1,2 +0,0 @@
-from ulab import linalg
-print(linalg.eye(3))
--- a/tests/00smoke.py.exp
+++ b/tests/00smoke.py.exp
@ -1,3 +0,0 @@
-array([[1.0, 0.0, 0.0],
-	 [0.0, 1.0, 0.0],
-	 [0.0, 0.0, 1.0]], dtype=float)
--- a/tests/circuitpy/00smoke.py
+++ b/tests/circuitpy/00smoke.py
@ -0,0 +1,2 @@
+import ulab
+print(ulab.eye(3))
--- a/tests/circuitpy/00smoke.py.exp
+++ b/tests/circuitpy/00smoke.py.exp
@ -0,0 +1,3 @@
+array([[1.0, 0.0, 0.0],
+       [0.0, 1.0, 0.0],
+       [0.0, 0.0, 1.0]], dtype=float64)
--- a/tests/circuitpy/argminmax.py
+++ b/tests/circuitpy/argminmax.py
@ -0,0 +1,62 @@
+import ulab
+
+# Adapted from https://docs.python.org/3.8/library/itertools.html#itertools.permutations
+def permutations(iterable, r=None):
+    # permutations('ABCD', 2) --> AB AC AD BA BC BD CA CB CD DA DB DC
+    # permutations(range(3)) --> 012 021 102 120 201 210
+    pool = tuple(iterable)
+    n = len(pool)
+    r = n if r is None else r
+    if r > n:
+        return
+    indices = list(range(n))
+    cycles = list(range(n, n-r, -1))
+    yield tuple(pool[i] for i in indices[:r])
+    while n:
+        for i in reversed(range(r)):
+            cycles[i] -= 1
+            if cycles[i] == 0:
+                indices[i:] = indices[i+1:] + indices[i:i+1]
+                cycles[i] = n - i
+            else:
+                j = cycles[i]
+                indices[i], indices[-j] = indices[-j], indices[i]
+                yield tuple(pool[i] for i in indices[:r])
+                break
+        else:
+            return
+
+# Combinations expected to throw
+try:
+    print(ulab.numerical.argmin([]))
+except ValueError:
+    print("ValueError")
+
+try:
+    print(ulab.numerical.argmax([]))
+except ValueError:
+    print("ValueError")
+
+# Combinations expected to succeed
+print(ulab.numerical.argmin([1]))
+print(ulab.numerical.argmax([1]))
+print(ulab.numerical.argmin(ulab.array([1])))
+print(ulab.numerical.argmax(ulab.array([1])))
+
+print()
+print("max tests")
+for p in permutations((100,200,300)):
+    m1 = ulab.numerical.argmax(p)
+    m2 = ulab.numerical.argmax(ulab.array(p))
+    print(p, m1, m2)
+    if m1 != m2 or p[m1] != max(p):
+        print("FAIL", p, m1, m2, max(p))
+
+print()
+print("min tests")
+for p in permutations((100,200,300)):
+    m1 = ulab.numerical.argmin(p)
+    m2 = ulab.numerical.argmin(ulab.array(p))
+    print(p, m1, m2)
+    if m1 != m2 or p[m1] != min(p):
+        print("FAIL", p, m1, m2, min(p))
--- a/Show more
+++ b/Show more