Benchmarking : Smil vs scikit-image

Morphological Image Libraries

Discussion

Summary

SpeedUp
Resources usage : CPU and memory
Some programming issues impacting performance
New directions

SpeedUp

Globally, we can see from taurus and nestor summaries that Smil is always faster than scikit-image for gray level images. For binary images, scikit-image may win on functions doing labeling (label()) or depending on it.

We can observe four situations about speedups in image sizes (see taurus and nestor summaries):

Very high SpeedUps mostly in basic morphological functions such as erode(), open(), … This happens on functions which algorithms can be fully parallelized. In these cases, Smil can be 2 or 3 orders of magnitude faster than Scikit-Image. This is explained below at section Performance/Parallelism

SpeedUps in the range [1,10], almost the same in both machines (modulo the CPU clock ratio as nestor clock is faster than taurus one). These functions probably are based on the same algorithm, which can't be parallelized. Examples are hMinima(), watershed() and distance().

High speedups with behavior different in both libraries and computers. These functions probably aren't based on the same algorithm or aren't programmed in the same way or are partially parallelized in Smil. Examples are : thinning() and areaOpen().

Smil is slower on function label() and other functions making use of it (e.g., areaThreshold()). This is because Scikit-Image uses an optimized, but not generic algorithm (See [B3]). This algorithm doesn't allow for hexagonal grids and uses a parameter called connectivity which is conceptually different from the generic Structuring Element from Smil. This makes sense as it solves most cases but lacks genericity. Maybe Smil should provide this algorithm in addition of the current one.

When varying the size of the Structuring Element, we observe that the speedup increases with the size. There are two reasons :

the data structure used to represent structuring elements;
how these libraries handle structuring elements of size bigger than 1.

Both reasons have a big impact on the computational complexity as explained below at section Performance/Structuring Elements

Resources usage (memory and CPU)

We observed, with the help of an external application, the behaviour of python processes while running mathematical morphology operations.

CPU usage

About CPU usage, it's obvious that CPU usage is always intensive when handling huge data. The interest of what we saw was just to verify the effectiveness of parallelization in both libraries. While it was confirmed with Smil we haven't seen any sign of parallelization on Scikit-Image mathematical morphology operations.

Looking around internet we haven't found any link about internal parallelization of Scikit-Image, but some hints on how to launch Python tasks. But this isn't related to parallelization at the library level and isn't transparent to the final user. But maybe we've lacking any information on how to enable parallelization at library level for Scikit-Image.

Memory footprint

We observed that for both libraries the memory footprint of the Python process is much higher than the image memory size, even considering the eventual use of temporary allocated images. So, it's not related to libraries, but very probably to Python itself.

The interesting information we've got with this experiment is that Scikit-Image always needs more memory than Smil. This imposes some limit on the image size each library can handle on computers with limited memory.

We've reached a limit on the desktop computer (nestor), which has only 16GiB of installed memory for the function h_minima(). In this computer the image size limit for Scikit-Image was around 12000x12000 (144 MBytes), while the limit for Smil was around 32000x32000 (1 GBytes).

Notes

the data type of test images is UINT8 (8 bits). So, image size limits above are different for other data types;
in both libraries the memory footprint is different for different functions.

Some programming issues impacting performance

Let's talk just about programming issues. Some strange, in appearence, behaviors can be seen on the graphics results, or even for small images. Most of the time these behaviors can be explained by hardware or operating system bottlenecks, such as memory bandwidth or cost, launching threads for small image sizes or synchronization issues in parallel code.

Parallelism

Parallelism is implemented in Smil in two ways :

Threads - Whenever possible, loops are parallelized in Smil thanks to OpenMP. This kind of parallelization can be easily observed by CPU load (tools like top). E.g., is a process is seen with a CPU load of “N*100 %”, this means that the task is being distributed over N virtual CPUs (N threads). In this benchmark we haven't noticed any significant parallelism with Scikit-Image.

SIMD (Single Instruction Multiple Data) - recent processors has long registers (128 bits and more) allowing to load many values (e.g. 16 byte values) and execute the same instruction on all values. Before OpenMP release 4.5 this was done with some specific instructions from some Intel library. Release 4.5 and newer provide specific C and C directives to let OpenMP manage this kind of parallelism.

Combining both types of parallelism explains why Smil can be orders of magnitude faster than Scikit-Image, depending on the number of processor cores.

Structuring elements

Representation and handling of structuring element

In this work, all operations needing a structuring element, the default one was the choice : CrossSE() for Smil and diamond() for Scikit-Image, represented below when its size is 1 :

There are two differences between the way this structure is handled in the two libraries with significant impact in their performance :

Data representation - In Smil, the structuring element is defined as a list of points (five points for CrossSE()) while in Scikit-Image the definition is as a NumPy array (diamond(1) is an array of shape (3,3)).

Handling of StrElt of size greater than 1 - in Scikit-Image the size of the structuring element defines the size of the data. E.g., a diamond(5) is represented by a NumPy array of size 121, while in Smil the size of the structuring element is used only to define how many times the operation will be repeated with a structuring element of size 1. Note that this is possible with homothetic structuring elements.

If we report this to the complexity - number of memory accesses (r and w) - for each cell in the image (for 2D images and square bounding box for the structuring element), we have :

So, the complexity grows, in Smil linearly with the size of the structuring element while it grows with its square in Scikit-Image. That's why the SpeedUp grows with the size of the structuring element in some functions we benchmarked : erosion() and open() for binary images and erosion(), open(), topHat(), hMinima().

Structuring element versus connectivity

While Smil always uses a formal structuring element on its morphological operations, Scikit-Image sometimes uses a connectivity parameter defined as:

Maximum number of orthogonal hops to consider a pixel/voxel as a neighbor. Accepted values are ranging from 1 to input.ndim.

Note : Equals to 1 for 4-connectivity and 2 for 8-connectivity on 2D Images. For 3D images, 3 can be used to full 26-connectivity, but 2 may cause some ambiguity in its interpretation.

Source : scikit-image - Module: morphology

Although this is convenient for 2D images and satisfies most situations, it doesn't allow for generic structuring elements nor for hexagonal grids. On the other hand, it allows the use of an optimized algorithm (See [XXX]) and that's why Scikit-Image label() function is faster than the Smil equivalent.

Size of data types

All tests here were done with images of type UINT8. A rule of thumb, which is valid really most of the time, is that “the bigger the data size the longer will be its processing time”. There are two main reasons :

computer architecture : memory bandwidth, processor cache size, … Hard to explain this simply;
SIMD parallelization is done with the help of registers with long but fixed size. So, in a 128 bits register, you can operate the same instruction on 16 UINT8 data, while only in 4 UINT32 and only on 2 FLOAT64.

To be convinced about the impact of the size of data types, try the following script in your computer :

import numpy as np
import timeit
d = 4096
print('{:27s} {:^7} {:^7} {:^7} {:^7}'.
      format('', '+', '-', '*', '/'))
for t in [np.uint8, np.uint16, np.uint32, np.float32, np.float64]:
    x = 70 * np.ones((d, d), dtype=t)
    y = 3 * np.ones((d, d), dtype=t)
    dp = 10 * timeit.timeit(lambda: x + y, number=100)
    ds = 10 * timeit.timeit(lambda: x - y, number=100)
    dm = 10 * timeit.timeit(lambda: x * y, number=100)
    dd = 10 * timeit.timeit(lambda: x / y, number=100)
    fmt = '* ({:d},{:d}) - {:8s} : '
    fmt += '{:7.3f} {:7.3f} {:7.3f} {:7.3f} - ms'
    print(fmt.format(d, d, str(np.dtype(t)), dp, ds, dm, dd))

You'll notice that, except for division, the elapsed time grows almost linearly with the data size. Division algorithm is, most of the time, iteractif and dependent on the values being handled.

Next steps

3D Images - 3D images are a very important target for morphological image applications : materials, medical imaging, … A next version of this benchmark shall include images of this type;
Other functions to benchmark;
Libraries improvements - both projects are alive and maintained. Many Improvements are done all the time in both libraries. So, it's obvious that this benchmark is just a snapshot taken at some point in the time and it must be updated from time to time.