Implementing histogram in python

Question

I was trying to implement my own version of matplotlib's hist function and I came up with this:

import numpy as np 
import matplotlib.pyplot as plt
import numba as nb
from numba import prange
import time

N = 10_000_000
normal = np.random.standard_normal(N)

@nb.jit(parallel=True)
def histogram(a, bins = 1000, density=False):
    
    min_data = np.min(a)
    max_data = np.max(a)

    dx = (max_data - min_data)/bins
    D = np.zeros([bins,2])
    
    for i in prange(bins):
        dr = i*dx + min_data
        x1 = a[a>dr]
        x2 = x1[x1<(dr+dx)]
        D[i] = [dr, len(x2)]
    
    D = D.T
    x , y = D
    
    if density == True:
        inte = sum((x[1]-x[0])*y)
        y /= inte

    plt.bar(x, y, width=dx)
    return D

def main():
    start_time = time.clock()
    histogram(normal, bins=1000, density=True)
    #plt.hist(normal, bins=1000, density=True)
    print("Execution time:", time.clock() - start_time, "seconds")
    plt.tight_layout()
    plt.show()

if __name__ == '__main__':
    main()

This outputs the desired result, but with the execution time of 24.3555 seconds. However, when I use the original hist (just comment my function and uncomment plt.hist) it does the job in just 0.6283 seconds, which is ridiculously fast. My question is, what am I doing wrong? And how does hist achieve this performance?

P.S: Using this bit of a code, you can see the the result of hist function and my implementation:

fig, axs = plt.subplots(1,2)
axs[0].hist(normal, bins=100, density=True)
axs[1] = histogram(normal, bins=100, density=True)
plt.tight_layout()
plt.show()

Where the left one is hists output and the one on right is my implementation; which both are exactly the same.

\$\begingroup\$ Can you show screenshots of your histogram? \$\endgroup\$
– Reinderien
Commented Oct 18, 2021 at 17:59 — Reinderien, Commented Oct 18, 2021 at 17:59
\$\begingroup\$ @Reinderien I just did. \$\endgroup\$
– Amirhossein Rezaei
Commented Oct 18, 2021 at 18:47 — Amirhossein Rezaei, Commented Oct 18, 2021 at 18:47

Leonid Mednikov · Accepted Answer · 2021-10-21 11:31:38Z

You are slicing your original dataset 1000 times, making 2000 comparisons with borders for each of the values in it. It is much faster to calculate the proper bin for each value using simple math:

bin = (value - min_data) / dx
D[bin][1] += 1

And you can use original circle to set all В[bin][0] The only problem, that at max_data, bin would be equal bins (not bins-1 as expected), you can add additional if or just add additional bin D = np.zeros([bins+1,2]) and then add from bins to bins-1 and drop the bin=bins.

UPD: So here is my full code (on the base of original code with some refactoring)

def histogram(a, bins = 1000, density=False):
    min_data = np.min(a)
    max_data = np.max(a)

    dx = (max_data - min_data) / bins
    x = np.zeros(bins)
    y = np.zeros(bins+1)
    
    for i in range(bins):
        x[i] = i*dx + min_data

    for v in a:
        bin = int((v - min_data) / dx)
        y[bin] += 1
    
    y[bins-2] += y[bins-1]
    y = y[:bins]
    
    if density == True:
        inte = sum((x[1]-x[0])*y)
        y /= inte

    plt.bar(x, y, width=dx)
    return np.column_stack((x, y))

It gave me 15 seconds in comparision with 60 seconds of original code. What else can we do? The most intense calculations are in this circle

    for v in a:
        bin = int((v - min_data) / dx)
        y[bin] += 1

And they are made pure python which is really slow (C++ is almost 10 times faster, and it is used to optimise inner calculation in numpy). So we'd better make these calculations with numpy (just change previous 3 lines to that 3 lines).

    a_to_bins = ((a - min_data) / dx).astype(int)
    for bin in a_to_bins:
        y[bin] += 1

That gave me the final time about 7 seconds.

UPD 2: What is the slowest part now? As we know python intensive calculations are slow. Let's check it:

def histogram(a, bins = 1000, density=False):
    start_time = time.perf_counter()
    min_data = np.min(a)
    max_data = np.max(a)
    dx = (max_data - min_data) / bins
    print(time.perf_counter() - start_time, 'to calc min/max')
    
    x = np.zeros(bins)
    y = np.zeros(bins+1)
    print(time.perf_counter() - start_time, 'to create x, y')
    
    for i in range(bins):
        x[i] = i*dx + min_data
    print(time.perf_counter() - start_time, 'to calc bin borders')

    a_to_bins = ((a - min_data) / dx).astype(int)
    print(time.perf_counter() - start_time, 'to calc bins')
    for bin in a_to_bins:
        y[bin] += 1
    print(time.perf_counter() - start_time, 'to fill bins')
    
    y[bins-2] += y[bins-1]
    y = y[:bins]
    
    if density == True:
        inte = sum((x[1]-x[0])*y)
        y /= inte

    print(time.perf_counter() - start_time, 'before draw')
    plt.bar(x, y, width=dx)
    print(time.perf_counter() - start_time, 'after draw')
    return np.column_stack((x, y))

Gives me:

0.010483399993972853 to calc min/max
0.011489700002130121 to create x, y
0.012588899990078062 to calc bin borders
0.09252060001017526 to calc bins
7.7265202999988105 to fill bins
7.727168200013693 before draw
8.440735899988795 after draw

So almost all the time was consumed by python circle to add 1 to the bin (pay attention that previous calculation with numpy made all bin calculation for less than 0.1 seconds). So if we rewrite that code in C++ we would probably get 10x faster result which is pretty close to the original .hist timing.

By the way, if I use @nb.jit(parallel=True) than time goes down to

0.014584699994884431 to calc min/max
0.014821599994320422 to create x, y
0.3439053999900352 to calc bin borders
0.5012229999992996 to calc bins
0.8304882999800611 to fill bins
0.8317092999932356 before draw
1.5190471999812871 after draw

While original .hist on my PC takes 0.8466330000082962 (I don't know if it uses parallel computation or not).

In total: pure python calculations are terribly slow, transfering them into C++ makes huge impact on speed. That is why you'd better use numpy calculations (which uses C++ under the hood) instead of python circles. It can give the speed up around 10x or even more.

I'm not exactly sure how your solution works. Can you make a reproducible example? — Amirhossein Rezaei, Commented Oct 18, 2021 at 18:48
This is very good, but still has some significant difference with the original hist function's 0.6283 seconds. So, the question becomes: what is happening inside the hist function that makes it that fast?(!) — Amirhossein Rezaei, Commented Oct 19, 2021 at 20:33

Stack Exchange Network

Implementing histogram in python

1 Answer 1

Your Answer

Hot Network Questions

Implementing histogram in python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions