Let me start with an example – sum up all numbers in a 10000×10000 matrix (of the default type double), which would involve 100M floating point additions. Look at the two implementations below:
% Code A:
x = randn(1e4);
s=0;
tic
for i=1:1e4
for j=1:1e4
s=s+x(i,j)*x(i,j);
end
end
toc
Avg Time Taken* : 2.2s
% Code B:
s=0; tic, s = sum(x(:)); toc
Avg Time Taken* : 0.06s
* on a 2.5Ghz Intel Core-I5 with R2011a running on OS 10.8
It can be seen that Code-A takes whopping 36 times longer than Code-B to do the same task. The reason B is faster is because it does a vectorised addition in contrast to a repetitive addition by loops of A.
Vectorization is akin to batch processing – its lot faster to execute a task on bunch of numbers than executing the same task individually on the same bunch of numbers. This is not really any magic, but an exploitation of processor architecture with some smart coding. Think of it like time taken for you to fetch groceries all in one go versus the time taken if you had to visit the store once for every item on your wife’s list.
Another example – Get averages of every two successive numbers in an array. i.e. if x = [0,1,2,3,4,5]
create a y which is [0.5,1.5,2.5,3.5,4.5].
It might be tempting to put a loop for the purpose, but it can be done much more elegantly using indexing as follows.
y = (x(1:end-1)+x(2:end))/2;
Another example: evaluating a function in 2 independent variables over a given 2D space. Let the function be the Sinc function given as f(x,y) = { sin((x^2+y^2)^0.5) } / ((x^2+y^2)^0.5) in the space [-2pi, 2pi].
[x,y] = meshgrid(-2*pi:0.1:2*pi,-2*pi:0.1:2*pi);
z = sin(sqrt(x.^2+y.^2)) ./ (sqrt(x.^2+y.^2))
In real world problems, vectorization makes day and night difference. It could be so pronounced for a computationally intensive task that one might have to wait for days in contrast to waiting a few minutes with vectorization exploited. Vectorising has no syntaxes per se, its just an art of utilising some cool features, operators and builtin functions. E.g . (the dot) operator for element-wise operations, functions like repmat, reshape. It’s a skill that you nurture with some practice.
Loops are inherently inefficient in Matlab, but they have come a long way in the last 8 years or so, thanks to JIT compiler (see the 4th comment in the article here). Loops however cannot be ditched altogether. There would be situations where using a loop becomes inevitable to realise some logic (e.g. in cases where current computation depends on previous computation). Its would be a good habit for budding Matlab programmers to think on lines of vectorization and resort to loops only as last option. Vectorization has a downside too. It’s demanding on memory and if computations push the machine to limits with “out of memory” errors you might want to have a relook at vectorization strategy. You might have to reduce the amount of vectorization in favour of loops (with better memory management) to strike a good balance between speed of execution and memory demand.
I would love to see your comments. Share your experiences with vectorization.
Can you tell me if vectorized code is always faster than loops? If its not, any counter-example ?
No, vectorization can sometimes be slower; especially in the cases where huge memory allocations are involved. Consider two ways of calculating kurtosis (ratio of 4th moment about the mean to standard deviation) of columns of a matrix below.
The loop version:
function k = kurtosis2(x,dim)
% Kurtosis function - returns kurtosis of a vector or matrix
k = zeros(1, size(x,2)); % preallocate
if dim==2
x = x';
end
for i = 1:size(x,2)
xi = x(:,i);
mu = mean(xi);
d = (xi-mu).^2;
n = d.^2;
k(i) = mean(n)/(mean(d))^2;
end
if dim==2
k = k';
end
And the vectorised way:
function k = kurtosis3(x,dim)
% Kurtosis function
mu = mean(x,dim);
repsiz = ones(1,length(size(mu)));
repsiz(dim) = size(x,dim);
d = (x-repmat(mu,repsiz)).^2;
n = d.^2;
k = mean(n,dim)./(mean(d,dim)).^2;
Look at how they fare.
>> x = randn(1e4);
>> k1 = zeros(1,1e4); tic, k1 = kurtosis3(x,1); toc
Elapsed time is 1.626570 seconds.
>> k = zeros(1,1e4); tic, k = kurtosis2(x,1); toc
Elapsed time is 1.498233 seconds
Clearly Kurtosis3 (the vectorised version) is a tad slower. As numel(x) increases, vectorised version gets slower and slower due to huge amounts of memory allocations involved in it. Loop version deals only in smaller chunks of memory and is hence faster.