Do you think running code in cloud makes your code run faster ? Let’s rethink

Sumeet More
4 min readNov 9, 2019

Suppose there is a company named “Mumbai Donuts”. They have sale tomorrow. VP of sales comes to you and say “ Dude, tomorrow we are expecting lot of traffic on our web/mobile application. Make sure our backend services are running fine. All we need is to give quickly discounted price of a donut which customer selects. That’s it”.

Let’s translate this business problem to technical one:

Okay! we have one lac different types of donuts (I know situation is little unrealistic :P ! Can’t help it. I love donuts and large scale problem) and one lac list of discount amount to each type. All we have to do is following :

Final Cost = Original cost - discounted price.

eg : Original cost is 40rs and discount is 5 rs on it then final cost will be 35rs. Simple! I am giving trivial eg so that it is easy to follow(in reality business logic is little complex)

From business problem, we will get two lists . Original cost of different donuts = { 23,56, ….} and their corresponding discounted price = { 1,5,..}

we will get final cost list by subtracting these above two lists like {23–1, 56–5}.

Solutions

  1. Sumit? seriously ! we can simply substrate two lists and get final list.(If you thought of parallelizing code using different threads, that’s another good idea but overhead of creating so many threads for 1 lac items is little concerning).
  2. SIMD

Single instructions multiple data. Wait what?

Let’s understand how high level code works.

First, where is our code stored ? it is in RAM(Memory).You know right our memory runs at lower frequency compared to CPU and that translates to fact that our CPU is faster than memory . Due to which there is cache in CPU where some data of memory is stored so that CPU is not idle. Okay fine but why you are telling us this, Sumit? Similarly, when we have lot of instructions/line of codes in our program it is obvious memory and CPU has to do lot of work and execute lot of instructions and performance is affected due to it. When CPU does work, it is measured in terms of CPU cycle(fetch, decode and execute). More CPU cycles wasted , less performant application starts to become.

What we can do then? minimize instructions for CPU to execute which will save CPU cycles and make application faster.

Take eg : a = b+ c

Compiled to assembly code/instructions (rough presentation)

LDR R1, [b]

LDR R2, [c]

ADD R3, R1, R2

These 3 instructions are executed in CPU when we add two numbers. So if we have million numbers, these instructions will keep happening? YEP.

Can we reduce number of instructions?by having an instruction which can perform add/sub/multiply operation on array of numbers at same time? Like instead of executing abv these 3 instructions for millions time? can’t we have a single instructions that can perform addition on 10 numbers at same time? which can reduce 1 million(10 lacs) instructions to 1 lac instructions and automatically boost performance of the application. Yes there are such instructions and they are called SIMD(single input multiple data).

Let’s use SIMD for our donuts problems.

Let’s see the code the for “Mumbai Donuts “ problem.

Code:

Normal method is using for loop for million instructions and better version method is using high level SIMD code.

Let’s see results:

You can clearly see better version (SIMD) ran faster than normal version(for loop). My laptop is old(2013) and latest laptops/CPU will enhance this benchmark since they have better SIMD hardware support.

Let’s see some assembly code and understand why it is fast.

Below pic you can clearly see, there is special subtraction instruction produced by better version method(SIMD one). Single instruction is operating on multiple data and helping us save CPU cycles .

Below pic is of normal for loop method and you can clearly see amount of extra instructions it has produced.

Let’s address the title of this article now. Even though you run this in cloud or on any high end machine it still going to produce same number of instructions if you are going to use normal for loop. What I have shown in this article is how to parallelize work load on instructions and hardware level. But sumit if we have multiple threads running on multiple cores ? threads is concept on high level but on hardware level thread is just set of instructions and it will follow same number of instructions as normal for loop does. SIMD over single core might also outperform multiple thread running on multiple cores since using multiple cores and threads will result in lot of memory usage and CPU utilization which will in turn degrade the performance of application. Again when we go multiple core , issues with memory access starts coming (how? That’s the story for another day)☺. Even though we are running stuff on cloud, Let’s not forget that CPU architecture remains same.✌️

Happy Coding ☺.

--

--

Sumeet More

Associate Vice President at Kotak Securities | Backend Engineer and Architect| Blockchain & ML enthusiast | C#,.NET Core, Rust, Javascript and Go