Implementing RSA decryption on a Cortex M4 processor

anand srivastava
Jul 19, 2022
5 min read

A multi-precision library is necessary for writing any encryption decryption software, because they all deal with very large numbers. In this blog we will talk about doing Decryption of an RSA encrypted data and how to implement it for a Cortex M4 processor so that it can be fast.

RSA is the oldest and simplest asymmetric algorithm. RSA key has two components an exponent and a modulo. The modulo is shared between the private and the public key. Private key contains several other components for speeding up encryption, but the public key only has the two elements and also the exponent is a constant value 0x10001 (for keys generated by OpenSSL), which is just 17 bits long, and so needs only 17 iterations of the code.

RSA Basics and implementation details

RSA is based on the below result

Encrypted code = (Plain code ^ Secret Exponent) % Modulo

Plain code = (Encrypted code ^ Public Exponent) % Modulo

RSA implementation depends on the following result

X = xM + y

(X * X) % M => ((xM + y) * (xM + y)) % M

=> (xM * xM + 2 * y * xM + y * y) % M

since (M * anything) % M = 0

=> (y * y) % M

So we can keep on doing modulo every step of the way, and the result will hold.

Calculation of exponent of say 5 can be written as below

X * X * X * X * X = X (bit 0) * (X^2)^2 (bit 2)

So we start as follows we have two numbers Result R and Temp T. R is initialized with 1 and T is initialized with the Data X.

for every bit we calculate T = (T * T) % M

And for any bit which is 1 we update the Result as R = (R * T) % M, using T of previous iteration.

Note the size of T and R have to be twice that of E, M, and X. Also note that since R starts at 1 we can avoid one multiplication by copying T into R, when we encounter first least significant 1 bit in E. For the OpenSSL public key the least significant bit is 1, so we can start with copying X to R.

We take the data which cannot be bigger than the key size. That is a restriction with RSA encryption. And it takes a long time to decrypt or encrypt, unless there is hardware support. Many new microcontrollers have built in encryption/decryption blocks. Still RSA is very expensive compared to Symmetric key encryption, so RSA or Asymmetric key encryption is only used where symmetric algo will not work.

High Level code for RSA decryption/encryption.

1) For each bit of Exponent E, starting from least significant, do the following.

2) if bit is 1 we do R = (R * T) % M

3) in both cases we do T = (T * T) % M

We have seen what is needed to do for doing RSA encryption/decryption.

Now to find out what is required for implementing the modulus and multiplication functions.

Multi-Precision Computation

If the key is 1024 bits, we have to multiply two 1024 bit numbers and get a 2048 bit number. These numbers are huge processors can only do 64 bits at once. And the Cortex M4 microcontroller can do only 32bits. It does have a 64 bit result, so that you can multiply two 32 bit number and get a 64bit result. Still it is too small. This is where multi-precision library comes in.

The multi-precision library holds a 1024 bit number as an array 32 32 bit unsigned integers. And for 2048 bit number it would need an array of 64 32 bit unsigned integers.

We have to basically implement multiplication and Modulus operations which operate on an array of 32 bit integers.

Multiplication

Lets start with multiplication. The multiplication is done the same way as we were taught to do multiplication of two decimal numbers on paper in primary school. Below is an example of multiplication of 128bit numbers stored as an array of 2 32 bit integers. X[] and Y[] are two operands, C[] is the carry over, and R[] is the result. A[] and B[] are temporary numbers. We can see that R size must be the sum of X size and Y size.

X1 X0

* Y1 Y0

-----------------------

A2 A1 A0

B2 B1 B0 00

-----------------------

R3 R2 R1 R0

Where

A0 + C0 = X0 * Y0

A1 + C1 = X1 * Y0 + C0

A2 = C1

B0 + C2 = X0 * Y1

B1 + C3 = X1 * Y1 + C2

B2 = C3

R0 = A0

R1 + C4 = A1 + B0

R2 + C5 = A2 + B1 + C4

R3 = B2 + C5 Note: This step cannot produce a carry.

Much of the above code can be implemented in C, the one part that is beneficial to implement in assembly is the multadd function, which multiplies two numbers and then adds a third number. This can be implemented more efficiently by using the umull instruction for multiplying 2 32bit numbers into a pair of 32bit integers yielding a 64bit result. This function should also add the high 32bit to the upper result and then any overflow to the higher integer and so on till there is an overflow.

Modulus

Modulus is basically a divide operation which results in the remainder. The simplest way is to shift M left till the Most significant 1 bit reaches the extreme left bit position. Now in a loop repeat the following

1) shift right till M is smaller than the Target

2) Subtract the number M from Target

3) repeat from 1 till Target is smaller than M

Note: we don't care about the result of the division only remainder.

Modulus is an extremely slow operation as the complexity is N^3. The 3 Ns have their own limits. For decryption, since exponent is always 17bit with only 2 bits 1, and the first 1 bit is handled by copy, so the first N is 18. For Multiplication both following Ns are divided by 32 as its a multiplication of 32 bit numbers. For Modulus operation, since we are doing it bit by bit, the impact is at full N. This is why the Modulus operation takes the most CPU. This is why it is important to optimize using hand optimized assembly code the inner most loop.

The modulus can be improved by using division. I did not complete that part, as the code was fast enough and we had other projects to work on.

Summary

When I started working on it, we had an opensource code which did the RSA decode, but took several minutes to achieve it. I implemented it with my own code and got it down to 1.5 seconds. After optimizing it with Assembly I could get it down to 100ms, which was usable.