[SPO600] Final Thoughts

Coming to the close of this semester I just wanted to make at least one post regarding the entirety of the SPO600 course with Chris Tyler.

The thing I want to start off saying comes from the “SPO600 – Information for Prospective Students” page:

Is this an easy course?
No! – It’s a challenging course.

Chris is definitely right about this. This course definitely is challenging. There were many times where I felt like I wasn’t ready for this course. A proficient understanding of Linux is required to get anywhere in this course. Just using makefiles, installing applications, using different compile flags, egrep, and a host of other different things were things that I personally had to work through (with a lot of help from Google) just to keep up with the other students in the class. However having the other students in the class be more proficient with Linux wasn’t all bad, because of this I was forced to learn so much more about Linux than I had even learned in the course that we have to take about using Linux.

Chris goes on in the “SPO600 – Information for Prospective Students” page:

However, it covers material which is not covered elsewhere in the program, and if you like to understand technology in detail, you may really enjoy this course. The knowledge and skills covered in this course are of practical value to programmers and system administrators.

Sometimes education in college is all too practical and leaves the theory out because of time constraints or other reasons. Although I understand the reasoning to this method (and have chosen it for myself) sometimes the understanding of the technologies we are working with are left out and can be of “practical value to programmers and system administrators”. I feel like this Chris does a good job with this course teaching some of the more theoretical parts of our future careers that may be glossed over in other courses. Things like XOR, branch prediction and the many methods of optimization are things that definitely have not been taught in any other course I have taken and are extremely interesting to me.

This course of obviously deals with more practical concepts as well. Concepts like the basis of assembly, how the registers work, and, because not all of us programmers (or sys admins) are going to be dealing with assembly, what I think to be most important working in open source and in a community in general.

Overall I feel like although I thought this course would be mostly dealing with learning to code in assembly that instead what I got from it was so much better, I feel like I learned the general concepts of so many things that will definitely help me in the future instead of learning just how to code in assembly which may be happening less and less because of improving compilers. I’m glad I took this course and hope to use what I have learned in the future to help my career.

[SPO600] Problems with “Blundering through Assembly”

Turns out having (and documenting) a plan really does help. In my preview post where I outlined my plan for porting 32-bit to 64-bit ARM assembly I had said this was my plan:

As I am not exactly the most proficient with making assembly code of my own for my first attempt at porting the assembly code required for a more optimized PolarSSL on aarch64 from aarch32 I am going to try to take the exact same assembly from the package for the 32 bit architecture and use the 32-bit-wide access on the registers for the 64 bit assembly.

I just realize now that since the 32-bit assembly code for ARM was actually slower than the C fallback for ARM there wouldn’t be any point in porting over that same assembly using 32-bit-wide access on a 64 bit machine. The assembly for AARCH64 using 32-bit-wide access would still not be more efficient that what is being used on AARCH64 by default (C fallback) .

Unfortunately porting over from 32-bit to 64-bit without limiting the 64-bit assembly to just using 32-bit-wide access is far out of my scope at this point in time. This is where the progress comes to an end for me. I hope that if someone plans on porting over PolarSSL to ARM64 they stumble upon this blog as I believe I have provided all the information required prior to coding the assembly.

If you are reading this and plan on embarking on cooking up some assembly I’m glad you found this, and wish you good luck!

[SPO600] Last Step(s)

After spending most of this course laying down all this groundwork, finding the assembly code, looking through bn_mul.h, figuring out that changing bn_mul.h will lead to the best performance increase, benchmarking with and without assembly, and benchmarking on ARM32, I can finally say there is only one last step to take.

Unfortunately for me it is definitely the hardest: Coding assembly for AARCH64.

Hopefully I can manage to get some assembly up into the PolarSSL repo to increase the performance for AARCH64. On the plus side if anyone wants to pick up this project after me all the legwork is already done and if they have a strong understanding of assembly they can go straight to polarssl/include/polarssl/bn_mul.h and get straight to coding.

[SPO600] ARM32 Optimization for PolarSSL

Finally got PolarSLL working on ARM32! So it looks like the assembly for ARM32 really helps the performance or PolarSSL on a 32-bit ARM machine. Here’s my benchmarking results with the old AARCH64 benchmarking along with it for comparison:

AARCH32
cfallback AVERAGES
real 71.871 69.983 73.919 73.59 71.633 72.1992
user 70.37 68.52 72.51 75.2 71.24 71.568
sys 1.6 1.59 1.53 1.49 1.5 1.542
assembly
real 45.866 43.162 43.506 46.363 45.333 44.846
user 44.38 41.68 42.05 44.94 43.94 43.398
sys 4.6 4.59 1.58 1.56 1.51 2.768
AARCH64
cfallback
real 38.235 36.175 36.567 37.844 38.517 37.4676
user 33.67 31.6 32.08 33.33 33.96 32.928
sys 4.4 4.43 4.36 4.35 4.39 4.386
assembly?
real 39.009 34.48 37.933 39.391 36.038 37.3702
user 34.48 32.13 33.38 34.81 31.49 33.258
sys 4.37 4.41 4.41 4.43 4.4 4.404

It turns out the platform specific assembly in bn_mul.h is really useful. The speed of the of running the testing suite with the C fallback is much worse (almost 2x as slow) than with the assembly. However it is worth noting that even with the assembly for AARCH32 it is still a fair amount slower than AARCH64 with the C fallback (i.e. AARCH64 testing suite is faster than AARCH32 with or without platform specific assembly).

Compared to some of the other packages on Linaro Connect which were much slower without assembly, or possibly didn’t even build on AARCH64 this is definitely not something that I feel should be a priority if the 32-bit ARM version is acceptable.

[SPO600] Optimized C

Just a quick update for the progress on the porting of PolarSSL. While cal-6-5 was busy installing Cmake I tried kicking up the optimization for the C fallback for AARCH64 to see if it would produce any better results but while all the tests still pass for the testing quite, it looks like there is no improvement on the bench-marking so that approach didn’t work. Didn’t expect it to work because by now someone would have probably taken out the assembly if the C worked better, but it was a quick test so it was worth a shot.

[SPO600] Another Approach – Setup

Recently I have been informed by Chris there is a 32-bit ARM machine called cal-6-5 which I could use. My plan for this would be to test the assembly for 32-bit ARM PolarSSL and see if it is better than the compiled C fallback. If it wasn’t there wouldn’t really be any point in trying to decipher that 32 bit code and port it over for 64-bit ARM.

I have run into a few problems regarding cal-6-5. I’m sure a more experienced linux user would love having an entirely clean Fedora machine but for a relatively new user to linux the problems I have encountered have taken me quite a while.

Being a completely(?) clean installation of Fedora the machine didn’t have git, so I couldn’t fetch the PolarSSL repo. Once I had downloaded and installed git I managed to get the repo but then found out that the machine didn’t have functionality to use cmake, which is what I had used to compile the other 2 versions of PolarSSL on the 64 bit ARM and x86 machines.

I tried installing cmake but it seems like there is no C++ compiler available either*. So I after a little of work I managed to install gcc and after quite a while (and much frustration) I found out that gcc-c++ was entirely different than gcc and I had to install gcc-c++ as well. After all this work Cmake is now installing, hopefully this will help me out and all this setup won’t be for nothing.

* There’s probably tons wrong with this post but these are the steps I’ve taken, I think I’m going in the right direction here

[SPO600][Lab 3] Assembly Lab

Just realized I never blogged about our 3rd lab we did in class, luckily I had the code that we worked on emailed to me from my group. My group consisted of 3 people. [names have been taken out just in case they don’t want them posted here]

The task we were given was to create an assembly program that would loop through the numbers 1-15 and print out each number. While this seems fairly simple it was actually a little bit more difficult than we expected.

Firstly this is the C code that would produce the same result:

#include

int main() {
int i;

for(int i=0; i<10; i++)
printf("Loop: %d!\n", i);
}

Pretty simple, here’s the code that we generated to do the same thing in assembly

/*
This is a ‘hello world’ program in x86_64 assembler using the
GNU assembler (gas) syntax. Note that this program runs in 64-bit
mode.

CTyler, Seneca College, 2014-01-20
Licensed under GNU GPL v2+
*/

.text
.globl _start

start = 0 /* starting value for the loop index */
max = 10 /* loop exits when the index hits this number (loop condition is i<max) */

_start:
mov $start,%r15 /* loop index */
loop:

/*print*/
movq $len,%rdx /* message length */
movq $msg,%rsi /* message location */
movq $1,%rdi /* file descriptor stdout */
movq $1,%rax /* syscall sys_write */

/*start of mangling*/
mov %r15,%r14
add $0x30,%r14b
mov $msg,%r13
add $0x6,%r13
mov %r14b,(%r13)

/* end of mangling*/
syscall

inc %r15 /* increment index */
cmp $max,%r15 /* see if we’re done */
jne loop /* loop if we’re not */

mov $0,%rdi /* exit status */
mov $60,%rax /* syscall sys_exit */
syscall

.section .data
msg: .ascii “loop: !\n”
len = . – msg

As we can see it’s a bit more lengthy but Chris’ comments really helped us out. Most of the code is actually dedicated to how to iterate through the loop and print our output, once we could do both of those things it was fairly easy to put the values we wanted into the registers we wanted and then the print and loop that Chris provided would do the rest of the work iterating and outputting the values that we had put in the registers.

[SPO600] Plan to Blunder through assembly

As I am not exactly the most proficient with making assembly code of my own for my first attempt at porting the assembly code required for a more optimized PolarSSL on aarch64 from aarch32 I am going to try to take the exact same assembly from the package for the 32 bit architecture and use the 32-bit-wide access on the registers for the 64 bit assembly.

Chris Tyler has provided the class with a quick start page for the aarch64 registers and instructions. So using this coupled with a bit more research I will be trying to port the instructions over from aarch32. I will be accessing the registers using the w- prefix instead of the r- prefix. I also have to keep in mind the special registers may have changed so I must be sure I’m not using any of the registers that are used for other purposes.

Looking at the code I see mostly r0-r7 which on aarch64 would be used for arguments and return values, I will be looking up if these registers are used for the same things on a 32 bit model but in the context it does make sense if this were the case.

Also I haven’t shown it in my blog but bn_mul.h does have some flags that will have to be set on to get the compiler to actually use the assembly. I will have to switch on POLARSSL_HAVE_ASM and __thumb__ and make sure __thumb2__ is not defined (off)

Last but not least MULADDC is refered to a lot in the file, so I have to figure out what MULADDC is. Searching for it results in only people asking on forums if they can learn to code in assembly and posters telling them not to, and that the compiler should optimize fairly well. If only it were that simple.

[SPO600] bn_mul.h

Although it doesn’t say anywhere in the file what bn_mul.h stands for I think it’s safe to say the it has to do with big number multiplication. If the file name didn’t make it too obvious the fact that in the comments multiplication is mentioned and the file includes a header file called bignum.h is enough for me.

I’ve cut out most of the file only leaving the parts relevant to porting the package.

//The Comments regarding architectures

/*
* Multiply source vector [s] with b, add result
* to destination vector [d] and set carry c.
*
* Currently supports:
*
* . IA-32 (386+) . AMD64 / EM64T
* . IA-32 (SSE2) . Motorola 68000
* . PowerPC, 32-bit . MicroBlaze
* . PowerPC, 64-bit . TriCore
* . SPARC v8 . ARM v3+
* . Alpha . MIPS32
* . C, longlong . C, generic
*/
#ifndef POLARSSL_BN_MUL_H
#define POLARSSL_BN_MUL_H

#include "bignum.h"

//Assembly for x86_64

#if defined(__amd64__) || defined (__x86_64__)

#define MULADDC_INIT \
asm( \
"movq %3, %%rsi \n\t" \
"movq %4, %%rdi \n\t" \
"movq %5, %%rcx \n\t" \
"movq %6, %%rbx \n\t" \
"xorq %%r8, %%r8 \n\t"

#define MULADDC_CORE \
"movq (%%rsi), %%rax \n\t" \
"mulq %%rbx \n\t" \
"addq $8, %%rsi \n\t" \
"addq %%rcx, %%rax \n\t" \
"movq %%r8, %%rcx \n\t" \
"adcq $0, %%rdx \n\t" \
"nop \n\t" \
"addq %%rax, (%%rdi) \n\t" \
"adcq %%rdx, %%rcx \n\t" \
"addq $8, %%rdi \n\t"

#define MULADDC_STOP \
"movq %%rcx, %0 \n\t" \
"movq %%rdi, %1 \n\t" \
"movq %%rsi, %2 \n\t" \
: "=m" (c), "=m" (d), "=m" (s) \
: "m" (s), "m" (d), "m" (c), "m" (b) \
: "rax", "rcx", "rdx", "rbx", "rsi", "rdi", "r8" \
);

#endif /* AMD64 */

//Assembly for ARMv3+

#if defined(__arm__)

#if defined(__thumb__) && !defined(__thumb2__)

#define MULADDC_INIT \
asm( \
"ldr r0, %3 \n\t" \
"ldr r1, %4 \n\t" \
"ldr r2, %5 \n\t" \
"ldr r3, %6 \n\t" \
"lsr r7, r3, #16 \n\t" \
"mov r9, r7 \n\t" \
"lsl r7, r3, #16 \n\t" \
"lsr r7, r7, #16 \n\t" \
"mov r8, r7 \n\t"

#define MULADDC_CORE \
"ldmia r0!, {r6} \n\t" \
"lsr r7, r6, #16 \n\t" \
"lsl r6, r6, #16 \n\t" \
"lsr r6, r6, #16 \n\t" \
"mov r4, r8 \n\t" \
"mul r4, r6 \n\t" \
"mov r3, r9 \n\t" \
"mul r6, r3 \n\t" \
"mov r5, r9 \n\t" \
"mul r5, r7 \n\t" \
"mov r3, r8 \n\t" \
"mul r7, r3 \n\t" \
"lsr r3, r6, #16 \n\t" \
"add r5, r5, r3 \n\t" \
"lsr r3, r7, #16 \n\t" \
"add r5, r5, r3 \n\t" \
"add r4, r4, r2 \n\t" \
"mov r2, #0 \n\t" \
"adc r5, r2 \n\t" \
"lsl r3, r6, #16 \n\t" \
"add r4, r4, r3 \n\t" \
"adc r5, r2 \n\t" \
"lsl r3, r7, #16 \n\t" \
"add r4, r4, r3 \n\t" \
"adc r5, r2 \n\t" \
"ldr r3, [r1] \n\t" \
"add r4, r4, r3 \n\t" \
"adc r2, r5 \n\t" \
"stmia r1!, {r4} \n\t"

#define MULADDC_STOP \
"str r2, %0 \n\t" \
"str r1, %1 \n\t" \
"str r0, %2 \n\t" \
: "=m" (c), "=m" (d), "=m" (s) \
: "m" (s), "m" (d), "m" (c), "m" (b) \
: "r0", "r1", "r2", "r3", "r4", "r5", \
"r6", "r7", "r8", "r9", "cc" \
);

#else

#define MULADDC_INIT \
asm( \
"ldr r0, %3 \n\t" \
"ldr r1, %4 \n\t" \
"ldr r2, %5 \n\t" \
"ldr r3, %6 \n\t"

#define MULADDC_CORE \
"ldr r4, [r0], #4 \n\t" \
"mov r5, #0 \n\t" \
"ldr r6, [r1] \n\t" \
"umlal r2, r5, r3, r4 \n\t" \
"adds r7, r6, r2 \n\t" \
"adc r2, r5, #0 \n\t" \
"str r7, [r1], #4 \n\t"

#define MULADDC_STOP \
"str r2, %0 \n\t" \
"str r1, %1 \n\t" \
"str r0, %2 \n\t" \
: "=m" (c), "=m" (d), "=m" (s) \
: "m" (s), "m" (d), "m" (c), "m" (b) \
: "r0", "r1", "r2", "r3", "r4", "r5", \
"r6", "r7", "cc" \
);

#endif /* Thumb */

#endif /* ARMv3 */

So it looks like ARMv3 is being supported. I’ve looked into the ARM microachitectures and it looks like ARMv3 supports 32 bit memory so taking from both the x86_64 and existing ARM assembly code I should be able to build my own 64 bit assembly

[SPO600] Searching For Assembly

Looking for the assembly code in my package, “PolarSSL”, it seems there are 4 files which contain any assembly language at all. The four files (not including asm mentioned in comments in config.h, and ChangeLog) being:

  1. library/padlock.c
  2. library/aesni.c
  3. library/timing.c
  4. include/polarssl/bn_mul.h

Revisiting the issues on the Linaro website:

lightweight crypto and SSL/TLS library; x86 asm for lowlevel driver support (VIA Padlock), timer access on various arches, bignum maths on various arches (incl A32) with C fallback; would need A64 porting for best results

We can clearly see all the corresponding files.

Next post we will be diving into bn_mul.h