Updated on 2025-03-03

This page represents some general info, tips and tricks, and other knowledge I have gain over the years specifically in developing embedded system firmware.

By embedded systems I refer to microcontrollers, think Attinys and PICs. I will not be covering any “high level” embedded systems like a Raspberry PI system, as I think of those as closer to a general computer when developing firmware/software.

This page will be very opinionated (it is my site after all). Be warned!


Table of Contents:


Don’t pre-optimize, but be cautious

The general programming tip of “Don’t pre-optimize” is also applicable to embedded systems.

If you aren’t familiar, a good practice is when you are developing a piece of software (firmware or PC software), don’t spent a lot of time optimizing code without it being necessary. For example if you are software floating math, don’t try to convert it to fixed point (more on that here) until the math is too slow for your application

BUT, you should try to use good design programming paradigms in your code, which is both good practice and can make your code run better. You should also be somewhat weary of what can be an obvious bottleneck, so that if your firmware needs to be optimized you know what to work on first.

Learn the instruction set

When you are dealing with a new CPU architecture, especially for a larger/professional project, I would have a quick glance at the instruction set.

While for the most part you will let the compiler handle things, you might have a project that directly can be solved by 1 or a couple of instructions. For example if you are dealing with a ARM Cortex M0 processor, to reverse bits of a word, you can do a C implementation, or you can use the built-in REV instruction.

It will also give you a better idea of what the processor is capable of, and you will be better positioned if you have to optimize things in assembly due to constraints.

Fixed Variable Size

We all know the C types int, byte, etc.

I try not to use the pre-defined C types in my firmware. I instead use fixed width types like uint8_t, int32_t, etc for the most part.

My reasoning is as follows:

uint8_t vs int in AVR

To see the danger of the last point, the following two assemblies are of these two same function, but one with i set to uint8_t, the other to int. Both are compiled with -Ofast.

void doLoop(char *text, uint8_t n){
    for(int i=0;i<n;i++){
        doSomething();
    }
}
uint8_t int
    
avr_loop1.o:     file format elf32-avr


Disassembly of section .text:

0000008a <doLoopFor>:
  8a: push r16
  8c: push r17
  8e: mov r16, r22

00000090 <.LBB9>:
  90: cp r22, r1
  92: breq .+10      ; 0x9e <.L2>

00000094 <.Loc.7>:
  94: ldi r17, 0x00 ; 0

00000096 <.L4>:
  96: rcall .-30      ; 0x7a <doSomething>

00000098 <.LVL3>:
  98: subi r17, 0xFF ; 255

0000009a <.Loc.10>:
  9a: cpse r16, r17
  9c: rjmp .-8       ; 0x96 <.L4>

0000009e <.L2>:
  9e: pop r17
  a0: pop r16

000000a2 <.Loc.13>:
  a2: ret

    
    
avr_loop1.o:     file format elf32-avr


Disassembly of section .text:

0000008a <doLoopFor>:
  8a: push r14
  8c: push r15
  8e: push r16
  90: push r17

00000092 <.LBB9>:
  92: mov r14, r22
  94: mov r15, r1
  96: cp r22, r1
  98: breq .+16      ; 0xaa <.L2>

0000009a <.Loc.7>:
  9a: ldi r16, 0x00 ; 0
  9c: ldi r17, 0x00 ; 0

0000009e <.L4>:
  9e: rcall .-38      ; 0x7a <doSomething>

000000a0 <.LVL3>:
  a0: subi r16, 0xFF ; 255
  a2: sbci r17, 0xFF ; 255

000000a4 <.Loc.10>:
  a4: cp r16, r14
  a6: cpc r17, r15
  a8: brne .-12      ; 0x9e <.L4>

000000aa <.L2>:
  aa: pop r17
  ac: pop r16
  ae: pop r15
  b0: pop r14

000000b2 <.Loc.13>:
  b2: ret

    

Try to avoid floats if unnecessary, and no FPU

Generally, low-end microcontroller doesn’t come with a FPU (floating point unit) with it’s CPU. The FPU is a hardware block that assists in doing floating point (think decimal/fractional numbers) math. Without this FPU, the toolchain will include software floating point, which is painfully slow, at least compared to integer arithmetic.

If you don’t need floating point, I would just avoid them in cases where it’s easy to use decimal arithmetic. With that said, if your application is not bottle necked with software floating point, by all means use those. Remember, don’t pre-optimize until needed.

With that said, if you need to apply fractional math, and software floating point is too slow for your application, you can always use fixed point math. I will leave the Wikipedia article on it if you want to do further reading.

Looping N times

If you have a function that loops around n times which is passed, an initial implementation would be to use the for loop:

void doLoop(char *text, uint8_t n){
    for(uint8_t i=0;i<n;i++){
        doSomething();
    }
}

But I prefer another method: the while loop. We use n as the condition for the while loop while decrementing it per itteration. The while statement loops until the condition is false (which is 0), so it will exit when n=0. The following is a C implementation:

void doLoop(char *text, uint8_t n){
    while(n--){
        doSomething();
    }
}

This method in my opinion is more readable, and should save on instructions as one doesn’t need to create an i variable and keep track of that. Plus CPUs are good as checking if a value is zero, versus having to compare two values then check the result.

Below is the dis-assembly of both functions, compiled with -Ofast optimization for an AVR system, allowing the compiler to go to town.

note: Each code block is scroll-able, you may have to scroll to view more details

For Loop While Loop Do While Loop
    
; Push values to stack to be used as temp registers
push r16       ; to be used as `n`
push r17       ; to be used as `i`

mov r16, r22   ; move the function input `n` to R16

cp r22, r1     ; check if function input `n` is zero
breq .+10      ; if equal, jump to the first pop instruction
ldi r17, 0x00  ; Sets `i` to zero

rcall .-30     ; calls doSomething()

subi r17, 0xFF ; subtract `i` by 0xFF, which is the
               ; same as adding it by 1 (arithmatic underflow)

cpse r16, r17  ; if `n`=`i`, skip the next instruction
rjmp .-8       ; jump back to `rcall`

; restore temp values back and return from function
pop r17
pop r16
ret

    
    
push r17      ; pushes r17 to stack, allowing it to be used as temporary variable

ldi r17, 0xFF   ; sets R17 (what will be used as n) to 255
add r17, r22    ; add R15 to the function input `n`
; the above essentially subtracts 1 from n, using arithmatic overflow
cp r22, r1      ; compare function input to 0 (r1 is always zero)
breq .+6        ; if equal to zero, go to the final pop instruction

rcall .-70      ; calls doSomething()

subi r17, 0x01  ; subtract 1 from n
brcc .-6        ; if n just turned <0 (carry bit),
                ; don't branch. otherwise go back to `rcall`

                pop r17         ; restore value of r17
ret             ; return

    
    
push r17      ; pushes r17 to stack, allowing it to be used as temporary variable
mov r17, r22  ; moves the function input `n` to R17

rcall .-84    ; calls doSomething()

dec r17       ; decrement 1 from n
brne .-6      ; if n != 0, go to `rcall`, otherise continue

pop r17       ; restore value of r17
ret           ; return

    

You might notice another method I added: do while. As I was typing this up, I figured I try it out, and it turns out it is more efficient than the while loop. This makes sense thinking about it: the while loop requires an initial check to see if i == 0, versus the do-while which just checks at the end, where it branches off anyways. It is implemented as follows:

void doLoopDoWhile(char *text, uint8_t n){
    do{
        doSomething();
    }while(--n);
}

I will post an ARM comparison soon (need to document the generated assembly)

If your function used the incrementing variable i, this method is not that applicable. With that said, if all you are using i for is to step through an array (for example to print text), a better way of doing so can be found in the section below

Stepping through arrays inside function

Let’s say you have an array, like a string, to increment in a function (this can also apply outside of a function, but not as common). Let’s say the function is as follows:

void doLoop(char *text, uint8_t n){
    for(int i=0;i<n;i++){
        print(text[i]);
    }
}

Instead of getting the value of the array at offset i, I prefer think of the array passed as a pointer, which is what it really is.

So I would get the value of the current pointer location through the * operator, then increment the pointer itself.

void doLoop(char *text, uint8_t n){
    do{
        // *text returns the value at the pointer location,
        // and text++ increments the pointer
        print(*text++);
    }while(--n);
}

This is useful if the contents of text only needs to be used once, otherwise you loose the original pointer location, making this trick a bit mute.

for(Ever) (main loop)

The following is a cool way of defining your main loop. While it can be used for any while(1) loop, I reserve it only for the main loop.

#define EVER    ;;

int main(void){
    // init stuff
    for(EVER){
        // main stuff
    }
}

Modulus of 2^n

If you are applying a modulus (%, which is a division remainder operation) of a power of two as the divisor, for example to limit a buffer size, you can instead AND the dividend by 1 minus the divisor, for example:

buff[idx++] = r;        // some operation
// the following two are equivalent
idx %= 64;
idx &= 0x3f;    // 63

This works because we are dealing with a binary system, a modulus of a power of two inherently limits the number of bits (a modulus of 64 would limit the result to 0 to 63, 0 to 0b111111), which is binary-equivalent of ANDing by the maximum 0b111111, as the rest of the bits will be zero.

If the divisor is part of the design you construct, for example a buffer size you define, I would limit it to power of twos for this reason.

Why do this? Well a modulus, if the compiler did not optimize for it (which it should if the divisor is a constant), will be an expensive division operation. Many low-end processors don’t come with a division instruction, so you are spending time doing software division. Even if the processor comes with hardware division, it may take a couple or more instruction cycles to complete. Compare that to the AND operation, which is common for all processors, and tends to take only a single instruction cycle.

Utilize #define for substitution constants

If you have a constant variable that is used to define a single data type, such as an integer for a IO pin, you are better off using the C processor to substitute in the desired value. An easy example is defining the IO pin number for an Arduino program.

Defining the variable as a const reserves it in flash with all other constants, requires the instruction to fetch the constant value from memory to use it, and depending on the architecture and compiler it will be copied to RAM on startup.

Compare this to a define, where the value it will be merely substitute in-place before the compilation step. This saves a memory allocation when the value can easily by substituted with a single load-immediate instruction, which tends to be cheaper than reading from memory.

Below are an example of both using the pre-processor and using a constant variable.

#define LED_IO  5

const uint8_t led_io = 5;

void setup(){
    // both functions to exactly the same thing, but the define saves on a variable.
    pinMode(LED_IO, OUTPUT);
    pinMode(led_io, OUTPUT);
}