Navigation:

Sub Blogs

Post

Crumb, Getting Linux to Start

Getting Linux running on Crumb is a major part of the development as categorized here. Doing so is not so easy, and as one part of the long processes, is at least getting Linux to somewhat run on the emulator.

This involves fixing many bugs, and so for a start my first aim was to get Linux to be able to log out to the console. This all takes place during early printk and the Linux kernel should eventually reach a road block as there is not interrupt controller on the system.

Despite how early in the boot process this is, there wany many hard bug that had to be sorted out. This took about 3 days of effort once the debugger was built.

Here I will list out the issues, and how I fixed them.

Issue 0 - set_cpu_online in cpu.c causes access error.

Source:

CRUMB Software Emulator cpu_core.c

Cause:

When CRUMB emulator does arithmetic shift right on 32 bit words, it would incorrectly sign extend the value when the MSB was not set, unlike the 64 bit versions that used a conditional and mask to check for this.

The code for SRAW and SRAIW had to be changed to include said conditional.

The result that was during taking the CPUs online, the address of __num_online_cpus would be incorrectly calculated due to large negative numbers being added to the address causing a trap.

Commented Source

// If a cpu is not marked as online, marks it online and
// atomicly increments the number of online CPUs.
void set_cpu_online(unsigned int cpu, bool online)
{
	// Bug Conditions
	// cpu    = 0
	// online = true

    if (online) {
        // Set bit in mask if not set, continue.
        // Returns the masked bits before it was set.
        if (!cpumask_test_and_set_cpu(cpu, &__cpu_online_mask))
            atomic_inc(&__num_online_cpus);
    } else {
        if (cpumask_test_and_clear_cpu(cpu, &__cpu_online_mask))
            atomic_dec(&__num_online_cpus);
    }
}

Commented Disassembly

set_cpu_online:
	# Function Header, save stack frame
	addi	sp,sp,-16
	sd	s0,8(sp)
	addi	s0,sp,16

	# A4 = (int)cpu
	sext.w	a4,a0
	
	# A5 = 0x3F
	li	a5,63
	
	# if(online)
	beqz	a1,0x8000f1b0 NOT_ONLINE
	
	# Assert A0 > 63, then error. (No more than 63 CPUs)
	bltu	a5,a0,0x8000f1f8 ASSERT_0
	
	# A5 = ((int)cpu) >> 57
	sraiw	a5,a4,0x1f
	srliw	a5,a5,0x1a
	
	# A5 = A5 + A4
	addw	a5,a5,a4
	
	# A5 = A5 >> 6
	sraiw	a5,a5,0x6
	
	# A3 = &__cpu_online_mask
	auipc	a3,0x62a
	addi	a3,a3,54 # 0x806391b8 <__cpu_online_mask>

	
	# A5 = A5 << 3
	slli	a5,a5,0x3
	
	# A4 = (1 << CPU)
	li	a4,1
	sll	a4,a4,a0
	
	# A5 = A5 + &__cpu_online_mask
	add	a5,a5,a3
	
	amoor.d.aqrl	a3,a4,(a5)
	
	
	and	a4,a4,a3
	bnez	a4,0x8000f1aa 
	auipc	a5,0x62a
	addi	a5,a5,-12 # 0x80639190 <__num_online_cpus>
	li	a4,1
	amoadd.w	zero,a4,(a5)
	ld	s0,8(sp)
	addi	sp,sp,16
	ret

NOT_ONLINE:
	bltu	a5,a0,0x8000f1f4 
	sraiw	a5,a4,0x1f
	srliw	a5,a5,0x1a
	addw	a5,a5,a4
	sraiw	a5,a5,0x6
	li	a4,1
	sll	a4,a4,a0
	auipc	a3,0x62a
	addi	a3,a3,-16 # 0x806391b8 <__cpu_online_mask>
	slli	a5,a5,0x3
	add	a5,a5,a3
	not	a3,a4
	amoand.d.aqrl	a3,a3,(a5)
	and	a4,a4,a3
	beqz	a4,0x8000f1aa 
	auipc	a5,0x62a
	addi	a5,a5,-80 # 0x80639190 <__num_online_cpus>
	li	a4,-1
	amoadd.w	zero,a4,(a5)
	ld	s0,8(sp)
	addi	sp,sp,16
	ret
	
ASSERT_1:
	ebreak
	j	0x8000f1b4 
ASSERT_0:
	ebreak
	j	0x8000f174 

Hand Traced

Expected Program Flow:

set_cpu_online:
	# A0 = 0
	# A1 = 1

	# We will skip the header
	addi	sp,sp,-16
	sd	s0,8(sp)
	addi	s0,sp,16
	
	# A4 = 0
	sext.w	a4,a0
	
	# A5 = 63
	li	a5,63
	
	# Should not branch
	beqz	a1,0x8000f1b0 NOT_ONLINE
	
	# Should not branch
	bltu	a5,a0,0x8000f1f8 ASSERT_0
	
	# A5 = 0
	sraiw	a5,a4,0x1f
	
	# A5 = 0
	srliw	a5,a5,0x1a
	
	# A5 = 0
	addw	a5,a5,a4
	
	# A5 = 0
	sraiw	a5,a5,0x6
	
	# A3 = 0x806391b8
	auipc	a3,0x62a
	addi	a3,a3,54
	
	# A5 = 0
	slli	a5,a5,0x3
	
	# A4 = 1
	li	a4,1
	
	# A4 = 1
	sll	a4,a4,a0
	
	# A5 = &__cpu_online_mask
	add	a5,a5,a3
	
	# This should work
	amoor.d.aqrl	a3,a4,(a5)

What actually happened:

	# A4 = 0
	# Set A5 to -2. Very sus
	sraiw	a5,a4,0x1f

The shift right sign extended, when there was no 1 in the MSB to begin with. The troubled code:

case 0b101:

            resultant = rs1 >> (imm & 0xFF);

            if(imm & 0x400) {

                #ifdef DEBUG_TRACE

                    Log(TAG, "sraiw X%d, X%d, %d", rd_i, rs1_i, imm);

                #endif

                resultant |= 0xFFFFFFFF ^ (0xFFFFFFFF >> imm);

            } else {

                #ifdef DEBUG_TRACE

                    Log(TAG, "srliw X%d, X%d, %d", rd_i, rs1_i, imm);

                #endif

            }

            break;

The fixed code:

case 0b101:

            resultant = rs1 >> (imm & 0xFF);

            if(imm & 0x400) {

                #ifdef DEBUG_TRACE

                    Log(TAG, "sraiw X%d, X%d, %d", rd_i, rs1_i, imm);

                #endif

                if(resultant & 0x80000000)

                    resultant |= 0xFFFFFFFF ^ (0xFFFFFFFF >> imm);

            } else {

                #ifdef DEBUG_TRACE

                    Log(TAG, "srliw X%d, X%d, %d", rd_i, rs1_i, imm);

                #endif

            }

            break;

Same needed to be fixed for similar arithmetic shift right instructions, though only the 32 bit word ones were effected.

WARNING: A second bug appears in this code. Look to Issue 3 - Messages in Kernel Log Missing Formatting

Issue 1 - memblock_alloc_try_nid in memblock.c causes access error.

Source:

CRUMB Software Emulator system.c

Cause:

When a memory is allocated during early boot for the creation of the flattened device tree, the kernel correctly allocates memory at the end of the kernels memory space, in this case 0xBFFFFFFF. With memory starting at 0x80000000.

Here the issue was that Crumb accidently only mapped 1/16 of the 16 MB pages causes anything past them to come up as unmapped throwing an exception.

This crash would occur after the block has been allocated by a child function, and then crash as it attempts to memset it to zeros.

Original Code:

    for(int i = 0; i < 0x70; i+=0x10) {
        inst->io_level_table[0x80+i].instance = &inst->rom;
        inst->io_level_table[0x80+i].on_write = memory_write;
        inst->io_level_table[0x80+i].on_read = memory_read;
    }

Fixed Code:

    for(int i = 0; i < 0x70; i++) {
        inst->io_level_table[0x80+i].instance = &inst->rom;
        inst->io_level_table[0x80+i].on_write = memory_write;
        inst->io_level_table[0x80+i].on_read = memory_read;
    }

Issue 2 - pcpu_alloc_first_chunk crashes with access error in percpu.c

Source:

CRUMB Software Emulator cpu_core.c

Cause:

Missing bit when decoding the intermediate value of the c.lui instruction resulting in a crash due to heap corruption from a undersized buffer due to an error when calculating its size.

This would corrupt a pointer to 0xFFFFFFFFFFFFFFFF causing the access error.

pcpu_alloc_first_chunk Crash Analysis

Commented Source

Cut to include the part causing the trouble.

...
if (chunk->start_offset) {
    offset_bits = chunk->start_offset / PCPU_MIN_ALLOC_SIZE;
    bitmap_set(chunk->alloc_map, 0, offset_bits);
    set_bit(0, chunk->bound_map); // CRASH
    set_bit(offset_bits, chunk->bound_map);
    chunk->chunk_md.first_free = offset_bits;
    pcpu_block_update_hint_alloc(chunk, 0, offset_bits);
}
...

Broken Down Disassembly

eqz	s2,0x803f93e8 # If chunk->start_offset
li	s6,4
divw	s7,s2,s6
ld	a0,72(s1)
li	a1,0
mv	a2,s7
auipc	ra,0xffead
jalr	-750(ra) # bitmap_set
ld	a5,56(s1)  # Get address from stack frame (a invalid address)
amoor.d	zero,s5,(a5) # CRASH
li	a5,256
divw	s2,s2,a5
ld	a5,56(s1)
sll	s5,s5,s7
slli	s2,s2,0x3
add	a5,a5,s2
amoor.d	zero,s5,(a5)li	a1,0
sw	s7,44(s1)
mv	a2,s7
mv	a0,s1
auipc	ra,0xffd42
jalr	-1392(ra) # 0x8013ae70 
lw	s2,104(s1)

# End of if statement.

After following the code around jalr -750(ra) # bitmap_set we can observe that s1 is never corrupted, but *($s1 + 56) before it is a valid memory address, and changes to 0xFFFFFFFFFFFFFFFF after.

bitmap_set must be corrupting the values at the stack frame.

It is apparent that it is likely setting bits to 1, and bleeding over into neighboring values.

Up to calling

s1 = 0x8fbe5cc0               # Stack Frame
a0 = 0x8fbe5b40               # chunk->alloc_map
a1 = 0                        # Start Bit
a2 = 33704                    # Length Bits (4,213 bytes)

chunk->bound_map = 0x8FBE5CF8 # s1 + 56

Based on this we can see there is 440 bytes between the physical locations of chunk->alloc_map and the pointer of chunk->bound_map. Since we will be settings 4213 bytes worth of bits, it is clear that it will be well overridden by bitmap_set causing the error.

This could be caused by a bad placements of the buffers, or the len bits being too large.

pcpu_alloc_first_chunk start_offset Analysis

Start Offset Calculation

    /* region calculations */
    aligned_addr = tmp_addr & PAGE_MASK; // # #define PAGE_MASK 0xFFFFF

    start_offset = tmp_addr - aligned_addr;
    region_size = ALIGN(start_offset + map_size, PAGE_SIZE);

This shows start_offset nor aligned_addr can be greater than 0xFFFFF000.

That puts start_offset at 134816 bytes, thus how do the allocations get so small.

Allocation Size Calculation

	chunk->base_addr = (void *)aligned_addr;
    chunk->start_offset = start_offset;
    chunk->end_offset = region_size - chunk->start_offset - map_size;

	chunk->nr_pages = region_size >> PAGE_SHIFT;
    region_bits = pcpu_chunk_map_bits(chunk);

    alloc_size = BITS_TO_LONGS(region_bits) * sizeof(chunk->alloc_map[0]);
    chunk->alloc_map = memblock_alloc(alloc_size, SMP_CACHE_BYTES);

    if (!chunk->alloc_map)
        panic("%s: Failed to allocate %zu bytes\n", __func__,
              alloc_size);

Here it seems faulty that region_bits is only calculated to be 3072 bits.

pcpu_chunk_map_bits expands to chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE.

Disassembly

A lot has been optimized out by the compiler, making this arithmetic hard to follow.

So we will just follow it and make sure we got the expected values.

	mv	s9,s3               # s3,s9 = 3
	slli	s3,s3,0xa       # s3    = 3072
	
	addiw	a4,s3,63        # a4    = 3135
	srli	a4,a4,0x6       # a4    = 48
	sd	a0,0(a0)

	subw	a5,s2,s4

	slli	s2,a4,0x3       # s2    = 384

	# Push saved registers to the stack
	mv	s1,a0
	sd	a0,8(a0)
	sd	s8,64(a0)
	sw	s7,100(a0)
	sw	a5,104(a0)
	sw	s9,120(a0)

	# Call function
	li	a4,-1
	li	a3,0
	li	a2,0
	li	a1,64
	mv	a0,s2              # a0, alloc_size, = 384 bytes
	jal	memblock_alloc_try_nid

This matches the emulator's output 384 bytes. Algebraically then, the initial value of s3 must be wrong if a bad allocation size is to be expected.

pcpu_alloc_first_chunk, Tracing S3

Here is the code, (at the beginning of the function), where s3 is set before reaching the code from before. We will equate this out to the C code as well working backwards.

C Code

    struct pcpu_chunk *chunk;
    unsigned long aligned_addr;
    int start_offset, offset_bits, region_size, region_bits;
    size_t alloc_size;


    /* region calculations */
    aligned_addr = tmp_addr & PAGE_MASK;
  
    start_offset = tmp_addr - aligned_addr;
    region_size = ALIGN(start_offset + map_size, PAGE_SIZE);


    /* allocate chunk */
    alloc_size = struct_size(chunk, populated,
                 BITS_TO_LONGS(region_size >> PAGE_SHIFT));

Dissassembly

Disassembly, and stepped through by hand.

	# a0 = 0x8fbf3ea0 = tmp_addr
	# a1 = 8192       = map_size
pcpu_alloc_first_chunk:
	# Begining of function stack push.
	addi	sp,sp,-96
	sd	s8,16(sp)
	
	# aligned_addr = tmp_addr & PAGE_MASK;
	# PAGE_MASK = 0xFFFFF000
	lui	s8,0xfffff              # s8 = 0xFFFFF000
	and	s8,a0,s8                # s8 = 0x8FBF3000

	sd	s7,24(sp)
	
	# = 4096
	lui	a5,0x1                  # a5 = 4096
	
	# start_offset = tmp_addr - aligned_addr;
	subw	s7,a0,s8            # s7 = EA0
	
	sd	s4,48(sp)
	
	# A5 = 4095
	addiw	a5,a5,-1 # 0xfff    # a5 = 4095
	
	# s3 = ((start_offset + map_size + 4095) & 0xfffff000) >> 12
	addw	s4,s7,a1            # s4 = 11936
	lui	a4,0xfffff              # a4 = 0xFFFFF000
	addw	a5,s4,a5            # a5 = 16031
	and	a5,a5,a4                # a5 = 16000
	sd	s3,56(sp)               
	sraiw	s3,a5,0xc           # s3 = 16
	...

Here we get the value of 16, not 3. We will step through the code in the debugger until there is a difference.

and s8,a0,s8 returns 0x8fbd3000 instead of 0x8fbf3000. The LUI instruction sets s8 to 0xfffffffffffdf000 instead of 0xfffffffffffff000. A 1 bit difference.

Resolution

This is a compressed, c.lui instruction.

Here is the cpu_core.c's code for it:

			...
			} else { // C.LUI
               uint64_t imm = ((op & (1<<12)) ? 0xFFFFFFFFFFFC0000 : 0) |
                               ((op << 10) & 0b11111000000000000);


                #ifdef DEBUG_TRACE
                    Log(TAG, "c.lui x%d, %d", rd_i, imm);
                #endif
                v = imm;
            }
            ...

Here we can see the intermediate value is calculated, that C needs to be an E:

			...
			} else { // C.LUI
               uint64_t imm = ((op & (1<<12)) ? 0xFFFFFFFFFFFE0000 : 0) |
                               ((op << 10) & 0b11111000000000000);


                #ifdef DEBUG_TRACE
                    Log(TAG, "c.lui x%d, %d", rd_i, imm);
                #endif
                v = imm;
            }
            ...

Issue 3 - Messages in Kernel Log Missing Formatting

Source:

CRUMB Software Emulator cpu_core.c

Cause:

A bug with how the intermediate value is used to sign extend during a sraiw instruction resulted in no sign extension at all.

This caused a miscalculation in buffer sizes returned from vsnprintf causing corrupted kernel log messages.

Investigation

The kernel logging function kprintf uses vsnprintf to calculate the size of the buffer needed for the message before calling vsnprintf again to fill in the message.

When formatting is used through, the size is too small, and the message is clipping, often making it appear there was no formatting to begin with.

This miscalculation occurs in widen_string where the iterated buffer is set back by a large negative number.

Commented Source

char *widen_string(char *buf, int n, char *end, struct printf_spec spec)
{
	// field_width = -1
	// n = some normal small posative number

    unsigned spaces;
	
	// Should exit here, but it does not and keeps going!
    if (likely(n >= spec.field_width))
        return buf;

    /* we want to pad the sucker */
    // Results in negative number, since spaces
    // is unsigned, it becomes very large
    spaces = spec.field_width - n;
    if (!(spec.flags & LEFT)) { // Branch not taken.
		// Undoes all the movement to buf and undoes all string operatinons added before.
        move_right(buf - n, end, n, spaces);
        return buf + spaces;
    }
    
    while (spaces--) {
        if (buf < end)
            *buf = ' ';
            
        ++buf;
    }
    return buf;
}

Commented Assembly

addi	sp,sp,-64
sd	s0,48(sp)
sd	s1,40(sp)
addi	s0,sp,64
sd	ra,56(sp)

# a3 is negative 32 bit number
# After shift a4 is weird posative number
sraiw	a4,a3,0x8
sd	a3,-56(s0)
mv	s1,a0

# n >= spec.field_width inverted
# a1 is normal at this point, our n value
# a4 though is a weird posative number
blt	a1,a4,0x8050def4 

sraiw must be the culprit here.

Resolution

Troubled code from cpu_core.c

As we can see the XOR operation's bit shift used to sign extend the shifted value is enabled when an arithmetic shift is selected, simultaneously though, we forget to mask the imitate value.

			if(imm & 0x400) {
                #ifdef DEBUG_TRACE
                    Log(TAG, "sraiw X%d, X%d, %d", rd_i, rs1_i, imm);
                #endif

                if(resultant & 0x80000000)
                    resultant |= 0xFFFFFFFF ^ (0xFFFFFFFF >> imm);
            }

The corrected code:

			if(imm & 0x400) {
                #ifdef DEBUG_TRACE
                    Log(TAG, "sraiw X%d, X%d, %d", rd_i, rs1_i, imm);
                #endif

                if(resultant & 0x80000000)
                    resultant |= 0xFFFFFFFF ^ (0xFFFFFFFF >> (imm & 0xFF));
            }

The second bug here is that we don't want to check the resultant for bits, we want to check the input.

			if(imm & 0x400) {
                #ifdef DEBUG_TRACE
                    Log(TAG, "sraiw X%d, X%d, %d", rd_i, rs1_i, imm);
                #endif

                if(rs1 & 0x80000000)
                    resultant |= 0xFFFFFFFF ^ (0xFFFFFFFF >> (imm & 0xFF));
            }

Our Pay Off

Now that all those issues have been fixed, the Linux kernel is now able to boot and actually print out an error through earlycon that we have a missing interrupt controller.

Maybe now we can focus more on Linux and not CEMU, our emulator, for now.

Not for long though, soon Linux will likely reach issues with how interrupts and CSRs are handled.

[    0.000000] Linux version 6.11.0-rc3 (rainbain@debian) (riscv64-unknown-linux-gnu-gcc (gc891d8dc23e) 13.2.0, GNU ld (GNU Binutils) 2.42) #12 Mon Sep  9 00:49:02 UTC 2024
[    0.000000] Machine model: Crumb RISC-V SoC
[    0.000000] Forcing kernel command line to: earlycon=uart8250,mmio32,0x10000000
[    0.000000] earlycon: uart8250 at MMIO32 0x0000000010000000 (options '')
[    0.000000] printk: legacy bootconsole [uart8250] enabled
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000080000000-0x00000000bfffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080000000-0x00000000bfffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x00000000bfffffff]
[    0.000000] Falling back to deprecated "riscv,isa"
[    0.000000] riscv: base ISA extensions acdfim
[    0.000000] riscv: ELF capabilities acdfim
[    0.000000] Kernel command line: earlycon=uart8250,mmio32,0x10000000
[    0.000000] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
[    0.000000] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 262144
[    0.000000] mem auto-init: stack:all(zero), heap alloc:off, heap free:off
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] Invalid parameters for execmem allocator, module loading will fail
[    0.000000] RCU Tasks Trace: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] Kernel panic - not syncing: No interrupt controller found.
Last Edited September 14, 2024