The Hidden Complexity of the Simplest C Program
When you write the absolute simplest C program—one that does nothing but exit successfully—you might expect the compiled output to be trivial.
1
2
3
int main() {
return 0;
}
However, executing this program involves a highly intricate dance coordinated between the compiler, the linker, the C runtime (CRT), and the operating system’s loader. Let’s peel back the abstraction layers to understand exactly what happens before and after main executes.
1. The True Entry Point: _start
Contrary to popular belief, main is not the first thing executed when you run a C program.
When the linker stitches your program together, it includes startup code provided by the C standard library (e.g., crt1.o in glibc). This object file defines a symbol named _start.
But exactly how does the linker mark this as the entry point?
Under the hood, the linker uses a default linker script (which contains a directive like ENTRY(_start)). During the final linking phase, it resolves the virtual memory address of the _start symbol and explicitly writes this address into the e_entry field of the resulting ELF file’s header (the Elf64_Ehdr structure). You can inspect this yourself by running readelf -h a.out | grep "Entry point address".
When the operating system (specifically the execve syscall) loads your binary into memory, it parses the ELF header, extracts the e_entry address, and sets the CPU’s instruction pointer (%rip on x86-64) to that exact address. Thus, execution officially begins at _start, not main.
The _start routine itself is written in pure assembly because it has to deal with the raw state of the machine exactly as the kernel left it.
2. Setting Up the Stack Pointer and Environment
Before _start is executed, the OS kernel (via the execve system call) has already set up the initial execution environment.
When the kernel maps your program into memory, it populates the top of the stack with crucial data:
argc(the number of arguments)argv(pointers to the argument strings)envp(pointers to environment variables)- The auxiliary vector (information passed from the kernel to the user-space loader)
When _start wakes up, the stack pointer (%rsp) points directly to argc. The kernel didn’t execute user-space instructions to do this; instead, during execve, it manually constructed this stack layout in memory and set the initial CPU %rsp register to point at its top before transitioning to user mode.
If we were to write our own raw, freestanding _start routine (bypassing __libc_start_main entirely) to manually call main and then trap to the OS to exit, it would look exactly like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
.global _start
.text
_start:
; 1. Mark the deepest stack frame
xor ebp, ebp ; Clear the frame pointer
; 2. Extract argc and argv from the stack
pop rdi ; Pop the top of the stack into %rdi. This is 'argc'.
mov rsi, rsp ; Since argc was popped, %rsp now points directly at 'argv'.
; Move this address into %rsi.
; 3. Stack Alignment
; The System V ABI requires a 16-byte aligned stack before function calls.
and rsp, -16 ; Mask the lowest 4 bits (equivalent to 0xfffffffffffffff0)
; 4. Call our C function
call main ; main(argc, argv)
; The return value of main is left in %eax
; 5. Trap into the kernel to gracefully exit
mov edi, eax ; Move main's return value into %edi (1st arg for syscall)
mov eax, 60 ; Syscall number 60 is 'sys_exit' on x86-64 Linux
syscall ; Trap into the kernel to tear down the process
; 6. Failsafe halt (the kernel should never return here)
hlt
In a standard C program, instead of calling main and sys_exit directly like our raw assembly above, _start passes argc, argv, and the environment pointers to __libc_start_main. That glibc function initializes the C library environment, sets up thread-local storage, calls global constructors, and then finally calls your main function before routing its return value to exit().
3. How the Loader Steps In
Before your binary’s code even runs, the program must be loaded into memory. When you execute the program, the kernel reads the ELF header. If the program is dynamically linked (which it is by default on modern systems), the kernel notices an INTERP segment.
This segment specifies the dynamic linker/loader (often /lib64/ld-linux-x86-64.so.2). The kernel maps both your program and the dynamic loader into memory, but it passes control to the loader first.
The loader:
- Resolves dynamic symbols (like functions from
libc.so). - Performs relocations.
- Finally, jumps to the
_startaddress of your executable.
4. Handling the Return Code
Once __libc_start_main has initialized threading, global constructors, and security cookies, it finally calls your main(argc, argv, envp) function.
Your main function executes its single instruction: return 0;. In assembly, this simply moves 0 into the %eax register and issues a ret instruction.
1
2
3
main:
xor eax, eax ; Set return value to 0
ret ; Return to __libc_start_main
When main returns, control flows back to __libc_start_main. The C library captures the value left in %eax (which is 0) and uses it to determine the process exit code.
5. The Final Act: exit()
__libc_start_main takes the return value from main and immediately passes it to the exit() function.
You might think returning 0 immediately kills the process, but exit() has a lot of housekeeping to do:
- It walks through the list of functions registered via
atexit()andon_exit()and calls them in reverse order. - It calls global destructors (for C++ objects or C functions marked with
__attribute__((destructor))). - It flushes and closes all open standard I/O streams (like
stdout).
Finally, exit() invokes the _exit() system call (specifically, exit_group on Linux) to trap into the kernel. The kernel then reclaims the memory, closes file descriptors, and notifies the parent process that the program has terminated with status 0.
6. PIC vs. Non-PIC Behavior
The complexity of this sequence depends heavily on whether the program is compiled as Position Independent Code (PIC).
Non-PIC Executables
In the old days, executables were linked to a fixed absolute memory address (often 0x400000 on x86-64). The compiler could hardcode absolute memory addresses for function calls and global variables. The loader’s job was simple: map the file to that exact address and run it.
PIC and PIE (Position Independent Executable)
Modern compilers build programs as PIC/PIE by default for security (to enable ASLR - Address Space Layout Randomization). Because the binary can be loaded at any random memory address, the compiler cannot use absolute addresses.
Instead:
- The compiler uses RIP-relative addressing (addressing data relative to the current instruction pointer).
- To call external functions (like those in
libc), it uses the PLT (Procedure Linkage Table) and GOT (Global Offset Table).
When your PIE program is loaded, the dynamic linker must fix up the GOT so that indirect jumps to shared library functions point to the correct randomized addresses. This adds significant overhead to the loader’s execution before _start even begins, making our simple return 0; program dependent on a highly sophisticated dynamic linking mechanism.
7. What if we actually do something? puts("Hello, World!")
If we graduate from return 0; to actually printing something, we usually add printf("Hello, World!\n");. Interestingly, modern compilers (like GCC and Clang) will optimize a simple constant printf ending in a newline directly into a call to puts("Hello, World!").
But how does that puts call actually execute? Since puts lives in the shared C library (libc.so), the compiler doesn’t know its memory address at compile time.
Here is the exact sequence of events when puts is called in a dynamically linked PIE binary:
- The PLT Stub: The
call putsinstruction in your code actually jumps to a small piece of trampoline code in the PLT (Procedure Linkage Table). - Lazy Binding: By default, Linux uses “lazy binding” (unless compiled with
-z now). The very first timeputsis called, its address isn’t in the GOT (Global Offset Table) yet. Instead, the GOT entry points right back into the PLT stub. - The Dynamic Linker Resolver: The PLT pushes a relocation index onto the stack and jumps to the dynamic linker’s resolver function (e.g.,
_dl_runtime_resolve). The dynamic linker looks up the true memory address ofputsin the loadedlibc.somemory space and dynamically overwrites the GOT entry with that exact address. - Execution: The resolver then transfers control to the actual
putsfunction. On all subsequent calls toputs, the PLT stub jumps directly to the function via the now-populated GOT entry, bypassing the resolver entirely. - The System Call: Deep inside
libc,putshandles appending a newline to your string and eventually invokes thewritesystem call (syscall number1on x86-64), targeting File Descriptor1(stdout).
1
2
3
4
5
6
; A raw representation of the underlying write syscall
mov rdi, 1 ; File descriptor 1 (stdout)
lea rsi, [rip + string_ptr] ; Pointer to "Hello, World!\n"
mov rdx, 14 ; Length of the string
mov eax, 1 ; sys_write syscall number
syscall ; Trap to the kernel to push characters to the terminal
Only after the kernel handles the character buffer and prints it to your terminal does control return to user-space, where your program can finally hit that return 0;.
References
For those who want to dive directly into the source code to see how these abstractions are implemented in the real world, here are some excellent starting points:
- glibc
sysdeps/x86_64/start.S: The actual assembly implementation of_startfor x86-64 in the GNU C Library. (Source Code) - glibc
csu/libc-start.c: The C source for__libc_start_main, illustrating how the CRT sets up thread-local storage, constructors, and invokes yourmainfunction. (Source Code) - System V Application Binary Interface (x86-64): The definitive specification detailing exactly how the stack, registers, and memory must be formatted by the kernel before
_startexecutes. (GitLab) - Linux Syscall Reference: A quick reference for mapping x86-64 system call numbers (like
1forsys_writeand60forsys_exit). (Table)
The next time you compile an empty main function or print a simple greeting, take a moment to appreciate the monumental software stack—the compiler, the linker, the CRT, the dynamic loader, and the kernel—all working in perfect harmony under the hood.
Food for Thought
With everything you’ve learned here about loaders, stack initialization, and ABI constraints—how would you now go about running a simple “Hello, World!” in a heterogeneous system, where a host machine is solely responsible for launching the program on a completely different target architecture?
Acknowledgements
I would like to dedicate this post to Elliott Hughes and Reid Tatge, as I learned most of these deep systems-level intricacies from/because-of them.