The Hidden Complexity of the Simplest C Program

Posted Jun 10, 2026 Updated Jun 13, 2026

By Aditya Kumar

14 min read

When you write the absolute simplest C program—one that does nothing but exit successfully—you might expect the compiled output to be trivial.

  
int main() { 
    return 0; 
}

However, executing this program involves a complex sequence of steps coordinated between the compiler, the linker, the C runtime (CRT), and the operating system’s loader. Let’s peel back the abstraction layers to understand exactly what happens before and after main executes.

graph TD
    Kernel["OS Kernel / execve"] -->|"Loads ELF & Sets Stack"| Start["_start"]
    Start -->|"Passes argc/argv"| LibcStart["__libc_start_main"]
    LibcStart -->|"Initializes TLS/Constructors"| Main["main"]
    Main -->|"returns 0"| LibcStart
    LibcStart -->|"Passes 0"| Exit["exit"]
    Exit -->|"Calls Destructors & sys_exit"| Kernel2["Kernel terminates process"]

1. The True Entry Point: `_start`

Contrary to popular belief, main is not the first thing executed when you run a C program.

When the linker stitches your program together, it includes startup code provided by the C standard library (e.g., crt1.o in glibc). This object file defines a symbol named _start¹.

The linker employs a specific mechanism to designate this as the entry point.

Under the hood, the linker uses a default linker script (which contains a directive like ENTRY(_start)). During the final linking phase, it resolves the virtual memory address of the _start symbol and explicitly writes this address into the e_entry field of the resulting ELF file’s header (the Elf64_Ehdr structure). You can inspect this yourself by running readelf -h a.out | grep "Entry point address".

When the operating system (specifically the execve syscall) loads your binary into memory, it parses the ELF header, extracts the e_entry address, and sets the CPU’s instruction pointer (%rip on x86-64) to that exact address. Thus, execution officially begins at _start, not main.

The _start routine itself is written in pure assembly because it has to deal with the raw state of the machine exactly as the kernel left it.

2. Setting Up the Stack Pointer and Environment

Before _start is executed, the OS kernel (via the execve system call) has already set up the initial execution environment.

When the kernel maps your program into memory, it populates the top of the stack with crucial data. The memory layout looks exactly like this:

+-------------------------+ High Addresses
| Environment Strings     |
| Argument Strings        |
+-------------------------+
| NULL                    |
| Auxiliary Vector (Elf64)|
| NULL                    |
| envp pointers           |
| NULL                    |
| argv pointers           |
| argc                    | <-- %rsp points here
+-------------------------+ Low Addresses

When _start wakes up, the stack pointer (%rsp) points directly to argc. The kernel didn’t execute user-space instructions to do this; instead, during execve, it manually constructed this stack layout in memory and set the initial CPU %rsp register to point at its top before transitioning to user mode.

If we were to write our own raw, freestanding _start routine (bypassing __libc_start_main entirely) to manually call main and then trap to the OS to exit, it would look exactly like this:

  
.global _start
.text

_start:
    ; 1. Mark the deepest stack frame
    xor ebp, ebp    ; Clear the frame pointer
    
    ; 2. Extract argc and argv from the stack
    pop rdi         ; Pop the top of the stack into %rdi. This is 'argc'.
    mov rsi, rsp    ; Since argc was popped, %rsp now points directly at 'argv'.
                    ; Move this address into %rsi.
    
    ; 3. Stack Alignment
    ; The System V ABI requires a 16-byte aligned stack before function calls[^3].
    and rsp, -16    ; Mask the lowest 4 bits (equivalent to 0xfffffffffffffff0)
    
    ; 4. Call our C function
    call main       ; main(argc, argv)
                    ; The return value of main is left in %eax
    
    ; 5. Trap into the kernel to gracefully exit
    mov edi, eax    ; Move main's return value into %edi (1st arg for syscall)
    mov eax, 60     ; Syscall number 60 is 'sys_exit' on x86-64 Linux[^4]
    syscall         ; Trap into the kernel to tear down the process
    
    ; 6. Failsafe halt (the kernel should never return here)
    hlt

In a standard C program, instead of calling main and sys_exit directly like our raw assembly above, _start passes argc, argv, and the environment pointers to __libc_start_main².

Here is a pseudo-C representation of what __libc_start_main actually does under the hood:

  
int __libc_start_main(
    int (*main) (int, char**, char**), 
    int argc, 
    char **argv,
    void (*init) (void), 
    void (*fini) (void)
) {
    // 1. Setup Thread Local Storage (TLS) and stack security cookies
    __pthread_initialize_minimal();
    __cxa_atexit(fini, NULL, NULL); // Register global destructors
    
    // 2. Call global constructors (.init_array)
    init();

    // 3. Jump to the user's main function
    int result = main(argc, argv, __environ);

    // 4. Pass the return code to exit()
    exit(result);
}

That glibc function initializes the C library environment, sets up thread-local storage, calls global constructors, and then finally calls your main function before routing its return value to exit().

Static vs. Dynamic Linking

Before the loader even enters the picture, it’s important to distinguish how the program was compiled:

Statically Linked: The compiler stitches crt1.o and the entirety of the C library (libc.a) directly into your single binary. The resulting executable is large but highly portable—it doesn’t rely on the dynamic loader at all. The OS jumps straight to _start, and execution begins entirely in user-space.
Dynamically Linked (Default): The compiler inserts placeholder references to libc.so. The binary is tiny, but it cannot run on its own. The OS must map a dynamic loader into memory alongside your program to resolve those placeholders before _start is executed.

3. How the Loader Steps In

Before your binary’s code even runs, the program must be loaded into memory. When you execute the program, the kernel reads the ELF header. If the program is dynamically linked (which it is by default on modern systems), the kernel notices an INTERP segment.

This segment specifies the dynamic linker/loader (often /lib64/ld-linux-x86-64.so.2). The kernel maps both your program and the dynamic loader into memory, but it passes control to the loader first.

The loader:

Resolves dynamic symbols (like functions from libc.so).
Performs relocations.
Finally, jumps to the _start address of your executable.

4. Handling the Return Code

Once __libc_start_main has initialized threading, global constructors, and security cookies, it finally calls your main(argc, argv, envp) function.

Your main function executes its single instruction: return 0;. In assembly, this simply moves 0 into the %eax register and issues a ret instruction.

  
main:
    xor eax, eax  ; Set return value to 0
    ret           ; Return to __libc_start_main

When main returns, control flows back to __libc_start_main. The C library captures the value left in %eax (which is 0) and uses it to determine the process exit code.

5. The Final Act: `exit()`

__libc_start_main takes the return value from main and immediately passes it to the exit() function.

You might think returning 0 immediately kills the process, but exit() has a lot of housekeeping to do:

It walks through the list of functions registered via atexit() and on_exit() and calls them in reverse order.
It calls global destructors (for C++ objects or C functions marked with __attribute__((destructor))).
It flushes and closes all open standard I/O streams (like stdout).

Finally, exit() invokes the _exit() system call (specifically, exit_group on Linux) to trap into the kernel. The kernel then reclaims the memory, closes file descriptors, and notifies the parent process that the program has terminated with status 0.

6. PIC vs. Non-PIC Behavior

The complexity of this sequence depends heavily on whether the program is compiled as Position Independent Code (PIC).

Non-PIC Executables

In the old days, executables were linked to a fixed absolute memory address (often 0x400000 on x86-64). The compiler could hardcode absolute memory addresses for function calls and global variables. The loader’s job was simple: map the file to that exact address and run it.

PIC and PIE (Position Independent Executable)

Modern compilers build programs as PIC/PIE by default for security (to enable ASLR - Address Space Layout Randomization). Because the binary can be loaded at any random memory address, the compiler cannot use absolute addresses.

Instead:

The compiler uses RIP-relative addressing (addressing data relative to the current instruction pointer).
To call external functions (like those in libc), it uses the PLT (Procedure Linkage Table) and GOT (Global Offset Table).

When your PIE program is loaded, the dynamic linker must fix up the GOT so that indirect jumps to shared library functions point to the correct randomized addresses. This adds significant overhead to the loader’s execution before _start even begins, making our simple return 0; program dependent on a complex dynamic linking mechanism.

7. Cross-Platform Considerations

While this post focuses heavily on Linux and x86-64, the fundamental concepts remain similar across operating systems and architectures, though the specific implementations differ significantly:

Windows: Instead of ELF, Windows uses the PE (Portable Executable) format. The entry point is typically mainCRTStartup or WinMainCRTStartup (provided by the MSVC CRT). The loader is the Windows NT kernel loader, which resolves DLL imports via the Import Address Table (IAT)—the Windows equivalent of the GOT/PLT.
macOS: Apple platforms use the Mach-O binary format and the dyld dynamic linker. The entry point structure is similar, but dyld operates significantly differently than Linux’s ld.so, especially with recent optimizations like dyld3 closure caches.
ARM64 (AArch64): On modern ARM architectures, the kernel does not rely as heavily on the stack to pass the initial state. Instead of popping argc directly off the stack like x86-64, the ARM64 ABI dictates that initial state and auxiliary vectors are passed differently, primarily leveraging a large pool of general-purpose registers before dropping into _start.

8. Beyond C: The Pre-Main of Objective-C and Rust

The concepts we’ve explored apply to C, but modern languages build their own runtime environments on top of this foundation before passing control to the developer’s code.

Objective-C: The Runtime Initialization

In Objective-C (commonly used on Apple platforms), the main function is not the true beginning of the application’s logic. Before main is ever reached, the dynamic linker (dyld) maps the executable and its libraries into memory and begins invoking initialization routines.

Crucially, dyld initializes the Objective-C runtime. During this phase, the runtime automatically discovers every class and category in the binary and executes their +load methods³. It wires up the class hierarchy, registers method selectors, and allocates necessary runtime data structures. Only after this extensive amount of implicit setup finishes does control pass to the standard C main function, which typically just delegates execution to UIApplicationMain or NSApplicationMain to start the UI event loop.

Rust: Bridging `fn main()` to `int main()`

In C, the contract with the OS is clear: main returns an int. But in Rust, the standard entry point is fn main(), which returns the unit type () (essentially nothing) or a Result. So how does the OS get its integer exit code?

When you compile a Rust binary, the Rust compiler (rustc) automatically generates a hidden C-ABI compliant main(int argc, char **argv) function. This generated main acts as a trampoline. It immediately calls an internal Rust runtime function—specifically std::rt::lang_start⁴.

The lang_start routine is responsible for:

Setting up stack overflow guards.
Initializing the thread-local storage for the main thread.
Setting up the global panic handler.

Once the environment is secure, lang_start invokes your actual Rust fn main(). When your main finishes, the runtime catches any panics, determines the correct termination code (e.g., 0 for success or non-zero if a Result::Err was returned), and passes that integer back to the generated C main. The C main then returns this integer to __libc_start_main, completely satisfying the operating system’s standard C ABI expectations while allowing the developer to write safe, idiomatic Rust.

[!NOTE] Do you think Rust’s approach of having another layer of fn main provides any advantages?

9. Executing a System Call with `puts("Hello, World!")`

If we graduate from return 0; to actually printing something, we usually add printf("Hello, World!\n");. Interestingly, modern compilers (like GCC and Clang) will optimize a simple constant printf ending in a newline directly into a call to puts("Hello, World!").

The execution of that puts call requires dynamic resolution. Since puts lives in the shared C library (libc.so), the compiler doesn’t know its memory address at compile time.

Here is the exact sequence of events when puts is called in a dynamically linked PIE binary:

The PLT Stub: The call puts instruction in your code actually jumps to a small piece of trampoline code in the PLT (Procedure Linkage Table).
Lazy Binding: By default, Linux uses “lazy binding” (unless compiled with -z now). The very first time puts is called, its address isn’t in the GOT (Global Offset Table) yet. Instead, the GOT entry points right back into the PLT stub.
The Dynamic Linker Resolver: The PLT pushes a relocation index onto the stack and jumps to the dynamic linker’s resolver function (e.g., _dl_runtime_resolve). The dynamic linker looks up the true memory address of puts in the loaded libc.so memory space and dynamically overwrites the GOT entry with that exact address.
Execution: The resolver then transfers control to the actual puts function. On all subsequent calls to puts, the PLT stub jumps directly to the function via the now-populated GOT entry, bypassing the resolver entirely.

graph LR
    subgraph Before_Lazy_Binding ["Before Lazy Binding"]
        PLT1["PLT Entry for puts"] --> GOT1["GOT Entry: Points back to PLT"]
        GOT1 -.-> Linker["Dynamic Linker Resolver"]
    end

    subgraph After_Lazy_Binding ["After Lazy Binding"]
        PLT2["PLT Entry for puts"] --> GOT2["GOT Entry: Points directly to libc"]
        GOT2 --> LIBC["libc.so: puts()"]
    end

The System Call: Deep inside libc, puts handles appending a newline to your string and eventually invokes the write system call (syscall number 1 on x86-64), targeting File Descriptor 1 (stdout).

  
    ; A raw representation of the underlying write syscall
    mov rdi, 1                  ; File descriptor 1 (stdout)
    lea rsi, [rip + string_ptr] ; Pointer to "Hello, World!\n"
    mov rdx, 14                 ; Length of the string
    mov eax, 1                  ; sys_write syscall number
    syscall                     ; Trap to the kernel to push characters to the terminal

Only after the kernel handles the character buffer and prints it to your terminal does control return to user-space, where your program can finally hit that return 0;.

10. Architectural Challenges in Heterogeneous Systems

Armed with the knowledge of how loaders, stack initialization, and ABI constraints operate on a single machine, how would you design the execution flow for a “Hello, World!” program in a heterogeneous system—where a host processor is responsible for bootstrapping and launching the binary on an entirely different target architecture?

Conclusion

The next time you compile an empty main function or print a simple greeting, take a moment to appreciate the extensive software stack—the compiler, the linker, the CRT, the dynamic loader, and the kernel—all working in close coordination under the hood.

Acknowledgements

I would like to dedicate this post to Elliott Hughes and Reid Tatge, as I learned most of these systems-level intricacies from/because-of them.

References

For those who want to dive directly into the source code to see how these abstractions are implemented in the real world, here are some excellent starting points:

Disclaimer: This article was generated using the Gemini 3.1 Pro model.

glibc sysdeps/x86_64/start.S: The actual assembly implementation of _start for x86-64 in the GNU C Library. (Source Code) ↩︎
glibc csu/libc-start.c: The C source for __libc_start_main, illustrating how the CRT sets up thread-local storage, constructors, and invokes your main function. (Source Code) ↩︎
Apple Objective-C Runtime: The open-source implementations of Apple’s Objective-C runtime, showcasing how dyld triggers +load methods during early process initialization. (Source Code) ↩︎
Rust std::rt Module: The Rust standard library source code detailing how lang_start bridges the C-ABI main to the Rust fn main() and configures panic handlers. (Source Code) ↩︎

Compilers, Systems

This post is licensed under CC BY 4.0 by the author.