Debugging -Wframe-larger-than=

Unless work is done per architecture to implement HAVE_ARCH_VMAP_STACK / CONFIG_VMAP_STACK, the Linux kernel defaults to two pages worth of stack per thread.

Note: on many contemporary systems the page size is 4KiB, but this is actually configurable for many architectures. The trade offs probably require a separate post. If you see code that checks for alignment via bitwise tricks like addr & 4095 == 0 without checking sysconf(_SC_PAGESIZE) it is perhaps a red flag for code that might be to reused on different systems.

As a first line of defense against overflowing a thread’s kernel stack, we enable -Wframe-larger-than= with a value based on CONFIG_FRAME_WARN (commonly 1024). This doesn’t guarantee we won’t recurse enough at runtime to overflow the stack. Defending against that is akin to solving the Halting Problem unless you want to go to the extreme length of MISRA C’s guidance that “functions should not call themselves, either directly or indirectly.”

It does help us frequently find large structs that were stack allocated and probably should have been heap (or sometimes statically) allocated. But the ergonomics of this warning have room for improvement.

Let’s say one day the build is failing because someone has introduced a new instance of -Wframe-larger-than=. In the logs you see:

<source>:8:6: warning: stack frame size (4104) exceeds limit (1024) in 'foo' [-Wframe-larger-than]
void foo (void) {
     ^

So you go and look at source and see:

void foo (void) {
    bar();
    baz();
    quux();
}

See any large local variables there? Thus begins the goose chase to understand what inlining decisions led to foo having a large stack frame. Or how about this?

void baz (void) {
    struct widget;
    ...

is a struct widget too large to be putting on the stack? Let’s look at the definition:

struct widget {
    struct gadget gadget;
    struct trombone trombone;
    long data [42];
    ...

What’s the sizeof struct widget? Can you do that calculation in your head quickly? What if the definitions of those structs are in other headers? More goose chasing and perhaps an argument in favor of an IDE.

DWARF has this information (if/when it’s produced) but we don’t have really great ways to visualize this information. DWARF is a Jack of All Trades, it’s good at many things, but kind of great at none and so gets easily dunked on leading to distinct unwind formats and type formats being used in the kernel.

While commiserating about this conundrum with my colleague, Paul Kirth, I mentioned that I really wished there was tooling that would “just break out the crayons and draw me a picture of the stack [usage of a given function].” Well, Paul must have been more upset than I was, because he implemented some really nice optimization remarks to help.

Starting with clang-16, you should be able to use -Rpass-analysis=stack-frame-layout to get a drawing of your stack layout.

Let’s see if we can better debug an instance of -Wframe-larger-than= affecting the Linux kernel. Looking at logs from the CI from last night, I see a case:

Warning: /builds/linux/fs/jffs2/xattr.c:775:6: warning: stack frame size (1216) exceeds limit (1024) in 'jffs2_build_xattr_subsystem' [-Wframe-larger-than]
void jffs2_build_xattr_subsystem(struct jffs2_sb_info *c)
     ^

Looks like I can reproduce this locally:

$ wget https://src.fedoraproject.org/rpms/kernel/raw/rawhide/f/kernel-aarch64-fedora.config -O .config
$ make -s LLVM=1 ARCH=arm64 -j128 olddefconfig fs/jffs2/xattr.o
fs/jffs2/xattr.c:775:6: warning: stack frame size (1216) exceeds limit (1024) in 'jffs2_build_xattr_subsystem' [-Wframe-larger-than]
void jffs2_build_xattr_subsystem(struct jffs2_sb_info *c)
     ^
119/1216 (9.79%) spills, 1097/1216 (90.21%) variables
1 warning generated.

Paul also recently added that tidbit (that was cutoff from CI logs) about the relative number of spills vs variables. That can help you quickly get a sense how many stack slots are variables’ storage vs spills from excessive register pressure (too many live values).

If we simply add -Rpass-analysis=stack-frame-layout, we’re going to get a beautiful ASCII table for every function in a given TU. We can use the flag -mllvm -fiter-print-funcs= to reduce the number of optimization remarks emitted. For the kernel, that might look like:

make -s LLVM=1 ARCH=arm64 -j128 fs/jffs2/xattr.o KCFLAGS="-Rpass-analysis=stack-frame-layout -mllvm -filter-print-funcs=jffs2_build_xattr_subsystem" 
fs/jffs2/xattr.c:775:6: warning: stack frame size (1216) exceeds limit (1024) in 'jffs2_build_xattr_subsystem' [-Wframe-larger-than]
void jffs2_build_xattr_subsystem(struct jffs2_sb_info *c)
     ^
119/1216 (9.79%) spills, 1097/1216 (90.21%) variables
fs/jffs2/xattr.c:776:1: remark: 
Function: jffs2_build_xattr_subsystem
Offset: [SP-8], Type: Spill, Align: 8, Size: 8
Offset: [SP-16], Type: Spill, Align: 8, Size: 8
Offset: [SP-24], Type: Spill, Align: 8, Size: 8
Offset: [SP-32], Type: Spill, Align: 8, Size: 8
Offset: [SP-40], Type: Spill, Align: 8, Size: 8
Offset: [SP-48], Type: Spill, Align: 8, Size: 8
Offset: [SP-56], Type: Spill, Align: 8, Size: 8
Offset: [SP-64], Type: Spill, Align: 8, Size: 8
Offset: [SP-72], Type: Spill, Align: 8, Size: 8
Offset: [SP-80], Type: Spill, Align: 8, Size: 8
Offset: [SP-88], Type: Spill, Align: 8, Size: 8
Offset: [SP-96], Type: Spill, Align: 8, Size: 8
Offset: [SP-104], Type: Variable, Align: 8, Size: 8
Offset: [SP-136], Type: Variable, Align: 8, Size: 28
    rr @ fs/jffs2/xattr.c:448
Offset: [SP-144], Type: Variable, Align: 8, Size: 8
Offset: [SP-1168], Type: Variable, Align: 8, Size: 1024
    xref_tmphash @ fs/jffs2/xattr.c:778
Offset: [SP-1176], Type: Spill, Align: 8, Size: 8
Offset: [SP-1180], Type: Spill, Align: 4, Size: 4
Offset: [SP-1192], Type: Spill, Align: 8, Size: 8
Offset: [SP-1196], Type: Spill, Align: 4, Size: 4 [-Rpass-analysis=stack-frame-layout]
{
^
1 warning generated.

(This kernel config had debug info enabled. Without this, lines above printing variable and line number for rr and xref_tmphash would be omitted).

Spills can be occupied by different variables at different points for the program counter (this is how DWARF encodes DW_AT_location). Not sure about the Variable slots at Offsets [SP-104] and [SP-144] yet, maybe there’s more to fix in this nascent analysis, but the sizes show those aren’t the droids… err… stack slots that I’m looking for.

So right off the bat, if I’m setting -Wframe-larger-than=1024 and xref_tmphash is 1024B, that’s a problem. fs/jffs2/xattr.c:778 corresponds to this statement.

#define XREF_TMPHASH_SIZE	(128)
...
struct jffs2_xattr_ref *xref_tmphash[XREF_TMPHASH_SIZE];

It doesn’t matter what the sizeof struct jffs2_xattr_ref since xref_tmphash is a 128 element array of pointers to such struct. 128 times 8 (pointers are 64b on the kernel’s aarch64 target) is 1024.

Let’s see if anyone has tried to fix this. Yep, looking at lore I see:

fix sent in 2016
- rejected
another fix sent in 2017 (looks like it never received feedback)
report from my co-maintainer Nathan in 2021
report from bot in 2021
report to stable last year

The patch from 2017 LGTM, and perhaps has fallen through the cracks and just needs to be pinged/reviewed, which I’ve done. Hopefully it can get picked up.

-Wframe-larger-than= issues in the Linux kernel are numerous (at least with clang builds for now). They can be a bit of work to track down. One of our oldest open issues still on the TODO list to fix is -fsanitize=kernel-address (via CONFIG_KASAN=y) leading to excessive stack usage (at least when compared to -fsanitize=address aka ASAN). Another TODO seems to be related to passing structs by value. My hope is these optimization remarks will help us differentiate between compiler bugs vs kernel source bugs quicker.

Whether or not an optimization remark is the most ergonomic tooling is hopefully still up for debate, though the current implementation was an iteration based on compromise. Requiring debug info for better diagnostics trades compile time in clang, which is also unfortunate.

For now though, I’m happy to celebrate improved tooling in this regard. Cheers Paul!

That’s pretty much the end of the post. Some other tools I’ve used in this area if you’re still interested:

llvm-dwarfdump or GNU objdump --dwarf=info can print the DWARF stream. I had written a python script to attempt to decode this information. It’s manually tested, incomplete, and quite buggy. The library I depend on for decoding DWARF and ELF doesn’t support all of the architectures the Linux kernel does, which has led to issues being unable to debug these warnings for specific architectures. A tool failure when things are actively on fire is stressful.

$ frame_larger_than.py arch/x86/kernel/kvm.o kvm_send_ipi_mask_allbutself
kvm_send_ipi_mask_allbutself:
        1024    struct cpumask          new_mask
        4       unsigned int            this_cpu
        8       const struct cpumask*   local_mask
        4       int                     pscr_ret__
        4       int                     pfo_ret__
cpumask_copy:
bitmap_copy:
        4       unsigned int            len
        4       unsigned int            len
cpumask_clear_cpu:
clear_bit:
arch_clear_bit:

“Poke-a-hole” aka pahole can print the size of structs. Check out the LWN article.

# Make sure you've built with debug info enabled!
$ pahole fs/jffs2/xattr.o
...
struct jffs2_xattr_ref {
        void *                     always_null;          /*     0     8 */
        struct jffs2_raw_node_ref * node;                /*     8     8 */
        uint8_t                    class;                /*    16     1 */
        uint8_t                    flags;                /*    17     1 */
        u16                        unused;               /*    18     2 */
        uint32_t                   xseqno;               /*    20     4 */
        union {
                struct jffs2_inode_cache * ic;           /*    24     8 */
                uint32_t           ino;                  /*    24     4 */
        };                                               /*    24     8 */
        union {
                struct jffs2_inode_cache * ic;                   /*     0     8 */
                uint32_t                   ino;                  /*     0     4 */
        };

        union {
                struct jffs2_xattr_datum * xd;           /*    32     8 */
                uint32_t           xid;                  /*    32     4 */
        };                                               /*    32     8 */
        union {
                struct jffs2_xattr_datum * xd;                   /*     0     8 */
                uint32_t                   xid;                  /*     0     4 */
        };

        struct jffs2_xattr_ref *   next;                 /*    40     8 */

        /* size: 48, cachelines: 1, members: 9 */
        /* last cacheline: 48 bytes */
};

The kernel has a script (scripts/stackusage) that can print the estimated stack usage of each function in vmlinux to a file. Example:

$ ./scripts/stackusage LLVM=1 -j128 defconfig all
...
./scripts/stackusage: output written to /tmp/stackusage.3991505.3gRw
$ cat /tmp/stackusage.3991505.3gRw
arch/x86/entry/common.c:119:do_int80_syscall_32	24	dynamic
arch/x86/entry/common.c:138:__do_fast_syscall_32	40	dynamic
arch/x86/entry/common.c:186:do_fast_syscall_32	16	static
arch/x86/entry/common.c:238:do_SYSENTER_32	0	static
...

Finally, GCC currently supports (but clang currently does not) -fconserve-stack. From playing with it, it seems that this flag causes GCC to limit inlining if would increase the stack usage of a caller beyond what appears to be through experimentation an arch specific threshold.

Clang recently got -finline-max-stacksize=, which feels like a nuclear option to me. We haven’t deployed it yet in the kernel, but it might ultimately be necessary to use. We’ll see. I would hate to potentially cover up other issues that should perhaps be fixed first.

Nick Desaulniers

The enemy's gate is down

Debugging -Wframe-larger-than=