Nick Desaulniers

The enemy's gate is down

Static and Dynamic Libraries

| Comments

This is the second post in a series on memory segmentation. It covers working with static and dynamic libraries in Linux and OSX. Make sure to check out the first on object files and symbols.

Let’s say we wanted to reuse some of the code from our previous project in our next one. We could continue to copy around object files, but let’s say we have a bunch and it’s hard to keep track of all of them. Let’s combine multiple object files into an archive or static library. Similar to a more conventional zip file or “compressed archive,” our static library will be an uncompressed archive.

We can use the ar command to create and manipulate a static archive.

1
2
$ clang -c x.c y.c
$ ar -rv libhello.a x.o y.o

The -r flag will create the archive named libhello.a and add the files x.o and y.o to its index. I like to add the -v flag for verbose output. Then we can use the familiar nm tool I introduced in the previous post to examine the content of the archives and their symbols.

1
2
3
4
5
6
7
8
9
10
11
$ file libhello.a
libhello.a: current ar archive random library
$ nm libhello.a
libhello.a(x.o):
                 U _puts
0000000000000000 T _x


libhello.a(y.o):
                 U _puts
0000000000000000 T _y

Some other useful flags for ar are -d to delete an object file, ex. ar -d libhello.a y.o and -u to update existing members of the archive when their source and object files are updated. Not only can we run nm on our archive, otool and objdump both work.

Now that we have our static library, we can statically link it to our program and see the resulting symbols. The .a suffix is typical on both OSX and Linux for archive files.

1
2
3
4
5
6
$ clang main.o libhello.a
$ nm a.out
0000000100000f30 T _main
                 U _puts
0000000100000f50 T _x
0000000100000f70 T _y

Our compiler understands how to index into archive files and pull out the functions it needs to combine into the final executable. If we use a static library to statically link all functions required, we can have one binary with no dependencies. This can make deployment of binaries simple, but also greatly increase their size. Upgrading large binaries incrementally becomes more costly in terms of space.

While static libraries allowed us to reuse source code, static linkage does not allow us to reuse memory for executable code between different processes. I really want to put off talking about memory benefits until the next post, but know that the solution to this problem lies in “dynamic libraries.”

While having a single binary file keeps things simple, it can really hamper memory sharing and incremental relinking. For example, if you have multiple executables that are all built with the same static library, unless your OS is really smart about copy-on-write page sharing, then you’re likely loading multiple copies of the same exact code into memory! What a waste! Also, when you want to rebuild or update your binary, you spend time performing relocation again and again with static libraries. What if we could set aside object files that we could share amongst multiple instances of the same or even different processes, and perform relocation at runtime?

The solution is known as dynamic libraries. If static libraries and static linkage were Atari controllers, dynamic libraries and dynamic linkage are Steel Battalion controllers. We’ll show how to work with them in the rest of this post, but I’ll prove how memory is saved in a later post.

Let’s say we want to created a shared version of libhello. Dynamic libraries typically have different suffixes per OS since each OS has it’s preferred object file format. On Linux the .so suffix is common, .dylib on OSX, and .dll on Windows.

1
2
3
4
5
6
7
$ clang -shared -fpic x.c y.c -o libhello.dylib
$ file libhello.dylib
libhello.dylib: Mach-O 64-bit dynamically linked shared library x86_64
$ nm libhello.dylib
                 U _puts
0000000000000f50 T _x
0000000000000f70 T _y

The -shared flag tells the linker to create a special file called a shared library. The -fpic option converts absolute addresses to relative addresses, which allows for different processes to load the library at different virtual addresses and share memory.

Now that we have our shared library, let’s dynamically link it into our executable.

1
2
3
4
$ clang main.c libhello.dylib
$ ./a.out
x
y

The dynamic linker essential produces an incomplete binary. You can verify with nm. At runtime, we’ll delay start up to perform some memory mapping early on in the process start (performed by the dynamic linker) and pay slight costs for trampolining into position independent code.

Let’s say we want to know what dynamic libraries a binary is using. You can either query the executable (most executable object file formats contain a header the dynamic linker will parse and pull in libs) or observe the executable while running it. Because each major OS has its own object file format, they each have their own tools for these two checks. Note that statically linked libraries won’t show up here, since their object code has already been linked in and thus we’re not able to differentiate between object code that came from our first party code vs third party static libraries.

On OSX, we can use otool -L <bin> to check which .dylibs will get pulled in.

1
2
3
4
$ otool -L a.out
a.out:
           libhello.dylib (compatibility version 0.0.0, current version 0.0.0)
           /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1226.10.1)

So we can see that a.out depends on libhello.dylib (and expects to find it in the same directory as a.out). It also depends on shared library called libSystem.B.dylib. If you run otool -L on libSystem itself, you’ll see it depends on a bunch of other libraries including a C runtime, malloc implementation, pthreads implementation, and more. Let’s say you want to find the final resting place of where a symbol is defined, without digging with nm and otool, you can fire up your trusty debugger and ask it.

1
2
3
4
5
$ lldb a.out
...
(lldb) image lookup -r -s puts
...
        Summary: libsystem_c.dylib`puts        Address: libsystem_c.dylib[0x0000000000085c30] (libsystem_c.dylib.__TEXT.__stubs + 3216)

You’ll see a lot of output since puts is treated as a regex. You’re looking for the Summary line that has an address and is not a symbol stub. You can then check your work with otool and nm.

If we want to observe the dynamic linker in action on OSX, we can use dtruss:

1
2
3
4
5
6
7
8
9
10
11
$ sudo dtruss ./a.out
...
stat64("libhello.dylib\0", 0x7FFF50CEAC68, 0x1)         = 0 0
open("libhello.dylib\0", 0x0, 0x0)              = 3 0
...
mmap(0x10EF27000, 0x1000, 0x5, 0x12, 0x3, 0x0)          = 0x10EF27000 0
mmap(0x10EF28000, 0x1000, 0x3, 0x12, 0x3, 0x1000)               = 0x10EF28000 0
mmap(0x10EF29000, 0xC0, 0x1, 0x12, 0x3, 0x2000)         = 0x10EF29000 0
...
close(0x3)              = 0 0
...

On Linux, we can simply use ldd or readelf -d to query an executable for a list of its dynamic libraries.

1
2
3
4
5
6
7
8
9
10
11
12
13
$ clang -shared -fpic x.c y.c -o libhello.so
$ clang main.c libhello.so
$ ldd a.out
           linux-vdso.so.1 =>  (0x00007fff95d43000)
           libhello.so => not found
           libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fcc98c5f000)
           /lib64/ld-linux-x86-64.so.2 (0x0000555993852000)
$ readelf -d a.out
Dynamic section at offset 0xe18 contains 25 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libhello.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
...

We can then use strace to observe the dynamic linker in action on Linux:

1
2
3
4
5
6
7
8
$ LD_LIBRARY_PATH=. strace ./a.out
...
open("./libhello.so", O_RDONLY|O_CLOEXEC) = 3
...
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\5\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=8216, ...}) = 0
close(3)                                = 0
...

What’s this LD_LIBRARY_PATH thing? That’s shell syntax for setting an environmental variable just for the duration of that command (as opposed to exporting it so it stays set for multiple commands). As opposed to OSX’s dynamic linker, which was happy to look in the cwd for libhello.dylib, on Linux we must supply the cwd if the dynamic library we want to link in is not in the standard search path.

But what is the standard search path? Well, there’s another environmental variable we can set to see this, LD_DEBUG. For example, on OSX:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ LD_DEBUG=libs LD_LIBRARY_PATH=. ./a.out
     15828:        find library=libhello.so [0]; searching
     15828:         search path=./tls/x86_64:./tls:./x86_64:.             (LD_LIBRARY_PATH)
     15828:          trying file=./tls/x86_64/libhello.so
     15828:          trying file=./tls/libhello.so
     15828:          trying file=./x86_64/libhello.so
     15828:          trying file=./libhello.so
     15828:
     15828:        find library=libc.so.6 [0]; searching
     15828:         search path=./tls/x86_64:./tls:./x86_64:.             (LD_LIBRARY_PATH)
     15828:          trying file=./tls/x86_64/libc.so.6
     1earc:          trying file=./tls/libc.so.6
     15828:          trying file=./x86_64/libc.so.6
     15828:          trying file=./libc.so.6
     15828:         search cache=/etc/ld.so.cache
     15828:          trying file=/lib/x86_64-linux-gnu/libc.so.6
     15828:        calling init: /lib/x86_64-linux-gnu/libc.so.6
     15828:        calling init: ./libhello.so
     15828:        initialize program: ./a.out
     15828:        transferring control: ./a.out
x
y
     15828:        calling fini: ./a.out [0]
     15828:        calling fini: ./libhello.so [0]

LD_DEBUG is pretty useful. Try:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ LD_DEBUG=help ./a.out
Valid options for the LD_DEBUG environment variable are:


  libs        display library search paths
  reloc       display relocation processing
  files       display progress for input file
  symbols     display symbol table processing
  bindings    display information about symbol binding
  versions    display version dependencies
  scopes      display scope information
  all         all previous options combined
  statistics  display relocation statistics
  unused      determined unused DSOs
  help        display this help message and exit


To direct the debugging output into a file instead of standard output
a filename can be specified using the LD_DEBUG_OUTPUT environment variable.

For some cool stuff, I recommend checking out LD_DEBUG=symbols and LD_DEBUG=statistics.

Going back to LD_LIBRARY_PATH, usually libraries you create and want to reuse between projects go into /usr/local/lib and the headers into /usr/local/include. I think of the convention as:

1
2
3
4
5
6
7
8
9
$ tree -L 2 /usr/
/usr
├── bin # system installed binaries like nm, gcc
├── include # system installed headers like stdio.h
├── lib # system installed libraries, both static and dynamic
└── local
    ├── bin # user installed binaries like rustc
    ├── include # user installed headers
    └── lib # user installed

Unfortunately, it’s a loose convention that’s broken down over the years and things are scattered all over the place. You can also run into dependency and versioning issues, that I don’t want to get into here, by placing libraries here instead of keeping them in-tree or out-of-tree of the source code of a project. Just know when you see a library like libc.so.6 that the numeric suffix is a major version number that follows semantic versioning. For more information, you should read Michael Kerrisk’s excellent book The Linux Programming Interface. This post is based on his chapter’s 41 & 42 (but with more info on tooling and OSX).

If we were to place our libhello.so into /usr/local/lib (on Linux you need to then run sudo ldconfig) and move x.h and y.h to /usr/local/include, then we could then compile with:

1
$ clang main.c -lhello

Note that rather than give a full path to our library, we can use the -l flag followed by the name of our library with the lib prefix and .so suffix removed.

When working with shared libraries and external code, three flags I use pretty often:

1
2
3
* -l<libname to link, no lib prefix or file extension; ex: -lnanomsg to link libnanomsg.so>
* -L <path to search for lib if in non standard directory>
* -I <path to headers for that library, if in non standard directory>

For finding specific flags needed for compilation where dynamic linkage is required, a tool called pkg-config can be used for finding appropriate flags. I’ve had less than stellar experiences with the tool as it puts the onus on the library author to maintain the .pc files, and the user to have them installed in the right place that pkg-config looks. When they do exist and are installed properly, the tool works well:

1
2
3
4
$ sudo apt-get install libpng12-dev
$ pkg-config --libs --cflags libpng12
-I/usr/include/libpng12  -lpng12
$ clang program.c `!!`

Using another neat environmental variable, we can hook into the dynamic linkage process and inject our own shared libraries to be linked instead of the expected libraries. Let’s say libgood.so and libmalicous.so both define a symbol for a function (the same symbol name and function signature). We can get a binary that links in libgood.so’s function to instead call libmalicous.so’s version:

1
2
3
4
$ ./a.out
hello from libgood
$ LD_PRELOAD=./libmalicious.so ./a.out
hello from libmalicious

Manually invoking the dynamic linker from our code, we can even man in the middle library calls (call our hooked function first, then invoke the original target). We’ll see more of this in the next post on using the dynamic linker.

As you can guess, readjusting the search paths for dynamic libraries is a security concern as it let’s good and bad folks change the expected execution paths. Guarding against the use of these env vars becomes a rabbit hole that gets pretty tricky to solve without the heavy handed use of statically linking dependencies.

In the the previous post, I alluded to undefined symbols like puts. puts is part of libc, which is probably the most shared dynamic library on most computing devices since most every program makes use of the C runtime. (I think of a “runtime” as implicit code that runs in your program that you didn’t necessarily write yourself. Usually a runtime is provided as a library that gets implicitly linked into your executable when you compile.) You can statically link against libc with the -static flag, on Linux at least (OSX makes this difficult, “Apple does not support statically linked binaries on Mac OS X”).

I’m not sure what the benefit would be to mixing static and dynamic linking, but after searching the search paths from LD_DEBUG=libs for shared versions of a library, if any static ones are found, they will get linked in.

There’s also an interesting form of library called a “virtual dynamic shared object” on Linux. I haven’t covered memory mapping yet, but know it exists, is usually hidden for libc, and that you can read more about it via man 7 vdso.

One thing I find interesting and don’t quite understand how to recreate is that somehow glibc on Linux is also executable:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ /lib/x86_64-linux-gnu/libc.so.6
GNU C Library (Ubuntu GLIBC 2.24-3ubuntu1) stable release version 2.24, by Roland McGrath et al.
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 6.2.0 20161005.
Available extensions:
    crypt add-on version 2.1 by Michael Glad and others
    GNU Libidn by Simon Josefsson
    Native POSIX Threads Library by Ulrich Drepper et al
    BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<https://bugs.launchpad.net/ubuntu/+source/glibc/+bugs>.

Also, note that linking against third party code has licensing implications (of course) of particular interest when it’s GPL or LGPL. Here is a good overview which I’d summarize as: code that statically links against LGPL code must also be LGPL, while any form of linkage against GPL code must be GPL’d.

Ok, that was a lot. In the previous post, we covered Object Files and Symbols. In this post we covered hacking around with static and dynamic linkage. In the next post, I hope to talk about manually invoking the dynamic linker at runtime.

Object Files and Symbols

| Comments

What was supposed to be one blog post about memory segmentation turned into what will be a series of posts. As the first in the series, we cover the extreme basics of object files and symbols. In follow up posts, I plan to talk about static libraries, dynamic libraries, dynamic linkage, memory segments, and finally memory usage accounting. I also cover command line tools for working with these notions, both in Linux and OSX.

A quick review of the compilation+execution pipeline (for terminology):

  1. Lexing produces tokens
  2. Parsing produces an abstract syntax tree
  3. Analysis produces a code flow graph
  4. Optimization produces a reduced code flow graph
  5. Code gen produces object code
  6. Linkage produces a complete executable
  7. Loader instructs the OS how to start running the executable

This series will focus on part #6.

Let’s say you have some amazing C/C++ code, but for separations of concerns, you want to start moving it out into separate source files. Whereas previously in one file you had:

1
2
3
4
5
6
7
8
// main.c
#include <stdio.h>
void helper () {
  puts("helper");
}
int main () {
  helper();
}

You now have two source files and maybe a header:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// main.c
#include "helper.h"
int main () {
  helper();
}

// helper.h
void helper();

//helper.c
#include <stdio.h>
#include "helper.h"
void helper () {
  puts("helper");
}

In the single source version, we would have compiled and linked that with clang main.c and had an executable file. In the multiple source version, we first compile our source files to object files, then link them altogether. That can be done separately:

1
2
3
$ clang -c helper.c     # produces helper.o
$ clang -c main.c       # produces main.o
$ clang main.o helper.o # produces a.out

We can also do the compilation and linkage in one step:

1
$ clang helper.c main.c # produces a.out

Nothing special thus far; C/C++ 101. In the first case of separate compilation and linkage steps, we were left with intermediate object files (.o). What exactly are these?

Object files are almost full executables. They contain machine code, but that code still requires a relocation step. It also contains metadata about the addresses of its variables and functions (called symbols) in an associative data structure called a symbol table. The addresses may not be the final address of the symbol in the final executable. They also contain some information for the loader and probably some other stuff.

Remember that if we fail to specify the helper object file, we’ll get an undefined symbol error.

1
2
3
4
5
6
$ clang main.c
Undefined symbols for architecture x86_64:
  "_helper", referenced from:
      _main in main-459dde.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

The problem is main.o refers to some symbol called helper, but on it’s own doesn’t contain any more information about it. Let’s say we want to know what symbols an object file contains, or expects to find elsewhere. Let’s introduce our first tool, nm. nm will print the name list or symbol table for a given object or executable file. On OSX, these are prefixed with an underscore.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ nm helper.o
0000000000000000 T _helper
                 U _puts

$ nm main.o
                 U _helper
0000000000000000 T _main

$ nm a.out
...
0000000100000f50 T _helper
0000000100000f70 T _main
                 U _puts
...

Let’s dissect what’s going on here. The output (as understood by man 1 nm) is a space separated list of address, type, and symbol name. We can see that the addresses are placeholders in object files, and final in executables. The name should make sense; it’s the name of the function or variable. While I’d love to get in depth on the various symbol types and talk about sections, I don’t think I could do as great a job as Peter Van Der Linden in his book “Expert C Programming: Deep C Secrets.”

For our case, we just care about whether the symbol in a given object file is defined or not. The type U (undefined) means that this symbol is referenced or used in this object code/executable, but it’s value wasn’t defined here. When we compiled main.c alone and got the undefined symbol error, it should now make sense why we got the undefined symbol error for helper. main.o contains a symbol for main, and references helper. helper.o contains a symbol for helper, and references to puts. The final executable contains symbols for main and helper and references to puts.

You might be wondering where puts comes from then, and why didn’t we get an undefined symbol error for puts like we did earlier for helper. The answer is the C runtime. libc is implicitly dynamically linked to all executables created by the C compiler. We’ll cover dynamic linkage in a later post in this series.

When the linker performs relocation on the object files, combining them into a final executable, it goes through placeholders of addresses and fills them in. We did this manually in our post on JIT compilers.

While nm gave us a look into our symbol table, two other tools I use frequently are objdump on Linux and otool on OSX. Both of these provide disassembled assembly instructions and their addresses. Note how the symbols for functions get translated into labels of the disassembled functions, and that their address points to the first instruction in that label. Since I’ve shown objdump numerous times in previous posts, here’s otool.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
$ otool -tV helper.o
helper.o:
(__TEXT,__text) section
_helper:
0000000000000000    pushq    %rbp
0000000000000001    movq    %rsp, %rbp
0000000000000004    subq    $0x10, %rsp
0000000000000008    leaq    0xe(%rip), %rdi         ## literal pool for: "helper"
000000000000000f    callq    _puts
0000000000000014    movl    %eax, -0x4(%rbp)
0000000000000017    addq    $0x10, %rsp
000000000000001b    popq    %rbp
000000000000001c    retq
$ otool -tV main.o
main.o:
(__TEXT,__text) section
_main:
0000000000000000    pushq    %rbp
0000000000000001    movq    %rsp, %rbp
0000000000000004    movb    $0x0, %al
0000000000000006    callq    _helper
000000000000000b    xorl    %eax, %eax
000000000000000d    popq    %rbp
000000000000000e    retq
$ otool -tV a.out
a.out:
(__TEXT,__text) section
_helper:
0000000100000f50    pushq    %rbp
0000000100000f51    movq    %rsp, %rbp
0000000100000f54    subq    $0x10, %rsp
0000000100000f58    leaq    0x43(%rip), %rdi        ## literal pool for: "helper"
0000000100000f5f    callq    0x100000f80             ## symbol stub for: _puts
0000000100000f64    movl    %eax, -0x4(%rbp)
0000000100000f67    addq    $0x10, %rsp
0000000100000f6b    popq    %rbp
0000000100000f6c    retq
0000000100000f6d    nop
0000000100000f6e    nop
0000000100000f6f    nop
_main:
0000000100000f70    pushq    %rbp
0000000100000f71    movq    %rsp, %rbp
0000000100000f74    movb    $0x0, %al
0000000100000f76    callq    _helper
0000000100000f7b    xorl    %eax, %eax
0000000100000f7d    popq    %rbp
0000000100000f7e    retq

readelf -s <object file> will give us a list of symbols on Linux. ELF is the file format used by the loader on Linux, while OSX uses Mach-O. Thus readelf and otool, respectively.

Also note that for static linkage, symbols need to be unique*, as they refer to memory locations to either read/write to in the case of variables or locations to jump to in the case of functions.

1
2
3
4
5
6
7
8
9
10
11
12
$ cat double_define.c
void a () {}
void a () {}
int main () {}
$ clang double_define.c
double_define.c:2:6: error: redefinition of 'a'
void a () {}
     ^
double_define.c:1:6: note: previous definition is here
void a () {}
     ^
1 error generated.

*: there’s a notion of weak symbols, and some special things for dynamic libraries we’ll see in a follow up post.

Languages like C++ that support function overloading (functions with the same name but different arguments, return types, namespaces, or class) must mangle their function names to make them unique.

Code like:

1
2
3
4
5
6
7
namespace util {
  class Widget {
    public:
      void doSomething (bool save);
      void doSomething (int n);
  };
}

Will produce symbols like:

1
2
3
4
5
$ clang class.cpp -std=c++11
$ nm a.out
0000000100000f70 T __ZN4util6Widget11doSomethingEb
0000000100000f60 T __ZN4util6Widget11doSomethingEi
...

Note: GNU nm on Linux distros will have a --demangle option:

1
2
3
4
5
$ nm --demangle a.out
...
00000000004006d0 T util::Widget::doSomething(bool)
00000000004006a0 T util::Widget::doSomething(int)
...

On OSX, we can pipe nm into c++filt:

1
2
3
4
$ nm a.out | c++filt
0000000100000f70 T util::Widget::doSomething(bool)
0000000100000f60 T util::Widget::doSomething(int)
...

Finally, if you don’t have an object file, but instead a backtrace that needs demangling, you can either invoke c++filt manually or use demangler.com.

Rust also mangles its function names. For FFI or interface with C functions, other languages usually have to look for or expose symbols in a manner suited to C, the lowest common denominator. C++ has extern "C" blocks and Rust has extern blocks.

We can use strip to remove symbols from a binary. This can slim down a binary at the cost of making stack traces unreadable. If you’re following along at home, try comparing the output from your disassembler and nm before and after running strip on the executable. Luckily, you can’t strip the symbols out of object files, otherwise they’d be useless as you’d no longer be able to link them.

If we compile with the -g flag, we can create a different kind of symbol; debug symbols. Depending on your compiler+host OS, you’ll get another file you can run through nm to see an entry per symbol. You’ll get more info by using dwarfdump on this file. Debug symbols will retain source information such as filename and line number for all symbols.

This post should have been a simple refresher of some of the basics of working with C code. Finding symbols to be placed into a final executable and relocating addresses are the main job of the linker, and will be the main theme of the posts in this series. Keep your eyes out for more in this series on memory segmentation.

Cross Compiling C/C++ for Android

| Comments

Let’s say you want to build a hello world command line application in C or C++ and run it on your Android phone. How would you go about it? It’s not super practical; apps visible and distributable to end users must use the framework (AFAIK), but for folks looking to get into developing on ARM it’s they likely have an ARM device in their pocket.

This post is for folks who typically invoke their compiler from the command line, either explicitly, from build scripts, or other forms of automation.

At work, when working on Android, we typically checkout the entire Android source code (which is huge), use lunch to configure a ton of environmental variables, then use Makefiles with lots of includes and special vars. We don’t want to spend the time and disk space checking out the Android source code just to have a working cross compiler. Luckily, the Android tools team has an excellent utility to grab a prebuilt cross compiler.

This assumes you’re building from a Linux host. Android is a distribution of Linux, which is much easier to target from a Linux host. At home, I’ll usually develop on my OSX machine, ssh’d into my headless Linux box. (iTerm2 and tmux both have exclusive features, but I currently prefer iTerm2.)

The first thing we want to do is fetch the Android NDK. Not the SDK, the NDK.

1
2
3
➜  ~ curl -O \
  http://dl.google.com/android/repository/android-ndk-r12b-linux-x86_64.zip
➜  ~ unzip android-ndk-r12b-linux-x86_64.zip

It would be helpful to install adb and fastboot, too. This might be different for your distro’s package manager. Better yet may be to just build from source.

1
➜  ~ sudo apt-get install android-tools-adb android-tools-fastboot

Now for you Android device that you want to target, you’ll want to know the ISA. Let’s say I want to target my Nexus 6P, which has an ARMv8-A ISA (the first 64b ARM ISA).

1
2
➜  ~ ./android-ndk-r12b/build/tools/make_standalone_toolchain.py --arch arm64 \
  --install-dir ~/arm

This will create a nice standalone bundle in ~/arm. It will contain our cross compiler, linker, headers, libs, and sysroot (crt.o and friends). Most Android devices are ARMv7-A, so you’d use --arch arm. See the other supported architectures for cross compiling under table 4.

You might also want to change your install-dir and possible add it to your $PATH, or set $CC and $CXX.

Now we can compile hello_world.c.

1
2
3
4
5
6
7
8
9
10
11
➜  ~ cat hello_world.c
#include <stdio.h>
int main () {
  puts("hello world");
}

➜  ~ ~/arm/bin/clang -pie hello_world.c
➜  ~ file a.out
a.out: ELF 64-bit LSB shared object, ARM aarch64, version 1 (SYSV), dynamically
linked, interpreter /system/bin/linker64,
BuildID[sha1]=ecc46648cf2c873253b3b522c0d14b91cf17c70f, not stripped

Since Android Lollipop, Android has required that executables be linked as position independent (-pie) to help provide ASLR.

<install-dir>/bin/ also has shell scripts with more full featured names like aarch64-linux-android-clang if you prefer to have clearer named executables in your $PATH.

Connect your phone, enable remote debugging, and accept the prompt for remote debugging.

1
2
3
➜  ~ adb push a.out /data/local/tmp/.
➜  ~ adb shell "./data/local/tmp/a.out"
hello world

We’ll use this toolchain in a follow post to start writing some ARMv8-A assembly. Stay tuned.

Setting Up Mutt With Gmail on Ubuntu

| Comments

I was looking to set up the mutt email client on my Ubuntu box to go through my gmail account. Since it took me a couple of hours to figure out, and I’ll probably forget by the time I need to know again, I figure I’d post my steps here.

I’m on Ubuntu 16.04 LTS (lsb_release -a)

Install mutt:

1
$ sudo apt-get install mutt

In gmail, allow other apps to access gmail:

Allowing less secure apps to access your account Less Secure Apps

Create the folder

1
2
3
$ sudo touch $MAIL
$ sudo chmod 660 $MAIL
$ sudo chown `whoami`:mail $MAIL

where $MAIL for me was /var/mail/nick.

Create the ~/.muttrc file

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
set realname = "<first and last name>"
set from = "<gmail username>@gmail.com"
set use_from = yes
set envelope_from = yes

set smtp_url = "smtps://<gmail username>@gmail.com@smtp.gmail.com:465/"
set smtp_pass = "<gmail password>"
set imap_user = "<gmail username>@gmail.com"
set imap_pass = "<gmail password>"
set folder = "imaps://imap.gmail.com:993"
set spoolfile = "+INBOX"
set ssl_force_tls = yes

# G to get mail
bind index G imap-fetch-mail
set editor = "vim"
unset record
set move = no
set charset = "utf-8"

I’m sure there’s better/more config options. Feel free to go wild, this is by no means a comprehensive setup.

Run mutt:

1
$ mutt

We should see it connect and download our messages. m to start sending new messages. G to fetch new messages.

Data Models and Word Size

| Comments

This post is a follow up to my previous blog post about word size.

Three C/C++ programmers walk into a bar. One argues that sizeof(void*) is equivalent to sizeof(long), one argues that sizeof(void*) is equivalent to sizeof(int), and the third argues it’s sizeof(long long). Simultaneously, they’re all right, but they’re also all wrong (and need a lesson about portable C code). What the hell is going on?

One of the first few programs a programmer might write after hello world is something like this:

1
2
3
4
5
6
7
#include <stdio.h>
int main () {
  printf("sizeof(int): %zu\n", sizeof(int));
  printf("sizeof(long): %zu\n", sizeof(long));
  printf("sizeof(long long): %zu\n", sizeof(long long));
  printf("sizeof(void*): %zu\n", sizeof(void*));
}

Note the use of the %zu format specifier, a C99 addition that isn’t portable to older compilers! (This post is more about considerations when porting older code to newer machines, not about porting newer code to run on older machines. Not having a standards compliant C compiler makes writing more portable C code even trickier, and is a subject for another blog post).

When I run that code on my x86-64 OSX machine, I get the following output:

1
2
3
4
sizeof(int): 4
sizeof(long): 8
sizeof(long long): 8
sizeof(void*): 8

So it looks like I would be the first programmer in the story in the first paragraph, since on my machine, it looks like sizeof(long) == sizeof(void*). Also note how sizeof(long long) is equivalent as well.

But what would happen if I compiled my code on a 32 bit machine? Luckily, my processor has backwards compatibility with 32b binaries, so I can cross compile it locally and still run it. Ex:

1
2
3
4
5
6
7
8
9
10
11
➜  clang sizeof.c -Wall -Wextra -Wpedantic
➜  file a.out
a.out: Mach-O 64-bit executable x86_64
➜  clang sizeof.c -Wall -Wextra -Wpedantic -m32
➜  file a.out
a.out: Mach-O executable i386
➜  ./a.out
sizeof(int): 4
sizeof(long): 4
sizeof(long long): 8
sizeof(void*): 4

Huh, suddenly sizeof(void*) == sizeof(int) == sizeof(long)! This seems to be the case of the second programmer from the story.

Both programmer 1 and programmer 2 might agree that the size of a pointer is equivalent to their machine’s respective word size, but that too would be an incorrect assumption for portable C/C++ code!

Programmer 3 goes through the hellscape that is installing a working compiler for Windows and building a 64b command line application (to be fair, installing command line tools for OSX is worse; installing a compiler for most OS’ leaves much to be desired). When they run that program, they see:

1
2
3
4
sizeof(int): 4
sizeof(long): 4
sizeof(long long): 8
sizeof(void*): 8

This is yet a third case (the third programmer from the story). In this case, only sizeof(long long) is equivalent to sizeof(void*).

Data Models

What these programmers are seeing is known as data models. Programmer 1 one on a 64b x86-64 OSX machine had an LP64 data model where longs (L), (larger long longs,) and pointers (P) are 64b, but ints were 32b. Programmer 2 on a 32b x86 OSX machine had an ILP32 data model where ints (I), longs (L), and pointers (P) were 32b, but long longs were 64b. Programmer 3 on a 64b x86-64 Windows machine had a LLP64 data model, where only long longs (LL) and pointers (P) were 64b, ints and longs being 32b.

Data model sizeof(int) sizeof(long) sizeof(long long) sizeof(void*) example
ILP32 32b 32b 64b 32b Win32, i386 OSX & Linux
LP64 32b 64b 64b 64b x86-64 OSX & Linux
LLP64 32b 32b 64b 64b Win64

There are older data models such as LP32 (Windows 3.1, Macintosh, where ints are 16b), and more exotic ones like ILP64, and SILP64. Knowing the data model thus is important for portable C/C++ code.

Historical Perspective

Running out of address space is and will continue to be tradition in computing. Applications become bigger as computer power and memory gets cheaper. Companies want to sell chips that have larger word sizes to address more memory, but early adopters don’t want to buy a computer where there favorite application hasn’t been compiled and thus doesn’t exist on yet. Someone from the back shouts virtual machines then ducks as a chair is thrown.

This document highlights some reasons why LP64 is preferred to ILP64: ILP64 made portable C code that only needed 32b of precision harder to maintain (on ILP64 an int was 64b, but a short was 16b!). It mentions how for data structures that did not contain pointers, their size would be the same on LLP64 as ILP32, which is the direction Microsoft went. LLP64 was essentially the ILP32 model with 64b pointers.

Linux also supports an ABI called x32 which can use x86-64 ISA improvements but uses 32b pointers to reduce the size of data structures that would otherwise have 64b pointers.

For a great historical perspective on the evolution of word size and data models, as well as the “toil and trouble” caused, this paper was an excellent reference. It describes Microsoft finally abandoning support for 16b data models in Windows XP 64. It mentions that the industry was pretty split between LP64, LLP64, and ILP64 as porting code from the good old days of ILP32 would break in different ways. That the use of long was more prevalent in Windows applications vs the use of int in unix applications. It also makes the key point that a lot of programmers from the ILP32 era made assumptions that sizeof(int) == sizeof(long) == sizeof(void*) which would not hold true for the LP64/LLP64 era.

One important point the paper makes makes that’s easily missed is that typedef wasn’t added to C until 1977 when hardware manufactures still couldn’t agree on how many bits were in a char (CHAR_BITS) and some machines were using 24b addressing schemes. stdint.h and inttypes.h did not exist yet.

This article talks about two main categories of effects of switching from ILP32 to LP64 and has excellent examples of problematic code. That section near the end is worth the read alone and makes excellent points to look for during code review.

Conclusion

Word size or ISA doesn’t tell you anything about sizeof(int), sizeof(long), or sizeof(long long). We also saw that one machine can support multiple different data models (when I compiled and ran the same code with the -m32 flag).

The C standard tells you minimum guaranteed sizes for these types, but the data model (part of the ABI, external to but abiding by the C standard) is what tells you about the specifics sizes of standard integers and pointers.

Further Reading

What’s in a Word?

| Comments

Recently, there some was some confusion between myself and a coworker over the definition of a “word.” I’m currently working on a blog post about data alignment and figured it would be good to clarify some things now, that we can refer to later.

Having studied computer engineering and being quite fond of processor design, when I think of a “word,” I think of the number of bits wide a processor’s general purpose registers are (aka word size). This places hard requirements on the largest representable number and address space. A 64 bit processor can represent 264-1 (1.8x1019) as the largest unsigned long integer, and address up to 264-1 (16 EiB) different addresses in memory.

Further, word size limits the possible combinations of operations the processor can perform, length of immediate values used, inflates the size of binary files and memory needed to store pointers, and puts pressure on instruction caches.

Word size also has implications on loads and stores based on alignment, as we’ll see in a follow up post.

When I think of 8 bit computers, I think of my first microcontroller: an Arduino with an Atmel AVR processor. When I think of 16 bit computers, I think of my first game console, a Super Nintendo with a Ricoh 5A22. When I think of 32 bit computers, I think of my first desktop with Intel’s Pentium III. And when I think of 64 bit computers, I think modern smartphones with ARMv8 instruction sets. When someone mentions a particular word size, what are the machines that come to mind for you?

So to me, when someone’s talking about a 64b processor, to that machine (and me) a word is 64b. When we’re referring to a 8b processor, a word is 8b.

Now, some confusion.

Back in my previous blog posts about x86-64 assembly, JITs, or debugging, you might have seen me use instructions that have suffixes of b for byte (8b), w for word (16b), dw for double word (32b), and qw for quad word (64b) (since SSE2 there’s also double quadwords of 128b).

Wait a minute! How suddenly does a “word” refer to 16b on a 64b processor, as opposed to a 64b “word?”

In short, historical baggage. Intel’s first hit processor was the 4004, a 4b processor released in 1971. It wasn’t until 1979 that Intel created the 16b 8086 processor.

The 8086 was created to compete with other 16b processors that beat it to the market, like the Zilog Z80 (any Gameboy emulator fans out there? Yes, I know about the Sharp LR35902). The 8086 was the first design in the x86 family, and it allowed for the same assembly syntax from the earlier 8008, 8080, and 8085 to be reassembled for it. The 8086’s little brother (8088) would be used in IBM’s PC, and the rest is history. x86 would become one of the most successful ISAs in history.

For backwards compatibility, it seems that both Microsoft’s (whose success has tracked that of x86 since MS-DOS and IBM’s PC) and Intel’s documentation refers to words still as being 16b. This allowed 16b PE32+ executables to be run on 32b or even 64b newer versions of Windows, without requiring recompilation of source or source code modification.

This isn’t necessarily wrong to refer to a word based on backwards compatibility, it’s just important to understand the context in which the term “word” is being used, and that there might be some confusion if you have a background with x86 assembly, Windows API programming, or processor design.

So the next time someone asks: why does Intel’s documentation commonly refer to a “word” as 16b, you can tell them that the x86 and x86-64 ISAs have maintained the notion of a word being 16b since the first x86 processor, the 8086, which was a 16b processor.

Side Note: for an excellent historical perspective programming early x86 chips, I recommend Michael Abrash’s Graphics Programming Black Book. For instance he talks about 8086’s little brother, the 8088, being a 16b chip but only having an 8b bus with which to access memory. This caused a mysterious “cycle eater” to prevent fast access to 16b variables, though they were the processor’s natural size. Michael also alludes to alignment issues we’ll see in a follow up post.

Intro to Debugging X86-64 Assembly

| Comments

I’m hacking on an assembly project, and wanted to document some of the tricks I was using for figuring out what was going on. This post might seem a little basic for folks who spend all day heads down in gdb or who do this stuff professionally, but I just wanted to share a quick intro to some tools that others may find useful. (oh god, I’m doing it)

If your coming from gdb to lldb, there’s a few differences in commands. LLDB has great documentation on some of the differences. Everything in this post about LLDB is pretty much there.

The bread and butter commands when working with gdb or lldb are:

  • r (run the program)
  • s (step in)
  • n (step over)
  • finish (step out)
  • c (continue)
  • q (quit the program)

You can hit enter if you want to run the last command again, which is really useful if you want to keep stepping over statements repeatedly.

I’ve been using LLDB on OSX. Let’s say I want to debug a program I can build, but is crashing or something:

1
$ sudo lldb ./asmttpd web_root

Setting a breakpoint on jump to label:

1
2
(lldb) b sys_write
Breakpoint 3: where = asmttpd`sys_write, address = 0x00000000000029ae

Running the program until breakpoint hit:

1
2
3
4
5
6
7
8
9
10
(lldb) r
Process 32236 launched: './asmttpd' (x86_64)
Process 32236 stopped
* thread #1: tid = 0xe69b9, 0x00000000000029ae asmttpd`sys_write, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
    frame #0: 0x00000000000029ae asmttpd`sys_write
asmttpd`sys_write:
->  0x29ae <+0>: pushq  %rdi
    0x29af <+1>: pushq  %rsi
    0x29b0 <+2>: pushq  %rdx
    0x29b1 <+3>: pushq  %r10

Seeing more of the current stack frame:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
(lldb) d
asmttpd`sys_write:
->  0x29ae <+0>:  pushq  %rdi
    0x29af <+1>:  pushq  %rsi
    0x29b0 <+2>:  pushq  %rdx
    0x29b1 <+3>:  pushq  %r10
    0x29b3 <+5>:  pushq  %r8
    0x29b5 <+7>:  pushq  %r9
    0x29b7 <+9>:  pushq  %rbx
    0x29b8 <+10>: pushq  %rcx
    0x29b9 <+11>: movq   %rsi, %rdx
    0x29bc <+14>: movq   %rdi, %rsi
    0x29bf <+17>: movq   $0x1, %rdi
    0x29c6 <+24>: movq   $0x2000004, %rax
    0x29cd <+31>: syscall
    0x29cf <+33>: popq   %rcx
    0x29d0 <+34>: popq   %rbx
    0x29d1 <+35>: popq   %r9
    0x29d3 <+37>: popq   %r8
    0x29 <+39>: popq   %r10
    0x29d7 <+41>: popq   %rdx
    0x29d8 <+42>: popq   %rsi
    0x29d9 <+43>: popq   %rdi
    0x29da <+44>: retq

Getting a back trace (call stack):

1
2
3
4
5
6
7
(lldb) bt
* thread #1: tid = 0xe69b9, 0x00000000000029ae asmttpd`sys_write, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000000000029ae asmttpd`sys_write
    frame #1: 0x00000000000021b6 asmttpd`print_line + 16
    frame #2: 0x0000000000002ab3 asmttpd`start + 35
    frame #3: 0x00007fff9900c5ad libdyld.dylib`start + 1
    frame #4: 0x00007fff9900c5ad libdyld.dylib`start + 1

peeking at the upper stack frame:

1
2
3
4
5
6
7
(lldb) up
frame #1: 0x00000000000021b6 asmttpd`print_line + 16
asmttpd`print_line:
    0x21b6 <+16>: movabsq $0x30cb, %rdi
    0x21c0 <+26>: movq   $0x1, %rsi
    0x21c7 <+33>: callq  0x29ae                    ; sys_write
    0x21cc <+38>: popq   %rcx

back down to the breakpoint-halted stack frame:

1
2
3
4
5
6
7
(lldb) down
frame #0: 0x00000000000029ae asmttpd`sys_write
asmttpd`sys_write:
->  0x29ae <+0>: pushq  %rdi
    0x29af <+1>: pushq  %rsi
    0x29b0 <+2>: pushq  %rdx
    0x29b1 <+3>: pushq  %r10

dumping the values of registers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
(lldb) register read
General Purpose Registers:
       rax = 0x0000000000002a90  asmttpd`start
       rbx = 0x0000000000000000
       rcx = 0x00007fff5fbffaf8
       rdx = 0x00007fff5fbffa40
       rdi = 0x00000000000030cc  start_text
       rsi = 0x000000000000000f
       rbp = 0x00007fff5fbffa18
       rsp = 0x00007fff5fbff9b8
        r8 = 0x0000000000000000
        r9 = 0x00007fff7b1670c8  atexit_mutex + 24
       r10 = 0x00000000ffffffff
       r11 = 0xffffffff00000000
       r12 = 0x0000000000000000
       r13 = 0x0000000000000000
       r14 = 0x0000000000000000
       r15 = 0x0000000000000000
       rip = 0x00000000000029ae  asmttpd`sys_write
    rflags = 0x0000000000000246
        cs = 0x000000000000002b
        fs = 0x0000000000000000
        gs = 0x0000000000000000

read just one register:

1
2
(lldb) register read rdi
     rdi = 0x00000000000030cc  start_text

When you’re trying to figure out what system calls are made by some C code, using dtruss is very helpful. dtruss is available on OSX and seems to be some kind of wrapper around DTrace.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ cat sleep.c
#include <time.h>
int main () {
  struct timespec rqtp = {
    2,
    0
  };

  nanosleep(&rqtp, NULL);
}

$ clang sleep.c

$ sudo dtruss ./a.out
...all kinds of fun stuff
__semwait_signal(0xB03, 0x0, 0x1)    = -1 Err#60

If you compile with -g to emit debug symbols, you can use lldb’s disassemble command to get the equivalent assembly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
$ clang sleep.c -g
$ lldb a.out
(lldb) target create "a.out"
Current executable set to 'a.out' (x86_64).
(lldb) b main
Breakpoint 1: where = a.out`main + 16 at sleep.c:3, address = 0x0000000100000f40
(lldb) r
Process 33213 launched: '/Users/Nicholas/code/assembly/asmttpd/a.out' (x86_64)
Process 33213 stopped
* thread #1: tid = 0xeca04, 0x0000000100000f40 a.out`main + 16 at sleep.c:3, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000100000f40 a.out`main + 16 at sleep.c:3
   1    #include <time.h>
   2    int main () {
-> 3      struct timespec rqtp = {
   4        2,
   5        0
   6      };
   7
(lldb) disassemble
a.out`main:
    0x100000f30 <+0>:  pushq  %rbp
    0x100000f31 <+1>:  movq   %rsp, %rbp
    0x100000f34 <+4>:  subq   $0x20, %rsp
    0x100000f38 <+8>:  leaq   -0x10(%rbp), %rdi
    0x100000f3c <+12>: xorl   %eax, %eax
    0x100000f3e <+14>: movl   %eax, %esi
->  0x100000f40 <+16>: movq   0x49(%rip), %rcx
    0x100000f47 <+23>: movq   %rcx, -0x10(%rbp)
    0x100000f4b <+27>: movq   0x46(%rip), %rcx
    0x100000f52 <+34>: movq   %rcx, -0x8(%rbp)
    0x100000f56 <+38>: callq  0x100000f68               ; symbol stub for: nanosleep
    0x100000f5b <+43>: xorl   %edx, %edx
    0x100000f5d <+45>: movl   %eax, -0x14(%rbp)
    0x100000f60 <+48>: movl   %edx, %eax
    0x100000f62 <+50>: addq   $0x20, %rsp
    0x100000f66 <+54>: popq   %rbp
    0x100000f67 <+55>: retq

Anyways, I’ve been learning some interesting things about OSX that I’ll be sharing soon. If you’d like to learn more about x86-64 assembly programming, you should read my other posts about writing x86-64 and a toy JIT for Brainfuck (the creator of Brainfuck liked it).

I should also do a post on Mozilla’s rr, because it can do amazing things like step backwards. Another day…

My SIGGRAPH 2015 Experience

| Comments

I was recently lucky enough to get to attend my first SIGGRAPH conference this year. While I didn’t attend any talks, I did spend some time in the expo. Here is a collection of some of the neat things I saw at SIGGRAPH 2015. Sorry it’s not more collected; I didn’t have the intention of writing a blog post until after folks kept asking me “how was it?”

VR

Most booths had demos on VR headsets. Many were DK2’s and GearVR’s. AMD and NVIDIA had Crescent Bay’s (next gen VR headset). It was noticeably lighter than the DK2, and I thought it rendered better quality. It had nicer cable bundling, and headphones built in, that could fold up and lock out of the way that made it nice to put on/take off. I also tried a Sony Morpheus. They had a very engaging demo that was a tie in to the upcoming movie about tight rope walking, “The Walk”. They had a thin PVC pipe taped to the floor that you had to balance on, and a fan, and you were tight rope walking between the Twin Towers. Looking down and trying to balance was terrifying. There were some demos with a strange mobile VR setup where folks had a backpack on that had an open laptop hanging off the back and could walk around. Toyota and Ford had demos where you could inspect their vehicles in virtual space. I did not see a single HTC/Valve Vive at SIGGRAPH.

AR

Epson had some AR glasses. They were very glasses friendly, unlike most VR headsets. The nose piece was flexible, and if you flattened it out, the headset could rest on top of your glasses and worked well. The headset had some very thick compound lenses. There was a front facing camera and they had a simple demo using image recognition of simple logos (like QR codes) that helped provide position data. There were other demos with orientation tracking that worked well. They didn’t have positional sensor info, but had some hack that tried to estimate positional velocity off the angular momentum (I spoke with the programmer who implemented it). https://moverio.epson.biz/

Holograms

There was a demo of holograms using tilted pieces of plastic arranged in a box. Also, there was a multiple (200+) projector array that projected a scene onto a special screen. When walking around the screen, the viewing angle always seemed correct. It was very convincing, except for the jarring restart of the animated loop which could be smoothed out (think looping/seamless gifs).

VR/3D video

Google cardboard had a booth showing off 3D videos from youtube. I had a hard time telling if the video were stereoscopic or monoptic since the demo videos only had things in the distance so it was hard to tell if parallax was implemented correctly. A bunch of booths were showing off 3D video, but as far as I could tell, all of the correctly rendered stereoscopic shots were computer rendered. I could not find a single instance with footage shot from a stereoscopic rig, though I tried.

Booths/Expo

NVIDIA and Intel had the largest booths, followed by Pixar’s Renderman. Felt like a GDC event, smaller, but definitely larger than GDC next. More focus on shiny photorealism demos, artistic tools, less on game engines themselves.

Vulcan/OpenGL ES 3.2

Intel had demos of Vulcan and OpenGL ES 3.2. For 3.2 they were showing off tessellation shaders, I think. For the Vulcan demo, they had a cool demo showing how with a particle demo scene rendered with OpenGL 4, a single CPU was pegged, it was using a lot of power, and had pretty abysmal framerate. When rendering the same scene with Vulcan, they were able to more evenly distribute the workload across CPUs, achieve higher framerate, while using less power. The API to Vulcan is still not published, so no source code is available. It was explained to me that Vulcan is still not thread safe; instead you get the freedom to implement synchronization rather than the driver.

Planetarium

There was a neat demo of a planetarium projector being repurposed to display an “on rails” demo of a virtual scene. You didn’t get parallax since it was being projected on a hemisphere, but it was neat in that like IMAX your entire FOV was encompassed, but you could move your head, not see any pixels, and didn’t experience any motion sickness or disorientation.

X3D/X3DOM

I spoke with some folks at the X3D booth about X3DOM. To me, it seems like a bunch of previous attempts have kind of added on too much complexity in an effort to support every use case under the sun, rather than just accept limitations, so much so that getting started writing hello world became difficult. Some of the folks I spoke to at the booth echoed this sentiment, but also noted the lack of authoring tools as things that hurt adoption. I have some neat things I’m working on in this space, based on this and other prior works, that I plan on showing off at the upcoming BrazilJS.

Maker Faire

There was a cool maker faire, some things I’ll have to order for family members (young hackers in training) were Canny bots, eBee and Piper.

Experimental tech

Bunch of neat input devices, one I liked used directional sound as tactile feedback. One demo was rearranging icons on a home screen. Rather than touch the screen, there was a field of tiny speakers that would blast your finger with sound when it entered to simulate the feeling of vibration. It would vibrate to let you know you had “grabbed” and icon, and then drag it.

Book Signing

This was the first time I got to see my book printed in physical form! It looked gorgeous, hardcover printed in color. I met about half of the fellow authors who were also at SIGGRAPH, and our editor. I even got to meet Eric Haines, who reviewed my chapter before publication!

Additional C/C++ Tooling

| Comments

21st Century C by Ben Klemens was a great read. It had a section with an intro to autotools, git, and gdb. There are a few other useful tools that came to mind that I’ve used when working with C and C++ codebases. These tools are a great way to start contributing to Open Source C & C++ codebases; running these tools on the code or adding them to the codebases. A lot of these favor command line, open source utilities. See how many you are familiar with!

Build Tools

CMake

The first tool I’d like to take a look at is CMake. CMake is yet another build tool; I realize how contentious it is to even discuss one of the many. From my experience working with Emscripten, we recommend the use of CMake for people writing portable C/C++ programs. CMake is able to emit Makefiles for unixes, project files for Xcode on OSX, and project files for Visual Studio on Windows. There are also a few other “generators” that you can use.

I’ve been really impressed with CMake’s modules for finding dependencies and another for fetching and building external dependencies. I think C++ needs a package manager badly, and I think CMake would be a solid foundation for one.

The syntax isn’t the greatest, but when I wanted to try to build one of my C++ projects on Windows which I know nothing about developing on, I was able to install CMake and Visual Studio and get my project building. If you can build your code on one platform, it will usually build on the others.

If you’re not worried about writing cross platform C/C++, maybe CMake is not worth the effort, but I find it useful. I wrestle with the syntax sometimes, but documentation is not bad and it’s something you deal with early on in the development of a project and hopefully never have to touch again (how I wish that were true).

Code Formatters

ClangFormat

Another contentious point of concern amongst developers is code style. Big companies with lots of C++ code have documents explaining their stylistic choices. Don’t waste another hour of your life arguing about something that really doesn’t matter. ClangFormat will help you codify your style and format your code for you to match the style. Simply write the code however you want, and run the formatter on it before commiting it.

It can also emit a .clang-format file that you can commit and clang-format will automatically look for that file and use the rules codified there.

Linters

Flint / Flint++

Flint is a C++ linter in use at Facebook. Since it moved from being implemented in C++ to D, I’ve had issues building it. I’ve had better luck with a fork that’s pure C++ without any of the third party dependencies Flint originally had, called Flint++. While not quite full-on static analyzers, both can be used for finding potential issues in your code ahead of time. Linters can look at individual files in isolation; you don’t have to wait for long recompiles like you would with a static analyzer.

Static Analyzers

Scan-build

Scan-build is a static analyzer for C and C++ code. You build your code “through” it, then use the sibling tool scan-view to see the results. Scan-view will emit and open an html file that shows a list of the errors it detected. It will insert hyperlinks into the resulting document that step you through how certain conditions could lead to a null pointer dereference, for example. You can also save and share those html files with others in the project. Static analyzers will help you catch bugs at compile time before you run the code.

Runtime Sanitizers

ASan and UBSan

Clang’s Address (ASan) and Undefined Behavior (UBSan) sanitizers are simply compiler flags that can be used to detect errors at runtime. ASan and UBSan two of the more popular tools, but there are actually a ton and more being implemented. See the list here. These sanitizers will catch bugs at runtime, so you’ll have to run the code to notice any violations, at variable runtime performance costs per sanitizer. ASan and TSan (Thread Sanitizer) made it into gcc4.8 and UBSan is in gcc4.9.

Header Analysis

Include What You Use

Include What You Use (IWYU) helps you find unused or unnecessary #include preprocessor directives. It should be obvious how this can help improve compile times. IWYU can also help cut down on recompiles by recommending forward declarations under certain conditions. I look forward to the C++ module proposal being adopted, but until then this tool can help you spot cruft that can be removed.

Rapid Recompiles

ccache

ccache greatly improves recompile times by caching the results of parts of the compilation process. I use when building Firefox, and it saves a great deal of time.

distcc

distcc is a distributed build system. Some folks at Mozilla speed up their Firefox builds with it.

Memory Leak Detectors

Valgrind

Valgrind has a suite of tools, my favorite being memcheck for finding memory leaks. Unfortunately, it doesn’t seem to work on OSX since 10.10. This page referring to ASan seems to indicate that it can do everything Valgrind’s Memcheck can, at less of a runtime performance cost, but I’m not sure how true this is exactly.

leaks

A much more primitive tool for finding leaks from the command line, BSD’s have leaks.

1
2
3
MallocStackLogging=1 ./a.out
leaks a.out
...

Profilers

Perf

Perf, and Brendan Gregg’s tools for emitting SVG flamegraphs from the output are helpful for finding where time is spent in a program. In fact, there are numerous perfomance analysis tools that are Linux specific. My recommendation is spend some time on Brendan Gregg’s blog.

DTrace

OSX doesn’t have the same tooling as Linux, but DTrace was ported to it. I’ve used it to find sampling profiles of my code before. Again, Brendan Gregg’s blog is a good resource; there are some fantastic DTrace one liners.

Debuggers

lldb

lldb is analogous to gdb. I can’t say I have enough experience with LLDB and GDB to note the difference between the two, but LLDB did show the relative statements forward and back from the current statement by default. I’m informed by my friends/mortal enemies using emacs that this is less of an issue when using emacs/gdb in combination.

Fuzzers

American Fuzzy Lop

American Fuzzy Lop (AFL) is a neat program that performs fuzzing on programs that take inputs from files and repeatedly runs the program, modifies the input trying to get full code coverage, and tries to find crashes. It’s been getting lots of attention lately, and while I haven’t used it myself yet, it seems like a very powerful tool. Mozilla employs the use of fuzzers on their JavaScript engine, for instance (not AFL, but one developed in house).

Disassemblers

gobjdump

If you really need to make sure the higher level code you’re writing is getting translated into the assembly your expecting, gobjdump -S will intermix the emitted binary’s disassembled assembly and the source code. This was used extensively while developing my Brainfuck JIT.

Conclusion

Hopefully you learned of some useful tools that you should know about when working with C or C++. What did I miss?

Interpreter, Compiler, JIT

| Comments

Interpreters and compilers are interesting programs, themselves used to run or translate other programs, respectively. Those other programs that might be interpreted might be languages like JavaScript, Ruby, Python, PHP, and Perl. The other programs that might be compiled are C, C++, and to some extent Java and C#.

Taking the time to do translation to native machine code ahead of time can result in better performance at runtime, but an interpreter can get to work right away without spending any time translating. There happens to be a sweet spot somewhere in between interpretation and compilation that combines the best of both worlds. Such a technique is called Just In Time (JIT) compiling. While interpreting, compiling, and JIT’ing code might sound radically different, they’re actually strikingly similar. In this post, I hope to show how similar by comparing the code for an interpreter, a compiler, and a JIT compiler for the language Brainfuck in around 100 lines of C code each.

All of the code in the post is up on GitHub.

Brainfuck is an interesting, if hard to read, language. It only has eight operations it can perform > < + - . , [ ], yet is Turing complete. There’s nothing really to lex; each character is a token, and if the token is not one of the eight operators, it’s ignored. There’s also not much of a grammar to parse; the forward jumping and backwards jumping operators should be matched for well formed input, but that’s about it. In this post, we’ll skip over validating input assuming well formed input so we can focus on the interpretation/code generation. You can read more about it on the Wikipedia page, which we’ll be using as a reference throughout.

A Brainfuck program operates on a 30,000 element byte array initialized to all zeros. It starts off with an instruction pointer, that initially points to the first element in the data array or “tape.” In C code for an interpreter that might look like:

1
2
3
4
5
// Initialize the tape with 30,000 zeroes.
unsigned char tape [30000] = { 0 };

// Set the pointer to point at the left most cell of the tape.
unsigned char* ptr = tape;

Then, since we’re performing an operation for each character in the Brainfuck source, we can have a for loop over every character with a nested switch statement containing case statements for each operator.

The first two operators, > and < increment and decrement the data pointer.

1
2
case '>': ++ptr; break;
case '<': --ptr; break;

One thing that could be bad is that because the interpreter is written in C and we’re representing the tape as an array but we’re not validating our inputs, there’s potential for stack buffer overrun since we’re not performing bounds checks. Again, punting and assuming well formed input to keep the code and the point more precise.

Next up are the + and - operators, used for incrementing and decrementing the cell pointed to by the data pointer by one.

1
2
case '+': ++(*ptr); break;
case '-': --(*ptr); break;

The operators . and , provide Brainfuck’s only means of input or output, by writing the value pointed to by the instruction pointer to stdout as an ASCII value, or reading one byte from stdin as an ASCII value and writing it to the cell pointed to by the instruction pointer.

1
2
case '.': putchar(*ptr); break;
case ',': *ptr = getchar(); break;

Finally, our looping constructs, [ and ]. From the definition on Wikipedia for [: if the byte at the data pointer is zero, then instead of moving the instruction pointer forward to the next command, jump it forward to the command after the matching ] command and for ]: if the byte at the data pointer is nonzero, then instead of moving the instruction pointer forward to the next command, jump it back to the command after the matching [ command.

I interpret that as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
case '[':
  if (!(*ptr)) {
    int loop = 1;
    while (loop > 0) {
      unsigned char current_char = input[++i];
      if (current_char == ']') {
        --loop;
      } else if (current_char == '[') {
        ++loop;
      }
    }
  }
  break;
case ']':
  if (*ptr) {
    int loop = 1;
    while (loop > 0) {
      unsigned char current_char = input[--i];
      if (current_char == '[') {
        --loop;
      } else if (current_char == ']') {
        ++loop;
      }
    }
  }
  break;

Where the variable loop keeps track of open brackets for which we’ve not seen a matching close bracket, aka our nested depth.

So we can see the interpreter is quite basic, in around 50 SLOC were able to read a byte, and immediately perform an action based on the operator. How we perform that operation might not be the fastest though.

How about if we want to compile the Brainfuck source code to native machine code? Well, we need to know a little bit about our host machine’s Instruction Set Architecture (ISA) and Application Binary Interface (ABI). The rest of the code in this post will not be as portable as the above C code, since it assumes an x86-64 ISA and UNIX ABI. Now would be a good time to take a detour and learn more about writing assembly for x86-64. The interpreter is even portable enough to build with Emscripten and run in a browser!

For our compiler, we’ll iterate over every character in the source file again, switching on the recognized operator. This time, instead of performing an action right away, we’ll print assembly instructions to stdout. Doing so requires running the compiler on an input file, redirecting stdout to a file, then running the system assembler and linker on that file. We’ll stick with just compiling and not assembling (though it’s not too difficult), and linking (for now).

First, we need to print a prologue for our compiled code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
const char* const prologue =
  ".text\n"
  ".globl _main\n"
  "_main:\n"
  "  pushq %rbp\n"
  "  movq %rsp, %rbp\n"
  "  pushq %r12\n"        // store callee saved register
  "  subq $30008, %rsp\n" // allocate 30,008 B on stack, and realign
  "  leaq (%rsp), %rdi\n" // address of beginning of tape
  "  movl $0, %esi\n"     // fill with 0's
  "  movq $30000, %rdx\n" // length 30,000 B
  "  call _memset\n"      // memset
  "  movq %rsp, %r12";
puts(prologue);

During the linking phase, we’ll make sure to link in libc so we can call memset. What we’re doing is backing up callee saved registers we’ll be using, stack allocating the tape, realigning the stack (x86-64 ABI point #1), copying the address of the only item on the stack into a register for our first argument, setting the second argument to the constant 0, the third arg to 30000, then calling memset. Finally, we use the callee saved register %r12 as our instruction pointer, which is the address into a value on the stack.

We can expect the call to memset to result in a segfault if simply subtract just 30000B, and not realign for the 2 registers (64 b each, 8 B each) we pushed on the stack. The first pushed register aligns the stack on a 16 B boundary, the second misaligns it; that’s why we allocate an additional 8 B on the stack (x86-64 ABI point #1). The stack is mis-aligned upon function entry in x86-64. 30000 is a multiple of 16.

Moving the instruction pointer (>, <) and modifying the pointed to value (+, -) are straight-forward:

1
2
3
4
5
6
7
8
9
10
11
12
case '>':
  puts("  inc %r12");
  break;
case '<':
  puts("  dec %r12");
  break;
case '+':
  puts("  incb (%r12)");
  break;
case '-':
  puts("  decb (%r12)");
  break;

For output, ., we need to copy the pointed to byte into the register for the first argument to putchar. We explicitly zero out the register before calling putchar, since it takes an int (32 b), but we’re only copying a char (8 b) (Look up C’s type promotion rules for more info). x86-64 has an instruction that does both, movzXX, Where the first X is the source size (b, w) and the second is the destination size (w, l, q). Thus movzbl moves a byte (8 b) into a double word (32 b). %rdi and %edi are the same register, but %rdi is the full 64 b register, while %edi is the lowest (or least significant) 32 b.

1
2
3
4
5
6
case '.':
  // move byte to double word and zero upper bits since putchar takes an
  // int.
  puts("  movzbl (%r12), %edi");
  puts("  call _putchar");
  break;

Input (,) is easy; call getchar, move the resulting lowest byte into the cell pointed to by the instruction pointer. %al is the lowest 8 b of the 64 b %rax register.

1
2
3
4
case ',':
  puts("  call _getchar");
  puts("  movb %al, (%r12)");
  break;

As usual, the looping constructs ([ & ]) are much more work. We have to match up jumps to matching labels, but for an assembly program, labels must be unique. One way we can solve for this is whenever we encounter an opening brace, push a monotonically increasing number that represents the numbers of opening brackets we’ve seen so far onto a stack like data structure. Then, we do our comparison and jump to what will be the label that should be produced by the matching close label. Next, we insert our starting label, and finally increment the number of brackets seen.

1
2
3
4
5
6
case '[':
  stack_push(&stack, num_brackets);
  puts("  cmpb $0, (%r12)");
  printf("  je bracket_%d_end\n", num_brackets);
  printf("bracket_%d_start:\n", num_brackets++);
  break;

For close brackets, we pop the number of brackets seen (or rather, number of pending open brackets which we have yet to see a matching close bracket) off of the stack, do our comparison, jump to the matching start label, and finally place our end label.

1
2
3
4
5
6
case ']':
  stack_pop(&stack, &matching_bracket);
  puts("  cmpb $0, (%r12)");
  printf("  jne bracket_%d_start\n", matching_bracket);
  printf("bracket_%d_end:\n", matching_bracket);
  break;

So for sequential loops ([][]) we can expect the relevant assembly to look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  cmpb $0, (%r12)
  je bracket_0_end
bracket_0_start:

  cmpb $0, (%r12)
  jne bracket_0_start
bracket_0_end:

  cmpb $0, (%r12)
  je bracket_1_end
bracket_1_start:

  cmpb $0, (%r12)
  jne bracket_1_start
bracket_1_end:

and for nested loops ([[]]), we can expect assembly like the following (note the difference in the order of numbered start and end labels):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  cmpb $0, (%r12)
  je bracket_0_end
bracket_0_start:

  cmpb $0, (%r12)
  je bracket_1_end
bracket_1_start:

  cmpb $0, (%r12)
  jne bracket_1_start
bracket_1_end:

  cmpb $0, (%r12)
  jne bracket_0_start
bracket_0_end:

Finally, we need an epilogue to clean up the stack and callee saved registers after ourselves.

1
2
3
4
5
6
const char* const epilogue =
  "  addq $30008, %rsp\n" // clean up tape from stack.
  "  popq %r12\n" // restore callee saved register
  "  popq %rbp\n"
  "  ret\n";
puts(epilogue);

The compiler is a pain when modifying and running a Brainfuck program; it takes a couple extra commands to compile the Brainfuck program to assembly, assemble the assembly into an object file, link it into an executable, and run it whereas with the interpreter we can just run it. The trade off is that the compiled version is quite a bit faster. How much faster? Let’s save that for later.

Wouldn’t it be nice if there was a translation & execution technique that didn’t force us to compile our code every time we changed it and wanted to run it, but also performance closer to that of compiled code? That’s where a JIT compiler comes in!

For the basics of JITing code, make sure you read my previous article on the basics of JITing code in C. We’re going to follow the same technique of creating executable memory, copying bytes into that memory, casting it to a function pointer, then calling it. Just like the interpreter and the compiler, we’re going to do a unique action for each recognized token. What’s different is that for each operator, we’re going to push opcodes into a dynamic array, that way it can grow based on our sequential reading of input and will simplify our calculation of relative offsets for branching operations.

The other special thing we’re going to do it that we’re going to pass the address of our libc functions (memset, putchar, and getchar) into our JIT’ed function at runtime. This avoids those kooky stub functions you might see in a disassembled executable. That means we’ll be invoking our JIT’ed function like:

1
2
3
4
5
typedef void* fn_memset (void*, int, size_t);
typedef int fn_putchar (int);
typedef int fn_getchar ();
void (*jitted_func) (fn_memset, fn_putchar, fn_getchar) = mem;
jitted_func(memset, putchar, getchar);

Where mem is our mmap’ed executable memory with our opcodes copied into it, and the typedef’s are for the respective function signatures for our function pointers we’ll be passing to our JIT’ed code. We’re kind of getting ahead of ourselves, but knowing how we will invoke the dynamically created executable code will give us an idea of how the code itself will work.

The prologue is quite a bit involved, so we’ll take it step at a time. First, we have the usual prologue:

1
2
3
char prologue [] = {
  0x55, // push rbp
  0x48, 0x89, 0xE5, // mov rsp, rbp

Then we want to back up our callee saved registers that we’ll be using. Expect horrific and difficult to debug bugs if you forget to do this.

1
2
3
  0x41, 0x54, // pushq %r12
  0x41, 0x55, // pushq %r13
  0x41, 0x56, // pushq %r14

At this point, %rdi will contain the address of memset, %rsi will contain the address of putchar, and %rdx will contain the address of getchar, see x86-64 ABI point #2. We want to store these in callee saved registers before calling any of them, else they may clobber %rdi, %rsi, or %rdx since they’re not “callee saved,” rather “call clobbered.” See x86-64 ABI point #4.

1
2
3
  0x49, 0x89, 0xFC, // movq %rdi, %r12
  0x49, 0x89, 0xF5, // movq %rsi, %r13
  0x49, 0x89, 0xD6, // movq %rdx, %r14

At this point, %r12 will contain the address of memset, %r13 will contain the address of putchar, and %r14 will contain the address of getchar.

Next up is allocating 30008 B on the stack:

1
  0x48, 0x81, 0xEC, 0x38, 0x75, 0x00, 0x00, // subq $30008, %rsp

This is our first hint at how numbers, whose value is larger than the maximum representable value in a byte, are represented on x86-64. Where in this instruction is the value 30008? The answer is the 4 byte sequence 0x38, 0x75, 0x00, 0x00. The x86-64 architecture is “Little Endian,” which means that the least significant bit (LSB) is first and the most significant bit (MSB) is last. When humans do math, they typically represent numbers the other way, or “Big Endian.” Thus we write decimal ten as “10” and not “01.” So that means that 0x38, 0x75, 0x00, 0x00 in Little Endian is 0x00, 0x00, 0x75, 0x38 in Big Endian, which then is 7*16^3+5*16^2+3*16^1+8*16^0 which is 30008 in decimal, the amount of bytes we want to subtract from the stack. We’re allocating an additional 8 B on the stack for alignment requirements, similar to the compiler. By pushing even numbers of 64 b registers, we need to realign our stack pointer.

Next in the prologue, we set up and call memset:

1
2
3
4
5
6
7
8
  // address of beginning of tape
  0x48, 0x8D, 0x3C, 0x24, // leaq (%rsp), %rdi
  // fill with 0's
  0xBE, 0x00, 0x00, 0x00, 0x00, // movl $0, %esi
  // length 30,000 B
  0x48, 0xC7, 0xC2, 0x30, 0x75, 0x00, 0x00, // movq $30000, %rdx
  // memset
  0x41, 0xFF, 0xD4, // callq *%r12

After invoking memset, %rdi, %rsi, & %rcx will contain garbage values since they are “call clobbered” registers. At this point we no longer need memset, so we now use %r12 as our instruction pointer. %rsp will point to the top (technically the bottom) of the stack, which is the beginning of our memset’ed tape. That’s the end of our prologue.

1
2
  0x49, 0x89, 0xE4 // movq %rsp, %r12
};

We can then push our prologue into our dynamic array implementation:

1
vector_push(&instruction_stream, prologue, sizeof(prologue))

Now we iterate over our Brainfuck program and switch on the operations again. For pointer increment and decrement, we just nudge %r12.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
case '>':
  {
    char opcodes [] = {
      0x49, 0xFF, 0xC4 // inc %r12
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;
case '<':
  {
    char opcodes [] = {
      0x49, 0xFF, 0xCC // dec %r12
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;

That extra fun block in the switch statement is because in C/C++, we can’t define variables in the branches of switch statements.

Pointer deref then increment/decrement are equally uninspiring:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
case '+':
  {
    char opcodes [] = {
      0x41, 0xFE, 0x04, 0x24 // incb (%r12)
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;
case '-':
  {
    char opcodes [] = {
      0x41, 0xFE, 0x0C, 0x24 // decv (%r12)
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;

I/O might be interesting, but in x86-64 we have an opcode for calling the function at the end of a pointer. %r13 contains the address of putchar while %r14 contains the address of getchar.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
case '.':
  {
    char opcodes [] = {
      0x41, 0x0F, 0xB6, 0x3C, 0x24, // movzbl (%r12), %edi
      0x41, 0xFF, 0xD5 // callq *%r13
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;
case ',':
  {
    char opcodes [] = {
      0x41, 0xFF, 0xD6, // callq *%r14
      0x41, 0x88, 0x04, 0x24 // movb %al, (%r12)
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;

Now with our looping constructs, we get to the fun part. With the compiler, we deferred the concept of “relocation” to the assembler. We simply emitted labels, that the assembler turned into relative offsets (jumps by values relative to the last byte in the jump instruction). We’ve found ourselves in a Catch-22 though: how many bytes forward do we jump to the matching close bracket that we haven’t seen yet?

Normally, an assembler might have a data structure known as a “relocation table.” It keeps track of the first byte after a label and jumps, rewriting jumps-to-labels (which aren’t kept around in the resulting binary executable) to relative jumps. Spidermonkey, Firefox’s JavaScript Virtual Machine has two classes for this, MacroAssembler and Label. Spidermonkey embeds a linked list in the opcodes it generates for jumps with which it’s yet to see a label for. Once it finds the label, it walks the linked list (which itself is embedded in the emitted instruction stream) patching up these locations as it goes.

For Brainfuck, we don’t have to anything quite as fancy since each label only ends up having one jump site. Instead, we can use a stack of integers that are offsets into our dynamic array, and do the relocation once we know where exactly we’re jumping to.

1
2
3
4
5
6
7
8
9
10
11
case '[':
  {
    char opcodes [] = {
      0x41, 0x80, 0x3C, 0x24, 0x00, // cmpb $0, (%r12)
      // Needs to be patched up
      0x0F, 0x84, 0x00, 0x00, 0x00, 0x00 // je <32b relative offset, 2's compliment, LE>
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  stack_push(&relocation_table, instruction_stream.size); // create a label after
  break;

First we push the compare and jump opcodes, but for now we leave the relative offset blank (four zero bytes). We will come back and patch it up later. Then, we push the current length of dynamic array, which just so happens to be the offset into the instruction stream of the next instruction.

All of the relocation magic happens in the case for the closing bracket.

1
2
3
4
5
6
7
8
9
10
case ']':
  {
    char opcodes [] = {
      0x41, 0x80, 0x3C, 0x24, 0x00, // cmpb $0, (%r12)
      // Needs to be patched up
      0x0F, 0x85, 0x00, 0x00, 0x00, 0x00 // jne <33b relative offset, 2's compliment, LE>
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  // ...

First, we push our comparison and jump instructions into the dynamic array. We should know the relative offset we need to jump back to at this point, and thus don’t need to push four empty bytes, but it makes the following math a little simpler, as were not done yet with this case.

1
2
3
4
  // ...
  stack_pop(&relocation_table, &relocation_site);
  relative_offset = instruction_stream.size - relocation_site;
  // ...

We pop the matching offset into the dynamic array (from the matching open bracket), and calculate the difference from the current size of the instruction stream to the matching offset to get our relative offset. What’s interesting is that this offset is equal in magnitude for the forward and backwards jumps that we now need to patch up. We simply go back in our instruction stream 4 B, and write that relative offset negated as a 32 b LE number (patching our backwards jump), then go back to the site of our forward jump minus 4 B and write that relative offset as a 32 b LE number (patching our forwards jump).

1
2
3
4
  // ...
  vector_write32LE(&instruction_stream, instruction_stream.size - 4, -relative_offset);
  vector_write32LE(&instruction_stream, relocation_site - 4, relative_offset);
  break;

Thus, when writing a JIT, one must worry about manual relocation. From the Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z “A relative offset (rel8, rel16, or rel32) is generally specified as a label in assembly code, but at the machine code level, it is encoded as a signed, 8-bit or 32-bit immediate value, which is added to the instruction pointer.”

The last thing we push onto our instruction stream is clean up code in the epilogue.

1
2
3
4
5
6
7
8
9
10
char epilogue [] = {
  0x48, 0x81, 0xC4, 0x38, 0x75, 0x00, 0x00, // addq $30008, %rsp
  // restore callee saved registers
  0x41, 0x5E, // popq %r14
  0x41, 0x5D, // popq %r13
  0x41, 0x5C, // popq %r12
  0x5d, // pop rbp
  0xC3 // ret
};
vector_push(&instruction_stream, epilogue, sizeof(epilogue));

A dynamic array of bytes isn’t really useful, so we need to create executable memory the size of the current instruction stream and copy all of the machine opcodes into it, cast it to a function pointer, call it, and finally clean up:

1
2
3
4
5
6
7
void* mem = mmap(NULL, instruction_stream.size, PROT_WRITE | PROT_EXEC,
  MAP_ANON | MAP_PRIVATE, -1, 0);
memcpy(mem, instruction_stream.data, instruction_stream.size);
void (*jitted_func) (fn_memset, fn_putchar, fn_getchar) = mem;
jitted_func(memcpy, putchar, getchar);
munmap(mem, instruction_stream.size);
vector_destroy(&instruction_stream);

Note: we could have used the instruction stream rewinding technique to move the address of memset, putchar, and getchar as 64 b immediate values into %r12-%r14, which would have simplified our JIT’d function’s type signature.

Compile that, and we now have a function that will JIT compile and execute Brainfuck in roughly 141 SLOC. And, we can make changes to our Brainfuck program and not have to recompile it like we did with the Brainfuck compiler.

Hopefully it’s becoming apparent how similar an interpreter, compiler, and JIT behave. In the interpreter, we immediately execute some operation. In the compiler, we emit the equivalent text based assembly instructions corresponding to what the higher level language might get translated to in the interpreter. In the JIT, we emit the binary opcodes into executable memory and manually perform relocation, where the binary opcodes are equivalent to the text based assembly we might emit in the compiler. A production ready JIT would probably have macros for each operation in the JIT would perform, so the code would look more like the compiler rather than raw arrays of bytes (though the preprocessor would translate those macros into such). The entire process is basically disassembling C code with gobjdump -S -M suffix a.out, and punching in hex like one would a Gameshark.

Compare pointer incrementing from the three:

Interpreter:

1
case '>': ++ptr; break;

Compiler:

1
2
3
case '>':
  puts("  inc %r12");
  break;

JIT:

1
2
3
4
5
6
7
8
case '>':
  {
    char opcodes [] = {
      0x49, 0xFF, 0xC4 // inc %r12
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;

Or compare the full sources of the the interpreter, the compiler, and the JIT. Each at ~100 lines of code should be fairly easy to digest.

Let’s now examine the performance of these three. One of the longer running Brainfuck programs I can find is one that prints the Mandelbrot set as ASCII art to stdout.

Running the UNIX command time on the interpreter, compiled result, and the JIT, we should expect numbers similar to:

1
2
3
4
5
6
7
8
$ time ./interpreter ../samples/mandelbrot.b
43.54s user 0.03s system 99% cpu 43.581 total

$ ./compiler ../samples/mandelbrot.b > temp.s; ../assemble.sh temp.s; time ./a.out
3.24s user 0.01s system 99% cpu 3.254 total

$ time ./jit ../samples/mandelbrot.b
3.27s user 0.01s system 99% cpu 3.282 total

The interpreter is an order of magnitude slower than the compiled result or run of the JIT. Then again, the interpreter isn’t able to jump back and forth as efficiently as the compiler or JIT, since it scans back and forth for matching brackets O(N), while the other two can jump to where they need to go in a few instructions O(1). A production interpreter would probably translate the higher level language to a byte code, and thus be able to calculate the offsets used for jumps directly, rather than scanning back and forth.

The interpreter bounces back and forth between looking up an operation, then doing something based on the operation, then lookup, etc.. The compiler and JIT preform the translation first, then the execution, not interleaving the two.

The compiled result is the fastest, as expected, since it doesn’t have the overhead the JIT does of having to read the input file or build up the instructions to execute at runtime. The compiler has read and translated the input file ahead of time.

What if we take into account the time it takes to compile the source code, and run it?

1
2
$ time (./compiler ../samples/mandelbrot.b > temp.s; ../assemble.sh temp.s; ./a.out)
3.27s user 0.08s system 99% cpu 3.353 total

Including the time it takes to compile the code then run it, the compiled results are now slightly slower than the JIT (though I bet the multiple processes we start up are suspect), but with the JIT we pay the price to compile each and every time we run our code. With the compiler, we pay that tax once. When compilation time is cheap, as is the case with our Brainfuck compiler & JIT, it makes sense to prefer the JIT; it allows us to quickly make changes to our code and re-run it. When compilation is expensive, we might only want to pay the compilation tax once, especially if we plan on running the program repeatedly.

JIT’s are neat but compared to compilers can be more complex to implement. They also repeatedly re-parse input files and re-build instruction streams at runtime. Where they can shine is bridging the gap for dynamically typed languages where the runtime itself is much more dynamic, and thus harder (if not, impossible) to optimize ahead of time. Being able to jump into JIT’d native code from an interpreter and back gives you the best of both (interpreted and compiled) worlds.