Nick Desaulniers

The enemy's gate is down

My SIGGRAPH 2015 Experience

| Comments

I was recently lucky enough to get to attend my first SIGGRAPH conference this year. While I didn’t attend any talks, I did spend some time in the expo. Here is a collection of some of the neat things I saw at SIGGRAPH 2015. Sorry it’s not more collected; I didn’t have the intention of writing a blog post until after folks kept asking me “how was it?”

VR

Most booths had demos on VR headsets. Many were DK2’s and GearVR’s. AMD and NVIDIA had Crescent Bay’s (next gen VR headset). It was noticeably lighter than the DK2, and I thought it rendered better quality. It had nicer cable bundling, and headphones built in, that could fold up and lock out of the way that made it nice to put on/take off. I also tried a Sony Morpheus. They had a very engaging demo that was a tie in to the upcoming movie about tight rope walking, “The Walk”. They had a thin PVC pipe taped to the floor that you had to balance on, and a fan, and you were tight rope walking between the Twin Towers. Looking down and trying to balance was terrifying. There were some demos with a strange mobile VR setup where folks had a backpack on that had an open laptop hanging off the back and could walk around. Toyota and Ford had demos where you could inspect their vehicles in virtual space. I did not see a single HTC/Valve Vive at SIGGRAPH.

AR

Epson had some AR glasses. They were very glasses friendly, unlike most VR headsets. The nose piece was flexible, and if you flattened it out, the headset could rest on top of your glasses and worked well. The headset had some very thick compound lenses. There was a front facing camera and they had a simple demo using image recognition of simple logos (like QR codes) that helped provide position data. There were other demos with orientation tracking that worked well. They didn’t have positional sensor info, but had some hack that tried to estimate positional velocity off the angular momentum (I spoke with the programmer who implemented it). https://moverio.epson.biz/

Holograms

There was a demo of holograms using tilted pieces of plastic arranged in a box. Also, there was a multiple (200+) projector array that projected a scene onto a special screen. When walking around the screen, the viewing angle always seemed correct. It was very convincing, except for the jarring restart of the animated loop which could be smoothed out (think looping/seamless gifs).

VR/3D video

Google cardboard had a booth showing off 3D videos from youtube. I had a hard time telling if the video were stereoscopic or monoptic since the demo videos only had things in the distance so it was hard to tell if parallax was implemented correctly. A bunch of booths were showing off 3D video, but as far as I could tell, all of the correctly rendered stereoscopic shots were computer rendered. I could not find a single instance with footage shot from a stereoscopic rig, though I tried.

Booths/Expo

NVIDIA and Intel had the largest booths, followed by Pixar’s Renderman. Felt like a GDC event, smaller, but definitely larger than GDC next. More focus on shiny photorealism demos, artistic tools, less on game engines themselves.

Vulcan/OpenGL ES 3.2

Intel had demos of Vulcan and OpenGL ES 3.2. For 3.2 they were showing off tessellation shaders, I think. For the Vulcan demo, they had a cool demo showing how with a particle demo scene rendered with OpenGL 4, a single CPU was pegged, it was using a lot of power, and had pretty abysmal framerate. When rendering the same scene with Vulcan, they were able to more evenly distribute the workload across CPUs, achieve higher framerate, while using less power. The API to Vulcan is still not published, so no source code is available. It was explained to me that Vulcan is still not thread safe; instead you get the freedom to implement synchronization rather than the driver.

Planetarium

There was a neat demo of a planetarium projector being repurposed to display an “on rails” demo of a virtual scene. You didn’t get parallax since it was being projected on a hemisphere, but it was neat in that like IMAX your entire FOV was encompassed, but you could move your head, not see any pixels, and didn’t experience any motion sickness or disorientation.

X3D/X3DOM

I spoke with some folks at the X3D booth about X3DOM. To me, it seems like a bunch of previous attempts have kind of added on too much complexity in an effort to support every use case under the sun, rather than just accept limitations, so much so that getting started writing hello world became difficult. Some of the folks I spoke to at the booth echoed this sentiment, but also noted the lack of authoring tools as things that hurt adoption. I have some neat things I’m working on in this space, based on this and other prior works, that I plan on showing off at the upcoming BrazilJS.

Maker Faire

There was a cool maker faire, some things I’ll have to order for family members (young hackers in training) were Canny bots, eBee and Piper.

Experimental tech

Bunch of neat input devices, one I liked used directional sound as tactile feedback. One demo was rearranging icons on a home screen. Rather than touch the screen, there was a field of tiny speakers that would blast your finger with sound when it entered to simulate the feeling of vibration. It would vibrate to let you know you had “grabbed” and icon, and then drag it.

Book Signing

This was the first time I got to see my book printed in physical form! It looked gorgeous, hardcover printed in color. I met about half of the fellow authors who were also at SIGGRAPH, and our editor. I even got to meet Eric Haines, who reviewed my chapter before publication!

Additional C/C++ Tooling

| Comments

21st Century C by Ben Klemens was a great read. It had a section with an intro to autotools, git, and gdb. There are a few other useful tools that came to mind that I’ve used when working with C and C++ codebases. These tools are a great way to start contributing to Open Source C & C++ codebases; running these tools on the code or adding them to the codebases. A lot of these favor command line, open source utilities. See how many you are familiar with!

Build Tools

CMake

The first tool I’d like to take a look at is CMake. CMake is yet another build tool; I realize how contentious it is to even discuss one of the many. From my experience working with Emscripten, we recommend the use of CMake for people writing portable C/C++ programs. CMake is able to emit Makefiles for unixes, project files for Xcode on OSX, and project files for Visual Studio on Windows. There are also a few other “generators” that you can use.

I’ve been really impressed with CMake’s modules for finding dependencies and another for fetching and building external dependencies. I think C++ needs a package manager badly, and I think CMake would be a solid foundation for one.

The syntax isn’t the greatest, but when I wanted to try to build one of my C++ projects on Windows which I know nothing about developing on, I was able to install CMake and Visual Studio and get my project building. If you can build your code on one platform, it will usually build on the others.

If you’re not worried about writing cross platform C/C++, maybe CMake is not worth the effort, but I find it useful. I wrestle with the syntax sometimes, but documentation is not bad and it’s something you deal with early on in the development of a project and hopefully never have to touch again (how I wish that were true).

Code Formatters

ClangFormat

Another contentious point of concern amongst developers is code style. Big companies with lots of C++ code have documents explaining their stylistic choices. Don’t waste another hour of your life arguing about something that really doesn’t matter. ClangFormat will help you codify your style and format your code for you to match the style. Simply write the code however you want, and run the formatter on it before commiting it.

It can also emit a .clang-format file that you can commit and clang-format will automatically look for that file and use the rules codified there.

Linters

Flint / Flint++

Flint is a C++ linter in use at Facebook. Since it moved from being implemented in C++ to D, I’ve had issues building it. I’ve had better luck with a fork that’s pure C++ without any of the third party dependencies Flint originally had, called Flint++. While not quite full-on static analyzers, both can be used for finding potential issues in your code ahead of time. Linters can look at individual files in isolation; you don’t have to wait for long recompiles like you would with a static analyzer.

Static Analyzers

Scan-build

Scan-build is a static analyzer for C and C++ code. You build your code “through” it, then use the sibling tool scan-view to see the results. Scan-view will emit and open an html file that shows a list of the errors it detected. It will insert hyperlinks into the resulting document that step you through how certain conditions could lead to a null pointer dereference, for example. You can also save and share those html files with others in the project. Static analyzers will help you catch bugs at compile time before you run the code.

Runtime Sanitizers

ASan and UBSan

Clang’s Address (ASan) and Undefined Behavior (UBSan) sanitizers are simply compiler flags that can be used to detect errors at runtime. ASan and UBSan two of the more popular tools, but there are actually a ton and more being implemented. See the list here. These sanitizers will catch bugs at runtime, so you’ll have to run the code to notice any violations, at variable runtime performance costs per sanitizer. ASan and TSan (Thread Sanitizer) made it into gcc4.8 and UBSan is in gcc4.9.

Header Analysis

Include What You Use

Include What You Use (IWYU) helps you find unused or unnecessary #include preprocessor directives. It should be obvious how this can help improve compile times. IWYU can also help cut down on recompiles by recommending forward declarations under certain conditions. I look forward to the C++ module proposal being adopted, but until then this tool can help you spot cruft that can be removed.

Rapid Recompiles

ccache

ccache greatly improves recompile times by caching the results of parts of the compilation process. I use when building Firefox, and it saves a great deal of time.

distcc

distcc is a distributed build system. Some folks at Mozilla speed up their Firefox builds with it.

Memory Leak Detectors

Valgrind

Valgrind has a suite of tools, my favorite being memcheck for finding memory leaks. Unfortunately, it doesn’t seem to work on OSX since 10.10. This page referring to ASan seems to indicate that it can do everything Valgrind’s Memcheck can, at less of a runtime performance cost, but I’m not sure how true this is exactly.

leaks

A much more primitive tool for finding leaks from the command line, BSD’s have leaks.

1
2
3
MallocStackLogging=1 ./a.out
leaks a.out
...

Profilers

Perf

Perf, and Brendan Gregg’s tools for emitting SVG flamegraphs from the output are helpful for finding where time is spent in a program. In fact, there are numerous perfomance analysis tools that are Linux specific. My recommendation is spend some time on Brendan Gregg’s blog.

DTrace

OSX doesn’t have the same tooling as Linux, but DTrace was ported to it. I’ve used it to find sampling profiles of my code before. Again, Brendan Gregg’s blog is a good resource; there are some fantastic DTrace one liners.

Debuggers

lldb

lldb is analogous to gdb. I can’t say I have enough experience with LLDB and GDB to note the difference between the two, but LLDB did show the relative statements forward and back from the current statement by default. I’m informed by my friends/mortal enemies using emacs that this is less of an issue when using emacs/gdb in combination.

Fuzzers

American Fuzzy Lop

American Fuzzy Lop (AFL) is a neat program that performs fuzzing on programs that take inputs from files and repeatedly runs the program, modifies the input trying to get full code coverage, and tries to find crashes. It’s been getting lots of attention lately, and while I haven’t used it myself yet, it seems like a very powerful tool. Mozilla employs the use of fuzzers on their JavaScript engine, for instance (not AFL, but one developed in house).

Disassemblers

gobjdump

If you really need to make sure the higher level code you’re writing is getting translated into the assembly your expecting, gobjdump -S will intermix the emitted binary’s disassembled assembly and the source code. This was used extensively while developing my Brainfuck JIT.

Conclusion

Hopefully you learned of some useful tools that you should know about when working with C or C++. What did I miss?

Interpreter, Compiler, JIT

| Comments

Interpreters and compilers are interesting programs, themselves used to run or translate other programs, respectively. Those other programs that might be interpreted might be languages like JavaScript, Ruby, Python, PHP, and Perl. The other programs that might be compiled are C, C++, and to some extent Java and C#.

Taking the time to do translation to native machine code ahead of time can result in better performance at runtime, but an interpreter can get to work right away without spending any time translating. There happens to be a sweet spot somewhere in between interpretation and compilation that combines the best of both worlds. Such a technique is called Just In Time (JIT) compiling. While interpreting, compiling, and JIT’ing code might sound radically different, they’re actually strikingly similar. In this post, I hope to show how similar by comparing the code for an interpreter, a compiler, and a JIT compiler for the language Brainfuck in around 100 lines of C code each.

All of the code in the post is up on GitHub.

Brainfuck is an interesting, if hard to read, language. It only has eight operations it can perform > < + - . , [ ], yet is Turing complete. There’s nothing really to lex; each character is a token, and if the token is not one of the eight operators, it’s ignored. There’s also not much of a grammar to parse; the forward jumping and backwards jumping operators should be matched for well formed input, but that’s about it. In this post, we’ll skip over validating input assuming well formed input so we can focus on the interpretation/code generation. You can read more about it on the Wikipedia page, which we’ll be using as a reference throughout.

A Brainfuck program operates on a 30,000 element byte array initialized to all zeros. It starts off with an instruction pointer, that initially points to the first element in the data array or “tape.” In C code for an interpreter that might look like:

1
2
3
4
5
// Initialize the tape with 30,000 zeroes.
unsigned char tape [30000] = { 0 };

// Set the pointer to point at the left most cell of the tape.
unsigned char* ptr = tape;

Then, since we’re performing an operation for each character in the Brainfuck source, we can have a for loop over every character with a nested switch statement containing case statements for each operator.

The first two operators, > and < increment and decrement the data pointer.

1
2
case '>': ++ptr; break;
case '<': --ptr; break;

One thing that could be bad is that because the interpreter is written in C and we’re representing the tape as an array but we’re not validating our inputs, there’s potential for stack buffer overrun since we’re not performing bounds checks. Again, punting and assuming well formed input to keep the code and the point more precise.

Next up are the + and - operators, used for incrementing and decrementing the cell pointed to by the data pointer by one.

1
2
case '+': ++(*ptr); break;
case '-': --(*ptr); break;

The operators . and , provide Brainfuck’s only means of input or output, by writing the value pointed to by the instruction pointer to stdout as an ASCII value, or reading one byte from stdin as an ASCII value and writing it to the cell pointed to by the instruction pointer.

1
2
case '.': putchar(*ptr); break;
case ',': *ptr = getchar(); break;

Finally, our looping constructs, [ and ]. From the definition on Wikipedia for [: if the byte at the data pointer is zero, then instead of moving the instruction pointer forward to the next command, jump it forward to the command after the matching ] command and for ]: if the byte at the data pointer is nonzero, then instead of moving the instruction pointer forward to the next command, jump it back to the command after the matching [ command.

I interpret that as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
case '[':
  if (!(*ptr)) {
    int loop = 1;
    while (loop > 0) {
      unsigned char current_char = input[++i];
      if (current_char == ']') {
        --loop;
      } else if (current_char == '[') {
        ++loop;
      }
    }
  }
  break;
case ']':
  if (*ptr) {
    int loop = 1;
    while (loop > 0) {
      unsigned char current_char = input[--i];
      if (current_char == '[') {
        --loop;
      } else if (current_char == ']') {
        ++loop;
      }
    }
  }
  break;

Where the variable loop keeps track of open brackets for which we’ve not seen a matching close bracket, aka our nested depth.

So we can see the interpreter is quite basic, in around 50 SLOC were able to read a byte, and immediately perform an action based on the operator. How we perform that operation might not be the fastest though.

How about if we want to compile the Brainfuck source code to native machine code? Well, we need to know a little bit about our host machine’s Instruction Set Architecture (ISA) and Application Binary Interface (ABI). The rest of the code in this post will not be as portable as the above C code, since it assumes an x86-64 ISA and UNIX ABI. Now would be a good time to take a detour and learn more about writing assembly for x86-64. The interpreter is even portable enough to build with Emscripten and run in a browser!

For our compiler, we’ll iterate over every character in the source file again, switching on the recognized operator. This time, instead of performing an action right away, we’ll print assembly instructions to stdout. Doing so requires running the compiler on an input file, redirecting stdout to a file, then running the system assembler and linker on that file. We’ll stick with just compiling and not assembling (though it’s not too difficult), and linking (for now).

First, we need to print a prologue for our compiled code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
const char* const prologue =
  ".text\n"
  ".globl _main\n"
  "_main:\n"
  "  pushq %rbp\n"
  "  movq %rsp, %rbp\n"
  "  pushq %r12\n"        // store callee saved register
  "  subq $30008, %rsp\n" // allocate 30,008 B on stack, and realign
  "  leaq (%rsp), %rdi\n" // address of beginning of tape
  "  movl $0, %esi\n"     // fill with 0's
  "  movq $30000, %rdx\n" // length 30,000 B
  "  call _memset\n"      // memset
  "  movq %rsp, %r12";
puts(prologue);

During the linking phase, we’ll make sure to link in libc so we can call memset. What we’re doing is backing up callee saved registers we’ll be using, stack allocating the tape, realigning the stack (x86-64 ABI point #1), copying the address of the only item on the stack into a register for our first argument, setting the second argument to the constant 0, the third arg to 30000, then calling memset. Finally, we use the callee saved register %r12 as our instruction pointer, which is the address into a value on the stack.

We can expect the call to memset to result in a segfault if simply subtract just 30000B, and not realign for the 2 registers (64 b each, 8 B each) we pushed on the stack. The first pushed register aligns the stack on a 16 B boundary, the second misaligns it; that’s why we allocate an additional 8 B on the stack (x86-64 ABI point #1). The stack is mis-aligned upon function entry in x86-64. 30000 is a multiple of 16.

Moving the instruction pointer (>, <) and modifying the pointed to value (+, -) are straight-forward:

1
2
3
4
5
6
7
8
9
10
11
12
case '>':
  puts("  inc %r12");
  break;
case '<':
  puts("  dec %r12");
  break;
case '+':
  puts("  incb (%r12)");
  break;
case '-':
  puts("  decb (%r12)");
  break;

For output, ., we need to copy the pointed to byte into the register for the first argument to putchar. We explicitly zero out the register before calling putchar, since it takes an int (32 b), but we’re only copying a char (8 b) (Look up C’s type promotion rules for more info). x86-64 has an instruction that does both, movzXX, Where the first X is the source size (b, w) and the second is the destination size (w, l, q). Thus movzbl moves a byte (8 b) into a double word (32 b). %rdi and %edi are the same register, but %rdi is the full 64 b register, while %edi is the lowest (or least significant) 32 b.

1
2
3
4
5
6
case '.':
  // move byte to double word and zero upper bits since putchar takes an
  // int.
  puts("  movzbl (%r12), %edi");
  puts("  call _putchar");
  break;

Input (,) is easy; call getchar, move the resulting lowest byte into the cell pointed to by the instruction pointer. %al is the lowest 8 b of the 64 b %rax register.

1
2
3
4
case ',':
  puts("  call _getchar");
  puts("  movb %al, (%r12)");
  break;

As usual, the looping constructs ([ & ]) are much more work. We have to match up jumps to matching labels, but for an assembly program, labels must be unique. One way we can solve for this is whenever we encounter an opening brace, push a monotonically increasing number that represents the numbers of opening brackets we’ve seen so far onto a stack like data structure. Then, we do our comparison and jump to what will be the label that should be produced by the matching close label. Next, we insert our starting label, and finally increment the number of brackets seen.

1
2
3
4
5
6
case '[':
  stack_push(&stack, num_brackets);
  puts("  cmpb $0, (%r12)");
  printf("  je bracket_%d_end\n", num_brackets);
  printf("bracket_%d_start:\n", num_brackets++);
  break;

For close brackets, we pop the number of brackets seen (or rather, number of pending open brackets which we have yet to see a matching close bracket) off of the stack, do our comparison, jump to the matching start label, and finally place our end label.

1
2
3
4
5
6
case ']':
  stack_pop(&stack, &matching_bracket);
  puts("  cmpb $0, (%r12)");
  printf("  jne bracket_%d_start\n", matching_bracket);
  printf("bracket_%d_end:\n", matching_bracket);
  break;

So for sequential loops ([][]) we can expect the relevant assembly to look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  cmpb $0, (%r12)
  je bracket_0_end
bracket_0_start:

  cmpb $0, (%r12)
  jne bracket_0_start
bracket_0_end:

  cmpb $0, (%r12)
  je bracket_1_end
bracket_1_start:

  cmpb $0, (%r12)
  jne bracket_1_start
bracket_1_end:

and for nested loops ([[]]), we can expect assembly like the following (note the difference in the order of numbered start and end labels):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  cmpb $0, (%r12)
  je bracket_0_end
bracket_0_start:

  cmpb $0, (%r12)
  je bracket_1_end
bracket_1_start:

  cmpb $0, (%r12)
  jne bracket_1_start
bracket_1_end:

  cmpb $0, (%r12)
  jne bracket_0_start
bracket_0_end:

Finally, we need an epilogue to clean up the stack and callee saved registers after ourselves.

1
2
3
4
5
6
const char* const epilogue =
  "  addq $30008, %rsp\n" // clean up tape from stack.
  "  popq %r12\n" // restore callee saved register
  "  popq %rbp\n"
  "  ret\n";
puts(epilogue);

The compiler is a pain when modifying and running a Brainfuck program; it takes a couple extra commands to compile the Brainfuck program to assembly, assemble the assembly into an object file, link it into an executable, and run it whereas with the interpreter we can just run it. The trade off is that the compiled version is quite a bit faster. How much faster? Let’s save that for later.

Wouldn’t it be nice if there was a translation & execution technique that didn’t force us to compile our code every time we changed it and wanted to run it, but also performance closer to that of compiled code? That’s where a JIT compiler comes in!

For the basics of JITing code, make sure you read my previous article on the basics of JITing code in C. We’re going to follow the same technique of creating executable memory, copying bytes into that memory, casting it to a function pointer, then calling it. Just like the interpreter and the compiler, we’re going to do a unique action for each recognized token. What’s different is that for each operator, we’re going to push opcodes into a dynamic array, that way it can grow based on our sequential reading of input and will simplify our calculation of relative offsets for branching operations.

The other special thing we’re going to do it that we’re going to pass the address of our libc functions (memset, putchar, and getchar) into our JIT’ed function at runtime. This avoids those kooky stub functions you might see in a disassembled executable. That means we’ll be invoking our JIT’ed function like:

1
2
3
4
5
typedef void* fn_memset (void*, int, size_t);
typedef int fn_putchar (int);
typedef int fn_getchar ();
void (*jitted_func) (fn_memset, fn_putchar, fn_getchar) = mem;
jitted_func(memset, putchar, getchar);

Where mem is our mmap’ed executable memory with our opcodes copied into it, and the typedef’s are for the respective function signatures for our function pointers we’ll be passing to our JIT’ed code. We’re kind of getting ahead of ourselves, but knowing how we will invoke the dynamically created executable code will give us an idea of how the code itself will work.

The prologue is quite a bit involved, so we’ll take it step at a time. First, we have the usual prologue:

1
2
3
char prologue [] = {
  0x55, // push rbp
  0x48, 0x89, 0xE5, // mov rsp, rbp

Then we want to back up our callee saved registers that we’ll be using. Expect horrific and difficult to debug bugs if you forget to do this.

1
2
3
  0x41, 0x54, // pushq %r12
  0x41, 0x55, // pushq %r13
  0x41, 0x56, // pushq %r14

At this point, %rdi will contain the address of memset, %rsi will contain the address of putchar, and %rdx will contain the address of getchar, see x86-64 ABI point #2. We want to store these in callee saved registers before calling any of them, else they may clobber %rdi, %rsi, or %rdx since they’re not “callee saved,” rather “call clobbered.” See x86-64 ABI point #4.

1
2
3
  0x49, 0x89, 0xFC, // movq %rdi, %r12
  0x49, 0x89, 0xF5, // movq %rsi, %r13
  0x49, 0x89, 0xD6, // movq %rdx, %r14

At this point, %r12 will contain the address of memset, %r13 will contain the address of putchar, and %r14 will contain the address of getchar.

Next up is allocating 30008 B on the stack:

1
  0x48, 0x81, 0xEC, 0x38, 0x75, 0x00, 0x00, // subq $30008, %rsp

This is our first hint at how numbers, whose value is larger than the maximum representable value in a byte, are represented on x86-64. Where in this instruction is the value 30008? The answer is the 4 byte sequence 0x38, 0x75, 0x00, 0x00. The x86-64 architecture is “Little Endian,” which means that the least significant bit (LSB) is first and the most significant bit (MSB) is last. When humans do math, they typically represent numbers the other way, or “Big Endian.” Thus we write decimal ten as “10” and not “01.” So that means that 0x38, 0x75, 0x00, 0x00 in Little Endian is 0x00, 0x00, 0x75, 0x38 in Big Endian, which then is 7*16^3+5*16^2+3*16^1+8*16^0 which is 30008 in decimal, the amount of bytes we want to subtract from the stack. We’re allocating an additional 8 B on the stack for alignment requirements, similar to the compiler. By pushing even numbers of 64 b registers, we need to realign our stack pointer.

Next in the prologue, we set up and call memset:

1
2
3
4
5
6
7
8
  // address of beginning of tape
  0x48, 0x8D, 0x3C, 0x24, // leaq (%rsp), %rdi
  // fill with 0's
  0xBE, 0x00, 0x00, 0x00, 0x00, // movl $0, %esi
  // length 30,000 B
  0x48, 0xC7, 0xC2, 0x30, 0x75, 0x00, 0x00, // movq $30000, %rdx
  // memset
  0x41, 0xFF, 0xD4, // callq *%r12

After invoking memset, %rdi, %rsi, & %rcx will contain garbage values since they are “call clobbered” registers. At this point we no longer need memset, so we now use %r12 as our instruction pointer. %rsp will point to the top (technically the bottom) of the stack, which is the beginning of our memset’ed tape. That’s the end of our prologue.

1
2
  0x49, 0x89, 0xE4 // movq %rsp, %r12
};

We can then push our prologue into our dynamic array implementation:

1
vector_push(&instruction_stream, prologue, sizeof(prologue))

Now we iterate over our Brainfuck program and switch on the operations again. For pointer increment and decrement, we just nudge %r12.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
case '>':
  {
    char opcodes [] = {
      0x49, 0xFF, 0xC4 // inc %r12
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;
case '<':
  {
    char opcodes [] = {
      0x49, 0xFF, 0xCC // dec %r12
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;

That extra fun block in the switch statement is because in C/C++, we can’t define variables in the branches of switch statements.

Pointer deref then increment/decrement are equally uninspiring:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
case '+':
  {
    char opcodes [] = {
      0x41, 0xFE, 0x04, 0x24 // incb (%r12)
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;
case '-':
  {
    char opcodes [] = {
      0x41, 0xFE, 0x0C, 0x24 // decv (%r12)
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;

I/O might be interesting, but in x86-64 we have an opcode for calling the function at the end of a pointer. %r13 contains the address of putchar while %r14 contains the address of getchar.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
case '.':
  {
    char opcodes [] = {
      0x41, 0x0F, 0xB6, 0x3C, 0x24, // movzbl (%r12), %edi
      0x41, 0xFF, 0xD5 // callq *%r13
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;
case ',':
  {
    char opcodes [] = {
      0x41, 0xFF, 0xD6, // callq *%r14
      0x41, 0x88, 0x04, 0x24 // movb %al, (%r12)
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;

Now with our looping constructs, we get to the fun part. With the compiler, we deferred the concept of “relocation” to the assembler. We simply emitted labels, that the assembler turned into relative offsets (jumps by values relative to the last byte in the jump instruction). We’ve found ourselves in a Catch-22 though: how many bytes forward do we jump to the matching close bracket that we haven’t seen yet?

Normally, an assembler might have a data structure known as a “relocation table.” It keeps track of the first byte after a label and jumps, rewriting jumps-to-labels (which aren’t kept around in the resulting binary executable) to relative jumps. Spidermonkey, Firefox’s JavaScript Virtual Machine has two classes for this, MacroAssembler and Label. Spidermonkey embeds a linked list in the opcodes it generates for jumps with which it’s yet to see a label for. Once it finds the label, it walks the linked list (which itself is embedded in the emitted instruction stream) patching up these locations as it goes.

For Brainfuck, we don’t have to anything quite as fancy since each label only ends up having one jump site. Instead, we can use a stack of integers that are offsets into our dynamic array, and do the relocation once we know where exactly we’re jumping to.

1
2
3
4
5
6
7
8
9
10
11
case '[':
  {
    char opcodes [] = {
      0x41, 0x80, 0x3C, 0x24, 0x00, // cmpb $0, (%r12)
      // Needs to be patched up
      0x0F, 0x84, 0x00, 0x00, 0x00, 0x00 // je <32b relative offset, 2's compliment, LE>
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  stack_push(&relocation_table, instruction_stream.size); // create a label after
  break;

First we push the compare and jump opcodes, but for now we leave the relative offset blank (four zero bytes). We will come back and patch it up later. Then, we push the current length of dynamic array, which just so happens to be the offset into the instruction stream of the next instruction.

All of the relocation magic happens in the case for the closing bracket.

1
2
3
4
5
6
7
8
9
10
case ']':
  {
    char opcodes [] = {
      0x41, 0x80, 0x3C, 0x24, 0x00, // cmpb $0, (%r12)
      // Needs to be patched up
      0x0F, 0x85, 0x00, 0x00, 0x00, 0x00 // jne <33b relative offset, 2's compliment, LE>
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  // ...

First, we push our comparison and jump instructions into the dynamic array. We should know the relative offset we need to jump back to at this point, and thus don’t need to push four empty bytes, but it makes the following math a little simpler, as were not done yet with this case.

1
2
3
4
  // ...
  stack_pop(&relocation_table, &relocation_site);
  relative_offset = instruction_stream.size - relocation_site;
  // ...

We pop the matching offset into the dynamic array (from the matching open bracket), and calculate the difference from the current size of the instruction stream to the matching offset to get our relative offset. What’s interesting is that this offset is equal in magnitude for the forward and backwards jumps that we now need to patch up. We simply go back in our instruction stream 4 B, and write that relative offset negated as a 32 b LE number (patching our backwards jump), then go back to the site of our forward jump minus 4 B and write that relative offset as a 32 b LE number (patching our forwards jump).

1
2
3
4
  // ...
  vector_write32LE(&instruction_stream, instruction_stream.size - 4, -relative_offset);
  vector_write32LE(&instruction_stream, relocation_site - 4, relative_offset);
  break;

Thus, when writing a JIT, one must worry about manual relocation. From the Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z “A relative offset (rel8, rel16, or rel32) is generally specified as a label in assembly code, but at the machine code level, it is encoded as a signed, 8-bit or 32-bit immediate value, which is added to the instruction pointer.”

The last thing we push onto our instruction stream is clean up code in the epilogue.

1
2
3
4
5
6
7
8
9
10
char epilogue [] = {
  0x48, 0x81, 0xC4, 0x38, 0x75, 0x00, 0x00, // addq $30008, %rsp
  // restore callee saved registers
  0x41, 0x5E, // popq %r14
  0x41, 0x5D, // popq %r13
  0x41, 0x5C, // popq %r12
  0x5d, // pop rbp
  0xC3 // ret
};
vector_push(&instruction_stream, epilogue, sizeof(epilogue));

A dynamic array of bytes isn’t really useful, so we need to create executable memory the size of the current instruction stream and copy all of the machine opcodes into it, cast it to a function pointer, call it, and finally clean up:

1
2
3
4
5
6
7
void* mem = mmap(NULL, instruction_stream.size, PROT_WRITE | PROT_EXEC,
  MAP_ANON | MAP_PRIVATE, -1, 0);
memcpy(mem, instruction_stream.data, instruction_stream.size);
void (*jitted_func) (fn_memset, fn_putchar, fn_getchar) = mem;
jitted_func(memcpy, putchar, getchar);
munmap(mem, instruction_stream.size);
vector_destroy(&instruction_stream);

Note: we could have used the instruction stream rewinding technique to move the address of memset, putchar, and getchar as 64 b immediate values into %r12-%r14, which would have simplified our JIT’d function’s type signature.

Compile that, and we now have a function that will JIT compile and execute Brainfuck in roughly 141 SLOC. And, we can make changes to our Brainfuck program and not have to recompile it like we did with the Brainfuck compiler.

Hopefully it’s becoming apparent how similar an interpreter, compiler, and JIT behave. In the interpreter, we immediately execute some operation. In the compiler, we emit the equivalent text based assembly instructions corresponding to what the higher level language might get translated to in the interpreter. In the JIT, we emit the binary opcodes into executable memory and manually perform relocation, where the binary opcodes are equivalent to the text based assembly we might emit in the compiler. A production ready JIT would probably have macros for each operation in the JIT would perform, so the code would look more like the compiler rather than raw arrays of bytes (though the preprocessor would translate those macros into such). The entire process is basically disassembling C code with gobjdump -S -M suffix a.out, and punching in hex like one would a Gameshark.

Compare pointer incrementing from the three:

Interpreter:

1
case '>': ++ptr; break;

Compiler:

1
2
3
case '>':
  puts("  inc %r12");
  break;

JIT:

1
2
3
4
5
6
7
8
case '>':
  {
    char opcodes [] = {
      0x49, 0xFF, 0xC4 // inc %r12
    };
    vector_push(&instruction_stream, opcodes, sizeof(opcodes));
  }
  break;

Or compare the full sources of the the interpreter, the compiler, and the JIT. Each at ~100 lines of code should be fairly easy to digest.

Let’s now examine the performance of these three. One of the longer running Brainfuck programs I can find is one that prints the Mandelbrot set as ASCII art to stdout.

Running the UNIX command time on the interpreter, compiled result, and the JIT, we should expect numbers similar to:

1
2
3
4
5
6
7
8
$ time ./interpreter ../samples/mandelbrot.b
43.54s user 0.03s system 99% cpu 43.581 total

$ ./compiler ../samples/mandelbrot.b > temp.s; ../assemble.sh temp.s; time ./a.out
3.24s user 0.01s system 99% cpu 3.254 total

$ time ./jit ../samples/mandelbrot.b
3.27s user 0.01s system 99% cpu 3.282 total

The interpreter is an order of magnitude slower than the compiled result or run of the JIT. Then again, the interpreter isn’t able to jump back and forth as efficiently as the compiler or JIT, since it scans back and forth for matching brackets O(N), while the other two can jump to where they need to go in a few instructions O(1). A production interpreter would probably translate the higher level language to a byte code, and thus be able to calculate the offsets used for jumps directly, rather than scanning back and forth.

The interpreter bounces back and forth between looking up an operation, then doing something based on the operation, then lookup, etc.. The compiler and JIT preform the translation first, then the execution, not interleaving the two.

The compiled result is the fastest, as expected, since it doesn’t have the overhead the JIT does of having to read the input file or build up the instructions to execute at runtime. The compiler has read and translated the input file ahead of time.

What if we take into account the time it takes to compile the source code, and run it?

1
2
$ time (./compiler ../samples/mandelbrot.b > temp.s; ../assemble.sh temp.s; ./a.out)
3.27s user 0.08s system 99% cpu 3.353 total

Including the time it takes to compile the code then run it, the compiled results are now slightly slower than the JIT (though I bet the multiple processes we start up are suspect), but with the JIT we pay the price to compile each and every time we run our code. With the compiler, we pay that tax once. When compilation time is cheap, as is the case with our Brainfuck compiler & JIT, it makes sense to prefer the JIT; it allows us to quickly make changes to our code and re-run it. When compilation is expensive, we might only want to pay the compilation tax once, especially if we plan on running the program repeatedly.

JIT’s are neat but compared to compilers can be more complex to implement. They also repeatedly re-parse input files and re-build instruction streams at runtime. Where they can shine is bridging the gap for dynamically typed languages where the runtime itself is much more dynamic, and thus harder (if not, impossible) to optimize ahead of time. Being able to jump into JIT’d native code from an interpreter and back gives you the best of both (interpreted and compiled) worlds.

Hidden in Plain Sight - Public Key Crypto

| Comments

How is it possible for us to communicate securely when there’s the possibility of a third party eavesdropping on us? How can we communicate private secrets through public channels? How do such techniques enable us to bank online and carry out other sensitive transactions on the Internet while trusting numerous relays? In this post, I hope to explain public key cryptography, with actual code examples, so that the concepts are a little more concrete.

First, please check out this excellent video on public key crypto:

Hopefully that explains the gist of the technique, but what might it actually look like in code? Let’s take a look at example code in JavaScript using the Node.js crypto module. We’ll later compare the upcoming WebCrypto API and look at a TLS handshake.

Meet Alice. Meet Bob. Meet Eve. Alice would like to send Bob a secret message. Alice would not like Eve to view the message. Assume Eve can intercept, but not tamper with, everything Alice and Bob try to share with each other.

Alice chooses a modular exponential key group, such as modp4, then creates a public and private key.

1
2
3
var group = "modp4";
var aliceDH = crypto.getDiffieHellman(group);
aliceDH.generateKeys();

A modular exponential key group is simply a “sufficiently large” prime number, paired with a generator (specific number), such as those defined in RFC2412 and RFC3526.

The public key is meant to be shared; it is ok for Eve to know the public key. The private key must not ever be shared, even with the person communicating to.

Alice then shares her public key and group with Bob.

1
2
3
4
Public Key:
 <Buffer 96 33 c5 9e b9 07 3e f2 ec 56 6d f4 1a b4 f8 4c 77 e6 5f a0 93 cf 32 d3 22 42 c8 b4 7b 2b 1f a9 55 86 05 a4 60 17 ae f9 ee bf b3 c9 05 a9 31 31 94 0f ... >
Group:
 modp14

Bob now creates a public and private key pair with the same group as Alice.

1
2
var bobDH = crypto.getDiffieHellman(group);
bobDH.generateKeys();

Bob shares his public key with Alice.

1
2
Public key:
 <Buffer ee d7 e2 00 e5 82 11 eb 67 ab 50 20 30 81 b1 74 7a 51 0d 7e 2a de b7 df db cf ac 57 de a4 f0 bd bc b5 7e ea df b0 3b c3 3a e2 fa 0e ed 22 90 31 01 67 ... >

Alice and Bob now compute a shared secret.

1
2
var aliceSecret = aliceDH.computeSecret(bobDH.getPublicKey(), null, "hex");
var bobSecret = bobDH.computeSecret(aliceDH.getPublicKey(), null, "hex");

Alice and Bob have now derived a shared secret from each others’ public keys.

1
aliceSecret === bobSecret; // => true

Meanwhile, Eve has intercepted Alice and Bob’s public keys and group. Eve tries to compute the same secret.

1
2
3
4
5
var eveDH = crypto.getDiffieHellman(group);
eveDH.generateKeys();
var eveSecret = eveDH.computeSecret(aliceDH.getPublicKeys(), null, "hex");

eveSecret === aliceSecret; // => false

This is because Alice’s secret is derived from Alice and Bob’s private keys, which Eve does not have. Eve may not realize her secret is not the same as Alice and Bob’s until later.

That was asymmetric encryption; using different keys. The shared secret may now be used in symmetric encryption; using the same keys.

Alice creates a symmetric block cypher using her favorite algorithm, a hash of their secret as a key, and random bytes as an initialization vector.

1
2
3
4
5
var cypher = "aes-256-ctr";
var hash = "sha256";
var aliceIV = crypto.randomBytes(128);
var aliceHashedSecret = crypto.createHash(hash).update(aliceSecret).digest("binary");
var aliceCypher = crypto.createCypher(cypher, aliceHashedSecret, aliceIV);

Alice then uses her cypher to encrypt her message to Bob.

1
var cypherText = aliceCypher.update("...");

Alice then sends the cypher text, cypher, and hash to Bob.

1
2
3
4
5
6
cypherText:
 <Buffer bd 29 96 83 fa a8 7d 9c ea 90 ab>
cypher:
 aes-256-ctr
hash:
 sha256

Bob now constructs a symmetric block cypher using the algorithm from Alice, and a hash of their shared secret.

1
2
var bobHashedSecret = crypto.createHash(hash).update(bobSecret).digest("binary");
var bobCypher = crypto.createDecipher(cypher, bobHashedSecret);

Bob now decyphers the encrypted message (cypher text) from Alice.

1
2
var plainText = bobCypher.update(cypherText);
console.log(plainText); // => "I love you"

Eve has intercepted the cypher text, cypher, hash, and tries to decrypt it.

1
2
3
4
5
var eveHashedSecret = crypto.createHash(hash).update(eveSecret).digest("binary");
var eveCypher = crypto.createDecipher(cypher, eveHashedSecret);
console.log(eveCypher.update(cypherText).toString());

// => ��_r](�i)

Here’s where Eve realizes her secret is not correct.

This prevents passive eavesdropping, but not active man-in-the-middle (MITM) attacks. For example, how does Alice know that the messages she was supposedly receiving from Bob actually came from Bob, not Eve posing as Bob?

Today, we use a system of certificates to provide authentication. This system certainly has its flaws, but it is what we use today. This is more advanced topic that won’t be covered here. Trust is a funny thing.

What’s interesting to note is that the prime and generator used to generate Diffie-Hellman public and private keys have strings that represent the corresponding modular key exponential groups, ie “modp14”. Web crypto’s API gives you finer grain control to specify the generator and large prime number in a Typed Array. I’m not sure why this is; if it allows you to have finer grain control, or allows you to support newer groups before the implementation does? To me, it seems like a source for errors to be made; hopefully someone will make a library to provide these prime/generator pairs.

One issue with my approach is that I assumed that Alice and Bob both had support for the same hashing algorithms, modular exponential key group, and symmetric block cypher. In the real world, this is not always the case. Instead, it is much more common for the client to broadcast publicly all of the algorithms it supports, and the server to pick one. This list of algorithms is called a “suite,” ie “cypher suit.” I learned this the hard way recently trying to upgrade the cypher suit on my ssh server and finding out that my client did not support the lastest cyphers. In this case, Alice and Bob might not have the same versions of Node.js, which statically link their own versions of OpenSSL. Thus, one should use cryto.getCiphers() and crypto.getHashes() before assuming the party they’re communicating to can do the math to decrypt. We’ll see “cypher suites” come up again in TLS handshakes. The NSA publishes a list of endorsed cryptographic components, for what it’s worth. There are also neat tricks we can do to prevent the message from being decrypted at a later time should the private key be compromised and encrytped message recorded, called Perfect Forward Secrecy.

Let’s take a look now at how a browser does a TLS handshake. Here’s a capture from Wireshark of me navigating to https://google.com. First we have a TLSv1.2 Client Hello to start the handshake. Here we can see a list of the cypher suites.

Next is the response from the server, a TLSv1.2 Server Hello. Here you can see the server has picked a cypher to use.

The server then sends its certificate, which contains a copy of its public key.

Now that we’ve agreed on a cypher suite, the client now sends its public key. The server sets up a session, that way it may abbreviate the handshake in the future. Finally, the client may now start making requests to the server with encrypted application data.

For more information on TLS handshakes, you should read Ilya Grigorik’s High Performance Browser Networking book chapter TLS Handshake, Mozilla OpSec’s fantastic wiki, and this exellent Stack Exchange post. As you might imagine, all of these back and forth trips made during the TLS handshake add latency overhead when compared to unencrypted HTTP requests.

I hope this post helped you understand how we can use cryptography to exchange secret information through public channels. This isn’t enough information to implement a perfectly secure system; end to end security means one single mistake can compromise the entire system. Peer review and open source, battle tested implementations go a long way.

A cryptosystem should be secure even if everything about the system, except the key, is public knowledge.

Kerckhoffs’s principle

I wanted to write this post because I believe abstinence-only crypto education isn’t working and I cant stand when anyone acts like part of a cabal from their ivory tower to those trying to learn new things. Someone will surely cite Javascript Cryptography Considered Harmful, which while valid, misses my point of simply trying to show people more concrete basics with code examples. The first crypto system you implement will have its holes, but you can’t go from ignorance of crypto to perfect knowledge without implementing a few imperfect systems. Don’t be afraid to, just don’t start with trying to protect high value data. Crypto is dangerous, because it can be difficult to impossible to tell when your system fails. Assembly is also akin to juggling knives, but at least you’ll usually segfault if you mess up and program execution will halt.

With upcoming APIs like Service Workers requiring TLS, protocols like HTTP2, pushes for all web traffic to be encrypted, and shitty things governments, politicians, and ISPs do, web developers are going to have to start boning up on their crypto knowledge.

What are your recommendations for correctly learning crypto? Leave me some thoughts in the comments below.

Writing My First Technical Book Chapter

| Comments

It’s a feeling of immense satisfaction when we complete a major achievement. Being able to say “it’s done” is such a great stress relief. Recently, I completed work on my first publication, a chapter about Emscripten for the upcoming book WebGL Insights to be published by CRC Press in time for SIGGRAPH 2015.

One of the life goals I’ve had for a while is writing a book. A romantic idea it seems to have your ideas transcribed to a medium that will outlast your bones. It’s enamoring to hold books from long dead authors, and see that their ideas are still valid and powerful. Being able to write a book, in my eyes, provides some form of life after death. Though, one could imagine ancestors reading blog posts from long dead relatives via utilities like the Internet Archive’s WayBack Machine.

Writing about highly technical content places an upper limit on the usefulness of the content, and shows as “dated” quickly. A book I recently ordered was Scott Meyers’ Effective Modern C++. This title strikes me, because what exactly do we consider modern or contemporary? Those adjectives only make sense in a time limited context. When C++ undergoes another revolution, Scott’s book may become irrelevant, at which point the adjective modern becomes incorrect. Not that I think Scott’s book or my own is time-limited in usefulness; more that technical books’ duration of usefulness is significantly less than philosophical works like 1984 or Brave New World. Almost like having a record in a sport is a feather in one’s cap, until the next best thing comes along and you’re forgotten to time.

Somewhat short of my goal of writing an entire book, I only wrote a single chapter for a book. It’s interesting to see that a lot of graphics programming books seem to follow the format of one author per chapter or at least multiple authors. Such book series as GPU Gems, Shader X, and GPU Pro follow this pattern, which is interesting. After seeing how much work goes into one chapter, I think I’m content with not writing an entire book, though I may revisit that decision later in life.

How did this all get started? I had followed Graham Sellers on Twitter and saw a tweet from him about a call to authors for WebGL Insights. Explicitly in the linked to page under the call for authors was interest in proposals about Emscripten and asm.js.

At the time, I was headlong into a project helping Disney port Where’s My Water from C++ to JavaScript using Emscripten. I was intimately familiar with Emscripten, having been trained by one of its most prolific contributors, Jukka Jylänki. Also, Emscripten’s creator, Alon Zakai, sat on the other side of the office from me, so I was constantly pestering him about how to do different things with Emscripten. The #emscripten irc channel on irc.mozilla.org is very active, but there’s no substitute for being able to have a second pair of eyes look over your shoulder when something is going wrong.

Knowing Emscripten’s strengths and limitations, seeing interest in the subject I knew a bit about (but wouldn’t consider myself an expert in), and having the goal of writing something to be published in book form, this was my opportunity to seize.

I wrote up a quick proposal with a few figures about why Emscripten was important and how it worked, and sent it off with fingers crossed. Initially, I was overjoyed to learn when my proposal was accepted, but then there was a slow realization that I had a lot of work to do. The editor, Patrick Cozzi, set up a GitHub repo for our additional code and figures, a mailing list, and sent us a chapter template document detailing the process. We had 6 weeks to write the rough draft, then 6 weeks to work with reviewers to get the chapter done. The chapter was written as a Google Doc, so that we could have explicit control over who we shared the document with, and what kinds of editing power they had over the document. I think this approach worked well.

I had most of the content written by week 2. This was surprising to me, because I’m a heavy procrastinator. The only issue was that the number of pages I wrote was double the allowed amount; way over page count. I was worried about the amount of content, but told myself to try not to be attached to the content, just as you shouldn’t stay attached with your code.

I took the additional 4 weeks I had left to finish the rough draft to invite some of my friends and coworkers to provide feedback. It’s useful to have a short list of people who have ever offered to help in this regard or owe you one. You’ll also want a diverse team of reviewers that are either close to the subject matter, or approaching it as new information. This allows you to stay technically correct, while not presuming your readers know everything that you do.

The strategy worked out well; some of the content I had initially written about how JavaScript VMs and JITs speculate types was straight up wrong. While it played nicely into the narrative I was weaving, someone more well versed in JavaScript virtual machines would be able to call BS on my work. The reviewers who weren’t as close to subject matter were able to point out when logical progressions did not follow.

Fear of being publicly corrected prevents a lot of people from blogging or contributing to open source. It’s important to not stay attached to your work, especially when you need to make cuts. When push came to shove, I did have difficulty removing sections.

Lets say you have three sequential sections: A, B, & C. If section A and section B both set up section C, and someone tells you section B has to go, it can be difficult to cut section B because as the author you may think it’s really important to include B for the lead into C. My recommendation is sum up the most important idea from section B and add it to the end of section A.

For the last six weeks, the editor, some invited third parties, and other authors reviewed my chapter. It was great that others even followed along and pointed out when I was making assumptions based on specific compiler or browser. Eric Haines even reviewed my chapter! That was definitely a highlight for me.

We used a Google Sheet to keep track of the state of reviews. Reviewers were able to comment on sections of the chapter. What was nice was that you were able to keep using the comment as a thread, responding directly to a criticism. What didn’t work so well was then once you edited that line, the comment and thus the thread was lost.

Once everything was done, we zipped up the assets to be used as figures, submitted bios, and wrote a tips and tricks section. Now, it’s just a long waiting game until the book is published.

As far as dealing with the publisher, I didn’t have much interaction. Since the book was assembled by a dedicated editor, Patrick did most of the leg work. I only asked that what royalties I would receive be donated to Mozilla, which the publisher said would be too small (est $250) to be worth the paperwork. It would be against my advice if you were thinking of writing a technical book for the sole reason of monetary benefit. I’m excited to be receiving a hard cover copy of the book when it’s published. I’ll also have to see if I can find my way to SIGGRAPH this year; I’d love to meet my fellow authors in person and potential readers. Just seeing the list of authors was really a who’s-who of folks doing cool WebGL stuff.

If you’re interested in learning more about working with Emscripten, asm.js, and WebGL, I sugguest you pick up a copy of WebGL Insights in August when it’s published. A big thank you to my reviewers: Eric Haines, Havi Hoffman, Jukka Jylänki, Chris Mills, Traian Stanev, Luke Wagner, and Alon Zakai.

So that was a little bit about my first experience with authorship. I’d be happy to follow up with any further questions you might have for me. Let me know in the comments below, on Twitter, HN, or wherever and I’ll probably find it!

Let’s Write Some X86-64

| Comments

…“‘Our speech interposes itself between apprehension and truth like a dusty pane or warped mirror. The tongue of Eden was like a flawless glass; a light of total understanding streamed through it. Thus Babel was a second Fall.’ And Isaac the Blind, an early Kabbalist, said that, to quote Gershom Scholem’s translation, ‘The speech of men is connected with divine speech and all language whether heavenly or human derives from one source: the Divine Name.’ The practical kabbalists, the sorcerers, bore the title Ba’al Shem, meaning ‘master of the divine name.’”

“The machine language of the world,” Hiro says.

“Is this another analogy?”

“Computers speak machine language,” Hiro says. “It’s written in ones and zeroes - binary code. At the lowest level, all computers are programmed with strings of ones and zeroes. When you program in machine language, you are controlling the computer at its brainstem, the root of its existence. It’s the tongue of Eden. But it’s very difficult to work in machine language because you go crazy after a while, working at such a minute level. So a whole Babel of computer languages has been created for programmers: FORTRAN, BASIC, COBOL, LISP, Pascal, C, PROLOG, FORTH. You talk to the computer in one of these languages, and a piece of software called a compiler converts it into machine language. But you never can tell exactly what the compiler is doing. It doesn’t always come out the way you want. Like a dusty pane or warped mirror. A really advanced hacker comes to understand the true inner workings of the machine – he sees through the language he’s working in and glimpses the secret functioning of the binary code – becomes a Ba’al Shem of sorts.”

Hiro Protagonist and The Librarian Snow Crash by Neal Stephenson

This a beautiful quote, one that I think truly captures the relationship between higher level languages and the Instruction Set Architecture (ISA)’s machine code, though this is from the angle of controlling the machine with its implementation specific quirks which can detract from what you’re actually trying to do.

This blog is meant for those who don’t know x86-64 assembly, but maybe know a little C, and are curious about code generation. Or maybe if you’ve ever tried to hand write x86-64 assembly, and got stuck trying to understand the tooling or seemingly random segfaults from what appears to be valid instructions.

I really enjoy writing code in CoffeeScript and C, so I have a quick anecdote about CoffeeScript though you don’t need to know the language. When writing CoffeeScript, I find myself frequently using a vim plugin to view the emitted JavaScript. I know when CoffeeScript emits less than optimal JavaScript. For example in the code:

1
nick = -> console.log x for x in [0..100]

I know that CoffeeScript is going to push the results of the call to console.log into an array and return that, because of the implicit return of the final expression in a function body, which in this case happens to be a for loop (array comprehension being treated itself as an expression). The emitted JavaScript looks like:

1
2
3
4
5
6
7
8
9
10
var nick;

nick = function() {
  var x, _i, _results;
  _results = [];
  for (x = _i = 0; _i <= 100; x = ++_i) {
    _results.push(console.log(x));
  }
  return _results;
};

By putting a seemingly meaningless undefined statement as the final statement in the function body, we can significantly reduce what the function is doing and decrease the number of allocations:

1
2
3
nick = ->
  console.log x for x in [0..100]
  undefined

emits:

1
2
3
4
5
6
7
nick = function() {
  var x, _i;
  for (x = _i = 0; _i <= 100; x = ++_i) {
    console.log(x);
  }
  return void 0;
};

That return void 0 may seem odd, but functions in JavaScript without an explicit return value return undefined, but since the undefined identifier can be reassigned to, the expression void 0 evaluates to the value undefined.

You can see that making the CoffeeScript function body slightly longer and adding a seemingly meaningless lone undefined statement at the end of the function body, the emitted JavaScript does not allocate an array or waste time pushing the results of console.log, which would be undefined, into that array a hundred times. This reminds me of how seemingly meaningless noop instructions can keep a processor’s pipeline full by preventing stalls, though a pipeline stall doesn’t change the correctness of a program, so it’s an imperfect analogy.

Now I’m not saying that you should be thinking about these kinds of optimizations when programming at such a high level, as they might be premature. I shared with you this example because while writing C code, and reading Michael Abrash’s Graphics Programming Black Book, I wondered to myself if hardcore C programmers also would know the equivalent assembly instructions that would be emitted from their higher level C code (before optimizing compilers even existed).

In college, I was taught 68k and MIPS ISAs. To understand x86-64 we need to be able to write and run it. Unfortunately, I did not have the training to know how to do so. My 68k code was run on a MCU from a FreeScale IDE in Windows, so the process might as well have been indistinguishable from magic to me. I understood that you’d start with low level source, in (somewhat) human readable instructions that would be converted to binary representing op codes. The assembler would then translate the assembly into non-executable object files that contained binary code that had placeholders for sections of code defined in other object files. The linker would then be used to replace the placeholders with the now combined binary code’s relative positions and then converted into an executable. But how do I do this from my x86-64 machine itself? The goto book I’ve been recommended many times is Professional Assembly Language by Richard Blum, but this book only covers x86, not x86-64. There’s been some very big changes to the ABI between x86 and x86-64. You may be familiar with Application Programmer Interfaces (APIs), but what is an Application Binary Interface? I think of an ABI as how two pieces of native code interact with one another, such as calling convention (how arguments are passed to functions at the ISA level).

I’m very lucky to have the privilege to work with a compiler engineer, Dan Gohman, who has worked on a variety of compilers. I was citing a particular email of Dan’s for some time before he came to work with us, when I would talk about how the naming of LLVM gives the imagery of a virtual machine, though it’s more so a compiler intermediate representation. Dan is an amazing and patient resource who has helped me learn more about the subtleties of the x86-64 ABI. Throughout this blog, I’ll copy some responses to questions I’ve had answered by Dan.

Our first goal is to write an x86-64 program that does nothing, but that we can build. Assembly files typically have the .s file extension, so let’s fire up our text editor and get started. I’ll be doing my coding from OSX 10.8.5, but most examples will work from Linux. All of my symbol names, like _main, _exit, and _printf, are prefixed with underscores, as Darwin requires. Most Linux systems don’t require this, so Linux users should omit the leading underscores from all such names. Unfortunately, I cannot figure out how to link with ld in Linux, so I recommend trying to understand what gcc -v your_obj_file.o is doing, and this might help. Let me know in the comments if there’s an easy way to use ld when linking your object files from linux and I’ll be happy to post an edit.

Let’s start with this fragment and get it building, then I’ll cover what it’s doing.

1
2
3
4
5
6
.text
.globl _main
_main:
  subq $8, %rsp
  movq $0, %rdi
  call _exit

Let’s use OSX’s built in assembler (as) and linker (ld).

1
2
as nothing.s -o nothing.o
ld -lc -macosx_version_min 10.8.5 nothing.o -o nothing

We should now be able to run ./nothing without any segfaults. Without -macosx_version_min 10.8.5 I get a warning and my executable segfaults.

Now let’s create a basic generic Makefile to help us automate these steps. Watch your step; archaic syntax ahead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SOURCES = $(wildcard *.s)
OBJECTS = $(SOURCES:.s=.o)
EXECUTABLES = $(OBJECTS:.o=)

# Generic rule
# $< is the first dependency name
# $@ is the target filename
%.o: %.s
  as $< -o $@

default: $(OBJECTS)
  for exe in $(EXECUTABLES) ; do \
    ld -lc -macosx_version_min 10.8.5 $$exe.o -o $$exe ; \
  done

.PHONY: clean
clean:
  rm *.o
  for exe in $(EXECUTABLES) ; do \
    rm $$exe ; \
  done

By the time you read this, I’ve already forgotten how half of that code works. But, this will allow us to run make to assemble and link all of our .s files individually, and make clean to remove our object and executable files. Surely you can whip up a better build script? Let me know in the comments below.

So now let’s go back to our assembly file and go over it line by line. Again, it looks like:

1
2
3
4
5
6
.text
.globl _main
_main:
  subq $8, %rsp
  movq $0, %rdi
  call _exit

.text is the text section. This section defines the instructions that the processor will execute. There can be other sections as well. We’ll see more later, but the “data” section typically has static variables that have been initialized with non null (and non zero) values, where as “bss” will have static but non initialized values. Also, there will be a heap and a stack although you don’t declare them as you would for text or data.

Next up is the global directive. The global directive tells the linker that there will be a section named _main that it may call into, making the _main section visible to other sections. You may be wondering why directives and sections both begin with a dot.

’.” isn’t a valid identifier character in C, so way back when it became common to use ‘.” as a prefix in the assembler and linker in order to avoid clashing with C symbol names. This unfortunately was used for both section names and directives, because both can appear in contexts where a simple parser wouldn’t be able to disambiguate otherwise.

I don’t know why they used the same convention for directives and section names, but they did, and it’s not practical to change it now.

Dan Gohman

Ok now the subtraction instruction. We’ve got a bit to go over with just this one line. The first is the instruction itself. The sub instruction has numerous suffixes that specify how many bytes to operate on. The typical convention for numerous instructions is to have a suffix of b for 1 byte (8 bits), w for a word (2 bytes, 16 bits), l for a long or double word (4 bytes, 32 bits), and q for a quad word (8 bytes, 64 bits). Leaving off the suffix, the assembler will try and guess based off of the operands, which can lead to obscure bugs. So subq operates on 64 bits. Extending this we should be able to recognize that subb operates on 8 bits, subw operates on 16 bits, subl operates on 32 bits, and subq operates on 64 bits. What’s important to understand is that instruction suffix is dictated by the inputs and destination size. See Figure 3-3 of the AMD64 ABI.

Ok now let’s look at the full instruction subq $8, %rsp. The current order of the operands is known as the AT&T syntax, where the destination is specified last (as opposed to the Intel syntax, where the destination follows the instruction name ex. subq rsp, 8).

I’m biased towards AT&T-syntax because GCC, LLVM, and icc (at least on Unix-like platforms) all use it, so it’s what I’m used to by necessity. People familiar with assembly languages on other platforms sometimes find it feels backwards from what they’re used to, but it is learnable.

Dan Gohman

I’m writing my examples in AT&T syntax simply because when I compile my C code from clang with the -S flag, or run my object files through gobjdump, I get AT&T syntax by default (though I’m sure there are flags for either AT&T or Intel syntaxes). Also, the ABI docs are in AT&T. What are your thoughts on the two different syntaxes? Let me know in the comments below.

So when we say subq $8, %rsp, we’re subtracting the immediate value of 8 from the stack pointer (the register %rsp contains our stack pointer). But why are we doing this? This is something that is left out from some of the basic hello world assembly programs I’ve seen. This is the first ABI point I want to make:

x86-64 ABI point 1: function calls need the stack pointer to be aligned by a multiple of 16 bytes.

By default, they are off by 8 on function entry. See Section 3.2.2 page 16 of the ABI.

Why is the stack pointer misaligned by 8 bytes on function entry? I’m going to punt on the answer to that for a bit, but I promise I’ll come back to it. The most important thing is that that call instruction later on will fail unless we align our stack pointer, which started out misaligned. If we comment it out (pound signs, #, comment out the rest of the line) and make our executable, we’ll get a segfault. You could even add 8 bytes to the stack pointer and our basic example would work (we just need a multiple of 16 remember), but when we learn later (I promise) about how the stack works in x86-64, we’ll see we can mess things up by adding rather than subtracting.

Next up we’re moving the immediate value 0x0 into %rdi. You may have heard that arguments to functions are pushed on the stack in reverse order, but that’s an old x86 convention. With the addition of 8 more general purpose registers, we now pass up to the first 6 arguments in registers (then push the rest, if any, on the stack in reverse order). The convention (in OSX and Linux) is our second ABI point:

x86-64 ABI point 2: The calling conventions for function invocations require passing integer arguments in the following sequence of registers: %rdi, %rsi, %rdx, %rcx, %r8, %r9, then pushing the rest on the stack in reverse order.

See section 3.2.3 under “Passing”. Warning: Microsoft has a different calling convention. This is quite troubling to me, because I assumed that Instruction Set Architectures were created so that the same code could run on two different machines with the same microarchitecture, but because the ISA does not define how arguments would be passed, this ambiguity is left up to the OS implementor to decide. Thus the same code may not run on two different machines with the same microarchitecture if their operating systems are incompatible at the ABI layer.

UPDATE: Further, I just learned that Go, and C code compiled by 6c don’t use the “normal” SysV ABI and calling convention, but have their own.

What our goal is is to call exit(0); where exit is defined in libc, which we link against during the linking phase with with flag -lc. This is another punt on system calls. So to invoke exit with the first integer argument of 0, we first need to move the immediate value of 0x0 into %rdi. Now if you run your executable from your shell, then echo $?, you should see that the previous command’s exit code was 0. Try changing the exit code to 42 and verify that it works successfully.

Ok, well a program that does nothing is more boring than hello world. Now that we have our build setup out of the way, let’s make a hello world program. If you’re familiar with ASCII tables, we can use putchar from libc since we’re already linking to it. Use man putchar to look at its signature and this ASCII table to move immediate values into a certain register (remember the calling convention, point #2) and make sure you setup the stack pointer before any calls and exit after all other calls.

I’ll leave that up to an exercise for the reader. Let’s use a string and printf.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
.data
_hello:
  .asciz "hello world\n"

.text
.globl _main
_main:
  subq $8, %rsp

  movb $0, %al
  leaq _hello(%rip), %rdi
  call _printf

  movq $0, %rdi
  call _exit

First up is our data section .data. I previously mentioned the data section contains global non null and non 0 variables. You can see here that the string itself becomes part of the binary by using the unix command strings and passing your executable as the first argument. Further, if you pass your executable to hexdump you can even see the ASCII codes in hex:

1
0001020 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a 00 00 00 00

Also, we can run our binary through objdump as well and see the string:

1
2
3
4
gobjdump -j .data -s hello_world
...
Contents of section .data:
 2020 68656c6c 6f20776f 726c640a 00        hello world..

Ok so now we’re moving an immediate value of 0x0 to %al. %al is 1 byte wide, so we use the b suffix on the mov instruction. The next important point of the ABI has to do with functions that use a variable number of arguments (varargs), like printf does:

x86-64 ABI point 3: Variadic functions need to have the number of vector arguments specified in %al.

This will make printf debugging hard without. Also in section 3.2.3 under passing.

If you don’t know what vector arguments are, no worries! I’m not going to cover them. Just know that without this, the contents of %al may work in a basic example, where we haven’t touched %al, %ax, %eax, or %rax yet, but we shouldn’t bank on it being 0x0. In fact we shouldn’t bank on most registers being preserved after a function call. Now’s a good time to talk about volatility:

x86-64 ABI point 4: Most registers are not preserved across function calls.

Only %rbx, %rsp, %rbp, and %r12-%r15 (and some others) are. These are called “call saved” or “non volatile” registers. The rest should be considered “call clobbered” or “volatile.” That means every time we invoke a call like printf, we need to reset %al, since it is the lower 8 bits of %rax which is the 1st return register, so it is always clobbered.

The next instruction loads the effective address of the string relative to the current instruction pointer into %rdi, the first argument for printf. The .asciz directive appends the null byte for us, since C strings are null terminated.

With this knowledge, can you modify hello world to print “hello world 42”, without putting 42 into the string in the data section? Hint: you’ll need a placeholder in your string and need to know the x86-64 calling convention to pass an additional argument to printf.

Finally, let’s talk about the stack. When we create automatic variables in C, they are created in the segment called the stack. On x86-64 the stack starts at some arbitrary address (virtual memory backed by physical memory) and “grows” downwards. That is why we subtracted 8 bytes, rather than add 8 bytes to the stack for alignment earlier. The metaphor of a stack of plates is kinda upside-down as additional plates (variables) are going underneath the current bottom plate if you can imagine, in this case. The stack grows towards the heap, and it is possible for them to collide if you don’t ask the OS to expand your data segment (sbrk).

credit

Let’s say we want to call something like memset, which from man memset we can see takes an address, a value to fill, and a number of bytes to fill. The equivalent of say this C code:

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <stdio.h>
#include <string.h>
#include <inttypes.h>
int main () {
  int8_t array [16];
  memset(&array, 42, 16);
  int8_t* ptr = array;
  printf("Current byte: %" PRId8 "\n", *ptr);
  ++(*ptr);
  printf("Current byte: %" PRId8 "\n", *ptr);
  ++ptr;
  printf("Current byte: %" PRId8 "\n", *ptr);
}

Well, that might look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
.data
_answer:
  .asciz "Current byte: %d\n"

.text
.globl _main
_main:
  subq $8, %rsp

  subq $16, %rsp # allocate 16B
  leaq (%rsp), %rdi # first arg, &array
  movq $42, %rsi # second arg, 42
  movq $16, %rdx, # third arg, 16B
  call _memset

  leaq _answer(%rip), %rdi
  movq $0, %rsi
  movb (%rsp), %sil # these two are equavlent to movzql (%rsp), %esi
  movb $0, %al
  call _printf

  incq (%rsp)

  leaq _answer(%rip), %rdi
  movq $0, %rsi
  movb (%rsp), %sil
  movb $0, %al
  call _printf

  leaq _answer(%rip), %rdi
  movq $0, %rsi
  movb 1(%rsp), %sil
  movb $0, %al
  call _printf

  addq $16, %rsp # clean up stack

  movq $0, %rdi
  call _exit

This isn’t a perfect example because I’m not allocating space for the ptr on the stack. Instead, I’m using the %rsp register to keep track of the address I’m working with.

What we’re doing is allocating 16B on the stack. Remember we need to keep %rsp aligned on 16B boundaries, making it a multiple of 16B. If we needed a non 16B multiple, we could allocate more than needed on the stack, and then do some arithmetic later when access our automatic variables.

For memset, we need to pass the address of our first argument. In x86-64, the stack grows downwards, but our variables “point” upwards, so %rsp and the higher 16B is the memory addresses of our array, with %rsp currently pointing to the front. The rest you should recognize by now as part of the calling convention.

In the next grouping of instructions, we want to verify that memset set every byte to 42 (0x2A). So what we’ll do is copy the first byte from our array, currently pointed to by %rsp, to the lower 8b of %rsi which is named %sil. It’s important to zero out the 64b contents of %rsi first, since it may have been clobbered by our previous call to memset.

Then we dereference and increment the value pointed to by our array pointer, ++(*ptr) or ++array[0]. Now array[0] is 43, not 42.

In the next grouping of instructions, we print the second byte of our array, array[1], and get 42 from memset. Now we could try to increment the stack pointer itself by one, but then the call to printf will fail, so instead when we load the value of array[1], we do some pointer arithmetic movb 1(%rsp), %sil. This is relative addressing, though you’ve already seen this with loading the strings. You might wonder why I’m not loading the byte in the other “direction,” say movb -1(%rsp), %sil. Well, that goes back to my point that while the stack pointer moves down as we allocate automatic variables, their address and memory they take up “points up.”

Finally, we clean up our automatic variable’s allocated space on the stack. Note that we do not zero out that memory. A preceding function call might overwrite that data on the stack, but until it does or unless we explicitly zero it out, a buffer overrun could accidentally read that data a la Heartbleed.

Now I did promise I would talk about why the stack pointer is misaligned by 8 bytes on function entry. That is because unoptimized functions typically have a function prolog and epilog. Typically, besides creating room on the stack for automatic variables at the beginning of a function, we typically want to save the frame AKA base pointer, %rbp, on the stack. Since %rbp is 64b or 8B and the push instruction will decrement the stack pointer by 8b, this will align the misaligned stack to a 16B multiple. So in function bodies, you’ll typically see:

1
2
3
4
5
6
my_func:
  push %rbp
  movq %rsp, %rbp
  # your code here...
  popq %rbp
  ret

This great article explains that you may want to profile your running application, or look at the call stack in a debugger, and by having a linked list of stack frames of the function that invoked yours and it’s caller and so on in a dedicated register makes it trivial to know the call stack at any given point in runtime. Since we’re always pushing %rbp immediately thereby saving it on the stack and putting our stack pointer (%rsp) in the base pointer (%rbp) (later restoring it, %rbp is call saved), we can keep popping %rbp then moving our stack pointer to that value to see that quux was called by bar was called by foo (foo(bar(quux()));). Now you saw that I was able to write code that clearly worked without the three additonal instructions in the prolog and epilog, and indeed that’s what happens with optimized code emitted from your compiler. And since GDB uses something called DWARF (adds symbols to your objects) anyways, it isn’t a huge issue to remove the prolog and epilog.

So, I think I’ve shown you enough to get started hand writing assembly. To learn more, you should write the higher level C code for what you’re trying to do and then study the emitted assembly by compiling with the -S flag. With clang, you’ll probably see a bunch of stack check guards for each frame, but those just prevent stack buffer overflows. Try compiling simple conditionals (jump forwards), then simple loops (jump backwards) without optimizations. Jumping to sections and calling your own functions should be pretty easy to figure out. Hint: don’t duplicate section names, but the assembler will catch this and warn you pretty explicitly.

Don’t let people discourage from learning assembly because “compilers will always beat you.” “Just use LLVM or libjit or whatever for codegen.” Well, existing solutions aren’t perfect solutions in every scenario. Someday you might be tasked with doing codegen because LLVM is not optimal under certain constraints. You’ll never know if you can beat them unless you try; those most comfortable are those most vulnerable. I’m afraid that if enough people are turned away from learning the lower levels of programming because the higher level is unquestionably better, then assembly will ultimately be forgotten and the computer becomes a black box again. This is something that troubles me and that I see occurring around me frequently; a lot of devs new to web development conflate jquery with JavaScript, and Three.js with WebGL. If you’re around the Bay Area, I’ll be giving a talk at HTML5DevConf on May 22 demystifying Raw WebGL. You should come and check it out.

In summary, remember:

  • The stack pointer needs to be aligned by 16B multiples when calling another function.
  • Calling convention dictates passing arguments in %rdi, %rsi, %rdx, %rcx, %r8, %r9, then stack.
  • %al needs the number of vector arguments for variadic functions.
  • Know which registers are call saved (%rbx, %rsp, %rbp, and %r12-%r15 (and some others)) and call clobbered.

Closing thoughts by Dan:

To me, machine code doesn’t feel analogous to this concept of the divine name. I certainly wouldn’t describe it as “like a flawless glass; a light of total understanding streamed through it”. Even if you strip away all the accidental complexity, machine code is still permeated by mundane implementation-oriented concerns. Also, the abstractions which we use to organize and “understand” the program are gone; information is lost.

My imagination of a divine language looks a lot more like an idealized and cleaned-up LISP. It’d be capable of representing all known means of abstraction and combination (in the SICP sense), it might be thought of as a kind of super-language of which all of our real-world languages are just messy subsets.

Dan Gohman

Write a Test Case

| Comments

Your application just broke, oh no! It couldn’t have been your code, right?

I’ve always had trouble spotting mistakes in my own work such as spelling, grammar, mathematical, or even in programming. With spelling or grammar, office applications quickly pick up on my mistakes and underline them for me, but most of my mistakes come from my own hubris. I’m confident in what I do, and that gets me in trouble when I make little mistakes. I’m confident that I solved this problem correctly and didn’t make any small mistakes. As a kid competing in timed math competitions, I quickly learned that reviewing your work cost time, so it was important to recognize where common mistakes would crop up on certain problems and spend slightly extra time the first time through those problem areas, rather than double checking the work in its entirety, unless I had the time to spare. Measure twice, cut once.

The same is true for programming. Writing test cases is time consuming, and if time is money, then it could be said that writing tests is costly. There’s probably a logical fallacy in there. In general, we hope that the time-cost of writing test cases will be recuperated by the time-cost of bugs caught, though it’s not easy to measure the time-cost of averted bugs. I’m all for test cases. I think that SQLite having 100% branch coverage is incredible, truly an lofty achievement. People get bogged down in arguments over type systems where testing is more of a must for languages without a compiler to catch certain bugs.

Ok, so going back to your code, something is wrong. But we’re confident it couldn’t be our code. Maybe it’s one of those open source modules I’m using not being tested enough, or that pesky browser implementation, or this damned OS. It couldn’t be my code.

Prove it.

I see bug reports all the time where people say your X breaks my Y, but when asked to provide Y they can’t for whatever software licensing reason. The bug resolver (person who is enabled to fix said bug in X), doesn’t know at this point whether the bug reporter (developer of Y) did their homework; the bug is unconfirmed. Both reporter and resolver are suspicious of each others’ code. I don’t make silly mistakes, right?

So it’s up to the resolver to work out a test case. Being able to reproduce said issue is the first goal to resolving a bug. Just like scientific statements aren’t taken as fact until reproducible, so too will this bug be merely conjecture at this point. It kills me when the bug resolver closes an issue because it works for me. Nothing boils my blood more; I took the time to try and help you and wasn’t rewarded on some level so I feel that I wasted my time. But from where you stand the resolver is not correct, so again I say…

Prove it.

I think that it’s super hard to get meaningful feedback from users. It’s hard for them to express why they’re frustrated. We as software developers don’t always point them in the right direction. As users can have wide ranges of technical capabilities, sometimes we as more technical oriented people have to do a little hand holding. For instance, from Firefox, how many hops does it take to get from the application’s menus to their open bug tracker, bugzilla? Where can I report an issue? How do I report an issue? Do I annoy the hell out of the user until they rate my app? Users might be complaining on Twitter, but that’s sure as hell not where devlopers are looking for bug reports. Enabling the user to provide feedback could be viewed as a double edged sword. Can I make meaningful conclusions from the torrent of feedback I’m now getting? Did I structure my form in such a way to ask the right questions? Should we make our issue tracker public? Should we make security issues that could harm users, if more widely understood, public? Public to fellow employees, even? What did you expect to happen, and is that what actually happened? This rabbit hole goes deep.

The other day, I was writing a patch for a large application that was developed by someone else and that I still don’t fully comprehend in its entirety. My patch should have worked, why wasn’t it working? Probably something in this person’s code right? Sure enough, I wrote a simple test case to eliminate everything but the control variables, and it turns out I was wrong.

It’s hard to get users to do more than the bare minimum to report an issue, even at all, but to prevent the issue from being ignored or closed because it works for the developer in their environment, take the time to report it correctly the first time. As someone who understands technology more than the average person, take the time to write a cut down test case that exposes the issue. The developer who looks at the issue will be grateful; instead of wasting their time with something that may not even be a bug, you’ve just done something that would have to be done anyways if it is indeed a bug. You might find that a reduced code size shows that it’s in fact not an issue with someone else’s code, but in fact a simple mistake you made somewhere in your monolithic application. Or it might be exactly the evidence you need to get someone’s attention to fix said problem. That test case might even be included in the latest test suite. Either way, you have now proved beyond a reasonable doubt that the issue does not lie in your fault.

Write a Test Case for Your Next Bug Report or Issue.

Function.prototype.bind Edge Cases

| Comments

ECMAScript 5’s Function.prototype.bind is a great tool that’s implemented in all modern browser JavaScript engines. It allows you to modify the context, this, of a function when it is evaluated in the future. Knowing what this refers to in various contexts is key to being a professional JavaScript developer; don’t show up to an interview without knowing all about it.

Here’s a common use case that developers need to watch for. Can you spot the mistake?

1
2
3
4
5
6
7
8
9
10
var person = "Bill";

var obj = {
  person: "Nick",
  hi: function () {
    console.log("Hi " + this.person);
  }
};

window.addEventListener("DOMContentLoaded", obj.hi);

Ooops!!! Turns out that since we added the event listener to the window object, this in the event handler or callback refers to window. So this code prints "Hi Bill" instead of "Hi Nick". We could wrap obj.hi in an anonymous function:

1
2
3
window.addEventListener("DOMContentLoaded", function () {
  obj.hi();
});

But that is so needlessly verbose and what we were trying to avoid in the first place. The three functions you should know for modifying this (a question I ask all my interview candidates) are Function.prototype.call, Function.prototype.apply, and Function.prototype.bind. call is variadic, while apply takes an array of arguments, but the two both immediately invoke the function. We don’t want to do that just yet. The fix we need is Function.prototype.bind.

1
window.addEventListener("DOMContentLoaded", obj.hi.bind(obj));

There, now isn’t that nice and short? Instead of saving this as another variable then closing over it, you can instead use bind!

1
2
3
4
5
6
7
8
9
var obj = {
  person: "Nick",
  wait: function () {
    var self = this;
    someButton.onclick = function () {
      console.log(self.person + " clicked!");
    };
  },
};

becomes

1
2
3
4
5
6
7
8
var obj = {
  person: "Nick",
  wait: function () {
    someButton.onclick = function () {
      console.log(this.person + " clicked!");
    }.bind(this);
  },
};

No need to store this into self, then close over it. One great shortcut I use all the time is creating an alias for document.getElementById.

1
2
3
4
5
var $ = document.getElementById.bind(document);
$('someElementsId').doSomething();
$('anotherElement').doSomethingElse();
$('aThirdElement').doSomethingDifferent();
$('theFifthElementOops').doSomethingFun();

Why did I bind getElementById back to document? Try it without the call to bind. Any luck?

bind can also be great for partially applying functions, too.

1
2
3
4
5
6
7
function add (a, b) {
  console.log("a: " + a);
  console.log("b: " + b);
  return a + b;
};
var todo = add.bind(null, 4);
console.log(todo(7));

will print

1
2
3
a: 4
b: 7
11

What Function.prototype.bind is essentially doing is wrapping add in a function that essentially looks like:

1
2
3
var todo = function () {
  add.apply(null, [4].concat(Array.prototype.slice.call(arguments)));
};

The array has the captured arguments (just 4), and is converting todo’s arguments into an array (a common idiom for converting “Array-like” objects into Arrays), then joining (concat) them and invoking the bound function (apply) with the value for this (in this case, null).

In fact, if you look at the compatibility section of the MDN page for bind, you’ll see a function that returns a function that is essentially the above. One caveat is that this approach only allows you to partially apply variables in order.

So bind is a great addition to the language. Now to the point I wanted to make; there are edge cases when bind doesn’t work or might trip you up. The first is that bind evaluates its arguments when bound, not when invoked. The other is that bind returns a new function, always. And the final is to be careful binding to variadic functions when you don’t intend to use all of the passed in variables. Um, duh right? Well, let me show you three examples that have bitten me (recently). The first is with ajax calls.

1
2
3
4
5
6
7
8
function crunch (data) {
  // operate on data
};

var xhr = new XMLHttpRequest;
xhr.open("GET", "data.json");
xhr.onload = crunch.bind(this.response);
xhr.send();

Oops, while I do want to operate on this.result within crunch with this referring to xhr, this at the time of binding was referring to window! Let’s hope window.results is undefined! What if we changed this.result with xhr.result? Well, we’re no longer referring to the window object, but xhr.result is evaluated at bind time (and for an unsent XMLHttpRequest object, is null), so we’ve bound null as the first argument. We must delay the handling of xhr.onload; either use an anonymous function inline or named function to control nesting depth.

1
2
3
xhr.onload = function () {
  crunch(this.result);
};

The next is that bind always returns a new function. Dude, it says that in the docs, RTFM. Yeah I know, but this case still caught me. When removing an event listener, you need to supply the same handler function. Example, a once function:

1
2
3
4
5
6
function todo () {
  document.removeEventListener("myCustomEvent", todo);
  console.log(this.person);
});

document.addEventListener("myCustomEvent", todo.bind({ person: "Nick" }));

Try firing myCustomEvent twice, see what happens! "Nick" is logged twice. A once function that handles two separate events is not very good. In fact, it will continue to handle events, since document does not have todo as an event handler for myCustomEvent events. The event listener you bound was a new function; bind always returns a new function. The solution:

1
2
3
4
5
var todo = function () {
  console.log(this.person);
  document.removeEventListener("myCustomEvent", todo);
}.bind({ person: "Nick" });
document.addEventListener("myCustomEvent", todo);

That would be a good interview question. The final gotcha is with functions that are variadic. Extending one of my earlier examples:

1
2
3
4
5
6
7
8
9
10
11
var obj = {
  person: "Nick",
  wait: function () {
    var someButton = document.createElement("button");
    someButton.onclick = function () {
      console.log(this.person + " clicked!");
    }.bind(this);
    someButton.click();
  },
};
obj.wait();

Let’s say I thought I could use bind to simplify the onclick using the trick I did with document.getElementById:

1
2
3
4
5
6
7
8
9
var obj = {
  person: "Nick",
  wait: function () {
    var someButton = document.createElement("button");
    someButton.onclick = console.log.bind(console, this.person + " clicked!");
    someButton.click();
  },
};
obj.wait();

Can you guess what this prints? It does prints the expected, but with an unexpected addition. Think about what I said about variadic functions. What might be wrong here?

Turns out this prints "Nick clicked! [object MouseEvent]" This one took me a while to think through, but luckily I had other experiences with bind that helped me understand why this occurred.

console.log is variadic, so it prints all of its arguments. When we called bind on console.log, we set the onclick handler to be a new function that applied that expected output with any additional arguments. Well, onclick handlers are passed a MouseEvent object (think e.target), which winds up being passed as the second argument to console.log. If this was the example with add from earlier, this.person + " clicked!" would be the 4 and the MouseEvent would be the 7:

1
2
3
someButton.onclick = function (e) {
  console.log.apply(console, ["Nick clicked!"].concat([e]));
};

I love bind, but sometimes, it will get you. What are some examples of times when you’ve been bitten by bind?

Making Great Node.js Modules With CoffeeScript

| Comments

Node.js is a great runtime for writing applications in JavaScript, the language I primarily develop in. CoffeeScript is a programming language that compiles to JavaScript. Why would we write a reusable piece of code, a module , in CoffeeScript? CoffeeScript is a very high level language and beautifully brings together my favorite aspects of JavaScript, Ruby, and Python. In this tutorial, I’ll show you how I create reusable open source modules for Node.js from CoffeeScript, which is something I recently discovered while creating a playlist parser module. The point is to focus on how to turn a quick hack into a nicely laid out Node.js module.

The steps are as follows:

  1. Turn an idea into a git repo.
  2. Add directory structure.
  3. Split library functions from testing.
  4. Add a Build script.
  5. Create node module.
  6. Add LICENSE and README.
  7. Publish.

First thing’s first, we have to have an idea. It doesn’t have to be revolutionary, just do one thing and do it well. That is the first rule of UNIX fightclub philosophy, which resonates well within the Node.js community. When I’m hacking on something, I start out with a single file to test something out. Then I progressively refine the example until it’s something reusable. That way, I can reuse it, others can reuse it, others can learn from it, and the world can be a better place.

For this tutorial, I’ll show you my process for creating a binding for nanomsg, the latest scalability protocol library from the creator of ZeroMQ, Martin Sústrik. I had played with ZeroMQ in the past and thought that it was really awesome, and I was excited to see a new library from it’s creator, based on C, since I also really enjoyed his post on why he shouldn’t have written it in C++.

So messing around real quick, let’s make sure we have node up to date. I like to use nvm and the latest stable minor version of node (stable versions have even minor patch numbers where versions are in the format major.minor.patch, so v0.11.0 is unstable). node -v –> v0.10.17

Then I need to download and install the library that I’ll be dynamically linking to, build, and install it.

1
2
3
4
5
6
7
8
curl -O http://download.nanomsg.org/nanomsg-0.1-alpha.zip && \
unzip nanomsg-0.1-alpha.zip && \
cd nanomsg-0.1-alpha && \
mkdir build && \
cd build && \
../configure && \
make && \
make install

We’ll use node’s FFI module to interface with the dynamically linked library, because it’s easier to write bindings than using native addons, and v8’s API has recently changed causing some headaches for native extensions.

npm install ffi

We’ll be writing the example in CoffeeScript.

npm install -g coffee-script

Now to mess around we can create main.coffee based on the C++ binding’s example:

main.coffee
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
ffi = require 'ffi'
assert = require 'assert'

AF_SP = 1
NN_PAIR = 16

nanomsg = ffi.Library 'libnanomsg',
  nn_socket: [ 'int', [ 'int', 'int' ]]
  nn_bind: [ 'int', [ 'int', 'string' ]]
  nn_connect: [ 'int', ['int', 'string' ]]
  nn_send: [ 'int', ['int', 'pointer', 'int', 'int']]
  nn_recv: [ 'int', ['int', 'pointer', 'int', 'int']]
  nn_errno: [ 'int', []]

# test
s1 = nanomsg.nn_socket AF_SP, NN_PAIR
assert s1 >= 0, 's1: ' + nanomsg.nn_errno()

ret = nanomsg.nn_bind s1, 'inproc://a'
assert ret > 0, 'bind'

s2 = nanomsg.nn_socket AF_SP, NN_PAIR
assert s2 >= 0, 's2: ' + nanomsg.nn_errno()

ret = nanomsg.nn_connect s2, 'inproc://a'
assert ret > 0, 'connect'

msg = new Buffer 'hello'
ret = nanomsg.nn_send s2, msg, msg.length, 0
assert ret > 0, 'send'

recv = new Buffer msg.length
ret = nanomsg.nn_recv s1, recv, recv.length, 0
assert ret > 0, 'recv'

console.log recv.toString()
assert msg.toString() is recv.toString(), 'received message did not match sent'

coffee main.coffee –> hello

This quick example shows that we have something working. Currently our working directory should look like:

1
2
3
4
5
6
7
tree -L 2
.
├── main.coffee
└── node_modules
    └── ffi

2 directories, 1 file

Turn an idea into a git repo

Next up is to create a repository using a version control system like git and start saving our work. Check in early, check in often.

Let’s add a .gitignore so that were not adding files that really don’t need to be committed. The node_modules folder is unnecessary because when this node module is installed, its dependencies will be recursively installed, so there’s no need to commit them to source control. The swap files are because I use vim and I accidentally commit the swap files from open buffers all the time like a noob.

.gitignore
1
2
node_modules/
*.swp

Let’s turn this into a git repo:

1
2
3
git init && \
git add . && \
git commit -am “initial commit”

Up on github, let’s create an uninitialized repo and push to it:

1
2
git remote add origin git@github.com:nickdesaulniers/node-nanomsg.git && \
git push -u origin master

So we have:

1
2
3
4
5
6
7
8
tree -L 2 -a
.
├── .gitignore
├── main.coffee
└── node_modules
    └── ffi

2 directories, 2 files

Add directory structure

Now that we have our repo under version control, let’s start adding some structure. Let’s create src/, lib/, and test/ directories. Our CoffeeScript will live in src/, compiled JavaScript will be in lib/, and our test code will be in test/.

mkdir src lib test

Split library functions from testing

Now let’s move a copy of main.coffee into src/ and one into test/. We are going to split the library definition away from the testing logic.

1
2
3
cp main.coffee test/test.coffee && \
git add test/test.coffee && \
git mv main.coffee src/nanomsg.coffee

This way git status tells us:

1
2
3
4
5
6
7
# On branch master
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
# renamed:    main.coffee -> src/nanomsg.coffee
# new file:   test/test.coffee
#

Let’s edit src/main.coffee to look like:

src/main.coffee
1
2
3
4
5
6
7
8
9
10
11
12
ffi = require 'ffi'

exports = module.exports = ffi.Library 'libnanomsg',
  nn_socket: [ 'int', [ 'int', 'int' ]]
  nn_bind: [ 'int', [ 'int', 'string' ]]
  nn_connect: [ 'int', ['int', 'string' ]]
  nn_send: [ 'int', ['int', 'pointer', 'int', 'int']]
  nn_recv: [ 'int', ['int', 'pointer', 'int', 'int']]
  nn_errno: [ 'int', []]

exports.AF_SP = 1
exports.NN_PAIR = 16

and edit the tests to:

test/test.coffee
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
assert = require 'assert'
nanomsg = require '../lib/nanomsg.js'

{ AF_SP, NN_PAIR } = nanomsg

s1 = nanomsg.nn_socket AF_SP, NN_PAIR
assert s1 >= 0, 's1: ' + nanomsg.nn_errno()

ret = nanomsg.nn_bind s1, 'inproc://a'
assert ret > 0, 'bind'

s2 = nanomsg.nn_socket AF_SP, NN_PAIR
assert s2 >= 0, 's2: ' + nanomsg.nn_errno()

ret = nanomsg.nn_connect s2, 'inproc://a'
assert ret > 0, 'connect'

msg = new Buffer 'hello'
ret = nanomsg.nn_send s2, msg, msg.length, 0
assert ret > 0, 'send'

recv = new Buffer msg.length
ret = nanomsg.nn_recv s1, recv, recv.length, 0
assert ret > 0, 'recv'

assert msg.toString() is recv.toString(), 'received message did not match sent'

Notice how in the test we’re including the compiled javascript from lib/ which doesn’t exist yet? If you try running coffee test/test.coffee it should crash. Let’s make the compiled version. coffee -o lib -c src/nanomsg.coffee

Once the compiled lib exists, we can run our tests with coffee test/test.coffee and shouldn’t see any errors.

Now we should have a little more order, let’s commit. Hold off on adding lib/ to version control, I’ll explain why in a bit.

1
2
3
4
5
6
7
8
9
10
11
12
13
tree -L 2 -C -a -I '.git'
.
├── .gitignore
├── lib
│   └── nanomsg.js
├── node_modules
│   └── ffi
├── src
│   └── nanomsg.coffee
└── test
    └── test.coffee

5 directories, 4 files

At this point, if we add features and want to rerun our tests, we need to execute:

coffee -o lib -c src/nanomsg.coffee && coffee test/test.coffee

While this command is simple now and easy to reverse search, anyone else contributing to you project is going to have to know the commands to run the tests. Let’s use Grunt, the JavaScript task runner, to automate our build and test process.

Add a Build script

1
2
npm install -g grunt-cli && \
npm install grunt-contrib-coffee

Create a simple Gruntfile which can also be written in CoffeeScript:

Gruntfile.coffee
1
2
3
4
5
6
7
8
module.exports = (grunt) ->
  grunt.initConfig
    coffee:
      compile:
        files:
          'lib/nanomsg.js': ['src/*.coffee']
  grunt.loadNpmTasks 'grunt-contrib-coffee'
  grunt.registerTask 'default', ['coffee']

Running grunt builds our lib which is a start, so let’s commit that.

But grunt is not running our tests. And our tests don’t have nice output. Let’s change that:

1
2
npm install -g mocha && \
npm install chai grunt-mocha-test

edit test/test.coffee to:

test/test.coffee
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
assert = require 'assert'
should = require('chai').should()
nanomsg = require '../lib/nanomsg.js'

describe 'nanomsg', ->
  it 'should at least work', ->
    { AF_SP, NN_PAIR } = nanomsg

    s1 = nanomsg.nn_socket AF_SP, NN_PAIR
    s1.should.be.at.least 0

    ret = nanomsg.nn_bind s1, 'inproc://a'
    ret.should.be.above 0

    s2 = nanomsg.nn_socket AF_SP, NN_PAIR
    s2.should.be.at.least 0

    ret = nanomsg.nn_connect s2, 'inproc://a'
    ret.should.be.above 0

    msg = new Buffer 'hello'
    ret = nanomsg.nn_send s2, msg, msg.length, 0
    ret.should.be.above 0

    recv = new Buffer msg.length
    ret = nanomsg.nn_recv s1, recv, recv.length, 0
    ret.should.be.above 0

    msg.toString().should.equal recv.toString()

and modify your gruntfile to add a testing step:

Gruntfile.coffee
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
module.exports = (grunt) ->
  grunt.initConfig
    coffee:
      compile:
        files:
          'lib/nanomsg.js': ['src/*.coffee']
    mochaTest:
      options:
        reporter: 'nyan'
      src: ['test/test.coffee']

  grunt.loadNpmTasks 'grunt-contrib-coffee'
  grunt.loadNpmTasks 'grunt-mocha-test'

  grunt.registerTask 'default', ['coffee', 'mochaTest']

Now when we run grunt, our build process will run, then our test process, then we should see one incredibly happy nyan cat. The nyan cat mocha test reporter is basically the pinnacle of human intellectual achievement.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
grunt
Running "coffee:compile" (coffee) task
File lib/nanomsg.js created.

Running "mochaTest:src" (mochaTest) task
 1   -__,------,
 0   -__|  /\_/\
 0   -_~|_( ^ .^)
     -_ ""  ""

  1 passing (5 ms)


Done, without errors.

Commit time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
tree -L 2 -C -a -I '.git'
.
├── .gitignore
├── Gruntfile.coffee
├── lib
│   └── nanomsg.js
├── node_modules
│   ├── ffi
│   ├── grunt
│   └── grunt-contrib-coffee
├── src
│   └── nanomsg.coffee
└── test
    └── test.coffee

7 directories, 5 files

Create node module

Now that we have a more modular design with build and test logic built in, let’s make this module redistributable. First, let’s talk about ignoring files. Create a .npmignore file that will specify what not to include in the module that is downloaded. Node Package Manager, npm, will ignore a bunch of files by default for us.

.npmignore
1
2
3
Gruntfile.coffee
src/
test/

Here we’re ignoring the src/ dir, where in our .gitignore we are going to ignore lib/.

.gitignore
1
2
3
node_modules/
lib/
*.swp

Why are we doing this? Admittedly, none of this is strictly necessary, but here’s why I think it is useful. When someone is checking out the source, they don’t need the results of the compilation step, as they can make modifications and would need to recompile anyways. Adding lib/nanomsg.js would just be another thing to download (though its size is relatively insignificant). Likewise, when someone downloads the module, they most likely just want the results of the compilation step, not the source, build script, or test suite. If I was planned on making the compiled JavaScript accessible to a web browser, I would not add lib/ to .gitignore, that way it could be referenced from the github raw URL. Again, these are generalizations that are not always true. To make up for not having the entire source when installed as a module, we’ll make up for it by adding a link to the repo from of manifest, but first let’s commit!

Time to create a manifest file that has some basic info about our app. It’s a pretty good idea to run npm search <packagename> before hand to make sure your planned package name is not taken. Since we have all of our dependencies in a row, let’s run npm init.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
This utility will walk you through creating a package.json file.
It only covers the most common items, and tries to guess sane defaults.

See `npm help json` for definitive documentation on these fields
and exactly what they do.

Use `npm install <pkg> --save` afterwards to install a package and
save it as a dependency in the package.json file.

Press ^C at any time to quit.
name: (nanomsg)
version: (0.0.0)
description: nanomsg bindings
entry point: (index.js) lib/nanomsg.js
test command: grunt
git repository: (git://github.com/nickdesaulniers/node-nanomsg.git)
keywords: nanomsg
author: Nick Desaulniers
license: (BSD-2-Clause) Beerware
About to write to /Users/Nicholas/code/c/nanomsg/package.json:

{
  "name": "nanomsg",
  "version": "0.0.0",
  "description": "nanomsg bindings",
  "main": "lib/nanomsg.js",
  "directories": {
    "test": "test"
  },
  "dependencies": {
    "chai": "~1.7.2",
    "ffi": "~1.2.5",
    "grunt": "~0.4.1",
    "grunt-mocha-test": "~0.6.3",
    "grunt-contrib-coffee": "~0.7.0"
  },
  "devDependencies": {},
  "scripts": {
    "test": "grunt"
  },
  "repository": {
    "type": "git",
    "url": "git://github.com/nickdesaulniers/node-nanomsg.git"
  },
  "keywords": [
    "nanomsg"
  ],
  "author": "Nick Desaulniers",
  "license": "Beerware",
  "bugs": {
    "url": "https://github.com/nickdesaulniers/node-nanomsg/issues"
  }
}


Is this ok? (yes)

That should create for us a nice package.json manifest file for npm.

We can now run our tests with the command npm test in addition to grunt. Let’s hold off on publishing just yet, committing instead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
tree -L 2 -C -a -I '.git'
.
├── .gitignore
├── .npmignore
├── Gruntfile.coffee
├── lib
│   └── nanomsg.js
├── node_modules
│   ├── .bin
│   ├── chai
│   ├── ffi
│   ├── grunt
│   ├── grunt-contrib-coffee
│   └── grunt-mocha-test
├── package.json
├── src
│   └── nanomsg.coffee
└── test
    └── test.coffee

10 directories, 7 files

Add LICENSE and README

So we have a module that’s almost ready to go. But how will developers know how to reuse this code? As much as I like to view the source, Luke, npm will complain without a readme. The readme also looks nice on the github repo.

# Node-NanoMSG
Node.js binding for [nanomsg](http://nanomsg.org/index.html).

## Usage

`npm install nanomsg`

```javascript
var nanomsg = require('nanomsg');
var assert = require('assert');
var AF_SP = nanomsg.AF_SP;
var NN_PAIR = nanomsg.NN_PAIR;
var msg = new Buffer('hello');
var recv = new Buffer(msg.length);
var s1, s2, ret;

s1 = nanomsg.nn_socket(AF_SP, NN_PAIR);
assert(s1 >= 0, 's1: ' + nanomsg.errno());

ret = nanomsg.nn_bind(s1, 'inproc://a');
assert(ret > 0, 'bind');

s2 = nanomsg.nn_socket(AF_SP, NN_PAIR);
assert(s2 >= 0, 's2: ' + nanomsg.errno());

ret = nanomsg.nn_connect(s2, 'inproc://a');
assert(ret > 0, 'connect');

ret = nanomsg.nn_send(s2, msg, msg.length, 0);
assert(ret > 0, 'send');

ret = nanomsg.recv(s1, recv, recv.length, 0);
assert(ret > 0, 'recv');

assert(msg.toString() === recv.toString(), "didn't receive sent message");
console.log(recv.toString());

Before we publish, let’s create a license file, because though we are making our code publicly viewable, public source code without an explicit license is still under copyright and cannot be reused.

LICENSE
1
2
3
4
5
6
7
8
/*
 * ----------------------------------------------------------------------------
 * "THE BEER-WARE LICENSE" (Revision 42):
 * <nick@mozilla.com> wrote this file. As long as you retain this notice you
 * can do whatever you want with this stuff. If we meet some day, and you think
 * this stuff is worth it, you can buy me a beer in return. Nick Desaulniers
 * ----------------------------------------------------------------------------
 */

If you want to be more serious, maybe instead shoot for an MIT or BSD style license if you don’t care what your repo gets used for or GPL style if you do. TLDRLegal has a great breakdown on common licenses.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
tree -L 2 -C -a -I '.git'
.
├── .gitignore
├── .npmignore
├── Gruntfile.coffee
├── LICENSE
├── README.md
├── lib
│   └── nanomsg.js
├── node_modules
│   ├── .bin
│   ├── chai
│   ├── ffi
│   ├── grunt
│   ├── grunt-contrib-coffee
│   └── grunt-mocha-test
├── package.json
├── src
│   └── nanomsg.coffee
└── test
    └── test.coffee

10 directories, 9 files

Publish

npm publish

1
2
3
4
5
6
7
8
9
npm http PUT https://registry.npmjs.org/nanomsg
npm http 201 https://registry.npmjs.org/nanomsg
npm http GET https://registry.npmjs.org/nanomsg
npm http 200 https://registry.npmjs.org/nanomsg
npm http PUT https://registry.npmjs.org/nanomsg/-/nanomsg-0.0.0.tgz/-rev/1-20f1ec5ca2eed51e840feff22479bb5d
npm http 201 https://registry.npmjs.org/nanomsg/-/nanomsg-0.0.0.tgz/-rev/1-20f1ec5ca2eed51e840feff22479bb5d
npm http PUT https://registry.npmjs.org/nanomsg/0.0.0/-tag/latest
npm http 201 https://registry.npmjs.org/nanomsg/0.0.0/-tag/latest
+ nanomsg@0.0.0

Finally as a sanity check, I like to make a new folder elsewhere, and run through the steps in the readme manually to make sure the package is reuseable. Which is good, since in the readme I accidentally forgot the nn_ prefix in front of errno and recv!

After updating the example in the readme, let’s bump the version number and republish. Use npm version without arguments to find the current version, then npm version patch to bump it. You have to commit the readme changes before bumping the version. Finally don’t forget to rerun npm publish.

Our final directory structure ends up looking like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
tree -L 2 -C -a -I '.git'
.
├── .gitignore
├── .npmignore
├── Gruntfile.coffee
├── LICENSE
├── README.md
├── lib
│   └── nanomsg.js
├── node_modules
│   ├── .bin
│   ├── chai
│   ├── ffi
│   ├── grunt
│   ├── grunt-contrib-coffee
│   └── grunt-mocha-test
├── package.json
├── src
│   └── nanomsg.coffee
└── test
    └── test.coffee

10 directories, 9 files

Lastly, I’ll reach out to Martin Sústrik and let him know that nanomsg has a new binding.

The bindings are far from complete, the test coverage could be better, and the API is very C like and could use some OO syntactic sugar, but we’re at a great starting point and ready to rock and roll. If you’d like to help out, fork https://github.com/nickdesaulniers/node-nanomsg.git.

What are some of your thoughts on build steps, testing, and directory layout of node module? This tutorial was definitely not meant to be an authoritarian guide. I look forward to your comments on your experiences!

Designated Initialization With Compound Literals in C

| Comments

Just a quick post on something I just discovered and found neat (I always find obscure C syntax interesting). I was trying to figure out how to use a C designated initializer, where a member was a pointer to another designated initializer. At this point, you need a compound literal. Just a quick background on C initialization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// verbosely create an array with a known size
int arr [3];
arr[0] = 1;
arr[1] = 2;
arr[2] = 3;
// => [1, 2, 3]

// concisely create an array with a known size
int arr [3] = { 1, 2, 3 }; // => [1, 2, 3]

// creates an array with unspecified values initialized to 0
int arr [4] = { 1, 2, 3 }; // => [1, 2, 3, 0]

// truncates declaration
int arr [1] = { 1, 2, 3 }; // => [1]

// based on number of initializers
int arr [] = { 1, 2, 3 }; // => [1, 2, 3]

Let’s look at how we might have initialized a struct in C89. In C89, you are required to declare local variables at the top of a block. A previous initialization of a point struct might have looked like:

1
2
3
4
5
6
7
8
9
struct point {
  int x, y;
};

{
  struct point a;
  a.x = 2;
  a.y = 3;
}

Just as we can define array literals in C using the initializer list syntax, we can use the same concise syntax for initializing structs!

1
2
// point a is located at (2, 3)
struct point a = { 2, 3 };

Well, this can be bad. Where would point a be located if say a fellow team mate came along and modified the definition of the point struct to:

1
2
3
struct point {
  int y, x; // used to be `int x, y;`
};

Suddenly point a points to (3, 2), not (2, 3). It’s better if you use designated initializers to declare the values for members of your struct. It’s up to the compiler to decide on the order of initialization, but it wont mess up where the data is intended to go.

1
2
// point b is located at (2, 3)
struct point b = { .y = 3, .x = 2 };

So now we have designated initializers, cool. What about if we want to use the same syntax to reassign point b?

1
2
3
b = { .x = 5, .y = 6 };
//    ^
// error: expected expression

While you are being explicit about the shape of the struct that you are trying to assign to b, the compiler cannot figure out that you’re trying to assign a point struct to another point struct. A C cast would probably help here and that’s what the concept of compound literals are.

1
b = (struct point) { .x = 5, .y = 6 }; // works!

Notice: I just combined a compound literal with a designated initializer. A compound literal on its own would look like:

1
b = (struct point) { 5, 6 }; // works!

To recap we can define points like so:

1
2
3
4
5
6
struct point a;
a.x = 1;
a.y = 2; // C89 (too verbose)
struct point b = { 3, 4 }; // initializer list (struct member order specific)
struct point c = { .x = 5, .y = 6 }; // designated initializer (non struct member order specific)
struct point d = (struct point) { .x = 7, .y = 8 }; // compound literal (cast + designated initialization)

My favorite part of compound literals is that you can define values inline of an argument list. Say you have a function prototype like:

1
2
3
4
5
6
7
8
9
10
int distance (struct point, struct point);

// instead of calling it like this
// (creating temp objects just to pass in)
struct point a = { .x = 1, .y = 2 };
struct point b = { .x = 5, .y = 6 };
distance(a, b);

// we can use compound literals
distance((struct point) { .x = 1, .y = 2 }, (struct point) { .x = 5, .y = 6 });

So compound literals help with reassignment of structs, and not storing temporary variables just to pass as function arguments. What happens though when one of your members is a pointer? C strings are easy because they already have a literal value:

1
2
3
4
5
6
7
8
9
10
struct node {
  char *value;
  struct node *next;
};

// just using designated initialization
struct node a = {
  .value = hello world,
  .next = NULL
};

But what happens if we want to initialize node.next? We could do:

1
2
3
4
5
struct node b = {
  .value = foo,
  .next = NULL
};
a.next = &b;

Again, we have to define b before assigning it to a.next. That’s worthwhile if you need to reference b later in that scope, but sometimes you don’t (just like how compound literals can help with function arguments)! But that’s where I was stumped. How do you nest designated initializers when the a member is a pointer to another designated initializer? A first naïve attempt was:

1
2
3
4
5
6
7
8
9
struct node c = {
  .value = bar,
  .next = {
    .value = baz,
//  ^
// error: designator in initializer for scalar type 'struct node *'
    .next = NULL
  }
};

WTF? Well, if you go back to the example with nodes a and b, we don’t assign the value of b to a.next, we assign it a pointer to b. So how can we use designated initializers to define, say, the first two nodes of a linked list? Compound literals. Remember, a compound literal is essentially a designated initialization + cast.

1
2
3
4
5
6
7
struct node d = {
  .value = qux,
  .next = &((struct node) {
    .value = fred,
    .next = NULL
  })
};

And that works, but why? d.next is assigned an address of a compound literal. Granted, you probably don’t want to be declaring your entire linked list like this, as nesting gets out of control fast. I really like this style because it reminds me of JavaScript’s syntax for declaring object literals. It would look nicer if all of your nested structs were values and not references though; then you could just use designated initializers and wouldn’t need compound literals or address of operators.

What’s your favorite or most interesting part of C syntax?

Acknowledgements: