Blog post

Time Travel Debugging for C/C++

Markus Woschank Markus Woschank
Illustration: Time Travel Debugging for C/C++

In software development, one of the most invaluable tools in your toolbox is the debugger. Typical debuggers for C/C++ let you halt the execution of a program; inspect the current state (including aspects like variables and registers); and continue the execution until the next source line, instruction, or breakpoint. Most of them let you change the current state by modifying registers and variables, and they even allow you to call functions directly from the debugger before continuing execution. In this way, you can experiment without the need to go through another compile and run cycle.

But even armed with this great tool, fixing some bugs can be difficult and time consuming. Just because we know a variable contains the wrong value and managed to halt the execution at a point to verify it has the wrong value doesn’t mean we know how the wrong value ended up there. We have to embark on a journey of guesswork and trial and error to narrow down the culprit, and this potentially involves many reruns of the program. Add some nondeterminism like threading into the mix and things can get very frustrating.

Reverse Debugging to the Rescue

This is where time travel debugging — also known as reverse debugging or reversible debugging — comes into play. It’s an idea that’s been around for quite some time (as early as 1973), but it’s surprisingly not very well known. However, the idea promises to make our lives easier.

While traditional debuggers only let us step forward in (execution) time, reverse debugging allows us to step backward. Basically this means we can “undo” already executed instructions. With the ability to set breakpoints and let time run backward, it’s suddenly much easier to catch who set a variable to the wrong value or who corrupted the stack. If we don’t need to restart the process, we can simply set watchpoints to stop if specific memory locations are accessed and let time run backward. This works even for heap allocated objects that would have different addresses each run or addresses that would change because of address space layout randomization (ASLR).

Under the Hood

To be able to step back in time, the reverse debugging system must be able to reproduce the process state at any point in time (of the captured process execution). Simply making a copy of the whole process state after every instruction is clearly not feasible. Reverse debugging systems use the fact that most instructions are deterministic. If the state of the process is known before a series of deterministic instructions are executed, then the reverse debugging system doesn’t need to record changes made to the process state during the execution of these instructions because it can always replay these instructions if needed. This approach dramatically reduces memory usage and runtime overhead, and it makes reverse debugging at least somewhat usable.

Not every instruction is deterministic; the side effects and results of nondeterministic instructions must be recorded, as should the operations that depend on the state of the environment, like reading from files or querying the current time. Capturing the application state — including nondeterministic system interactions — at every step of the program is called record and replay (debugging), and it provides the foundation for most reverse debugging systems. To enable the debugger to quickly jump to different points in time, techniques like creating full snapshots at regular intervals (like key frames in video encodings) can be used.

Being able to capture the execution of a program down to the instruction level opens up the possibility of creating a trace that can be shared and replayed on remote systems. Imagine a bug found during QA or in production, with the corresponding trace captured. This trace can be forwarded to developers, who can then debug it without needing to recreate the exact conditions that led to the failure.

While the concept doesn’t sound that difficult, given the variety of CPUs with different instruction sets, it can be challenging to correctly create traces that can be replayed and produce the exact same state.

Time Travel Debugging in Action

There are a few reverse debugging systems that have been freely available for a while now. On Windows, WinDbg Preview has supported time travel debugging since 2017. Meanwhile, GDB has had support for it on Linux since 2009. And rr 1.0 from Mozilla was released in 2014.

There’s a nice time travel debugging walkthrough for WinDbg showing how to find a stack corruption error on Windows, so let’s use it to see how we can find such an error on Linux using GDB’s process record and replay feature first, and then using rr.

GDB

We’ll use a slightly modified version of the WinDbg walkthrough’s sample app code:

#include <array>
#include <cstring>
#include <iterator>
#include <wchar.h>

void fill(wchar_t* dst,size_t sz)
{
  const wchar_t msg[] = L"Hello, World!";
  wcsncpy(dst,msg,std::min(wcslen(msg),sz));
}

int main(int,char**)
{
  std::array<wchar_t,8> buffer;
  fill(data(buffer),sizeof(buffer)); // Should be `size(buffer)`.
  // Some other code here.
}

The code above includes an easy-to-miss mistake and leads to a buffer overflow overwriting the stack. To make the disassembly a little easier to read, we’re disabling gcc’s stack protector feature. The stack protector feature, which some distributions enable by default, detects most stack overflows, prints a diagnostic message, and terminates the program. However, it won’t tell you the cause of the stack corruption.

If we compile and run our program, we get a segmentation fault:

$ g++ -g -Wall -std=c++17 -fno-stack-protector stack-smasher.cc -o stack-smasher
$ ./stack-smasher
Segmentation fault

Let’s jump into reverse debugging! We’ll start GDB, break on main, enable recording, and test out reverse debugging:

$ gdb ./stack-smasher
(gdb) break main
Breakpoint 1 at 0x122c: file stack-smasher.cc, line 15.
(gdb) run
Breakpoint 1, main () at stack-smasher.cc:15
15              fill(data(buffer),sizeof(buffer)); // Should be `size(buffer)`.
(gdb) target record-full
(gdb) continue
Process record does not support instruction 0xc4 at address 0x7ffff7fdc930.
Process record: failed to record execution log.

:(

Our enthusiasm is a little dampened by this cryptic error message. It turns out GDB process recording has troubles with some SIMD/AVX instructions, and after consulting with Stack Overflow, it becomes clear that glibc nowadays detects advanced CPU features on the fly, and there’s no easy way to disable the usage of SIMD/AVX instructions. Long story short: We can get past that with some hackery.

Disable SIMD/AVX use in glibc:

$ perl -0777 -pe 's/\x31\xc0.{0,32}?\K\x0f\xa2/\x66\x90/' \
  < /lib64/ld-linux-x86-64.so.2 > ld-linux
$ chmod u+rx ld-linux
$ patchelf --set-interpreter `pwd`/ld-linux stack-smasher
$ LD_BIND_NOW=1 gdb ./stack-smasher

With that out of the way, let’s dive into reverse debugging:

(gdb) break main
Breakpoint 1 at 0x122c: file stack-smasher.cc, line 15.
(gdb) run
Breakpoint 1, main () at stack-smasher.cc:15
15              fill(data(buffer),sizeof(buffer)); // Should be `size(buffer)`.
(gdb) target record-full
(gdb) continue
Program stopped.
0x000000640000006c in ?? ()
(gdb) backtrace
#0  0x000000640000006c in ?? ()
#1  0x0000000000000021 in ?? ()
#2  0x00007fffffffdb28 in ?? ()
#3  0x00000001f7f9afa0 in ?? ()
#4  0x000055555555521d in fill (dst=0x6500000048 <error: Cannot access memory at address 0x6500000048>, sz=4294967296) at stack-smasher.cc:10
#5  0x0000000000000000 in ?? ()

As the name of our program suggests, we’re smashing the stack, and as a result, the backtrace we’re getting isn’t very helpful. But with our new-found superpower, we can step back one instruction in time. We now get a useful backtrace, and looking into the disassembly lets us get closer to what’s going on:

(gdb) reverse-stepi
0x000055555555524b in main () at stack-smasher.cc:17
(gdb) layout asm
0x55555555521d <main(int, char**)>       push   %rbp
   0x55555555521e <main(int, char**)+1>     mov    %rsp,%rbp
   0x555555555221 <main(int, char**)+4>     sub    $0x30,%rsp
   0x555555555225 <main(int, char**)+8>     mov    %edi,-0x24(%rbp)
   0x555555555228 <main(int, char**)+11>    mov    %rsi,-0x30(%rbp)
B+ 0x55555555522c <main(int, char**)+15>    lea    -0x20(%rbp),%rax
   0x555555555230 <main(int, char**)+19>    mov    %rax,%rdi
   0x555555555233 <main(int, char**)+22>    callq  0x555555555277 <std::data<std::array<wchar_t, 8ul> >(std::array<wchar_t, 8ul>&)>
   0x555555555238 <main(int, char**)+27>    mov    $0x20,%esi
   0x55555555523d <main(int, char**)+32>    mov    %rax,%rdi
   0x555555555240 <main(int, char**)+35>    callq  0x555555555175 <fill(wchar_t*, unsigned long)>
   0x555555555245 <main(int, char**)+40>    mov    $0x0,%eax
   0x55555555524a <main(int, char**)+45>    leaveq
  >0x55555555524b <main(int, char**)+46>    retq

The segmentation fault is triggered at the retq, which tries to hand control back to the calling function. It’s a shortcut for popping the current value from the stack and jumping to that address. So let’s inspect the current value on the stack:

(gdb) x $rsp
0x7fffffffda48: 0x0000006c

That doesn’t look right. We could jump back to the beginning of the function (or even the calling code if we break in _start instead of main) and see what that value should be. If you’re interested in this kind of low-level stuff, just try it out, but to keep this post a little shorter, we’ll assume that such a low number is unlikely to be correct and clearly someone did overwrite the stack. The nice thing is that we can now set a watchpoint for writes to this address and let our program run backward:

(gdb) set can-use-hw-watchpoints 0
(gdb) watch *0x7fffffffda48
Watchpoint 2: *0x7fffffffda48
(gdb) reverse-continue
Watchpoint 2: *0x7fffffffda48
__memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:371
(gdb) backtrace
#0  __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:371
#1  0x000055555555521a in fill (dst=0x7fffffffda20 L"Hello, W\x555552c0啕\xf7a34e3b翿𑰀", sz=32) at stack-smasher.cc:9
#2  0x0000555555555245 in main () at stack-smasher.cc:15

Here we have the culprit: The fill function calls wcsncpy, which in turn overwrites the return address on the stack. We narrowed down the error to this invocation and it’s now much easier to track it down to the incorrect use of the sizeof operator. Time travel debugging enabled us to find the bug in a systematic way without needing to guess and restart the program several times.

Note: Hardware watchpoints don’t seem to work with the process record feature of GDB, although it accepts them silently, so we have to make sure to use software watchpoints by setting set can-use-hw-watchpoints 0.

rr

With rr, we first need to capture an execution trace that can be replayed later:

$ rr ./stack-smasher
[FATAL /home/roc/rr/rr/src/PerfCounters.cc:247:get_cpu_microarch()] AMD CPUs not supported.
For Ryzen, see https://github.com/mozilla/rr/issues/2034.
For post-Ryzen CPUs, please file a GitHub issue.

Sigh. OK, switching to an Intel machine:

$ rr ./stack-smasher --args
rr: Saving execution to trace directory `/home/ubuntu/.local/share/rr/stack-smasher-0'.
Segmentation fault

We now have a trace of the entire program execution recorded on disk, and we can replay it in the debugger:

$ rr replay
GNU gdb ..
0x00007f7e27f9a090 in _start () from /lib64/ld-linux-x86-64.so.2
(rr)

From here on, it’s basically the same as with GDB’s record functionality, except we don’t have to start recording, and reverse commands are already available. What’s noteworthy is this time, hardware watchpoints worked and set can-use-hw-watchpoints 0 wasn’t needed. On Intel, the CPU’s rr seems to work out of the box, and we didn’t need any of the SIMD/AVX hackery. This doesn’t work on AMD though — that’s a bummer.

Limitations

Traces can quickly eat up disk space, and capturing a trace with WinDbg for even a few minutes can add up to several gigabytes of disk space. GDB’s process record feature slows down program execution by several orders of magnitude, while rr on the other hand made it one of its design goals to have a very low run-time overhead and claims to add less than 20 percent to the execution time in usual scenarios.

Architecture support is another issue — rr not working on AMD CPUs with an open issue since 2017, ARM support being very unlikely, and GDB not supporting AVX instructions from modern CPUs all limit adoption. As a result, a lot of users can’t simply capture a trace of their program if they have the wrong CPU or may need to compile a version of their program not using AVX instructions. One more thing to note is multi-threading where rr simply allows only one thread to run at any given time: This severely slows down heavily parallelized applications.

Conclusion

The advantages of time travel debugging may not outweigh the limitations to justify day-to-day use, but it’s a handy tool worth having in your toolbox to catch some tricky bugs. rr’s idea of running all tests with tracing enabled in your CI setup sounds like a practical thing that might be useful for certain projects, and commercial products like UndoDB/LiveRecorder and TotalView claim it will cut your development costs.

Author
Markus Woschank
Markus Woschank Core Engineer

Markus loves to build things. He’s learned way too many things about document formats and text systems since he joined in 2019.

Free trial Ready to get started?
Free trial