Reverse Engineering UIKit to Fix Our Top Crash
Our most recent releases of PSPDFKit for iOS introduced many new features, such as Electronic Signatures, Instant Comments, and the revamped undo and redo architecture. But we’ve also been consistently working hard on under-the-hood enhancements and bug fixes. One such improvement was fixing an extremely hard-to-reproduce crash that, incidentally, was our top reported crash for a very long time.
I took on the challenge of hunting down the root cause of this crash and addressing it. This blog post describes my process of identifying it, reverse engineering UIKit to understand what was going on, and coming up with a fix.
Compiling Information to Reproduce It
We had been gathering information about this crash for a long time before I took it on. The issue had been open for more than a year, and we had multiple engineers attempting to resolve it. Eventually, we managed to gain a vague understanding of the conditions that caused it.
In short, the crash occurred when viewing documents and changing pages using the page curl transition style; it was caused by an assertion failure inside UIPageViewController
. We knew that a set of specific conditions had to be satisfied for it to occur:
-
Configuring the view controller to use a page curl transition style and double-page mode
-
Putting the device in landscape orientation
-
Going to the beginning of the document with the first page on the right
-
Double-tapping near the center of the screen to zoom in
-
Swiping left from near the center of the screen to go to the next page
Even when repeating the above steps, the chance of stumbling upon the crash was still very low: It took multiple attempts of zooming in and randomly swiping to trigger it. I knew I had to gather more information.
Looking at the Crash Logs
So next, I logged into our crash reporting service to collect the symbolicated .crash
files. I chose one of them and found three clues inside: the device hardware model (iPad11,6), the iOS version (14.4), and where the exception was raised:
-[UIPageViewController _validatedViewControllersForTransitionWithViewControllers:animated:] + 516 (UIPageViewController.m:1224)
The above means that the exception was raised in the UIPageViewController
class, in the _validatedViewControllers…
method, and at offset 516
, which corresponds to line 1224
in UIPageViewController.m
. Without access to UIKit’s source files, the file name and line number are meaningless. But what about the offset?
Hopping into Hopper
To find out what exact condition was causing the exception to be raised, I needed to disassemble UIKit — and not just any UIKit — the exact UIKit binary from iOS 14.4 compiled for iPad11,6. I went to the IPSW Downloads website and found the specific firmware archive I needed. I downloaded it, unzipped it, and mounted the largest DMG I found inside, having correctly assumed that it must contain the file system. 🛎
On iOS, framework binaries can’t be found at their classic /System
location. Instead, they’re all packed into a shared DYLD cache at /System
.
Fortunately, Hopper, my disassembler of choice, has supported disassembling DYLD caches for quite some time! I opened it and was presented with the choice of a concrete framework to disassemble.
Choosing UIKit would lead me nowhere. That’s because, beginning with iOS 12, Apple started using UIKit as a wrapper around UIKitCore, where all the implementation now lives. So I went with the latter and left my desk to make myself a cup of coffee. The disassembly process takes time.
After Hopper was done disassembling, I searched for the _validatedViewControllers…
symbol in the Procedures tab, selected Navigate → Go To Offset in Procedure, and entered 516
— the value from the crash report. I landed at a line that raises an exception with the exact message we were seeing:
Unfortunately, the surrounding pseudocode Hopper produced wasn’t helpful: It involved too many incomprehensible instructions that were hard to reason about. It turned out that the disassembly of Apple Silicon binaries generally produces a more cryptic pseudocode than the disassembly of Intel binaries.
To not spend more time in an evident dead end, I disassembled the UIKit binary built for the iOS Simulator instead. It can be found at Xcode.app
.
I searched for _validatedViewControllers…
again, but due to the architectural differences between Apple Silicon and Intel CPUs, offset 516
no longer pointed to anything useful. So I just looked for the exception message and found the equivalent offset myself.
At this point, I was looking at something that was a bit more understandable but still contained a large amount of noise that needed to be deciphered.
Some Disassembly Required
There are two key aspects to understanding the disassembled pseudocode: how methods are called in Objective-C, and how conditions are compiled. Knowledge of assembly isn’t actually that important!
In short, the job of any compiler is to transform code written in a programming language into a series of instructions that a processor understands. The most common instructions manipulate data placed in registers, which are units of memory that are incredibly fast to access. There are different types of registers, but general-purpose registers are the most important for reverse engineering. On all modern Apple devices, each one can hold up to 64 bits of data. Intel processors have 16 of them: rax
, rbx
, rcx
, rdx
, rdi
, rsi
, rsp
, rbp
, and r8
–r15
. Apple Silicon processors have nearly twice as many.
When the Objective-C compiler encounters a method call, it follows a strict set of rules, known as the calling convention, that dictate how arguments and return values are passed between them. For Intel processors, these rules are:
-
rdi
holds the object (or class) receiving a message — it’sself
, the first implicit argument of every Objective-C method. -
rsi
is the selector being sent — the second implicit argument of every Objective-C method, also known as_cmd
. -
rdx
,rcx
,r8
, andr9
contain the first four actual arguments of a method. -
The remaining arguments are placed on the stack.
-
If a method returns a value, it’ll be put in
rax
.
Values that are 64 bits or smaller — like pointers, integers, or Booleans — occupy one register and are relatively easy to track. For larger values and structs… it gets complicated. Sometimes they’re split between multiple registers, and sometimes they’re placed on the stack.
With that in mind, it’s best to see the calling convention in action. Imagine the following method:
- (nullable NSString *)greetPersonAtIndex:(NSInteger)index { NSArray<NSString *> *names = self.names; return [NSString stringWithFormat:@"Hello, %@!", names[index]]; }
According to the calling convention described above, the compiler should emit a set of assembly instructions equivalent to the following pseudocode:
// Hold on to `self` and the index argument. r10 = rdi; r11 = rdx; // Call `[self names]`. rdi = r10; rsi = @selector(names); rax = _objc_msgSend(rdi, rsi); // Call `names[index]`. rdi = rax; rsi = @selector(objectAtIndexedSubscript:); rdx = r11; rax = _objc_msgSend(rdi, rsi, rdx); // Call `[NSString stringWithFormat:]`. rdi = _OBJC_CLASS_$_NSString; rsi = @selector(stringWithFormat:); rdx = @"Hello, %@!"; rcx = rax; rax = _objc_msgSend(rdi, rsi, rdx, rcx); // Return the last set return value. return rax;
A disassembler like Hopper will often recognize the pattern above and try to produce a more legible pseudocode resembling actual method calls. But that’s not always possible, and the ability to recognize the pattern manually and fill in the necessary gaps is an extremely useful skill when reverse engineering.
Now, imagine the following, safer version of the above method that checks the array’s bounds before accessing an object at an arbitrary index:
- (nullable NSString *)greetPersonAtIndex:(NSUInteger)index { NSArray<NSString *> *names = self.names; if (index < names.count) { return [NSString stringWithFormat:@"Hello, %@!", names[index]]; } else { return nil; } }
Given a method with more than one execution path, the compiler will chop it up into labeled sections that can be either explicitly entered using goto
in pseudocode, or by falling through from earlier code. In the case of the example above, the compiler should emit assembly instructions similar to the following pseudocode:
// Hold on to `self` and the index argument. r10 = rdi; r11 = rdx; // Call `[self names]` and hold on to the return value. rdi = r10; rsi = @selector(names); rax = _objc_msgSend(rdi, rsi); r12 = rax; // Call `[names count]`. rdi = r12; rsi = @selector(count); rax = _objc_msgSend(rdi, rsi); // If the count is less than or equal to the index argument, jump to // `loc_3`. Otherwise, fall through to `loc_1`. if (rax <= r11) goto loc_3; loc_1: // Call `names[index]`. rdi = r12; rsi = @selector(objectAtIndexedSubscript:); rdx = r11; rax = _objc_msgSend(rdi, rsi, rdx); // Call `[NSString stringWithFormat:]`. rdi = _OBJC_CLASS_$_NSString; rsi = @selector(stringWithFormat:); rdx = @"Hello, %@!"; rcx = rax; rax = _objc_msgSend(rdi, rsi, rdx, rcx); // Fall through to `loc_2`. loc_2: // Return the last set return value. return rax; loc_3: // Set the return value to `nil` and jump to `loc_2`. rax = 0x0; goto loc_2;
Note how loc_1
can only be entered by falling through from above, loc_3
can only be entered explicitly, and loc_2
can be entered both ways. Understanding this basic mechanism, along with the calling convention, should be enough for now.
Making Sense of Things
Armed with the above knowledge, I went back to the _validatedViewControllers…
method and attempted to make sense of the pseudocode I was seeing:
// Hold on to the arguments. r13 = rdi; r14 = rdx; rbx = rcx; // If `self->_transitionStyle` is `.pageCurl` (raw value of `0`), jump // to `loc_447019`. What happens otherwise isn't relevant. rax = *ivar_offset(_transitionStyle); rax = *(r13 + rax); if (rax == 0x0) goto loc_447019; loc_447191: // Get the count of view controllers. Note how `rsi` (the `count:` // selector) is set in `loc_447050` and `loc_447050`. rdi = r14; rax = _objc_msgSend(rdi, rsi); // Raise the exception. Note how `r12` (required count) is set in // `loc_447050` and `loc_447050`. rdi = _OBJC_CLASS_$_NSException; rsi = @selector(raise:format:); rdx = **_NSInvalidArgumentException; rcx = @"The number of view controllers provided (%ld) doesn't match the number required (%ld) for the requested transition"; r8 = rax; r9 = r12; rax = _objc_msgSend(rdi, rsi, rdx, rcx, r8, r9); // Even though an exception was just raised, every method must // return in case the exception is caught or ignored. Therefore, // jump to `loc_4471d2`. goto loc_4471d2; loc_4471d2: // Return the slice of view controllers. Note how the return value // is set in `loc_447050`, but it holds gibberish if `loc_447050` // isn't executed. This is a demonstration of why the return value // of a method that raised an exception should be ignored. rax = rbx; return rax; loc_447019: // If the animated argument is `true`, jump to `loc_447021`. What // happens otherwise isn't relevant. if (rbx != 0x0) goto loc_447021; loc_447021: // Compare `self->_doubleSided` to `1` and subtract the Boolean // value from `2`, which could result in either `2` or `1`. CMP(*(int8_t *)&r13->_doubleSided, 0x1); rbx = 0x2 - 0x0 + CARRY(RFLAGS(cf)); // If the count of the view controller isn't equal to the result of the // above subtraction, jump to `loc_447187`. Otherwise, fall through. if ([r14 count] != rbx) goto loc_447187; loc_447050: // Hold on to the difference from `loc_447021`. r12 = rbx; // Get the slice of the view controllers in range at location `0` // and the length equal to the difference from `loc_447021`. Note how // the `NSRange` value here is split between `rdx` and `rcx`. rcx = rbx; rax = [r14 subarrayWithRange:0x0]; rbx = rax; // Prepare the selector for `_objc_msgSend` and jump to `loc_4471d2`. rsi = @selector(count); goto loc_4471d2; loc_447187: // Hold on to the difference from `loc_447021`. r12 = rbx; // Prepare the selector for `_objc_msgSend` and jump to `loc_447191`. rsi = @selector(count); goto loc_447191;
I decided to translate the above pseudocode into actual Objective-C code to better reason about what was going on inside:
- (NSArray<UIViewController *> *)_validatedViewControllersForTransitionWithViewControllers:(NSArray<UIViewController *> *)viewControllers animated:(BOOL)animated { if (self->_transitionStyle == UIPageViewControllerTransitionStylePageCurl) { if (animated) { NSUInteger requiredCount = self->_doubleSided ? 2 : 1; if (viewControllers.count == requiredCount) { return [viewControllers subarrayWithRange:NSMakeRange(0, requiredCount)]; } else { [NSException raise:NSInvalidArgumentException reason:@"The number of view controllers provided (%ld) doesn't match the number required (%ld) for the requested transition", viewControllers.count, requiredCount]; } } else { /* irrelevant */ } } else { /* irrelevant */ } }
From what I was seeing, this method was just asserting the correctness of its arguments. I enabled a symbolic breakpoint and attempted to reproduce the crash. Then, using LLDB, I promptly confirmed that the viewControllers
argument was indeed nil
. This meant I had to go up the backtrace to find where the invalid values were coming from.
Reverse Engineering the Page View Controller
To help me with this task, I overrode all private methods I saw in the backtrace — with the help of class-dumped runtime header files, which provided home hints regarding the types of arguments and return values. I translated all of them into actual Objective-C code using the method described above. That allowed me to more easily set up breakpoints and track data going in and out of the methods, greatly improving my debugging experience.
My next steps mostly consisted of playing cat and mouse with UIKit, featuring tons of: logs, breakpoints, stepping in and out of dozens of methods, printing register values, frantically manipulating arguments, and return values. This was the most time-consuming part of my investigation: It took multiple sessions over the course of several days.
At some point, I noticed that the gesture recognizer calls a method on UIPageViewController
that asks its dataSource
for view controllers. From what I understood, the nil
argument causing the exception to be raised was the return value of this method. I looked at our dataSource
implementation and confirmed that our logic was correct: We returned nil
only if UIPageViewController
asked for a view controller at an index that was out of bounds of a document.
I had more questions than answers. Why was UIPageViewController
requesting a page in a reverse direction when swiping forward? And how does it handle its pan gesture? I disassembled the responsible pan gesture recognizer handler:
- (void)_handlePanGesture:(UIPanGestureRecognizer *)recognizer { if (self->_panGestureRecognizer != nil && recognizer == self->_panGestureRecognizer) { if (recognizer.state == UIGestureRecognizerStateBegan) { UIPageViewControllerNavigationDirection direction = UIPageViewControllerNavigationDirectionForward; if ([self _shouldBeginNavigationInDirection:&direction inResponseToPanGestureRecognizer:recognizer] == YES && /* irrelevant */) { NSArray<UIViewController *> *viewControllers = [self _incomingViewControllersForGestureDrivenCurlInDirection:direction]; CGPoint location = [recognizer locationInView:self.view]; [self _setViewControllers:viewControllers withCurlOfType:/* irrelevant */ fromLocation:location direction:direction animated:YES notifyDelegate:NO completion:/* irrelevant */]; } else { /* irrelevant */ } } else { /* irrelevant */ } } else { /* irrelevant */ } }
In the course of my investigation, I’d already learned that the _incomingViewControllers…
method was being called with an incorrect direction, so the culprit must have been the return value of the _shouldBeginNavigation…
method. So I disassembled that as well:
- (BOOL)_shouldBeginNavigationInDirection:(UIPageViewControllerNavigationDirection *)direction inResponseToPanGestureRecognizer:(UIPanGestureRecognizer *)recognizer { if (self->_transitionStyle == UIPageViewControllerTransitionStylePageCurl) { if ([self _shouldNavigateInDirection:direction inResponseToVelocity:NULL ofGestureRecognizedByPanGestureRecognizer:recognizer] == YES) { return YES; } else { if (self->_navigationOrientation == UIPageViewControllerNavigationOrientationHorizontal) { CGPoint translation = [gestureRecognizer translationInView:self.view.superview]; /* tons of floating point operations */ if (/* the result of floating point operations */ > 0) { *direction = /* the result of floating point operations */ return YES; } else { return NO; } } else { /* irrelevant */ } } } else { /* irrelevant */ } }
Two things struck me. First, it calls another “should begin” method but falls back to its own logic if the other method returns NO
. Second, it requests the pan gesture recognizer’s translation vector in the view controller’s superview’s coordinate space. This looked suspicious, so I overrode this method and put an NSLog
inside to watch the return values:
- (BOOL)_shouldBeginNavigationInDirection:(UIPageViewControllerNavigationDirection *)direction inResponseToPanGestureRecognizer:(UIPanGestureRecognizer *)recognizer { CGPoint translation = [recognizer translationInView:self.view.superview]; BOOL first = [self _shouldNavigateInDirection:direction inResponseToVelocity:NULL ofGestureRecognizedByPanGestureRecognizer:recognizer]; BOOL second = [super _shouldBeginNavigationInDirection:direction inResponseToPanGestureRecognizer:recognizer]; NSLog(@"_shouldBeginNavigationInDirection, translation: %@, first: %i, second: %i, direction: %ld, state: %ld", NSStringFromCGPoint(translation), (int)first, (int)second, (direction != NULL ? *direction : -1), recognizer.state); return secondRv; }
I attempted to trigger the crash once again and saw that the only log before the exception was raised was the following:
_shouldBeginNavigationInDirection, translation: {1.1368683772161603e-13, 0}, first: 0, second: 1, direction: 1, state: 1
I repeated my tests over and over again and consistently got the same log every single time. I took note of the first value in the translation vector and pasted it in IEEE-754 Floating Point Converter to check if it was anything special.
The result: 2 to the power of -43. Oddly specific. I knew I was onto something.
Finding the Cause
At this point, I had a suspicion that the crash was caused by faulty arithmetic on Apple’s side, combined with touch location precision errors caused by the pan gesture recognizer being inside a zoomed scroll view. 2-43 is indeed greater than zero, even though it actually represents zero pixels.
To confirm my theory, I looked at the disassembly of translationInView:
in Hopper. This method takes the current location, along with a previously saved initial location of the touch, and it computes a difference between them. I triggered the crash once again and, based on my findings, checked the calculation results in the debugger:
(lldb) e [recognizer _convertPoint:recognizer->_firstSceneReferenceLocation fromSceneReferenceCoordinatesToView:self.view.superview]; (CGPoint) $1 = (x = 564.65827338129475, y = 263.75899280575533) (lldb) e [recognizer locationInView:self.view.superview] (CGPoint) $2 = (x = 564.65827338129486, y = 263.75899280575533) (lldb) e $2.x - $1.x (double) $3 = 1.1368683772161603e-13
That’s our 2-43. Even though my touch didn’t move at all, the translation vector’s x
component was greater than zero due to good old floating-point precision.
The difference between $1
and $2
— even though they both refer to the initial location of the touch — is still beyond my understanding. Perhaps this was sensory imprecision or a side effect of some implementation detail inside gesture recognizers. But it was no longer important.
Applying the Fix
This step was only a formality. I overrode the _shouldBeginNavigation…
method and fixed the comparison to return NO
if any of the translation vector components were less than one point:
- (BOOL)_shouldBeginNavigationInDirection:(UIPageViewControllerNavigationDirection *)direction inResponseToPanGestureRecognizer:(UIPanGestureRecognizer *)recognizer { if (ABS([recognizer translationInView:self.view.superview].x) < 1 || ABS([recognizer translationInView:self.view.superview].y) < 1) { return NO; } return [super _shouldBeginNavigationInDirection:direction inResponseToPanGestureRecognizer:recognizer]; }
To be completely sure, I put breakpoints inside the custom conditions and attempted to trigger a crash one more time. The breakpoints were being hit. The crash was gone! 🎉
The final step was to make this fix safer and futureproof by using swizzling instead — we wouldn’t dare to ship code that explicitly overrides a private method. That would definitely cause problems with Apple’s app review process.
Additional tests confirmed that the crash has indeed been fixed. And we’ve had no new reports of it happening since releasing PSPDFKit 10.3 for iOS.
Conclusion
Floating-point precision errors are vicious. They’re hard to notice, they pile up the more operations you perform, and they’re extremely difficult to investigate. When dealing with floating-point values, round them when it makes sense, and always take accuracy into account when doing comparisons. Let this also be a periodic reminder to not use Float
or Double
to represent money — you can’t express ⅒ in a base-2 system!
Perseverance is key with bugs like this one. They tend to be very frustrating in the beginning. No matter how much time you allocate for your investigation, you’ll probably exceed it. If stuck, throw a bunch of print
statements of seemingly unrelated stuff and look for patterns — computers aren’t that random.
In the end, I hope you’ll find this blog post useful next time you encounter an obscure bug in the iOS SDK, and that the process I described encourages you to give reverse engineering a try.
We emphasize consistently making our products more stable here at PSPDFKit, and we’ll continue doing so as much as we can. You can expect this kind of detailed investigation when you reach us through customer support.
If you’re interested in a PDF solution for your business but you don’t want to deal with the intricacies of building and maintaining complex user interfaces, just reach out to our sales team.