Regular Expression Backtracking on StackOverflow

They wrote a really nice postmortem, analyzing, explaining the problem, and a linear time fix:

http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016

I see quadratic time regular expressions all the time, and there have been multiple submitted that I haven’t posted just because they’re so common. This one is noteworthy because of the scale of impact it had.

As Rob Pike points out, linking to Russ Cox’s excellent post these bugs are all the more tragic for being utterly avoidable: building guaranteed-linear-time regex matching is quite simple with a bit of automaton theory.

accidentally quadratic submission regex

Dolphin Emulator Trampoline Generation

[Ed note: This post was contributed by Dougall Johnson via email, and edited by me, with permission]

Dolphin is a Wii/GameCube emulator. A few years back, it had a bug where it was crashing with the error “Trampoline cache full” during the “Super Smash Bros. Brawl” character selection screen.

Dolphin uses a JIT compiler that translates blocks of PowerPC code from Wii or GameCube games into x86-64 instructions as the game runs. It generally uses a feature called “fastmem”, to quickly emulate the Wii’s memory management unit by leveraging the host’s memory management. Fastmem configures a 4GB range of memory to match the Wii’s address space, and PPC memory accesses are translated directly to x86 memory accesses into this region. Certain parts of memory are special and require more complex handling: for example, the FIFO address which sends all data written to it to the GPU. Fastmem doesn’t map these parts of memory, so when JIT code tries to write to the FIFO a fault occurs. Dolphin handles these faults, and simulates the behaviour of special areas of memory (in the FIFO write example, by buffering the data and sending it to the emulated GPU).

However fault handling is extremely slow, so Dolphin also “back patches” the faulting instruction to call directly into the full memory-write code (implemented in C). This requires saving registers and setting up the arguments for the C function, which is much bigger than a single “write” instruction, so Dolphin generates some “trampoline” code to invoke the C code, and patches the write instruction with a call instruction in the original JIT code.

4MB of JIT memory is allocated for trampoline generation (the “trampoline cache”) and this was somehow being exceeded, even though each trampoline is very small and there couldn’t have been more than a few thousand write instructions needing patching in the game code.

However, it turns out that writing to the FIFO is slightly more complicated than it looks. Whenever a write overflows the FIFO buffer, Dolphin adds the address to an internal table (jit->js.fifoWriteAddresses) and triggers a recompilation of the faulting address, which will cause additional exception-handling code to be compiled in. The code that handles this looks like:

void CompileExceptionCheck()
{
    std::unordered_set<u32> *exception_addresses;
    exception_addresses = &jit->js.fifoWriteAddresses;

    if (PC != 0 && (exception_addresses->find(PC) ==
                    exception_addresses->end()))
    {
        int optype = GetOpInfo(Memory::ReadUnchecked_U32(PC))->type;
        if (optype == OPTYPE_STORE || optype == OPTYPE_STOREFP ||
            optype == OPTYPE_STOREPS)
        {
            exception_addresses->insert(PC);

            // Invalidate the JIT block so that it gets recompiled
            // with the external exception check included.
            jit->GetBlockCache()->InvalidateICache(PC, 4);
        }
   }
}

The key detail here is InvalidateICache(PC, 4), which invalidates the compiled code for the entire basic block containing PC (since code is JIT'ed a basic block at a time, there’s no way to invalidate a smaller range).

When the block is next invoked, the JIT compiler re-compiles the code, adding an external exception check because the instruction is in jit->js.fifoWriteAddresses (which is the whole point of the recompilation). Recompiling inadvertently has the side effect of removing any back-patches from the code. Because of the exception_addresses->find(PC) check in the above function, this recompilation can only occur once per distinct instruction that writes to the FIFO.

It turns out that there is a single block of code in Super Smash Bros. Brawl which contains 216 FIFO write instructions. You can see the PPC assembly code here: http://codepad.org/XMRCtzaG .

JITing and then executing this block will generate 216 faults (one for each write), 216 back-patches, and 216 trampolines.

However, as the code runs, eventually each of those 216 FIFO writes will happen to be the write that overflows the FIFO. When that happens, it will trigger a recompilation of the entire block, meaning the next execution will result in an additional 216 faults, backpatches, and trampolines.

In the limit, this can lead to 217 compilations of this block (the initial one, plus one per write), each of which results in 216 backpatches, for a theoretical limit of 46872 trampolines! This accidentally-quadratic trampoline generation was more than sufficient to blow the 4MB cache.

The bug was initially fixed by adding a map lookup to find and reuse existing trampolines. Later, the constant propagation was improved to detect all the FIFO writes ahead of time, avoiding all back patching in this case.

Confusingly, the bug first appeared in Dolphin 4.0-140, which was a tiny patch, that introduced the following code, just before generating each write instruction:

// Helps external systems know which instruction triggered the write MOV(32, M(&PC), Imm32(jit->js.compilerPC));

Prior to this commit, the PC wasn’t updated within the block (only at the start/end), so each block would only have been recompiled once, preventing the excessive trampoline generation. At that time, the character selection screen in Super Smash Bros. Brawl would work with only 1500 trampolines generated. After that commit, the trampoline count would start at around 4000 and quickly grow to over 20000.


For more interesting Dolphin stories, see the Dolphin Progress Reports (https://dolphin-emu.org/blog/), and “Gekko the Dolphin” by Fiora in PoC||GTFO 0x06 (https://www.alchemistowl.org/pocorgtfo/pocorgtfo06.pdf — 97MB PDF).

[Ed note: I love the war stories that come from the Dolphin team. Emulators are amazing pieces of software, and the Dolphin team writes some awesome stories of reverse-engineering and low-level debugging to figure out how the hell the platform they’re emulating ever worked in the first place]

accidentally quadratic dolphin jit submission

Apache Spark

Apache Spark is a cluster computing engine, essentially an alternative computation model to MapReduce for executing jobs across large clusters.

Spark’s scheduler stores pending work on a number of arrays, to keep track of which work is available to be executed where in the cluster.

In Spark 1.6, a bug was introduced, that meant that any time Spark added a new task to one of these arrays, it would first exhaustively search the array to ensure it wasn’t re-adding a duplicate.

Since the array search is O(n), this resulted in O(n²) behavior to schedule n tasks.

Like all good accidentallyquadratic bugs, this went unnoticed until some poor user tried to submit 200k jobs to their Spark install in one go, and observed it hanging for effectively forever.

It turns out that the code didn’t actually require the uniqueness guarantee in the array, so the fix was simply to remove the check. (Of course, if they had needed the check, keeping a secondary hash map, or turning the array into an ordered hash map, would likely have worked just as well).

(Thanks to @leifwalsh for bringing this one to my attention)

accidentally quadratic apache spark bigdata

node.js left-pad

If you’re a programmer, there’s a good chance you noticed the node.js left-pad fiasco of a few weeks back that temporarily broke most of the npm ecosystem.

This blog doesn’t have an opinion on any of that. However, in the ensuing kerfluffle, several people observed that left-pad appears to be quadratic, and that is definitely this blog’s concern.

If we look at the code, we see that the main loop looks like so:

while (++i < len) {
  str = ch + str;
}

In most languages’ string implementations, this would be definitely quadratic — + copies the entire string (of size O(len)) on each iteration, and this loop runs O(len) times.

However, I happen to know that modern JS implementations have a whole pile of complex String optimizations, including rope-like data structures. And this is an evidence-driven blog, so I decided to take a deeper look. And what I found is fascinating and bizarre and I honestly can’t explain it yet.

The benchmarks

But let’s start with something easy. I ran a benchmark in rhino, which I judged to be the javascript implementation that I had easy access to least likely to have super-advanced string optimizations. We can clearly see the telltale quadratic curve:

But now let’s try something a bit more sophisticated, which also happens to be my primary browser: Chrome. Running a left-pad benchmark in Chrome yields the following result:

There’s bit of anomalous behavior, especially at small sizes, but it makes a compelling case for being linear out through the 1MB limit I ran the test over!

(I’m running Chrome Version 50.0.2661.57 beta (64-bit); your results may well vary with Chrome version!)

And in fact, the Chrome developer tools will let us see the rope structure that Chrome has used to make these concatenations efficient! If we leftpad('hi', 100000, 'x') in a Chrome console and then take a heap snapshot, we can see that the string is a “concatenated string” made up of a huge number of chunks linked together:

(The downside of this optimization, as you can also see from that screenshot, is that our 100kb string now consumes 4MB of RAM…)

Moving on, we can also try Safari, another modern browser with an incredibly-sophisticated Javascript engine. In Safari, we see this odd behavior:

It’s a bit hard to be sure, but the data appears to fit two linear fits, with a cutover somewhere around 350k. This pattern was reproducible across multiple experiments, but I don’t have an explanation.

I also decided to try node, since that is the actual runtime that NPM primarily targets, after all. node runs the same v8 engine as Chrome, so we’d expect similar behavior, but version skew, configuration, or who knows what could cause divergence. And in fact, we see:

Oddly, it also seems to exhibit two separate linear regimes!

At this point, I’m out of energy for what was meant to be a short post, but if anyone can follow-up and explain what’s happening, I’d be terribly curious.

In Conclusion

This adventure turned out to be another excellent demonstration of a principle I’ve expounded on frequently in this blog: Quadratic behavior (or, in this case, the lack there of!) often results in complex interactions between pieces of code operating at different layers of abstraction. In this case, the action-at-a-distance saved us: Modern Javascript VMs, in their sophistication, were able to optimize what looked like extremely quadratic code into a linear linked list! But the principle remains, that it was actually completely impossible for us to analyze left-pad’s performance without a deep understanding of the specific underlying Javascript VM we cared about (or, in my case, resorting to brute experiment!).

accidentally quadratic left-pad node javascript js

Linux `/proc/pid/maps`

The /proc/ filesystem, if you’re not familiar with it, is a magical place full of all kinds of useful debugging tools for introspecting (and modifying) the state of a Linux machine – especially for inspecting other processes.

/proc/<pid>/maps shows, for any process on the system, a list of all of the memory mappings in its address space. For a simple cat execution on my machine, it looks like:

[nelhage@nelhage:~]$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 09:01 254164                             /bin/cat
0060a000-0060b000 r--p 0000a000 09:01 254164                             /bin/cat
0060b000-0060c000 rw-p 0000b000 09:01 254164                             /bin/cat
007dc000-007fd000 rw-p 00000000 00:00 0                                  [heap]
7fc8e702b000-7fc8e72f4000 r--p 00000000 fc:00 4861                       /usr/lib/locale/locale-archive
7fc8e72f4000-7fc8e74af000 r-xp 00000000 09:01 223514                     /lib/x86_64-linux-gnu/libc-2.19.so
7fc8e74af000-7fc8e76ae000 ---p 001bb000 09:01 223514                     /lib/x86_64-linux-gnu/libc-2.19.so
7fc8e76ae000-7fc8e76b2000 r--p 001ba000 09:01 223514                     /lib/x86_64-linux-gnu/libc-2.19.so
7fc8e76b2000-7fc8e76b4000 rw-p 001be000 09:01 223514                     /lib/x86_64-linux-gnu/libc-2.19.so
7fc8e76b4000-7fc8e76b9000 rw-p 00000000 00:00 0
7fc8e76b9000-7fc8e76dc000 r-xp 00000000 09:01 221267                     /lib/x86_64-linux-gnu/ld-2.19.so
7fc8e78cc000-7fc8e78cf000 rw-p 00000000 00:00 0
7fc8e78d9000-7fc8e78db000 rw-p 00000000 00:00 0
7fc8e78db000-7fc8e78dc000 r--p 00022000 09:01 221267                     /lib/x86_64-linux-gnu/ld-2.19.so
7fc8e78dc000-7fc8e78dd000 rw-p 00023000 09:01 221267                     /lib/x86_64-linux-gnu/ld-2.19.so
7fc8e78dd000-7fc8e78de000 rw-p 00000000 00:00 0
7fffd912a000-7fffd914b000 rw-p 00000000 00:00 0                          [stack]
7fffd91a9000-7fffd91ab000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Notably, /proc/$pid/maps attempts to label each mapping with information about its provenance. For regions that map files, the kernel has to track this information directly, anyways, but for regions like the [heap] and [stack] maps, it’s more heuristic.

As of kernel 3.3, the kernel attempts to mark stacks of any thread that is part of that process, as [stack:<thread-id>].

However, the code that checks if a mapping is a stack does so by looping over each thread in the process, to check if its stack pointer points into that mapping.

Since each thread will typically have its own stack, and each stack is typically its own mapping, this means that enumerating /proc/$pid/maps is quadratic in the number of threads in a process. For most applications this isn’t a big deal, but if we consider, say, a database that uses a thread per connection, and has 10s of thousands of connections, we can rapidly get very sad…

This bug was noticed fairly recently, and it looks like it will be resolved in the next kernel release by just removing the offending code: The feature isn’t worth the cost, and userspace can always look at threads individually if it really needs this information.

linux kernel accidentally quadratic memory procfs

Ruby `parser` gem

The ruby parser gem is an implementation of a Ruby parser in Ruby itself, which allows for Ruby code to parse and introspect Ruby code. It’s used in a number of places, but perhaps most prominently by rubocop, a Ruby linter and style checker.

In parser versions <= 2.3.0.4, the parser is quadratic in the length of the input string, for any input containing >0 unicode codepoints outside of the ASCII range.

The problem arises initially in the lexer, which turns the source input into a sequence of tokens. The lexer is implemented using ragel, to generate a state machine that processes the input sequentially, generating tokens as it goes.

The problem comes when the lexer attempts to extract tokens and return them to its caller. As it lexes, it keeps track of character offsets, and it returns tokens via a character-range slice of the input:

def tok(s = @ts, e = @te)
  source = @source[s...e]
  return source unless @need_encode
  source.encode(@encoding)
end

The problem arises because the input, at this point, has been UTF-8-encoded. Because UTF-8 is a variable-length encoding, finding the nth character requires a linear traversal to skip over one codepoint at a time. Therefore, @source[s...e] is linear in the end position. Since the input contains O(n) tokens, and each one requires an O(n) scan to extract, just extracting the tokens requires quadratic time.

The fix, which I implemented after discussion with the authors, was to re-encode any non-ASCII input into UTF-32 before processing. UTF-32 imposes a significant memory overhead, but it obviously admits O(1) character indexing, and the tradeoff is well worth avoiding quadratic behavior – on a 1500-line test case from the Stripe codebase, the change dropped parsing from about 2.5s to about 900ms.

I’m fond of this one because of how subtle it was. I’m obviously well-attuned to being on the lookout for these issues, but even after my profiler showed 50% of the parser’s time in tok, it took my a long time to realize what was happening. I eventually had to run it under Linux perf, and see a large amount of time in str_utf8_nth, before it clicked.

And it’s subtle, too, because of Ruby’s “clever” handling of strings: Even if a string is marked as UTF-8-encoded, Ruby knows internally if it happens to only contain 7-bit codepoints, and optimizes access to O(1) in that case. And so the exact same code, on the exact same “shape” of data (a UTF-8-encoded string) could change dramatically in time complexity by the modification of a single character anywhere in the input!

ruby parser utf8 accidentally quadratic strings

GHC Derived Foldable and Traversable Instances

(Boy is there a lot of Haskell here now. What’s up with that?)

This one comes by way of Shachaf Ben-Kiki, who found it originally and mentioned it to me.

Data.Traversable and Data.Foldable are two Haskell typeclasses that express related notions of a data structure being “list-like” in the sense that it contains zero or more elements, and an externally-provided function can be mapped over those elements in some way.

GHC provides support (as a language extension) for automatically “deriving” instances of these type classes for algebraic data types; Essentially this asks the compiler to generate the “obvious” code to traverse your data structure in the appropriate way.

In GHC prior to 7.8, given a classic definition of a cons-cell list:

data List a = Nil | Cons a (List a)

GHC would derive a Foldable instance that looks something like this:

instance Foldable List where
    foldr f z Nil = z
    foldr f z (Cons x xs) = f x (foldr (\a b -> f a b) z xs)

The bug arises from the unnecessary eta-expansion of f in the second line. A list N elements long will recurse into Cons case N times, and each time f will get wrapped in a no-op lambda. When the evaluations are finally forced, we get a series of evaluations of expressions like:

f a b
(\a b -> f a b) a b
(\a b -> (\a b -> f a b) a b) a b
(\a b -> (\a b -> (\a b -> f a b) a b) a b) a b
…

This is a classic “triangle” pattern of evaluations, resulting in O(N²) reductions and thus runtime.

The fix involved tweaking the code generator to not produce the extra lambdas, which were essentially only there as a quirk of the abstractions used in the code generation.

ghc haskell accidentally quadratic

`puppet apply`

puppet is a popular configuration-management tool. Puppet’s basic model is declarative: You define a set of “resources” and the state they should be in. A “resource” can be basically anything that might be managed on a server: file on disk, a user account, a provisioned database instance, a running service, …

Puppet compiles the input puppet configuration into a “catalog” with all the defined resources, and creates a dependency graph: e.g. before the MySQL service can be started, the MySQL package has to be created.

Applying a puppet catalog involves walking the catalog in dependency order, analyzing each resource in turn and modifying the running system to reflect the desired state (creating or removing a user, starting or stopping a service, …).

As with most problems in engineering, puppet has to deal with the ever-present possibility of failures or errors: What happens if a resource node cannot be applied correctly? Permission errors, insufficient disk space, being asked to install a typoed package, …

If a resource fails, puppet records this fact and then continues applying the catalog, attempting to apply as much of the catalog as it can. Since it maintains a dependency graph, it can selectively skip only the resources that depend on a failed resource.

However, until recently, puppet implemented this skipping by, for each node, visiting each recursive-dependency and checking if that failed

It performed this check regardless of whether any failures had happened or not, for every node. This trivially leads to O(n²) behavior for a depth-N dependency chain!

My fix, scheduled for release with Puppet 4.2, attaches a list of failed recursive-dependencies to each node. When visiting a node, the list is computed for that node by directly unioning the lists of the immediate dependencies.

To demonstrate the fix I constructed a series of puppet manifest that just included N notify resources in a linear chain, and compared runtime before and after my patch:

[edited to add]: The above graph is plotted to an artificially large N to make the quadratic behavior extremely obvious to visible inspection; I don’t mean to imply that real manifests will have depth-6000 dependency trees. However, the patch is also a significant improvement on real-life manifests: As noted in the PR, it cut puppet runtime nearly in half on many real servers at Stripe.

puppet accidentally quadratic ruby graph

Revisiting Haskell Network.HTTP

I spent a while talking with Greg Price about the Haskell Network.HTTP issue previously featured here, and he was unconvinced I had the whole story. He spent a while reading some source and thinking about folds, and pointed out today that my analysis was incomplete, and flat-out wrong for the cabal case referenced in the commit.

One detail I glossed over in the last post (deliberately for simplicity, but over-zealously!) was the bufOps variable. bufOps is an instance of BufferOps, which provides an abstraction layer that allows the same code to work on a variety of byte-buffer or string types.

For simplicity, I chose to ignore the indirection and did the analysis assuming a strict buffer backed by a flat memory array, such as Haskell’s Data.ByteString. And my analysis was correct for a caller using Data.ByteString.

However, Greg pointed out, it is much more common in modern idiomatic Haskell to use Data.ByteString.Lazy, and indeed cabal does so. And for Data.ByteString.Lazy’s different buffer implementation, my analysis is completely wrong!

For Data.ByteString.Lazy, the issue hinges on the representation of the bytestring, and the implementation of append. ByteString.Lazy stores buffers as immutable cons-cell style lists of immutable chunks of string data – the string “Hello, World” might be stored as

("Hell") -> ("o, Wor") -> ("ld") -> null

In order to append two of these structures, we must copy the list structure of the left-hand side, replacing the final null with a pointer to the right-hand-side. e.g.

append(("Hello, ") -> ("World") -> null,
       ("Good Night, ") -> ("Moon") -> null)
= ("Hello, ") -> ("World") -> ("Good Night, ") -> ("Moon") -> null

This copies the left hand side, but reuses the right-hand side directly and thus has performance O(n) where n is the length of the left-hand side, and the length of the right-hand side is irrelevant.

Armed with that knowledge, we can consider the effects of

foldr (flip append) empty strs

foldr will repeatedly apply flip append (which is just append with the arguments reversed: (flip append) a b = append b a) to pairs of strings, starting with the empty string, and working from the end of the list backwards. If we suppose strs = ["a", "b", "c", "d"], this will result in the following series of calls:

(flip append) "d" ""    = append "" "d"    = "d"
(flip append) "c" "d"   = append "d" "c"   = "dc"
(flip append) "b" "dc"  = append "dc" "b"  = "dcb"
(flip append) "a" "dcb" = append "dcb" "a" = "dcba"

Note that the ever-growing result always ends up on the left-hand side of append. Since append is linear in its left-hand-side, this repeated copying and accumulation yields quadratic behavior.

But consider instead foldr append $ reverse strs (which is just foldr append (reverse strs) – here you can think of $ as an open-left-paren that is closed by end-of-line). That expression results in this sequence of calls:

 append "a" ""     = "a"
 append "b" "a"    = "ba"
 append "c" "ba"   = "cba"
 append "d" "cba"  = "dcba"

Now, the left-hand-side consists always of individual strings, giving a total runtime of O(Σn) = O(N) where N is the total length.

For Data.ByteString.Lazy, then, the whole improvement comes from moving from flip foldr to reverse; Switching from an explicit fold to concat makes the code more concise, but doesn’t affect the asymptotics. But for Data.ByteString, as we saw last time, it is the switch to concat that nets the improvement to O(n).

This code was, in essence, O(n²) for two different reasons, depending on which bytestring implementation the caller asked for, and the patch fixed both of them at once! Now that is the level of subtlety I expect to see packed into a few characters of Haskell code :)

I also think this example makes for a great example of the points I made in my accidentally quadratic manifest.

Whoever wrote the original code was a skilled Haskell programmer, and almost certainly more than familiar with big-O notation and basic algorithmic analysis. This quadratic pathology arose not by incompetence or lack of knowledge by the author, but because of the subtle interaction of two different individually-linear layers (foldr and append). And as we’ve seen, the layering really is key: the author could not have fully analyzed this code without peeking below the abstraction boundary and considering various concrete buffer implementations.

Additionally, this example shows off the interplay between constant factors and algorithmics. If your HTTP connections are slow, which are you going to assume: That the network or the server is slow, or that your HTTP client library has quadratic behavior hidden inside a simple operation? I’m speculating here, but it seems likely to me that that was a factor in allowing this bug to lurk longer than it might have.

accidentally quadratic haskell lazy strict http