Inside String

If we take a look into V8 sources we discover a somewhat intimidating zoo of different string representations, each optimizing for some particular use case, indexing, concatenation, or slicing, for example. post

Whenever you see a string value in JavaScript code, know that it can actually be backed by any of these representations. The runtime is able to operate with them interchangeably and even can go from one representation to another dynamically if that improves performance of some operation.

But memory leaks happen. Consider a 20 character slice of a 10Gb string. It will counterintuitively retains the whole 10Gb input string because this token is internally represented as a SlicedString that points back to the source string. This might seem like a contrived example but leaks like this do tend to happen in the real world: for example three.js had to work-around this issue.

Eagerness of a runtime to fall over itself and make your code faster with clever hidden optimizations has an ugly side too. Issue 2869 tracks progress of fixing this on the V8 side, but nothing has really happened since 2013, probably because the only simple and robust solution is to remove sliced strings altogether.

Interestingly that’s precisely what Java did - they used to implement String.substring in O(1) time by reusing parent String’s char[] storage for the substring object but that lead to memory leaks and was eventually removed in 2012.

V8 history with string slices is even more curious: originally V8 had them, then removed them in 2009, then added it back in 2011.

One encounters this behavior when parsing input using regular expressions instead of writing a custom lexer. Any modern JavaScript interpreter that supports sticky RegExp flag introduced in ES6 simplifies this approach.

I parsed web pages and wiki markup as fast as cat could copy them. The "allocate nothing" approach of the pegleg parser generator enabled my Exploratory Parsing method.