somewhere to talk about random ideas and projects like everyone else

stuff

December 2015 Archive

Methods and Functions 16 December 2015

I watched a recording of Rob Pike’s talk Simplicity is Complicated a few weeks ago, and I was struck by his sentiment that many popular languages seem to be converging by adding popular features from other languages.

In particular, it seems that many popular languages combine aspects of object-oriented and functional programming— support for classes, high-order functions, and reduction. Sometimes this leads to situations where the distinction between methods and functions are confusing or inconsistent.

In Javascript, you find the length of an array or string using a getter method "blah".length but in Python[^2], it’s done through a function invocation len("blah").

This dichotomy is more evident when operations get chained in sequence to process some stream of data. Here’s an example in Java 8 of the object-oriented approach where methods are chained[^1]:

double average = roster.stream()
    .filter(p -> p.getGender() == Person.Sex.MALE)
    .mapToInt(Person::getAge)
    .average()

An alternative way to write the code is by thinking about it as function composition— as you can do in Haskell:

(sum (map sqrt (filter (>3) [1,2,3,4,5,6,7])))

An interesting exercise is to pay attention how your eyes move as you scan through snippets of code to figure out what it does.

To do that, let’s abstract away the slightly different syntax for these different languages and imagine a hypothetical language where x.y.z is syntactic sugar for z(y(x)). That is, when a function name is suffixed after a . behind some expression, it is equivalent to wrapping the entire expression and using it as the first argument.

More concretely, 32.double is equivalent to double(32), and 49.divide(7) is equivalent to divide(49, 7).

For a more complex example:

range(0, 20).map(x: sqrt(x)).filter(x: x > 3).sort.sum

sum(sort(filter(map(range(0, 20), x: sqrt(x)), x: x > 3)))

With the first code snippet, the process can be read left-to-right, and you can imagine it as the story of a little chunk of data flowing through the system.

With the second code snippet, you instead read the function from the inside-out. To a certain extent, it’s right-to-left, but you have to check the other side to make sure you aren’t forgetting any extra arguments.

Let’s try to visualize this:

sum(sort(filter(map(range(0, 20), x: sqrt(x)), x: x > 3)))
                    ------------
                ----            ------------
         -------                             ----------
    ----                                               -
---                                                      -

You can see here that interpreting the behavior with the functional style[^3] begins in the middle and approaches the beginning right-to-left. But when you reach some new function name, you have to zig-zag to the opposite side to visually inspect whether or not the method has any additional arguments.

range(0, 20).map(x: sqrt(x)).filter(x: x > 3).sort.sum
------------
            ----------------
                            -----------------
                                             -----
                                                  ----

The method chaining approach (read: OOP style) is nice because it fits the conceptual model of data flowing through a sequence of transformations, and it doesn’t disrupt the typical western left-to-right reading order.

To me, this seems much easier both to read and to write (for I am not so blessed as to use Paredit and must carefully tread a vast sea of syntax to manually close my parens) and I’ve always been mildly infuriated trying to learn Haskell[^4] because function composition feels much less natural[^5].

One of the stated reasons for why Python uses len is that you can’t glance at the type of the expression by looking at its head. When you’re scanning through a line and you see that the outermost function is “len”, you can immediately tell that the output is a number. When length is determined in postfix, you can’t determine the final type of an expression until you look at the end of the expression.

Another problem is that this linearized sequential syntax only helps when the the structure of an expression is relatively linear. Imagine instead that you have an expression like

sum(range(floor(sqrt(multiply(3, 4))), divide(square(5), 2)))
                     --------------           
                -----              -          ---------
          ------                    -  -------         ----
    ------                           --                    -
----                                                        -

In this case, both multiply(3, 4) and square(5) are of relatively similar depth which gives it the humps making it look like a snake eating an elephant.

Flattening it with the . syntax then forces range into a little invisible spot in the middle and that makes it a lot harder to interpret the behavior of the program.

3.multiply(4).sqrt.floor.range(5.square.divide(2)).sum

In fact this dot stuff is beginning to look a lot like the postfix notation in stack-based programming languages like FORTH) (read: similarly unreadable).

That actually leads to the nifty realization that it’s kind of like tail call optimization for the human mind.

When reading some code composed of nested function application, you end up having to keep track of how many level within the function you are. There’s an idea in psychology that human working memory can contain up to 7 ± 2 items. So reading some deeply nested set of expressions is naturally difficult.

But if the expression tree naturally represents a sequence of actions transforming a piece of data— you can reorganize it in such a way that depth of your mental stack doesn’t keep growing and objects can be substituted in place.


Passing objects as the first argument isn’t really all that strange. The way classes are implemented in python, methods are just functions with an implicit self argument.

class Dog:
    def talk(self, text):
        print "woof!", text, "bark!"

shibe = Dog()
shibe.talk("wow! such speak!")

So with the code above, shibe.talk("wow! such speak!") is really just shorthand for Dog.talk(shibe, "wow! such speak!").

Neither is it particularly strange to access method properties through function calls. This seems to be quite typical for accessing fields of typeclasses in Haskell.

data Car = Car {
    company :: String, 
    model :: String, 
    year :: Int
}
let WusTangClan = Car {company="Ford", model="Wustang", year=1967}
model(WusTangClan) -- "Wustang"

[^3]: Note that while I’m characterizing this as the functional style, many functional languages such as F# have a “pipe-forward” operator “|>” which has a similar effect. Haskell seems to have a convention where the data is passed as the last argument rather than the first argument, which makes implementing this particular strategy a bit clumsy.

[^2]: In the Python FAQ Guido describes the rationale behind the “len” function as opposed to using a method on strings and lists. Interestingly more recent revisions of the FAQ make it sound as if it were some mistake rendered unfixable by history.

[^1]: Something analogous to this sort of chained-method style is also used in popular frameworks like d3 and jQuery.

[^4]: Well, there’s also that incident where I tried installing some Hello World example with Cabal and ended up, through a far more infuriating series of events, rendering my Macbook unbootable.

[^5]: In similar vein, I’d like there to be a “@\” in Mathemath so that I can do data \@ transformation 1 rather than the other way around with the existing “/@” operator


Recovering Deleted Data from LevelDB 02 December 2015

It was 4am this morning, and I did something stupid. I was trying to write something which would automatically delete all empty documents. Within a few seconds of running it, all my precious data was gone. Oops.

There’s probably some grander lesson to be learned about keeping routine backups and not being an idiot. But the strength of that parable is somewhat weakened because I did manage to get the data back.

How LevelDB Works

LevelDB, developed by Sanjay Ghemawat and Jeff Dean of Google, is a fast key-value store based on a log-structured merge tree.

It’s not a relational database. It can’t do complex SQL queries. In fact, the only type of query that one can do is to ask for the set of entries whose keys fit lexicographically in some range. Think of it as a list of entities perpetually sorted by a key, where you can quickly read out any entry (or contiguous range) with a simple binary search.

It’s not just a metaphor— the basic primitive underlying LevelDB is literally called an SSTable (Sorted String Table). If you have a big arbitrarily large sorted list, the task of locating any single entity in a sorted list can be done in O(log n) time, which is nice because hopping around the big spinning metal hard disk platter is glacially slow in computer time.

In RAM, hopping around is pretty fast. It’s part of the name— Random Access Memory. In RAM you can maintain a balanced binary search tree, so that you can search, insert, delete, or update the contents of any entry in O(log n) time— all while keeping the contents sorted.

Computers tend to have a limited amount of RAM compared to hard disk space, so most databases have to somehow deal with both beasts. To a certain extent, LevelDB plays to the strengths of each while using the other to fill in weaknesses.

Diagram taken from https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/

To add some entry to the key-value store, LevelDB first sticks it into RAM— in particular some sort of data structure that keeps everything all nice and sorted. But over time, this chunk of memory may grow uncomfortably large, at which point its sorted contents get flushed into a file, and the old chunk of memory is emptied and ready for new data.

Once data is saved to the disk, the saved SSTables are essentially treated as immutable. Otherwise, if you want to keep the data sorted you might have to rewrite the entire file whenever a new entry is added, or an old one is deleted. This immutability property is really what ends up saving me, because it means that in spite of deleting everything— nothing is ever truly lost.

To look up some range of entries, LevelDB first checks RAM in case something matching the query has been manipulated recently. Then it checks the little immutable chunks which have already been flushed to disk, and merges them all together quickly in a manner similar to merge sort.

There’s some more to it which I haven’t covered, such as an append-only log to aid recovery in the event of some catastrophic power failure or software crash, and the process by which older SSTables get compacted together. But this essentially represents the basic design of an LSM database, at least enough to understand the principle of recovering deleted files.

LevelDB Structure

When you save some stuff to a LevelDB database, what you find is actually a directory consisting of a couple of .log files, .ldb files, MANIFEST, CURRENT, LOG and LOCK. As far as I’m aware the log files are temporary and kept in case the database is interrupted before it has an opportunity to write things down into a sorted table. Thus I’ll pretend that after having cleanly exited the database (following accidentally deleting lots of stuff), all the content which might possibly be interesting is within the .ldb files.

If you try to open up an .ldb in a hex editor, you’ll notice bits which look tantalizingly like the data you might want to recover, albeit subtly mangled by the Snappy compression algorithm.

Recovery

While you might imagine that being able to access previous versions of the data would be a natural and obvious affordance for a log-structured immutable append-only key-value store, I couldn’t find any APIs for accessing deleted data.

A good first step might be figuring out some way to parse those .ldb files. Fortunately, it’s well-documented enough that there are actually third-party implementations. And from the looks of it’s actually pretty simple.

That golang leveldb repo is particularly cool because it includes a helper tool ldbdump which reads in an .ldb file and dumps its contents. In fact the whole process of using it was surprisingly rather painless. I just ran go get github.com/golang/leveldb, navigated to the cmd/ldbdump directory and ran go build -o ldbdump main.go. I then wrote a little Python script which would call that program for all the .ldb files in a particular directory and parse it into a more canonical JSON form.

base = "test-stuff copy"

import os
from subprocess import Popen, PIPE
import json
import ast
for f in os.listdir(base):
  if f.endswith(".ldb"):
    process = Popen(["ldbdump", os.path.join(base, f)], stdout=PIPE)
    (output, err) = process.communicate()
    exit_code = process.wait()
    for line in (output.split("\n")[1:]):
      if line.strip() == "": continue
      parsed = ast.literal_eval("{" + line + "}")
      key = parsed.keys()[0]
      print json.dumps({ "key": key.encode('string-escape'), "value": parsed[key] })

Out of this I got a 73MB JSON file, which seemed to have all the contents in some form but unfortunately ordered by the key. This is bad because it conflates different versions of a particular key— an object may have been changed, deleted, and created again.

{"value": "{\"id\":\"Whole Earth Catalog\",\"type\":\"text\",\"parent\":\"__root__\",\"created\":1447821578094,\"list\":[\"MI11NZ\"]}", "key": "!entities/aleph!Whole Earth Catalog\\x01RI\\x05\\x00\\x00\\x00\\x00"}
{"value": "{\"id\":\"Wikia\",\"type\":\"text\",\"parent\":\"__root__\",\"created\":1447821578094,\"list\":[\"XB8UAU\"]}", "key": "!entities/aleph!Wikia\\x01UI\\x05\\x00\\x00\\x00\\x00"}
{"value": "{\"id\":\"Will Ferrell\",\"type\":\"text\",\"parent\":\"__root__\",\"created\":1447841260204,\"list\":[\"AKVVBF\"]}", "key": "!entities/aleph!Will Ferrell\\x01=N\\x05\\x00\\x00\\x00\\x00"}

One curious thing I noticed was the presence of some strange binary digits after the end of each key. After a little bit of experimentation it turns out that if you convert that binary string into a number, and sort the entire database by it, sticks all the changes in the proper chronological order!

import sys
import json
import ast 
import binascii

things = []
for line in sys.stdin:
  obj = json.loads(line)
  halves = obj['key'].split("\\x", 1)
  key = ast.literal_eval("'\\x" + halves[1].replace("\\\\u00", "\\x") + "'")
  if len(key) == 8:
    time = int(binascii.hexlify(key[1:][::-1]), 16)
    things.append((time, halves[0], obj['value']))

for x in sorted(things, key = lambda k: k[0]):
  print json.dumps(x)

And my precious data is now saved! Or well, rather, it always was— but now I finally have access to it in some reasonable form.

I could see the first few entries, which happened to be some particularly fecal test data

[1, "!changes/wumbo/!00000001448849217216", "poop"]
[2, "!entities/wumbo/!00000001448849471307", "poop"]
[3, "!changes/wumbo/!00000001448849471308", "poop"]

Then some data in the middle with an imported movie database

[2435, "!changes/merp!000000014488557172160000001163", "Deja Vu"]
[2437, "!changes/merp!000000014488557172160000001164", "Gattaca"]
[2439, "!changes/merp!000000014488557172170000001165", "Sunshine"]

And of course, the part of it where I accidentally deleted everything

[394061, "!entities/aleph!OYHEDW", ""]
[394062, "!entities/aleph!Gerry Sussman", ""]
[394063, "!entities/aleph!VSPKP8", ""]

So I guess that’s it! In case you ever get stuck in a rut where you’ve unwittingly deleted the contents of your LevelDB and forgot to keep decent backups, and you’re lucky enough that the changes haven’t been compactified yet, you can possibly get your data back.

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 57 FB 80 8B 24 75 47 DB