Skip to main content

Invalid Unicode encodings in Facebook data exports

An interesting case study of how even a large company can get Unicode encoding wrong in their data export format.

When you remove your Facebook account, you can download a copy of your data so that years of savvy punchlines are not lost for posterity.

What you get is a massive ZIP that contains JSON, images etc. The next logical step is to import the posts into ElasticSearch, so that they are easily searchable. A regular json.import() however gets you garbled output, if you happened to write in one of many languages that use non-ASCII characters.

The issue is non-obvious on the first look. They’re using the \u escape sequences to encode the Unicode characters, right?

000022a0  6c 69 20 7a 61 77 69 65  72 61 6a 5c 75 30 30 63  |li zawieraj\u00c|
000022b0  34 5c 75 30 30 38 35 63  79 63 68 20 5c 22 74 72  |4\u0085cych \"tr|

Well, no. What was there originally is a Polish language diacritic character ą called U+105 (LATIN SMALL LETTER A WITH OGONEK) in Unicode. Proper JSON encoding would produce the following:

> print(json.dumps('ą'))

The \u escape sequence encodes an Unicode character with the code point that follows next, so \u0105 encodes the U+105 ą character.

Unicode code point vs encoding

So what are these \u00c4\u0085c? Comes out it’s a binary representation of UTF-8 encoded character U+105.

  • Unicode code-point (U+105) is basically just the number of the character in the Unicode catalogue, which is literally a book, where U+0061 is letter a, U+0062 is letter b and so on, until finally U+105 is letter ą. Any compatible JSON processor will decode the escape sequence \u0105 as letter ą.
  • The escape sequence occupies 6 bytes, which is a lot for just a single character. To save space there’s a whole lot of ways how an Unicode character, an abstract object referenced by its code point, can be encoded into a sequence of bytes.

This is the confusing moment in Unicode encodings, beause we’re all used to ASCII, where byte 0x61 always represents character a. In Unicode it’s not — it really depends on what encoding you select.

If you find the topics of Unicode and UTF-8 confusing, it’s because they are. Don’t be like Facebook and don’t go writing custom encoders without understanding what you’re doing. Just watch my OWASP AppSec 2018 video “Unicode: The hero or villain? Input Validation of free-form Unicode text in Web Applications” on PeerTube or LBRY).

RFC 7159 allows JSON iplementations to be entirely encoded in UTF-8, in which case the problem of escaping would be irrelevant as Unicode characters would be simply inlined in the text. Let’s see how UTF-8 handles the ą character:

> 'ą'.encode('utf-8')

The character ą is encoded into two bytes under UTF-8: 0xc4, 0x85. Notice any similarities? This is precisely the mysterious \u00c4\u0085c sequence, just serialised using Unicode sequences.

The story here seems to be that Facebook programmers mixed up the concepts of Unicode encoding and escape sequences, probably while implementing their own ad-hoc serializer. What Facebook outputs is binary representation of UTF-8 encoded Unicode character U+105 (LATIN SMALL LETTER A WITH OGONEK) but confusingly prefixed with \u.

The prefix is confusing, because it implies it’s an escape sequence referring to a single Unicode character U+C4 (LATIN CAPITAL LETTER A WITH DIAERESIS) followed by another single Unicode character U+85 (NEXT LINE).

This kind of “Unicode characters pretending to be bytes” is possible for characters U+0000 up to U+00FF, or simply the basic ASCII range, because their UTF-8 encoding is identical to their ASCII counterparts. Nonetheless, it’s used against its purpose here and clearly confusing for both humans and JSON decoders.

RFC 7159

In Facebook defense, the section on character encoding in JSON in RFC 7159 is quite confusing on its own, with too many possibilities left as optional and general writing stule more suitable for a blog rather than a technical standard. Section 7 is filled with statements like “may be encoded”, “alternatively”, “may be represented” etc.

For example, the standard states that “if the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented (…) as "\u005C"” sequence, which clearly indicates that U+0105 may be represented as \u0105 but the curse of “may” in standards is that it leaves the programmer with freedom to use whatever other encoding they can come up with here.

Never use lower-case “may” in standards. Note that upper-case MAY is completely different story, as its meaning is clearly defined in RFC 2119. It’s also worth exploring the very pragmatic keywords defined in RFC 6919.

How to fix Facebook export encoding?

Note this bug has been noticed back in 2018 on StackOverflow (Facebook JSON badly encoded). I dumped my data somewhere around 2019 and the bug was still there, which may suggest helping people get their data off Facebook isn’t the first priority of the company.

I’ve come up with a Python class that wraps around the io.FileIO class and allows on-the-fly fixing of the input files. It’s slow, but it correctly handles the \u00XX sequences of two or more, which I have also seen in my export file. Each sequence is UTF-8 decoded into an Unicode character, and then JSON-encoded again into a proper \uXXXX sequence.

import json
import io

class FacebookIO(io.FileIO):
    def read(self, size: int = -1) -> bytes:
        data: bytes = super(FacebookIO, self).readall()
        new_data: bytes = b''
        i: int = 0
        while i < len(data):
            # \u00c4\u0085
            # 0123456789ab
            if data[i:].startswith(b'\\u00'):
                u: int = 0
                new_char: bytes = b''
                while data[i+u:].startswith(b'\\u00'):
                    hex = int(bytes([data[i+u+4], data[i+u+5]]), 16)
                    new_char = b''.join([new_char, bytes([hex])])
                    u += 6

                char : str = new_char.decode('utf-8')
                new_chars: bytes = bytes(json.dumps(char).strip('"'), 'ascii')
                new_data += new_chars
                i += u
                new_data = b''.join([new_data, bytes([data[i]])])
                i += 1

        return new_data

if __name__ == '__main__':
    f = FacebookIO('data.json','rb')
    d = json.load(f)