Invalid Unicode encodings in Facebook data exports
An interesting case study of how even a large company can get Unicode encoding wrong in their data export format.
When you remove your Facebook account, you can download a copy of your data so that years of savvy punchlines are not lost for posterity.
What you get is a massive ZIP that contains JSON, images etc. The next logical step
is to import the posts into ElasticSearch, so that they are easily searchable. A regular
json.import()
however gets you garbled output, if you happened to write in one of many
languages that use non-ASCII characters.
The issue is non-obvious on the first look. They’re using the \u
escape sequences
to encode the Unicode characters, right?
000022a0 6c 69 20 7a 61 77 69 65 72 61 6a 5c 75 30 30 63 |li zawieraj\u00c|
000022b0 34 5c 75 30 30 38 35 63 79 63 68 20 5c 22 74 72 |4\u0085cych \"tr|
Well, no. What was there originally is a Polish language diacritic character ą
called U+105 (LATIN SMALL LETTER A WITH OGONEK)
in Unicode. Proper JSON
encoding would produce the following:
> print(json.dumps('ą'))
"\u0105"
The \u
escape sequence encodes an Unicode character
with the code point that follows next, so \u0105
encodes the U+105 ą
character.
Unicode code point vs encoding
So what are these \u00c4\u0085c
? Comes out it’s a binary representation
of UTF-8 encoded character U+105
.
- Unicode code-point (U+105) is basically just the number of the character
in the Unicode catalogue, which is literally a book, where
U+0061
is lettera
,U+0062
is letterb
and so on, until finallyU+105
is letterą
. Any compatible JSON processor will decode the escape sequence\u0105
as letterą
. - The escape sequence occupies 6 bytes, which is a lot for just a single character. To save space there’s a whole lot of ways how an Unicode character, an abstract object referenced by its code point, can be encoded into a sequence of bytes.
This is the confusing moment in Unicode encodings, beause we’re all used to ASCII,
where byte 0x61
always represents character a
. In Unicode it’s not —
it really depends on what encoding you select.
If you find the topics of Unicode and UTF-8 confusing, it’s because they are. Don’t be like Facebook and don’t go writing custom encoders without understanding what you’re doing. Just watch my OWASP AppSec 2018 video “Unicode: The hero or villain? Input Validation of free-form Unicode text in Web Applications” on scitech.video PeerTube or LBRY).
RFC 7159 allows JSON
iplementations to be entirely encoded in UTF-8, in which case the problem of
escaping would be irrelevant as Unicode characters would be simply
inlined in the text. Let’s see how UTF-8 handles the ą
character:
> 'ą'.encode('utf-8')
b'\xc4\x85'
The character ą
is encoded into two bytes under UTF-8: 0xc4, 0x85
. Notice
any similarities? This is precisely the mysterious \u00c4\u0085c
sequence,
just serialised using Unicode sequences.
The story here seems to be that Facebook programmers mixed up the concepts
of Unicode encoding and escape sequences, probably while implementing their
own ad-hoc serializer. What Facebook outputs is binary representation of UTF-8 encoded Unicode
character U+105 (LATIN SMALL LETTER A WITH OGONEK) but confusingly prefixed with \u
.
The prefix is confusing, because it implies it’s an escape sequence
referring to a single Unicode character
U+C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)
followed by another
single Unicode character U+85 (NEXT LINE)
.
This kind of “Unicode characters pretending to be bytes” is possible for characters U+0000 up to U+00FF, or simply the basic ASCII range, because their UTF-8 encoding is identical to their ASCII counterparts. Nonetheless, it’s used against its purpose here and clearly confusing for both humans and JSON decoders.
RFC 7159
In Facebook defense, the section on character encoding in JSON in RFC 7159 is quite confusing on its own, with too many possibilities left as optional and general writing stule more suitable for a blog rather than a technical standard. Section 7 is filled with statements like “may be encoded”, “alternatively”, “may be represented” etc.
For example, the standard states that “if the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be represented (…) as "\u005C"
” sequence, which clearly indicates that U+0105
may be
represented as \u0105
but the curse of “may” in standards is that it leaves
the programmer with freedom to use whatever other encoding they can
come up with here.
Never use lower-case “may” in standards. Note that upper-case MAY
is
completely different story, as its meaning is clearly defined in
RFC 2119. It’s also worth
exploring the very pragmatic keywords defined in RFC 6919.
How to fix Facebook export encoding?
Note this bug has been noticed back in 2018 on StackOverflow (Facebook JSON badly encoded). I dumped my data somewhere around 2019 and the bug was still there, which may suggest helping people get their data off Facebook isn’t the first priority of the company.
I’ve come up with a Python class that wraps around the io.FileIO
class and allows on-the-fly fixing of the input files. It’s slow,
but it correctly handles the \u00XX
sequences of two or more, which
I have also seen in my export file. Each sequence is UTF-8 decoded
into an Unicode character, and then JSON-encoded again into a proper
\uXXXX
sequence.
import json
import io
class FacebookIO(io.FileIO):
def read(self, size: int = -1) -> bytes:
data: bytes = super(FacebookIO, self).readall()
new_data: bytes = b''
i: int = 0
while i < len(data):
# \u00c4\u0085
# 0123456789ab
if data[i:].startswith(b'\\u00'):
u: int = 0
new_char: bytes = b''
while data[i+u:].startswith(b'\\u00'):
hex = int(bytes([data[i+u+4], data[i+u+5]]), 16)
new_char = b''.join([new_char, bytes([hex])])
u += 6
char : str = new_char.decode('utf-8')
new_chars: bytes = bytes(json.dumps(char).strip('"'), 'ascii')
new_data += new_chars
i += u
else:
new_data = b''.join([new_data, bytes([data[i]])])
i += 1
return new_data
if __name__ == '__main__':
f = FacebookIO('data.json','rb')
d = json.load(f)
print(d)