It seems emoji break websocket.io (the websocket library used by socket.io). They seem to cause the message payload to terminate prematurely.
After a morning of research on the subject, I was pointed at mranney’s essay on the subject by `3rdEden on irc.freenode.net/socket.io. It turns out that the issue is due to V8 (the JavaScript engine used by Node.JS) using the UCS encoding internally rather than the more modern UTF-16. Emoji require 17 bits (their code values are larger than 65,535), which is more than UCS can give (UCS uses exactly 2 bytes (16 bits) to represent every character, so it only supports values between 0 and 65,535; whereas UTF-8 and UTF-16 use a variable number of bytes).
I’ve confirmed with WireShark that Safari is sending valid UTF-8 (11110000 10011111 10011000 10010011 or hex: f0 9f 98 93, which gives the Unicode codepoint U+1F613 😓). So I know it’s Node’s issue receiving and processing it.
This is probably one of the reasons that Google Chrome doesn’t support Emoji - check out this page in Chrome, then view it in Safari to see what you’re missing! (Chrome uses the V8 engine.)
Solution
You could base64-encode your payload, or simply escape()
/unescape()
it. Or you could trim anything outside of UCS. Or you could do a custom
encode/unencode such as this one. I’m not really happy with
any of these, so I’m still looking for a solution.
Update
In the end I implemented this (see below) on the client
side, and left the data encoded server side. All clients are responsible
for encode()
ing characters when sending, and decode()
ing upon
receiving (whether that be via websockets or HTTP). It seems to work
quite well and doesn’t massively inflate the content-size for ASCII and
possibly more[citation needed], so I’m relatively happy with it.
Besides it’s midnight and I need to get some sleep.