OpenSSL and Breaking UTF-8 Change (fixed in Node v0.8.27 and v0.10.29)

2014-06-16

Today we are releasing new versions of Node:

First and foremost these releases address the current OpenSSL vulnerability CVE-2014-0224, for both 0.8 and 0.10 we've upgraded the version of the bundled OpenSSL to their fixed versions v1.0.0m and v1.0.1h respectively.

Additionally these releases address the fact that V8 UTF-8 encoding would allow unmatched surrogate pairs. That is to say, previously you could construct a valid JavaScript string (which are stored internally as UCS-2), pass it to a Buffer as UTF-8, send and consume that string in another process and it would fail to interpret because the UTF-8 string was invalid.

Note, the results encoded by V8 in this case are exactly what was passed into the encoding routine. There is no overflow, underflow, or the inclusion of other arbitrary memory, merely an unmatched UTF-8 surrogate resulting in invalid UTF-8.

As of these releases, if you try and pass a string with an unmatched surrogate pair, Node will replace that character with the unknown unicode character (U+FFFD). To preserve the old behavior set the environment variable NODE_INVALID_UTF8 to anything (even nothing). If the environment variable is present at all it will revert to the old behavior.

This breaks backward compatibility for the specific reason that unsanitized strings sent as a text payload for an RFC compliant WebSocket implementation should result in the disconnection of the client. If the client attempts to reconnect and receives another invalid payload it must disconnect again. If there is no logic to handle the reconnection attempts, this may lead to a denial of service attack. For instance socket.io attempts to reconnect by default.

// Prior to these releases:
new Buffer('ab�cd', 'utf8');
// <Buffer 61 62 ed a0 80 63 64>

// After this release:
new Buffer('ab�cd', 'utf8');
// <Buffer 61 62 ef bf bd 63 64>

// This is an explicit conversion to a Buffer, but the implicit
// .write('ab�cd') also results in the same pattern
websocket.write(new Buffer('ab�cd', 'utf8'));
// This would result in the client disconnecting.

Node's default encoding for strings is UTF-8, so even if you're not explicitly creating Buffers out of strings, Node may be doing so under the hood. If what you're passing is not actually UTF-8 then when you call .write(str) you could be specific and say .write(str, 'binary') which signals Node to pass the string through without interpreting it.

You can also mitigate this in pure JavaScript by sanitizing your strings, as an example see node-unicode-sanitize which will similarly replace unmatched surrogate pairs with the unknown unicode character.

Thanks to Node.js alum Felix Geisendörfer for finding, getting the fixes upstreamed, and helping with the testing and mitigation. Also for helping to inform and improve the process for Node.js security issues.

To float these fixes in your own builds you can apply the following patch with git am