I'm getting weather alerts from a weather service. Although the HTTP response claims to be UTF-8, clearly it contains some text like this:
Suðaustan 13-20 m/s og snjókoma með lélegu skyggni og versnandi akstursskilyrðum.
...that should look like this:
Suðaustan 13-20 m/s og snjókoma með lélegu skyggni og versnandi akstursskilyrðum.
...but has already been improperly decoded before it first reached me, being re-encoded as UTF-8 after being decoded improperly. Most of us have probably seen this kind of "mojibake" garbage before, and visually at least, it often has a lot of common characteristics — such as lots of à characters, ¢ signs and the like.
I'm using this code to fix it up right now:
// Check for UTF-8 wrongly decoded as Latin-1
if (/[\x80-\xC5]/.test(result)) {
const bytes = Buffer.from(result, 'latin1');
const altText = bytes.toString('utf8');
if (altText.length < result.length)
result = altText;
}
...and that's doing the job for now, but it's not a very sophisticated test.
Anyone know of a better method?
CodePudding user response:
Anyone know of a better method?
Don't know how you would determine better. I wrote this function a while ago to do exactly this transform on a string.
Don't know if that's better than the Buffer.
function utf8_decode(str) {
//assuming the input is a valid utf-8 string.
//Invalid parts are ignored / remain in the string.
return str.replace(
/[\u00c0-\u00df][\u0080-\u00bf]|([\u00e0-\u00ef][\u0080-\u00bf]{2})|([\u00f0-\u00f7][\u0080-\u00bf]{3})/g,
(two, three, four) => String.fromCodePoint(
// UTF-16 codePoints
four ? (four.charCodeAt(0) & 7) << 18 | (four.charCodeAt(1) & 63) << 12 | (four.charCodeAt(2) & 63) << 6 | (four.charCodeAt(3) & 63) :
// UTF-8 multibytes
three ? (three.charCodeAt(0) & 15) << 12 | (three.charCodeAt(1) & 63) << 6 | (three.charCodeAt(2) & 63) :
(two.charCodeAt(0) & 31) << 6 | (two.charCodeAt(1) & 63)
)
)
}
console.log(utf8_decode("Suðaustan 13-20 m/s og snjókoma með lélegu skyggni og versnandi akstursskilyrðum."));
console.log(utf8_decode("ð\x9F\x98\x8B"));
The regex is better than yours.
No need to check afterwards if the transform resulted in some change.
