JavaScript - Remove XML-invalid chars from a Unicode string or file

Table of Contents

The Specs
The RegEx(s)
The JavaScript function

Today I was developing an Electron application for a client and I was looking for a way to remove invalid characters from a typical XML file in UTF-8 format . Unfortunately, StackOverflow was unable to help my for that, since I only found questions (and answers) related to stripping/removing non-UTF8 characters: close, yet still not enough for what I need, since there are a lot of legitimate UTF8 characters that might cause issues within a XML file.

The Specs

As a matter of fact, according to the official XML 1.0 specifications, a valid XML file should only contain Unicode characters, excluding the surrogate blocks, FFFE, and FFFF. To keep it short, this means that the only valid characters should fall into one of the following groups:

#x9, #xA, #xD, [#x20-#xD7FF], [#xE000-#xFFFD], [#x10000-#x10FFFF]

Additionally, the above specifications explains that "Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters.". In short word, this means that the following characters should be considered "discouraged" as well:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].

The RegEx(s)

Now that we know what to keep and what to remove, we can build the regular expressions accordingly to filter in/out the unwanted characters.

ECMAScript 6

These could be easily done using ECMAScript 6, which features a great Unicode code point escape feature which can be used to obtain any Unicode character in the following way:

\u{...}

\u{...}

Where the dots between the brackets can be replaced with any hex value from 00001 to 10FFFF, which is the highest code point defined by Unicode.

Here are the ES6 regular expressions:

// remove everything forbidden by XML 1.0 specifications, plus the unicode replacement character U+FFFD
var regex = /([^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFC\u{10000}-\u{10FFFF}])/ug;

// remove everything not suggested by XML 1.0 specifications
regex = /([\x7F-\x84]|[\x86-\x9F]|[\uFDD0-\uFDEF]|[\u{1FFFE}-\u{1FFFF}]|[\u{2FFFE}-\u{2FFFF}]|[\u{3FFFE}-\u{3FFFF}]|[\u{4FFFE}-\u{4FFFF}]|[\u{5FFFE}-\u{5FFFF}]|[\u{6FFFE}-\u{6FFFF}]|[\u{7FFFE}-\u{7FFFF}]|[\u{8FFFE}-\u{8FFFF}]|[\u{9FFFE}-\u{9FFFF}]|[\u{AFFFE}-\u{AFFFF}]|[\u{BFFFE}-\u{BFFFF}]|[\u{CFFFE}-\u{CFFFF}]|[\u{DFFFE}-\u{DFFFF}]|[\u{EFFFE}-\u{EFFFF}]|[\u{FFFFE}-\u{FFFFF}]|[\u{10FFFE}-\u{10FFFF}].)/ug;

// remove everything forbidden by XML 1.0 specifications, plus the unicode replacement character U+FFFD

var regex = /([^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFC\u{10000}-\u{10FFFF}])/ug;

// remove everything not suggested by XML 1.0 specifications

regex = /([\x7F-\x84]|[\x86-\x9F]|[\uFDD0-\uFDEF]|[\u{1FFFE}-\u{1FFFF}]|[\u{2FFFE}-\u{2FFFF}]|[\u{3FFFE}-\u{3FFFF}]|[\u{4FFFE}-\u{4FFFF}]|[\u{5FFFE}-\u{5FFFF}]|[\u{6FFFE}-\u{6FFFF}]|[\u{7FFFE}-\u{7FFFF}]|[\u{8FFFE}-\u{8FFFF}]|[\u{9FFFE}-\u{9FFFF}]|[\u{AFFFE}-\u{AFFFF}]|[\u{BFFFE}-\u{BFFFF}]|[\u{CFFFE}-\u{CFFFF}]|[\u{DFFFE}-\u{DFFFF}]|[\u{EFFFE}-\u{EFFFF}]|[\u{FFFFE}-\u{FFFFF}]|[\u{10FFFE}-\u{10FFFF}].)/ug;

IMPORTANT: It's worth noting that, since I was here, I took the chance to also exclude the #xFFFD character, aka the Unicode Replacement Character, which is usually used as a placeholder when the decoder encounters an invalid sequence of bytes: despite being formally accepted by the XML specifications, such character could often raise issues on most XML parsers: if you don't want to suppress it like I did, just replace "uFFFC" with "uFFFD" in the first RegExp.

ECMAScript 5

Unfortunately, since we're using Electron, we have to build our RegExp using ECMAScript 5, which is not that great when dealing with RegExp and Unicode characters. As you might already know, ECMAScript 5 has only the following escape sequences:

Octal, which can be used to escape any character with a character code lower than 256 (i.e. any character in the extended ASCII range): \42
Hexadecimal, which works the same way as Octal but uses hex values instead: \x125
Unicode, which can be used to escape any character with a character code lower than 65535: \uFFFF

What about the code points above 65535? Well, since JavaScript uses UCS-2 encoding internally, all code points higher than that must be represented by a pair of (lower valued) surrogate pseudo-characters which are used to comprise the real character: this basically means that, in order to get the actual character code of these higher code point characters in JavaScript, we need to use two UTF-8 halves which will match a corresponding UTF-16 code point.

Now, since we've seen that the XML specifications does indeed allow a nice amount of these characters - the whole [#x10000-#x10FFFF] block, not to mention all the discouraged characters - we are forced to do some extra work and convert these code points to their lower valued UTF-8 surrogate pairs.

Luckily enough, there's a great JS library called regexpu that will do that automatically for you, and even a free online tool that implements that to perform the ES6-to-ES5 conversion that we need. All we need to do is to paste there our ES6 regular expressions and have them converted (actually, transpiled) to a backward-compatible, Electron-friendly ES5 syntax:

// remove everything forbidden by XML 1.0 specifications, plus the unicode replacement character U+FFFD
var regex = /((?:[\0-\x08\x0B\f\x0E-\x1F\uFFFD\uFFFE\uFFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]))/g;

// remove everything not suggested by XML 1.0 specifications
regex = /([\x7F-\x84]|[\x86-\x9F]|[\uFDD0-\uFDEF]|(?:\uD83F[\uDFFE\uDFFF])|(?:\uD87F[\uDFFE\uDFFF])|(?:\uD8BF[\uDFFE\uDFFF])|(?:\uD8FF[\uDFFE\uDFFF])|(?:\uD93F[\uDFFE\uDFFF])|(?:\uD97F[\uDFFE\uDFFF])|(?:\uD9BF[\uDFFE\uDFFF])|(?:\uD9FF[\uDFFE\uDFFF])|(?:\uDA3F[\uDFFE\uDFFF])|(?:\uDA7F[\uDFFE\uDFFF])|(?:\uDABF[\uDFFE\uDFFF])|(?:\uDAFF[\uDFFE\uDFFF])|(?:\uDB3F[\uDFFE\uDFFF])|(?:\uDB7F[\uDFFE\uDFFF])|(?:\uDBBF[\uDFFE\uDFFF])|(?:\uDBFF[\uDFFE\uDFFF])(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]))/g;

// remove everything forbidden by XML 1.0 specifications, plus the unicode replacement character U+FFFD

var regex = /((?:[\0-\x08\x0B\f\x0E-\x1F\uFFFD\uFFFE\uFFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]))/g;

// remove everything not suggested by XML 1.0 specifications

Electron-prebuilt-compile

If you don't want to transpile these ES6 regular expressions to their ES5 counterpart, you can replace your vanilla electron build with the electron-prebuilt-compile NPM package, which comes with native ES6 and TypeScript support.

Single-line to Multiline RegExp

Regardless of what ECMAScript version you're using, the above RegExp have all a major flaw: they are insanely long, which makes them ugly in terms of code readability. On top of that, since they're using the peculiar JavaScript RegExp syntax, they cannot be split into multiple lines as they were string. What can we do to fix such issue?

I found a number of possible techniques for splitting JS RegExp on multiple lines while digging on StackOverflow and other tech sites: the best workaround I've found so far is explained in this StackOverflow answer, that shows how you can do an array of mini-regexp and join them at the end: despite being a good idea, it can be only used when we can effectively split the regexp, since each single "subset" has to be internally consistent. Other answers suggest to use the JavaScript RegExp object, which constructor does accept a string: the problem with that is that we need to manually escape the RegExp string, which could be a real bummer (and altering the "actual" string, thus crippling our chances to run further tests with third-party tools).

Long story short, I ended up coding a couple online tools:

RegEx Splitter, which can be used to split a single-line JavaScript RegEx into multiple-lines of JavaScript code.
RegEx Slasher, which can be used to add slashes and/or remove slashes from any RegEx - basically escaping and/or unescaping it.

I put both tools online in the Ryadel.IO project hub, hoping that they'll help other developers dealing with these kinds of issues: they are also both available on my GitHub repo page.

The JavaScript function

Thanks to my RegExSplitter I was eventually able to convert those ugly regular expressions into a more readable set of code lines.

Once done, I just wrapped everything up in the following JavaScript method:

/**
 * Removes XML-invalid characters from a string.
 * @param {string} string - a string potentially containing XML-invalid characters, such as non-UTF8 characters, STX, EOX and so on.
 * @param {boolean} removeDiscouragedChars - a string potentially containing XML-invalid characters, such as non-UTF8 characters, STX, EOX and so on.
 * @return : a sanitized string without all the XML-invalid characters.
 */
function removeXMLInvalidChars(string, removeDiscouragedChars = true)
{
    // remove everything forbidden by XML 1.0 specifications, plus the unicode replacement character U+FFFD
    var regex = /((?:[\0-\x08\x0B\f\x0E-\x1F\uFFFD\uFFFE\uFFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]))/g;
    string = string.replace(regex, "");

    if (removeDiscouragedChars) {
        // remove everything not suggested by XML 1.0 specifications
        regex = new RegExp(
            "([\\x7F-\\x84]|[\\x86-\\x9F]|[\\uFDD0-\\uFDEF]|(?:\\uD83F[\\uDFFE\\uDFFF])|(?:\\uD87F[\\uDF"+
            "FE\\uDFFF])|(?:\\uD8BF[\\uDFFE\\uDFFF])|(?:\\uD8FF[\\uDFFE\\uDFFF])|(?:\\uD93F[\\uDFFE\\uD"+
            "FFF])|(?:\\uD97F[\\uDFFE\\uDFFF])|(?:\\uD9BF[\\uDFFE\\uDFFF])|(?:\\uD9FF[\\uDFFE\\uDFFF])"+
            "|(?:\\uDA3F[\\uDFFE\\uDFFF])|(?:\\uDA7F[\\uDFFE\\uDFFF])|(?:\\uDABF[\\uDFFE\\uDFFF])|(?:\\"+
            "uDAFF[\\uDFFE\\uDFFF])|(?:\\uDB3F[\\uDFFE\\uDFFF])|(?:\\uDB7F[\\uDFFE\\uDFFF])|(?:\\uDBBF"+
            "[\\uDFFE\\uDFFF])|(?:\\uDBFF[\\uDFFE\\uDFFF])(?:[\\0-\\t\\x0B\\f\\x0E-\\u2027\\u202A-\\uD7FF\\"+
            "uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|"+
            "(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]))", "g");
        string = string.replace(regex, "");
    }

    return string;
}

/**

* Removes XML-invalid characters from a string.

* @param {string} string - a string potentially containing XML-invalid characters, such as non-UTF8 characters, STX, EOX and so on.

* @param {boolean} removeDiscouragedChars - a string potentially containing XML-invalid characters, such as non-UTF8 characters, STX, EOX and so on.

* @return : a sanitized string without all the XML-invalid characters.

function removeXMLInvalidChars(string, removeDiscouragedChars = true)

{

// remove everything forbidden by XML 1.0 specifications, plus the unicode replacement character U+FFFD

var regex = /((?:[\0-\x08\x0B\f\x0E-\x1F\uFFFD\uFFFE\uFFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]))/g;

string = string.replace(regex, "");

if (removeDiscouragedChars) {

// remove everything not suggested by XML 1.0 specifications

regex = new RegExp(

"([\\x7F-\\x84]|[\\x86-\\x9F]|[\\uFDD0-\\uFDEF]|(?:\\uD83F[\\uDFFE\\uDFFF])|(?:\\uD87F[\\uDF"+

"FE\\uDFFF])|(?:\\uD8BF[\\uDFFE\\uDFFF])|(?:\\uD8FF[\\uDFFE\\uDFFF])|(?:\\uD93F[\\uDFFE\\uD"+

"FFF])|(?:\\uD97F[\\uDFFE\\uDFFF])|(?:\\uD9BF[\\uDFFE\\uDFFF])|(?:\\uD9FF[\\uDFFE\\uDFFF])"+

"|(?:\\uDA3F[\\uDFFE\\uDFFF])|(?:\\uDA7F[\\uDFFE\\uDFFF])|(?:\\uDABF[\\uDFFE\\uDFFF])|(?:\\"+

"uDAFF[\\uDFFE\\uDFFF])|(?:\\uDB3F[\\uDFFE\\uDFFF])|(?:\\uDB7F[\\uDFFE\\uDFFF])|(?:\\uDBBF"+

"[\\uDFFE\\uDFFF])|(?:\\uDBFF[\\uDFFE\\uDFFF])(?:[\\0-\\t\\x0B\\f\\x0E-\\u2027\\u202A-\\uD7FF\\"+

"uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|"+

"(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]))", "g");

string = string.replace(regex, "");

}

return string;

}

... And I was finally able to achieve what I needed!

Well, that's it for now: I sincerely hope that this post, as well as the RegEx Splitter and RegEx Slasher tools from the Ryadel.IO project hub, will also help other developers to overcome these kinds of issues.

This post is part of a series of articles, tutorials and guides on the Electron development framework. To read the other posts, click here!

Print Friendly & PDF Download

One Comment on “JavaScript - Remove XML-invalid chars from a Unicode string or file Two Regular Expressions and a useful JavaScript / ECMAScript function to strip invalid characters from UTF8 strings and XML documents or other text files”

Z says:

January 15, 2022 at 19:05

Thank you!!

JavaScript - Remove XML-invalid chars from a Unicode string or file Two Regular Expressions and a useful JavaScript / ECMAScript function to strip invalid characters from UTF8 strings and XML documents or other text files

The Specs

The RegEx(s)

ECMAScript 6

ECMAScript 5

Electron-prebuilt-compile

Single-line to Multiline RegExp

The JavaScript function

About Ryan

One Comment on “JavaScript - Remove XML-invalid chars from a Unicode string or file Two Regular Expressions and a useful JavaScript / ECMAScript function to strip invalid characters from UTF8 strings and XML documents or other text files”

Leave a Reply Cancel reply

The Specs

The RegEx(s)

ECMAScript 6

ECMAScript 5

Electron-prebuilt-compile

Single-line to Multiline RegExp

The JavaScript function

Related Posts

Create a CRUD API in Node.JS and Express - Code sample A sample project illustrating how to create a basic CRUD API in Node.JS using the Express framework

Top 5 Web Programming Languages to Learn in 2023 Are you planning to become a web developer? Here's a list of the top 5 programming languages you should consider learning in 2023

EMail Address Validation in C# and ASP.NET Core A lightweight and customizable helper class to validate any e-mail address using the HTML living standards RegEx and/or ASP.NET Core built-in validators in C#

About Ryan

One Comment on “JavaScript - Remove XML-invalid chars from a Unicode string or file Two Regular Expressions and a useful JavaScript / ECMAScript function to strip invalid characters from UTF8 strings and XML documents or other text files”

Leave a Reply Cancel reply