JavaScript - Remove XML-invalid chars from a Unicode string or file Two Regular Expressions and a useful JavaScript / ECMAScript function to strip invalid characters from UTF8 strings and XML documents or other text files

JavaScript - Remove XML-invalid chars from a Unicode string or file

Today I was developing an Electron application for a client and I was looking for a way to remove invalid characters from a typical XML file in UTF-8 format . Unfortunately, StackOverflow was unable to help my for that, since I only found questions (and answers) related to stripping/removing non-UTF8 characters: close, yet still not enough for what I need, since there are a lot of legitimate UTF8 characters that might cause issues within a XML file.

The Specs

As a matter of fact, according to the official XML 1.0 specifications, a valid XML file should only contain Unicode characters, excluding the surrogate blocks, FFFE, and FFFF. To keep it short, this means that the only valid characters should fall into one of the following groups:

#x9, #xA, #xD, [#x20-#xD7FF], [#xE000-#xFFFD], [#x10000-#x10FFFF]

Additionally, the above specifications explains that "Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters.". In short word, this means that the following characters should be considered "discouraged" as well:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].

The RegEx(s)

Now that we know what to keep and what to remove, we can build the regular expressions accordingly to filter in/out the unwanted characters.

ECMAScript 6

These could be easily done using ECMAScript 6, which features a great Unicode code point escape feature which can be used to obtain any Unicode character in the following way:

Where the dots between the brackets can be replaced with any hex value from 00001 to 10FFFF, which is the highest code point defined by Unicode.

Here are the ES6 regular expressions:

IMPORTANT: It's worth noting that, since I was here, I took the chance to also exclude the #xFFFD character, aka the Unicode Replacement Character, which is usually used as a placeholder when the decoder encounters an invalid sequence of bytes: despite being formally accepted by the XML specifications, such character could often raise issues on most XML parsers: if you don't want to suppress it like I did, just replace "uFFFC" with "uFFFD" in the first RegExp.

ECMAScript 5

Unfortunately, since we're using Electron, we have to build our RegExp using ECMAScript 5, which is not that great when dealing with RegExp and Unicode characters. As you might already know, ECMAScript 5 has only the following escape sequences:

  • Octal, which can be used to escape any character with a character code lower than 256 (i.e. any character in the extended ASCII range): \42
  • Hexadecimalwhich works the same way as Octal but uses hex values instead: \x125
  • Unicode, which can be used to escape any character with a character code lower than 65535: \uFFFF

What about the code points above 65535? Well, since JavaScript uses UCS-2 encoding internally, all code points higher than that must be represented by a pair of (lower valued) surrogate pseudo-characters which are used to comprise the real character: this basically means that, in order to get the actual character code of these higher code point characters in JavaScript, we need to use two UTF-8 halves which will match a corresponding UTF-16 code point.

Now, since we've seen that the XML specifications does indeed allow a nice amount of these characters - the whole [#x10000-#x10FFFF] block, not to mention all the discouraged characters - we are forced to do some extra work and convert these code points to their lower valued UTF-8 surrogate pairs.

Luckily enough, there's a great JS library called regexpu that will do that automatically for you, and even a free online tool that implements that to perform the ES6-to-ES5 conversion that we need. All we need to do is to paste there our ES6 regular expressions and have them converted (actually, transpiled) to a backward-compatible, Electron-friendly ES5 syntax:

Electron-prebuilt-compile

If you don't want to transpile these ES6 regular expressions to their ES5 counterpart, you can replace your vanilla electron build with the electron-prebuilt-compile NPM package, which comes with native ES6 and TypeScript support.

Single-line to Multiline RegExp

Regardless of what ECMAScript version you're using, the above RegExp have all a major flaw: they are insanely long, which makes them ugly in terms of code readability. On top of that, since they're using the peculiar JavaScript RegExp syntax, they cannot be split into multiple lines as they were string. What can we do to fix such issue?

I found a number of possible techniques for splitting JS RegExp on multiple lines while digging on StackOverflow and other tech sites: the best workaround I've found so far is explained in this StackOverflow answer, that shows how you can do an array of mini-regexp and join them at the end: despite being a good idea, it can be only used when we can effectively split the regexp, since each single "subset" has to be internally consistent. Other answers suggest to use the JavaScript RegExp object, which constructor does accept a string: the problem with that is that we need to manually escape the RegExp string, which could be a real bummer (and altering the "actual" string, thus crippling our chances to run further tests with third-party tools).

Long story short, I ended up coding a couple online tools:

  • RegEx Splitter, which can be used to split a single-line JavaScript RegEx into multiple-lines of JavaScript code.
  • RegEx Slasher, which can be used to add slashes and/or remove slashes from any RegEx - basically escaping and/or unescaping it.

I put both tools online in the Ryadel.IO project hub, hoping that they'll help other developers dealing with these kinds of issues: they are also both available on my GitHub repo page.

The JavaScript function

Thanks to my RegExSplitter I was eventually able to convert those ugly regular expressions into a more readable set of code lines.

Once done, I just wrapped everything up in the following JavaScript method:

... And I was finally able to achieve what I needed!

Well, that's it for now: I sincerely hope that this post, as well as the RegEx Splitter and RegEx Slasher tools from the Ryadel.IO project hub, will also help other developers to overcome these kinds of issues.

This post is part of a series of articles, tutorials and guides on the Electron development framework. To read the other posts, click here!

 

About Ryan

IT Project Manager, Web Interface Architect and Lead Developer for many high-traffic web sites & services hosted in Italy and Europe. Since 2010 it's also a lead designer for many App and games for Android, iOS and Windows Phone mobile devices for a number of italian companies. Microsoft MVP for Development Technologies since 2018.

View all posts by Ryan

One Comment on “JavaScript - Remove XML-invalid chars from a Unicode string or file Two Regular Expressions and a useful JavaScript / ECMAScript function to strip invalid characters from UTF8 strings and XML documents or other text files

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.