How to skip invalid characters from an UTF-8 XML file or string in PHP

php-cgi.exe - The FastCGI process exited unexpectedly error and how to fix it

Yesterday I wrote something about stripping out P7M data from a XML P7M file or string, as long as it was encoded using CAdES format. It was quite ugly, yet it does the job.

Today I will raise the ugly-but-working bar even further by publishing the method I wrote as follow-up, which basically strips/skips all the invalid characters from the resulting XML string, so it can be cast into a SimpleXML PHP class:

This is nothing less than a mixup of two methods I found here and here on StackOverflow, so the credits go to the respective authors (which I thank): I needed them both because I had to deal with invalid UTF-8 characters and invalid XML characters: as you can see, the method makes use of a regular expression which is shortly followed by an iterative, char-by-char approach.

As I said before, it’s rather ugly and highly unefficient, possibly even more than the previous one… however it gets the job done, and since I had to complete the task in a ridiculously short amount of time that’s the best I’ve come with. In case someone wants to come out with something better, he’s VERY welcome… I’ll gladly accept his suggestions. Until then, I hope that this will actually help other PHP “double-clawed” developers to achieve decent results as well.

… It definitely seems like the PHP hammer has scored yet another hit!

How to skip invalid characters from an UTF-8 XML file or string in PHP

I swear it won’t happen again anytime soon… 🙂



About Ryan

IT Project Manager, Web Interface Architect and Lead Developer for many high-traffic web sites & services hosted in Italy and Europe. Since 2010 it's also a lead designer for many App and games for Android, iOS and Windows Phone mobile devices for a number of italian companies.

View all posts by Ryan