How to strip invalid characters from an UTF-8 XML file or string in PHP

php-cgi.exe - The FastCGI process exited unexpectedly error and how to fix it

Yesterday I wrote something about stripping out P7M data from a XML P7M file or string, as long as it was encoded using CAdES format. It was quite ugly, yet it does the job for the most part - which is stripping the header & footer signature info.

Today I will raise the ugly-but-working bar even further by publishing the method I wrote as follow-up, which basically strips/skips all the invalid characters from the resulting XML string, so it can be cast into a SimpleXML PHP class:

This is nothing less than a mixup of two methods I found here and here on StackOverflow, so the credits go to the respective authors (which I thank): I needed them both because I had to deal with invalid UTF-8 characters and invalid XML characters: as you can see, the method makes use of a regular expression which is shortly followed by an iterative, char-by-char approach.

As I said before, it's rather ugly and highly unefficient, possibly even more than the previous one... however it gets the job done, and since I had to complete the task in a ridiculously short amount of time that's the best I've come with. In case someone wants to come out with something better, he's VERY welcome... I'll gladly accept his suggestions. Until then, I hope that this will actually help other PHP "double-clawed" developers to achieve decent results as well.

... It definitely seems like the PHP hammer has scored yet another hit!

How to strip invalid characters from an UTF-8 XML file or string in PHP

I swear it won't happen again anytime soon... :)

 

About Ryan

IT Project Manager, Web Interface Architect and Lead Developer for many high-traffic web sites & services hosted in Italy and Europe. Since 2010 it's also a lead designer for many App and games for Android, iOS and Windows Phone mobile devices for a number of italian companies. Microsoft MVP for Development Technologies since 2018.

View all posts by Ryan

3 Comments on “How to strip invalid characters from an UTF-8 XML file or string in PHP”

  1. Pingback: PHP - How to strip P7M data from a XML.P7M file or string (CAdES)
  2. Many thanks for this code.

    It save my life today (Sunday 29th August 2021) when I was transferring three gigabytes of old and potentially dodgy mails (eg lots of spam) from mbox format into a MySQL database.

    All I had to do was cut’n’paste it into my script and it worked immediately and at the first time of asking.

  3. Pingback: Remove non-utf8 characters from string

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.