Solipsistic Meanderings

September 20, 2004

So what’s the stat, Matt? I’ve been working over the weekend on the weird XMLRPC problem when trying to send a post from Blog to WordPress. The problem, as I mentioned in the previous entry, had to do with various non-standard characters like the pound sign (£) or the Euro sign (). The problem seemed to crop up because the XMLRPC code in PHP expects the character data sent from Blog to be UTF-8 encoded whereas I am not exactly certain how the data from Blog is actually encoded – but it certainly is not UTF-8 :p I had been hoping to find a solution which would allow me to do a fix from the Blog end without touching the PHP code in WordPress but unfortunately, after about a day of testing and trying various avenues of attack, I was basically no further ahead than I had been the day before.

So, I finally decided to settle on a solution which would involve changes both at the Blog end and the WordPress end. Basically, I would encode selected characters with their numeric HTML entity equivalent at the Blog end. So, £ would become £ at the Blog end. You might ask why I didn’t use the HTML character entity equivalent of £ for £ instead of £ and it would be a good question since that’s what I too wanted to do originally :p However, at the WordPress end things blow up if I used the character entity since apparently, the £ entity is not defined in the XML spec and so is not recognized by the PHP XML parser 🙁 I looked at a few suggested workarounds for this problem but none of them really worked for me and so, I decided to go with the numeric entity since that at least worked … after a fashion.

But this still wasn’t the whole solution because while the numeric entity was not outright rejected by the XML parser, I would always get a strange character in front of the actual character whenever it was parsed by the PHP XML parser. So, instead of the single character that I originally had, I would get a garbage character followed by the actual character. This turned out to be due to the whole non-UTF-8 encoding problem that I’d already talked about and so, I decided to add some extra code to the XML parsing routines in WordPress to handle this particular scenario. The new code looks for particular instances of two character sets which meet certain criteria and then encodes them into UTF-8 and so far, it has been working fairly well. Of course, as to whether this particular change would affect something else in WordPress is something that I can’t confirm or deny totally at the moment :p However, I’ve managed to transfer around 600 out of the 4,000+ entries that Nigel gave me and progress seems to be much smoother now. Stay tuned for further developments …