Invalid XML Entities included in metadata file
Ian W Taylor
ian.wight.taylor at gmail.com
Fri Mar 22 14:17:23 EDT 2013
When using get_iplayer (V2.82) I ask it to produce a "generic" XML
metadata file. But when I use xpath on it it says that the XML is invalid.
I think the problem is that get_iplayer uses HTML encode_entities() and
there are about 250 entities defined in HTML but only 5 in the XML
specification. I've read that XML just defines " & ' <
and > for the "&'<> characters. However the generic metadata XML
file includes things like £ for the British? Pound sign in the
description nodes.
I fixed it by changing the call to encode_entities() in substitute() from...
} elsif ($sanitize_mode == 3) {
$replace = encode_entities( $value );
to be...
} elsif ($sanitize_mode == 3) {
$replace = encode_entities( $value, '"&\'<>' );
And that fixed my problem. However I notice that are loads of other
code (that it looks like I don't use) that have lines like "print XML
... encode_entities( ..." and I suspect that they may need fixing too.
I don't know how to produce a "git patch", if that is the correct term,
but I do know how to email so I am passing on my suggestion for a fix here.
Testing it ...
As of today there is a podcast on the BBC site that has a Pound sign in
the description. It is "Wake_Up_To_Money" episode
"Money_Budget_day_20_Mar_13" and it can be obtained using --metadataonly.
The following xpath command barfs under Ubuntu.
xpath -e '//desc/text()' Wake_Up_To_Money/Money_Budget_day_20_Mar_13.xml
Under FreeBSD the xpath comand has the filename as the first param
followed by the XPATH queries. It also says that the XML produced by
get_iplayer is invalid. Mind you both versions of xpath are just perl
scripts using XML::XPath.
Perhaps the problem is that a DTD used to be obtained from web site
named in the 2nd line of the XML, but is no longer available ?
<program_meta_data xmlns="http://linuxcentre.net/xmlstuff/get_iplayer"
revision="1">
--
Regards,
Ian Taylor
More information about the get_iplayer
mailing list