Invalid XML Entities included in metadata file
dinkypumpkin
dinkypumpkin at gmail.com
Sun Mar 24 16:41:14 EDT 2013
On 22/03/2013 20:16, Dave Lambley wrote:
> On 22/03/13 18:36, Roger Bell_West wrote:
>> On Fri, Mar 22, 2013 at 06:17:23PM +0000, Ian W Taylor wrote:
>>> I think the problem is that get_iplayer uses HTML encode_entities()
>>> and there are about 250 entities defined in HTML but only 5 in the
>>> XML specification. I've read that XML just defines " &
>>> ' < and > for the "&'<> characters. However the generic
>>> metadata XML file includes things like £ for the British? Pound
>>> sign in the description nodes.
>>
>> I've had this problem too, and would like not to have to sanitise the
>> XML before reading it.
>
> Try the attached patch, which switches to numeric entity encoding for XML.
A definite case of over-encoding. I've committed a fix for this issue
to the Git repo:
http://git.infradead.org/get_iplayer.git/commit/8fad7b46b626c74082fc9334544fce5d0eeb71d9
The fix is along the lines Ian suggested, i.e., encoding as few
characters as necessary. Since the source content should come down from
the BBC as UTF-8, I think it's a good idea to leave special characters
(like £) as they are received rather than creating numeric entities for
everything. It makes it a little easier to use the XML files in other
applications (e.g., text editors). The BBC metadata may not always be
properly encoded, but there isn't much get_iplayer can do about that.
Dave: Thanks for your patch. If you want to continue using it for
yourself, I should point out a small bug. You need to import
encode_entities_numeric in Programme as well as main. Otherwise,
get_iplayer can't find it when creating the programme metadata files.
More information about the get_iplayer
mailing list