Invalid XML Entities included in metadata file

dinkypumpkin dinkypumpkin at gmail.com
Sun Mar 24 16:41:14 EDT 2013


On 22/03/2013 20:16, Dave Lambley wrote:
> On 22/03/13 18:36, Roger Bell_West wrote:
>> On Fri, Mar 22, 2013 at 06:17:23PM +0000, Ian W Taylor wrote:
>>> I think the problem is that get_iplayer uses HTML encode_entities()
>>> and there are about 250 entities defined in HTML but only 5 in the
>>> XML specification.  I've read that XML just defines " &
>>> ' < and > for the "&'<> characters.  However the generic
>>> metadata XML file includes things like £ for the British? Pound
>>> sign in the description nodes.
>>
>> I've had this problem too, and would like not to have to sanitise the
>> XML before reading it.
>
> Try the attached patch, which switches to numeric entity encoding for XML.

A definite case of over-encoding.  I've committed a fix for this issue 
to the Git repo:

http://git.infradead.org/get_iplayer.git/commit/8fad7b46b626c74082fc9334544fce5d0eeb71d9

The fix is along the lines Ian suggested, i.e., encoding as few 
characters as necessary.  Since the source content should come down from 
the BBC as UTF-8, I think it's a good idea to leave special characters 
(like £) as they are received rather than creating numeric entities for 
everything.  It makes it a little easier to use the XML files in other 
applications (e.g., text editors).  The BBC metadata may not always be 
properly encoded, but there isn't much get_iplayer can do about that.

Dave: Thanks for your patch.  If you want to continue using it for 
yourself, I should point out a small bug.  You need to import 
encode_entities_numeric in Programme as well as main. Otherwise, 
get_iplayer can't find it when creating the programme metadata files.





More information about the get_iplayer mailing list