parser error

RS richard22j at zoho.com
Fri Oct 27 03:33:43 PDT 2017


On 26/10/2017 01:27, Jeremy Nicoll - ml gip wrote:
> On 2017-10-26 00:51, RS wrote:
> 
>> The corruption he refers to is a few spurious NUL characters in
>> <head><metadata>.  The subtitles themselves are in <body> and they are
>> intact.
> 
> But you're a human looking at the file.  XML files have a tightly defined
> syntax (defined by a formal grammar called a DTD).  When a program tries
> to extract data from an XML file it does so using standard code that knows
> what the structure of the file is because it has also read the DTD.
> 
> Anyway for a program to be able to parse an XML file the parser reads
> the file character by character and at every point it knows (from the
> grammar definition) exactly what could come next and can classify it
> as required.
> 
> By definition an XML file is only an XML file if it entirely matches
> the grammar that is defined.  As soon as a parser finds a character
> that makes no sense, the whole file is classed as corrupt, not an XML
> file after all.
> 
> Much much more at: https://en.wikipedia.org/wiki/XML
> 
> 

I don't agree with you about the approach to parsing.  The key exercise 
is to match pairs of tags and to associate what is between the matched 
pairs with keywords in the tags, but that is not relevant to this 
discussion.  The Wikipedia article you refer to says in 3.1
"The code point U+0000 (Null) is the only character that is not 
permitted in any XML 1.0 or 1.1 document." so you are right to that extent.

That is not the end of the story.  The parser has to decide what to do 
when it finds an invalid character.  It appears (I am guessing) that 
XML::LibXML rejects the entire document even to the extent of rejecting 
tag content which does not include any invalid character.  It also 
appears (and again I am guessing) that XML::Simple takes a different 
approach and ignores invalid characters.  Whether it ignores invalid 
characters anywhere in the document or only if, as is the case here, 
they are outside the desired tag pair (<body> ... <\body>) I am not able 
to say on the evidence I have seen.

It is then up to the calling script (get_iplayer.pl) to decide what 
action to take in response the action taken by the parser.  It is not 
adequate just to allow XML::LibXML to display "parser error" and take no 
further action.  My knowledge of Perl is not sufficient to understand 
how get_iplayer.pl interacts with XML::LibXML.

I said that similar errors in subtitles were rare and so not worth 
bothering with.  That was before I became aware of the v3.02 and v3.03 
changes to cease use of XML::Simple and to require version 1.91 of 
XML::LibXML.  In the past any similar errors will have been masked by 
XML::Simple.

Best wishes
Richard





More information about the get_iplayer mailing list