parser error
RS
richard22j at zoho.com
Fri Oct 27 03:33:43 PDT 2017
On 26/10/2017 01:27, Jeremy Nicoll - ml gip wrote:
> On 2017-10-26 00:51, RS wrote:
>
>> The corruption he refers to is a few spurious NUL characters in
>> <head><metadata>. The subtitles themselves are in <body> and they are
>> intact.
>
> But you're a human looking at the file. XML files have a tightly defined
> syntax (defined by a formal grammar called a DTD). When a program tries
> to extract data from an XML file it does so using standard code that knows
> what the structure of the file is because it has also read the DTD.
>
> Anyway for a program to be able to parse an XML file the parser reads
> the file character by character and at every point it knows (from the
> grammar definition) exactly what could come next and can classify it
> as required.
>
> By definition an XML file is only an XML file if it entirely matches
> the grammar that is defined. As soon as a parser finds a character
> that makes no sense, the whole file is classed as corrupt, not an XML
> file after all.
>
> Much much more at: https://en.wikipedia.org/wiki/XML
>
>
I don't agree with you about the approach to parsing. The key exercise
is to match pairs of tags and to associate what is between the matched
pairs with keywords in the tags, but that is not relevant to this
discussion. The Wikipedia article you refer to says in 3.1
"The code point U+0000 (Null) is the only character that is not
permitted in any XML 1.0 or 1.1 document." so you are right to that extent.
That is not the end of the story. The parser has to decide what to do
when it finds an invalid character. It appears (I am guessing) that
XML::LibXML rejects the entire document even to the extent of rejecting
tag content which does not include any invalid character. It also
appears (and again I am guessing) that XML::Simple takes a different
approach and ignores invalid characters. Whether it ignores invalid
characters anywhere in the document or only if, as is the case here,
they are outside the desired tag pair (<body> ... <\body>) I am not able
to say on the evidence I have seen.
It is then up to the calling script (get_iplayer.pl) to decide what
action to take in response the action taken by the parser. It is not
adequate just to allow XML::LibXML to display "parser error" and take no
further action. My knowledge of Perl is not sufficient to understand
how get_iplayer.pl interacts with XML::LibXML.
I said that similar errors in subtitles were rare and so not worth
bothering with. That was before I became aware of the v3.02 and v3.03
changes to cease use of XML::Simple and to require version 1.91 of
XML::LibXML. In the past any similar errors will have been masked by
XML::Simple.
Best wishes
Richard
More information about the get_iplayer
mailing list