parser error

Fri Oct 27 05:59:08 PDT 2017

On 2017-10-27 11:33, RS wrote:
> On 26/10/2017 01:27, Jeremy Nicoll - ml gip wrote:
>> On 2017-10-26 00:51, RS wrote:
>> 
>>> The corruption he refers to is a few spurious NUL characters in
>>> <head><metadata>.  The subtitles themselves are in <body> and they 
>>> are
>>> intact.
>> 
>> But you're a human looking at the file.  XML files have a tightly 
>> defined
>> syntax (defined by a formal grammar called a DTD).  When a program 
>> tries
>> to extract data from an XML file it does so using standard code that 
>> knows
>> what the structure of the file is because it has also read the DTD.
>> 
>> Anyway for a program to be able to parse an XML file the parser reads
>> the file character by character and at every point it knows (from the
>> grammar definition) exactly what could come next and can classify it
>> as required.
>> 
>> By definition an XML file is only an XML file if it entirely matches
>> the grammar that is defined.  As soon as a parser finds a character
>> that makes no sense, the whole file is classed as corrupt, not an XML
>> file after all.
>> 
>> Much much more at: https://en.wikipedia.org/wiki/XML

> I don't agree with you about the approach to parsing.  The key
> exercise is to match pairs of tags and to associate what is between
> the matched pairs with keywords in the tags, but that is not relevant
> to this discussion.  The Wikipedia article you refer to says in 3.1
> "The code point U+0000 (Null) is the only character that is not
> permitted in any XML 1.0 or 1.1 document." so you are right to that
> extent.
> 
> That is not the end of the story.  The parser has to decide what to do
> when it finds an invalid character.

The point you seem to be missing is that for XML parsing, the parser
does not have to decide.  The XML /standard/ is (however inconvenient
it is) that any error means the parse stops.

Read the wikipedia page's section on

   "Well-formedness and error-handling"

What you're really arguing for is for g_ip's author NOT to use an XML
parser to parse possible badly-formed XML pages.

Maybe some sort of regex-baed text extraction could in this specific
case find the text fields in a well-formed or maybe only a little
badly-formed XML document.

> It is then up to the calling script (get_iplayer.pl) to decide what
> action to take in response the action taken by the parser.  It is not
> adequate just to allow XML::LibXML to display "parser error" and take
> no further action.

Even though that's what the XML standard says IS the correct action?

-- 
Jeremy Nicoll - my opinions are my own