parser error
Jeremy Nicoll - ml gip
jn.ml.gti.91 at wingsandbeaks.org.uk
Fri Oct 27 05:59:08 PDT 2017
On 2017-10-27 11:33, RS wrote:
> On 26/10/2017 01:27, Jeremy Nicoll - ml gip wrote:
>> On 2017-10-26 00:51, RS wrote:
>>
>>> The corruption he refers to is a few spurious NUL characters in
>>> <head><metadata>. The subtitles themselves are in <body> and they
>>> are
>>> intact.
>>
>> But you're a human looking at the file. XML files have a tightly
>> defined
>> syntax (defined by a formal grammar called a DTD). When a program
>> tries
>> to extract data from an XML file it does so using standard code that
>> knows
>> what the structure of the file is because it has also read the DTD.
>>
>> Anyway for a program to be able to parse an XML file the parser reads
>> the file character by character and at every point it knows (from the
>> grammar definition) exactly what could come next and can classify it
>> as required.
>>
>> By definition an XML file is only an XML file if it entirely matches
>> the grammar that is defined. As soon as a parser finds a character
>> that makes no sense, the whole file is classed as corrupt, not an XML
>> file after all.
>>
>> Much much more at: https://en.wikipedia.org/wiki/XML
> I don't agree with you about the approach to parsing. The key
> exercise is to match pairs of tags and to associate what is between
> the matched pairs with keywords in the tags, but that is not relevant
> to this discussion. The Wikipedia article you refer to says in 3.1
> "The code point U+0000 (Null) is the only character that is not
> permitted in any XML 1.0 or 1.1 document." so you are right to that
> extent.
>
> That is not the end of the story. The parser has to decide what to do
> when it finds an invalid character.
The point you seem to be missing is that for XML parsing, the parser
does not have to decide. The XML /standard/ is (however inconvenient
it is) that any error means the parse stops.
Read the wikipedia page's section on
"Well-formedness and error-handling"
What you're really arguing for is for g_ip's author NOT to use an XML
parser to parse possible badly-formed XML pages.
Maybe some sort of regex-baed text extraction could in this specific
case find the text fields in a well-formed or maybe only a little
badly-formed XML document.
> It is then up to the calling script (get_iplayer.pl) to decide what
> action to take in response the action taken by the parser. It is not
> adequate just to allow XML::LibXML to display "parser error" and take
> no further action.
Even though that's what the XML standard says IS the correct action?
--
Jeremy Nicoll - my opinions are my own
More information about the get_iplayer
mailing list