parser error

Vangelis forthnet northmedia1 at the.forthnet.gr
Tue Oct 24 22:14:46 PDT 2017


 On Tue Oct 24 20:35:11 BST 2017, RS wrote:

> The resultant .mp4 file can be played in VLC,
> but MediaInfo shows no metadata.

 Hello Richard :-)
If you ended up, for whatever reason,
with an untagged file, you can always (re-)tag
post download with the --tag-only switch:

get_iplayer --type=video --pid=b00gmlrx --tag-only --tag-podcast-tv --tag-only-filename="path\to\Suspicion.mp4"

(I assume you renamed the "Suspicion.partial.mp4" to just "Suspicion.mp4")

> The programme is the editorial version of
> the 1941 Hitchcock film Suspicion, b00gmlrx

pid=b00gmlrx => vpid=b09c79wx (needed later...)

> Does anyone have any idea what causes a parser error?

 Answered by Colin; some further analysis below...

 On Tue Oct 24 21:41:54 BST 2017, RS wrote:

> I'm glad I asked because I hadn't realised
> that was where subtitles came from.
> I had assumed there was a ready-made .srt
> file to download.

 On-line media portals (like iPlayer) rarely use the .srt
(subrip text) format, because it's usually incompatible
with their embedded player (Flash based/HTML5 one);
I'm certainly not an expert on this subject, but Flash
based players usually require an XML caption file
(referred to also as DFXP), while HTML5 ones
may use the WebVTT (.vtt) format.

 DFXP is s a timed-text format that was developed by W3C
(stands for "Distribution Format Exchange Profile"); it is
currently referred to as TTML, read more at:
https://en.wikipedia.org/wiki/Timed_Text_Markup_Language

 GiP will use mediaselector URLs (which contain the vpid string)
to retrieve the URIs pointing to the iPlayer ttml files;
PC/iptv-all/apple-ipad-hls mediasets are tried. The URI
you included in your original post will be found, e.g., in

http://open.live.bbc.co.uk/mediaselector/5/select/version/2.0/mediaset/iptv-all/vpid/b09c79wx
(geo-filtered)
in the <media expires="2017-11-21T14:05:00Z" kind="captions"
XML element; this URI is a legacy format, not geo-blocked,
supplier="sis", never expires...
You'll also notice two other URIs for the same subtitles file, these
are the Video Factory flavours; they are served from Akamai/Limelight,
are UK-only and tokenised, with limited lifespans;
but ALL 3 URIs point to the same file!

 GiP fetches the XML subs file (which is referred to as "raw"
in GiP terminology) and then, through a dedicated perl subroutine
("ttml_to_srt", line 6588 of 3.05 script) converts it to .srt;
--subsraw flag will let you also keep the original file...

> I see from --info there are three subtitle modes.

 I used GiP 3.05 and the following command:

perl get_iplayer-305w.pl --type=tv --pid=b00gmlrx -i --streaminfo > 
Streams.txt 2>&1

and yes, there are 3 captions modes identified,
but, alas, I can sure tell there's a bug in the
detection scheme somewhere; no sign of the
legacy format, plus there's duplication, as

subtitles3=subtitles1

==================================
stream:     subtitles1
bitrate:
expires:    2017-11-21T14:05:00Z
ext:        srt
priority:   20
size:       118212
streamer:   http
streamurl: 
http://vod-sub-uk-live.bbcfmt.hs.llnwd.net/iplayer/subtitles/ng/moda
v/bUnknown-591e0c64-779b-4f16-9582-bd3bc6c441bd_b09c79wx_1508034118207.xml?s=150
8878211&e=1508921411&h=c1d8bb45cd85f418d83103af0ef1979a
type:       (captions) http stream (CDN: mf_limelight_uk_plain/20)

stream:     subtitles2
bitrate:
expires:    2017-11-21T14:05:00Z
ext:        srt
priority:   10
size:       118212
streamer:   http
streamurl: 
http://vod-sub-uk-live.akamaized.net/iplayer/subtitles/ng/modav/bUnk
nown-591e0c64-779b-4f16-9582-bd3bc6c441bd_b09c79wx_1508034118207.xml?__gda__=150
8921411_8042e3b62cef7eb303c0b44d69225c99
type:       (captions) http stream (CDN: mf_akamai_uk_plain/10)

stream:     subtitles3
bitrate:
expires:    2017-11-21T14:05:00Z
ext:        srt
priority:   20
size:       118212
streamer:   http
streamurl: 
http://vod-sub-uk-live.bbcfmt.hs.llnwd.net/iplayer/subtitles/ng/moda
v/bUnknown-591e0c64-779b-4f16-9582-bd3bc6c441bd_b09c79wx_1508034118207.xml?s=150
8878211&e=1508921411&h=c1d8bb45cd85f418d83103af0ef1979a
type:       (captions) http stream (CDN: mf_limelight_uk_plain/20)
==================================

but all three point to the same file!

Now, if you load the legacy URL
http://www.bbc.co.uk/iplayer/subtitles/ng/modav/bUnknown-591e0c64-779b-4f16-9582-bd3bc6c441bd_b09c79wx_1508034118207.xml
in a Firefox tab, the browser will print:
> XML Parsing Error: not well-formed
> Location: (The URI)
> Line Number 7, Column 18:
>
>         SUSPICION

Right-click -> View Page Source
and you'll be able to view the file contents
and actually visualise the corruption:
https://i.imgur.com/lsh8188.jpg

With the aid of Fx's Page Source and
a Text Editor, I managed to reconstitute
a proper TTML file, then used SubtitleEdit
to convert to (monochrome) .srt.
If you're in need of it, contact me off-list...

> If I try --subtitles-only --tvmode=subtitles2
> it tells me No media streams found.

 I don't think subtitle mode user selection is supported;
legacy GiP code assumed only one captions mode,
so this could be a new requested feature; I see no
reason for it though; all modes point to the same file,
negligible speed differences between CDNs for such
small files of just a few KBs...

> get_iplayer ought when unable to download subtitles
> successfully to continue to call AtomicParsley to add metadata.

 While in this case it's not the actual downloading that failed,
but rather the conversion to .srt (e.g. you can fetch the raw
corrupt ttml with --subsraw), I too agree with that.

 After another series of tests made, of note is the fact
that every GiP version from 3.00 onwards does fail to
convert this corrupted subtitles file, but, lo-and-behold,
v2.99 does so successfully:
=======================================
get_iplayer v2.99, Copyright (C) 2008-2010 Phil Lewis
  This program comes with ABSOLUTELY NO WARRANTY; for details 
use --warranty.
  This is free software, and you are welcome to redistribute it under 
certain
  conditions; use --conditions for details.

  NOTE: A UK TV licence is required to legally access BBC iPlayer TV content

INFO Trying to download PID using type tv
INFO: pid found in cache
Matches:
5276:   Suspicion - -, BBC Two, b00gmlrx
WARNING: Could not download programme metadata from 
http://www.bbc.co.uk/program
mes/b00gmlrx.xml
INFO: Downloading Subtitles to 'D:\Vangelis\iPlayer 
Recordings/Suspicion_-__b00g
mlrx_editorial.srt'
=======================================

 Actually, this is not a fluke; prior to 3.00,
GiP would produce monochrome .srt files, so,
without examining the code itself, I suspect
the older TTML parsing code was more forgiving...

> Further, if subtitles1 fails it ought to
> try subtitles2 and subtitles3

 Again, it isn't the actual download that failed,
but the conversion; since all 3 (2 by my tests)
modes point to same file, conversion of the other
two should fail also; we still don't know at which
stage the corruption took place; I'm presuming
during file generation, not during upload to CDNs (?).

 Now, if the actual download failed, then I see
your point as a valid one... I won't pretend I fully
understand the actual perl code, but perl wizards
could enlighten us as to actual content of GiP
subroutines "subtitles_available" & "download_subtitles";
my hunch is GiP already does what you suggest,
as far as downloading is concerned...

 Apologies for the length of this post and thanks
to those that stayed to read the end of it...

Kindest regards,
Vangelis. 




More information about the get_iplayer mailing list