UTF8 subs in get_iplayer

Sun Mar 10 16:31:45 EDT 2013

On 10/03/2013 12:35, Jon Davies wrote:
> On 9 March 2013 20:13, Mark Barker <mark.barker at shaw.ca> wrote:
>> Hi Jon,
>
> please write to the get-iplayer list, not to me directly.
>
>> I was just updating all things get_iplayer on my Win7 x64 setup today. One
>> thing I noticed is that your git repo commit:-
>> ... modifies the character encoding used in the .srt files generated by
>> get_iplayer - my Notepad++ identifies the encoding of these files as "ANSI
>> as UTF-8" (I believe this is Notepad++'s nomenclature for a UTF-8 encoding
>> without a BOM).
>>
>> This actually BREAKS the default behaviour when subsequently muxing these
>> .srt files into .mkv containers using mkvtoolnix (mkvmerge GUI)
>> i.e. if I mux the .mp4 and .srt (with its new "ANSI as UTF-8" .srt output
>> format) into an .mkv, I have to take an extra step to manually override the
>> character encoding type (charset) for the .srt file in the mkvmerge GUI (to
>> UTF-8), else the resultant .mkv will end up have corrupted subs for some of
>> the special characters, e.g. £
>>
>> I would actually prefer it if this 'new' format change was an option and not
>> the default behaviour of get_iplayer, since this has been working without
>> incident for me since forever previously! Failing that, a get_iplayer
>> cmdline fallback option to the non-UTF-8 original encoding would be much
>> appreciated.
>
> seems a reasonable view.  I'll have a think about what makes a
> sensible option, and think again what constitutes sensible default
> behaviour, and propose a patch.  But I note that mkvmerge converts the
> subtitles to UTF-8 if they're not already that.  Unless someone else
> gets there before me.

Nothing was "broken".  With no BOM, you have to tell mkvmerge that the 
input is UTF-8.  That will be true no matter where the content comes from.

This issue arises because mkvmerge does a conversion to UTF-8 when 
merging a subtitles track, something which I'm not sure will affect very 
many users.  Many media players will automatically detect the subtitles 
encoding at playback, so you rarely notice any problems with BBC 
subtitles if you leave your subtitles in separate files.

Of course, some applications still require that subtitles encoding be 
explicitly specified. In fact, looking at the Debian bug that started 
all this, I have a sneaking suspicion it was down to pilot error in not 
specifying the subtitles encoding to mplayer.

Before Jon's change, the subtitles encoding would have matched the 
active code page on Windows, which mkvmerge uses to determine how to 
convert to UTF-8, so no extra mkvmerge options would have been 
necessary.  Of course, the opposite would have been true for Linux/OSX, 
where the system locale typically uses UTF-8 encoding.  Jon's change 
reversed the situation.  So it seems this all boils down to which set of 
mkvmerge users - Windows or Linux/OSX - may need to add the 
--sub-charset option to their command lines.

To me, it seems like a good idea to save subtitles in UTF-8 by default 
so that it will be explicit how to post-process them.  It also covers 
the possibility that subtitles may contain characters outside the Latin1 
(ISO8859-1) character set, though I would guess that's a rare occurrence 
for BBC TV.  A useful tweak might be to add a BOM to the subtitles file, 
even though it really shouldn't be necessary.  That would placate 
mkvmerge and other Windows applications, presumably without causing 
trouble elsewhere (touch wood).

Should UTF-8 be mandatory for subtitles?  I suppose it wouldn't hurt to 
have a "--no-subs-utf8" option to switch it off for those who really 
want to do so.  But that only seems useful if you post-process subtitles 
with an application that really, truly cannot read UTF-8 files, which is 
almost a contradiction in terms.