UTF8 subs in get_iplayer
dinkypumpkin
dinkypumpkin at gmail.com
Sun Mar 10 16:31:45 EDT 2013
On 10/03/2013 12:35, Jon Davies wrote:
> On 9 March 2013 20:13, Mark Barker <mark.barker at shaw.ca> wrote:
>> Hi Jon,
>
> please write to the get-iplayer list, not to me directly.
>
>> I was just updating all things get_iplayer on my Win7 x64 setup today. One
>> thing I noticed is that your git repo commit:-
>> ... modifies the character encoding used in the .srt files generated by
>> get_iplayer - my Notepad++ identifies the encoding of these files as "ANSI
>> as UTF-8" (I believe this is Notepad++'s nomenclature for a UTF-8 encoding
>> without a BOM).
>>
>> This actually BREAKS the default behaviour when subsequently muxing these
>> .srt files into .mkv containers using mkvtoolnix (mkvmerge GUI)
>> i.e. if I mux the .mp4 and .srt (with its new "ANSI as UTF-8" .srt output
>> format) into an .mkv, I have to take an extra step to manually override the
>> character encoding type (charset) for the .srt file in the mkvmerge GUI (to
>> UTF-8), else the resultant .mkv will end up have corrupted subs for some of
>> the special characters, e.g. £
>>
>> I would actually prefer it if this 'new' format change was an option and not
>> the default behaviour of get_iplayer, since this has been working without
>> incident for me since forever previously! Failing that, a get_iplayer
>> cmdline fallback option to the non-UTF-8 original encoding would be much
>> appreciated.
>
> seems a reasonable view. I'll have a think about what makes a
> sensible option, and think again what constitutes sensible default
> behaviour, and propose a patch. But I note that mkvmerge converts the
> subtitles to UTF-8 if they're not already that. Unless someone else
> gets there before me.
Nothing was "broken". With no BOM, you have to tell mkvmerge that the
input is UTF-8. That will be true no matter where the content comes from.
This issue arises because mkvmerge does a conversion to UTF-8 when
merging a subtitles track, something which I'm not sure will affect very
many users. Many media players will automatically detect the subtitles
encoding at playback, so you rarely notice any problems with BBC
subtitles if you leave your subtitles in separate files.
Of course, some applications still require that subtitles encoding be
explicitly specified. In fact, looking at the Debian bug that started
all this, I have a sneaking suspicion it was down to pilot error in not
specifying the subtitles encoding to mplayer.
Before Jon's change, the subtitles encoding would have matched the
active code page on Windows, which mkvmerge uses to determine how to
convert to UTF-8, so no extra mkvmerge options would have been
necessary. Of course, the opposite would have been true for Linux/OSX,
where the system locale typically uses UTF-8 encoding. Jon's change
reversed the situation. So it seems this all boils down to which set of
mkvmerge users - Windows or Linux/OSX - may need to add the
--sub-charset option to their command lines.
To me, it seems like a good idea to save subtitles in UTF-8 by default
so that it will be explicit how to post-process them. It also covers
the possibility that subtitles may contain characters outside the Latin1
(ISO8859-1) character set, though I would guess that's a rare occurrence
for BBC TV. A useful tweak might be to add a BOM to the subtitles file,
even though it really shouldn't be necessary. That would placate
mkvmerge and other Windows applications, presumably without causing
trouble elsewhere (touch wood).
Should UTF-8 be mandatory for subtitles? I suppose it wouldn't hurt to
have a "--no-subs-utf8" option to switch it off for those who really
want to do so. But that only seems useful if you post-process subtitles
with an application that really, truly cannot read UTF-8 files, which is
almost a contradiction in terms.
More information about the get_iplayer
mailing list