FAO BBC: Double-encoded UTF-8 in Programme's JSON.
Ralph Corderoy
ralph at inputplus.co.uk
Sun Mar 4 05:39:36 PST 2018
Hi,
I noticed get_iplayer showing
Rothaà Móra an tSaoil: Series 1
and wondered if it was a bug, but the BBC's JSON has
$ curl -sS https://www.bbc.co.uk/programmes/b09w6dhm.json |
> grep -o '"Roth[^"]*"'
"Rotha\u00c3\u00ad M\u00c3\u00b3ra an tSaoil"
"Rotha\u00c3\u00ad M\u00c3\u00b3ra an tSaoil"
$
and get_iplayer is correctly showing U+c3 and U+ad after `Rotha'.
The problem is the BBC have taken a UTF-8 encoding of the intended rune
and encoded it again as UTF-8.
$ iconv -f utf-8 -t ucs-2be <<<$'\xc3\xad \xc3\xb3' |
> od --endian=big -tx2
0000000 00ed 0020 00f3 000a
0000010
$
Thus the title is meant to be
$ printf 'Rotha\u00ed M\u00f3ra an tSaoil\n'
Rothaí Móra an tSaoil
$
Can a BBC lurker please see if they can stop it happening. Thanks.
--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy
More information about the get_iplayer
mailing list