FAO BBC: Double-encoded UTF-8 in Programme's JSON.

Ralph Corderoy ralph at inputplus.co.uk
Sun Mar 4 05:39:36 PST 2018


Hi,

I noticed get_iplayer showing

    Rothaí Móra an tSaoil: Series 1

and wondered if it was a bug, but the BBC's JSON has

    $ curl -sS https://www.bbc.co.uk/programmes/b09w6dhm.json |
    > grep -o '"Roth[^"]*"'
    "Rotha\u00c3\u00ad M\u00c3\u00b3ra an tSaoil"
    "Rotha\u00c3\u00ad M\u00c3\u00b3ra an tSaoil"
    $

and get_iplayer is correctly showing U+c3 and U+ad after `Rotha'.

The problem is the BBC have taken a UTF-8 encoding of the intended rune
and encoded it again as UTF-8.

    $ iconv -f utf-8 -t ucs-2be <<<$'\xc3\xad \xc3\xb3' |
    > od --endian=big -tx2
    0000000 00ed 0020 00f3 000a
    0000010
    $

Thus the title is meant to be

    $ printf 'Rotha\u00ed M\u00f3ra an tSaoil\n'
    Rothaí Móra an tSaoil
    $

Can a BBC lurker please see if they can stop it happening.  Thanks.

-- 
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy



More information about the get_iplayer mailing list