Proof-of-concept scraper for iPlayer web frontend TV data to JSON

Fri Oct 31 18:29:40 PDT 2014

On 01/11/2014 01:09, Rob Dixon wrote:
> I would use the BBC server to do the search for me, after which there is
> little work to be done. For instance, if I look for all Book at Bedtime
> episodes with this URL
>
>
> http://www.bbc.co.uk/radio/programmes/a-z/by/book%20at%20bedtime/player
>
> then I am taken a page with a link to the series at
>
>      http://www.bbc.co.uk/programmes/b006qtlx/episodes/player?page=1
>
> through to `page=6`. That amounts to 52 programmes which, even on my
> meagre 13 megabit connection that takes less than ten seconds, and the
> results could be cached for practically instantaneous response for a
> similar request in the future. There is also the possibility of writing
> a batch solution that makes a query only every minute or so and could be
> run continuously or overnight.

That's a neat idea! (I'd also been concerned with trying to recreate the 
RSS feeds for programme categories, so I'd focused on pulling everything.)

The search isn't perfect (e.g. try searching for "BBC News"), but you 
could use that to refine the results to reduce the amount of scraping 
you need to do, then do better matching against title or synopsis in 
get_iplayer.

> I'm more than happy to write a proof of concept if you're interested. I
> have it half-written already just to get that timing information.
>
> The one thing that bothers me is the terms and conditions of the web
> site. I scanned through them quickly and couldn't find anything about
> robotic access, but it would be a first if there isn't anything there.
> If it's just a matter of obeying the /robots.txt then I'm more than
> happy to go ahead.
>

At a glance, robots.txt doesn't seem to disallow accessing the sections 
needed. In the terms of use, there is this though:

"(d) You agree to use BBC Online Services and access, download, view 
and/or listen to BBC Content as supplied to you by the BBC and you may 
not, and you may not assist anyone to, or attempt to, reverse engineer, 
decompile, disassemble, adapt, modify, copy, reproduce, lend, hire, 
rent, perform, sub-license, make available to the public, create 
derivative works from, broadcast, distribute, commercially exploit, 
transmit or otherwise use in any way BBC Online Services and/or BBC 
Content in whole or in part except to the extent permitted in these 
Terms of Use, any relevant Additional Terms and at law."

If I'm downloading pages automatically and automatically reading certain 
sections of the HTML, is that viewing it as supplied to me by the BBC?