Proof-of-concept scraper for iPlayer web frontend TV data to JSON
Rob Dixon
rob.dixon at gmx.com
Fri Oct 31 18:09:09 PDT 2014
On 01/11/2014 00:27, dinkypumpkin wrote:
>
> I tried this same approach, but it foundered on radio programmes. There
> is just too much stuff there. It's soul-crushingly slow to scrape the
> iPlayer Radio site, at least for a desktop cache. It would be great to
> have everything available on iPlayer searchable off-site, but there is
> too much of it for get_iplayer's current local caching model. I'm going
> to have another go at some point.
There is no real need to download *all* of the schedule information;
after all, only a fraction of it will ever be of any use to an
individual user.
I would use the BBC server to do the search for me, after which there is
little work to be done. For instance, if I look for all Book at Bedtime
episodes with this URL
http://www.bbc.co.uk/radio/programmes/a-z/by/book%20at%20bedtime/player
then I am taken a page with a link to the series at
http://www.bbc.co.uk/programmes/b006qtlx/episodes/player?page=1
through to `page=6`. That amounts to 52 programmes which, even on my
meagre 13 megabit connection that takes less than ten seconds, and the
results could be cached for practically instantaneous response for a
similar request in the future. There is also the possibility of writing
a batch solution that makes a query only every minute or so and could be
run continuously or overnight.
I'm more than happy to write a proof of concept if you're interested. I
have it half-written already just to get that timing information.
The one thing that bothers me is the terms and conditions of the web
site. I scanned through them quickly and couldn't find anything about
robotic access, but it would be a first if there isn't anything there.
If it's just a matter of obeying the /robots.txt then I'm more than
happy to go ahead.
Let me know how I can help.
Rob
---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com
More information about the get_iplayer
mailing list