Proof-of-concept scraper for iPlayer web frontend TV data to JSON
Steven Maude
get_iplayer at stevenmaude.co.uk
Fri Oct 31 19:01:22 PDT 2014
On 01/11/2014 00:27, dinkypumpkin wrote:
> Thanks for that. From underneath 10,000 lines of Perl I gaze longingly
> at that lovely strictly-indented, sigil-less Python.
Python is really easy to use, though not everyone loves the indentation!
> You wouldn't necessarily need to drill all the way down to individual
> episodes, though you would need to drill down one more level for series
> with multiple episodes available. I think the only get_iplayer cache
> field that would require individual episode pages is the guidance
> warning, and that could be sacrificed.
To cut down on that, one way might be just to keep track of expired
programmes. If you build up a feed cache daily and expire programmes at
the correct time, you wouldn't need to delve into the series pages as
you'd "collect" new episodes as they appear.
> I tried this same approach, but it foundered on radio programmes.
There
> is just too much stuff there. It's soul-crushingly slow to scrape the
> iPlayer Radio site, at least for a desktop cache. It would be great to
> have everything available on iPlayer searchable off-site, but there is
> too much of it for get_iplayer's current local caching model. I'm going
> to have another go at some point.
Not too familiar with the radio pages. For some reason, clicking 0-9
gives me a mixture of what might be everything, 695 search result pages,
and 66719 programmes, most/all seemed not to actually start with 0-9.
Radio buffs might have a better idea of whether that's accurate or not,
I was looking at: http://www.bbc.co.uk/radio/programmes/a-z/by/%40/current
If ~60000 programmes is accurate, that's a lengthy-ish scrape - roughly
2 hours if you use some polite request rate limiting (1 request every 2
seconds), but certainly doable daily.
Another way to cut down scraping would be to scrape each radio category
in turn, iterating through the "latest" results. This means you'd have
one large scrape the first time when you collect 7 days worth of data,
then smaller incremental ones each day to catch up on yesterday's
results, stopping once you've seen programmes before.
Finally, Rob's suggestion in this thread of using BBC search is a great
one. It means you don't need to scrape the whole thing, though you'd
have a short, acceptable wait for each search to run; that might be the
way to go, unless there are compelling reasons to retrieve a complete feed.
>> If there's interest, I'm happy to work on wrangling out get_iplayer
>> compatible feed data. (A guide to the structure of the iPlayer feeds
>> would be handy.)
>
> Well, there is more than one feed format. It would probably be better
> just to dump out files in get_iplayer's cache format, though we could
> certainly work out an intermediate format if you were going to implement
> actual data feeds. Contact me off-list if you have some specific
> questions in that area.
A pointer to a spec for that in the code or documentation would be
helpful. Did have a quick browse around the get_iplayer repo, but yes,
10,000 lines of Perl is quite intimidating when I've never looked at
Perl before!
Won't have much time for this in the next month as I'm on holiday from
end of the forthcoming week until early December, but if there's no
definite solution by then - seems unlikely given some of the ideas being
bounced around :) — I'm happy to help look into this more.
More information about the get_iplayer
mailing list