Proof-of-concept scraper for iPlayer web frontend TV data to JSON

dinkypumpkin dinkypumpkin at gmail.com
Fri Oct 31 17:27:53 PDT 2014


On 31/10/2014 00:08, Steven Maude wrote:
> https://github.com/StevenMaude/nitroradical

Thanks for that.  From underneath 10,000 lines of Perl I gaze longingly 
at that lovely strictly-indented, sigil-less Python.

> It would take some time to populate the programme data. Scraping the
> index pages for TV actually doesn't take that long, but in some cases
> you'd have to pull out individual programme pages to get all the episode
> info for them. As is, my script just gets the most recent episode.

You wouldn't necessarily need to drill all the way down to individual 
episodes, though you would need to drill down one more level for series 
with multiple episodes available.  I think the only get_iplayer cache 
field that would require individual episode pages is the guidance 
warning, and that could be sacrificed.

I tried this same approach, but it foundered on radio programmes.  There 
is just too much stuff there.  It's soul-crushingly slow to scrape the 
iPlayer Radio site, at least for a desktop cache.  It would be great to 
have everything available on iPlayer searchable off-site, but there is 
too much of it for get_iplayer's current local caching model.  I'm going 
to have another go at some point.

> However, dinkypumpkin mentioned that centralising a feed wasn't a
> preferred option. That said, there's nothing to stop having a
> user-specified option to point get_player to a specific feed URL. If
> someone hosts a feed, then decides to takes it down, someone else could
> take over.

I'm not going to look a gift horse in the mouth.  I was only saying 
Nitro wasn't the answer from my POV.  If someone hosts a programme data 
service, I'll integrate it.

The truth is that all available options to replace the BBC feeds have 
problems, and they might be for the chop soon as well.  Then the only 
desktop options would involve scraping of site content or search results 
- both of which would require big changes to get_iplayer, so a 
centralised feed of some kind would look pretty good in that situation.

> 3. It would be possible to use the output of this scraper client-side to
> search for programmes of interest, and then call get_iplayer with the
> appropriate pid to download the programme if any are found. More work
> would be needed for this, and it would be hacky, but could work too.
> This wouldn't need get_iplayer to be modified; it would just uses the
> existing pid download feature.

If it's going to be TV-only and implemented in Python, I would say 
that's the way to start.  I can more or less replace the old TV cache 
for now, but if - as I would guess is likely - the data sources 
disappear some dark night, people could jump over to this kind of 
application right away.

> If there's interest, I'm happy to work on wrangling out get_iplayer
> compatible feed data. (A guide to the structure of the iPlayer feeds
> would be handy.)

Well, there is more than one feed format.  It would probably be better 
just to dump out files in get_iplayer's cache format, though we could 
certainly work out an intermediate format if you were going to implement 
actual data feeds.  Contact me off-list if you have some specific 
questions in that area.



More information about the get_iplayer mailing list