Proof-of-concept scraper for iPlayer web frontend TV data to JSON
dinkypumpkin
dinkypumpkin at gmail.com
Fri Oct 31 17:27:53 PDT 2014
On 31/10/2014 00:08, Steven Maude wrote:
> https://github.com/StevenMaude/nitroradical
Thanks for that. From underneath 10,000 lines of Perl I gaze longingly
at that lovely strictly-indented, sigil-less Python.
> It would take some time to populate the programme data. Scraping the
> index pages for TV actually doesn't take that long, but in some cases
> you'd have to pull out individual programme pages to get all the episode
> info for them. As is, my script just gets the most recent episode.
You wouldn't necessarily need to drill all the way down to individual
episodes, though you would need to drill down one more level for series
with multiple episodes available. I think the only get_iplayer cache
field that would require individual episode pages is the guidance
warning, and that could be sacrificed.
I tried this same approach, but it foundered on radio programmes. There
is just too much stuff there. It's soul-crushingly slow to scrape the
iPlayer Radio site, at least for a desktop cache. It would be great to
have everything available on iPlayer searchable off-site, but there is
too much of it for get_iplayer's current local caching model. I'm going
to have another go at some point.
> However, dinkypumpkin mentioned that centralising a feed wasn't a
> preferred option. That said, there's nothing to stop having a
> user-specified option to point get_player to a specific feed URL. If
> someone hosts a feed, then decides to takes it down, someone else could
> take over.
I'm not going to look a gift horse in the mouth. I was only saying
Nitro wasn't the answer from my POV. If someone hosts a programme data
service, I'll integrate it.
The truth is that all available options to replace the BBC feeds have
problems, and they might be for the chop soon as well. Then the only
desktop options would involve scraping of site content or search results
- both of which would require big changes to get_iplayer, so a
centralised feed of some kind would look pretty good in that situation.
> 3. It would be possible to use the output of this scraper client-side to
> search for programmes of interest, and then call get_iplayer with the
> appropriate pid to download the programme if any are found. More work
> would be needed for this, and it would be hacky, but could work too.
> This wouldn't need get_iplayer to be modified; it would just uses the
> existing pid download feature.
If it's going to be TV-only and implemented in Python, I would say
that's the way to start. I can more or less replace the old TV cache
for now, but if - as I would guess is likely - the data sources
disappear some dark night, people could jump over to this kind of
application right away.
> If there's interest, I'm happy to work on wrangling out get_iplayer
> compatible feed data. (A guide to the structure of the iPlayer feeds
> would be handy.)
Well, there is more than one feed format. It would probably be better
just to dump out files in get_iplayer's cache format, though we could
certainly work out an intermediate format if you were going to implement
actual data feeds. Contact me off-list if you have some specific
questions in that area.
More information about the get_iplayer
mailing list