get_iplayer repair update #1

Fri Oct 31 17:45:12 PDT 2014

get_iplayer has been more or less repaired, but there are still some 
wounds.  I'm going to release what I have on Sunday.  I'm on the road 
next week, so I've run out of time to do more for the time being. 
Consider it a stopgap until progress can be made on other fronts. This 
is where things are:

1. I've disabled code related to the discontinued feeds, so you 
shouldn't get any more bogus values in your metadata tags.  You should 
also see thumbnails again in files < 7 days old downloaded via PID.

2. The new release will support entry of multiple PIDs.

3. I've more or less restored the 7 day cache for TV and radio.  There 
are still some holes in it:

a. It is not possible to search for audiodescribed versions of 
programmes.  I haven't been able to source that information.  If anyone 
has any clues on the subject, chime in - but not if your suggestion is 
to scrape the iPlayer site.  That isn't on the table right just yet.

You can still download audiodescribed versions, but you'll have to look 
for them on the iPlayer site.  Signed versions should still be flagged 
in the get_iplayer cache, but some may be missing.  Again, check the 
iPlayer site if in doubt.

I've changed get_iplayer to always scrape the related episode page to 
look for audiodescribed/signed versions when requested, so hopefully 
more downloads will be successful.  I found a number of cases where the 
playlist data for recent programmes didn't contain identifiers for 
audiodescribed versions even though they existed on the iPlayer site.

b. It is not possible to search radio programmes by category. TV 
programmes still have category information. There is a source for radio 
category information, but it uniformly foundered on Radio 4 and Radio 4 
Extra, which is where the categories are most meaningful.  I know that 
is going to break some PVR searches, but the alternative is a support 
headache I can't absorb.

c. I can't vouch that every programme from the previous 7 days will show 
up in the cache. As always, you can use the PID for any programme not in 
the cache. By the same token, I can't vouch that every programme in the 
cache will be downloadable.  The new feeds contain noticeably more 
programmes, some due to the inclusion of web-only stuff. With the 
heavier load, cache refreshes are noticeably slower than with the old 
feeds, ca. 90 seconds for me for tv+radio.

2. The more-or-less restored cache depends on some old data feeds 
lingering at the BBC.  Recent events have taught us that they could 
disappear without warning, so I've implemented a fallback mechanism. 
There will be a new option that will switch the cache to refresh from 
the channel schedule pages instead of the old data feeds.  However, this 
fallback is also limited:

a. It is not possible to search for audiodescribed or signed versions of 
programmes.  That information isn't in the schedule pages.

b. It is not possible to search TV or radio programmes by category. 
Again, that information isn't in the schedule pages.

c. Cache refresh is slow, ca. 4+ minutes for a full TV and radio refresh 
for me.  The time could be cut by about 1/3 by removing regional TV 
channel variations, but it cuts out 50+ programmes, so I've left them in 
for the present.

d. It appears that fewer programmes from the previous 7 days get cached 
compared to the feeds.  Part of that is because the schedule pages don't 
show most web-only programmes.  Part of it may also be because I'm 
checking availability info in the schedule pages more strictly than 
whatever produces the data feeds.  Again, you can use the PID for 
anything not in the cache.

e. The only plus to using the schedule pages to populate the cache is 
that it becomes possible to expand your cache out to 30 days.  It seems 
to work OK, if you have 10-15 minutes to refresh your cache.  There will 
be an option for this.

f. I've given you enough rope to hang yourself, but don't put this 
fallback option into regular use unless it becomes necessary - 
seriously.  It's only there to avoid weeks like this one.  I won't be 
interested in hearing how slow it is or how it doesn't locate some 
particular programme.  And for pete's sake *don't* use it with the Web 
PVR.  If you insist on playing around with it, you'll probably want to 
bump up --expiry to some gigantic number and refresh your cache manually 
as needed.

3. Looking further ahead

Some things that have been floated here in the past few days:

a. Programme data services: If somebody implements something along these 
lines, I'm sure get_iplayer could be integrated with it.  It's clear 
that get_iplayer would never be able to access Nitro if and when it's 
ever opened up.  But, if somebody can repackage Nitro data for wider 
use, that would be pretty useful.

b. iPlayer site scraping: This could also be the foundation of a 
programme data service instead of Nitro.  It is also the only real hope 
for get_iplayer to regain a full-featured desktop cache, though I'm not 
sure it will be practical.  A full scrape is out of the question for 
local caching - there are just too many programmes on the radio side. 
However, even caching just the previous 7 days will be much much slower 
than with the old data feeds.  The number of requests and the amount of 
data to move over the wire and parse would be vastly greater. Some sort 
of parallelisation might help. The trick will be to figure out the right 
way to filter the listings down to a practical volume.

I started down this road, but it was way too slow for radio and it was 
going to be too much work for the time available.  Plus, it didn't seem 
worth leaving get_iplayer crippled any longer than necessary.  To do 
this properly will likely mean adding some dependencies to get_iplayer 
as well as some major reworking.  I'm going to keep working in that 
direction just to see if it can be done, but no idea if it will be of 
practical use.

Also see Steven Maude's recent post for his take on the problem.

c. External search/indexing applications:  To my mind, it seems like a 
good idea for some energetic person to split this out.  get_iplayer 
badly needs to lose weight, not gain it, and there is a pretty clear 
functional separation between searching and downloading.  get_iplayer 
needs a lot of work in handling metadata that could make it a better 
downloader, so it would be no bad thing to get out of the caching 
business.  I'll have my pony now, thanks.