get_iplayer repair update #1
dinkypumpkin
dinkypumpkin at gmail.com
Fri Oct 31 17:45:12 PDT 2014
get_iplayer has been more or less repaired, but there are still some
wounds. I'm going to release what I have on Sunday. I'm on the road
next week, so I've run out of time to do more for the time being.
Consider it a stopgap until progress can be made on other fronts. This
is where things are:
1. I've disabled code related to the discontinued feeds, so you
shouldn't get any more bogus values in your metadata tags. You should
also see thumbnails again in files < 7 days old downloaded via PID.
2. The new release will support entry of multiple PIDs.
3. I've more or less restored the 7 day cache for TV and radio. There
are still some holes in it:
a. It is not possible to search for audiodescribed versions of
programmes. I haven't been able to source that information. If anyone
has any clues on the subject, chime in - but not if your suggestion is
to scrape the iPlayer site. That isn't on the table right just yet.
You can still download audiodescribed versions, but you'll have to look
for them on the iPlayer site. Signed versions should still be flagged
in the get_iplayer cache, but some may be missing. Again, check the
iPlayer site if in doubt.
I've changed get_iplayer to always scrape the related episode page to
look for audiodescribed/signed versions when requested, so hopefully
more downloads will be successful. I found a number of cases where the
playlist data for recent programmes didn't contain identifiers for
audiodescribed versions even though they existed on the iPlayer site.
b. It is not possible to search radio programmes by category. TV
programmes still have category information. There is a source for radio
category information, but it uniformly foundered on Radio 4 and Radio 4
Extra, which is where the categories are most meaningful. I know that
is going to break some PVR searches, but the alternative is a support
headache I can't absorb.
c. I can't vouch that every programme from the previous 7 days will show
up in the cache. As always, you can use the PID for any programme not in
the cache. By the same token, I can't vouch that every programme in the
cache will be downloadable. The new feeds contain noticeably more
programmes, some due to the inclusion of web-only stuff. With the
heavier load, cache refreshes are noticeably slower than with the old
feeds, ca. 90 seconds for me for tv+radio.
2. The more-or-less restored cache depends on some old data feeds
lingering at the BBC. Recent events have taught us that they could
disappear without warning, so I've implemented a fallback mechanism.
There will be a new option that will switch the cache to refresh from
the channel schedule pages instead of the old data feeds. However, this
fallback is also limited:
a. It is not possible to search for audiodescribed or signed versions of
programmes. That information isn't in the schedule pages.
b. It is not possible to search TV or radio programmes by category.
Again, that information isn't in the schedule pages.
c. Cache refresh is slow, ca. 4+ minutes for a full TV and radio refresh
for me. The time could be cut by about 1/3 by removing regional TV
channel variations, but it cuts out 50+ programmes, so I've left them in
for the present.
d. It appears that fewer programmes from the previous 7 days get cached
compared to the feeds. Part of that is because the schedule pages don't
show most web-only programmes. Part of it may also be because I'm
checking availability info in the schedule pages more strictly than
whatever produces the data feeds. Again, you can use the PID for
anything not in the cache.
e. The only plus to using the schedule pages to populate the cache is
that it becomes possible to expand your cache out to 30 days. It seems
to work OK, if you have 10-15 minutes to refresh your cache. There will
be an option for this.
f. I've given you enough rope to hang yourself, but don't put this
fallback option into regular use unless it becomes necessary -
seriously. It's only there to avoid weeks like this one. I won't be
interested in hearing how slow it is or how it doesn't locate some
particular programme. And for pete's sake *don't* use it with the Web
PVR. If you insist on playing around with it, you'll probably want to
bump up --expiry to some gigantic number and refresh your cache manually
as needed.
3. Looking further ahead
Some things that have been floated here in the past few days:
a. Programme data services: If somebody implements something along these
lines, I'm sure get_iplayer could be integrated with it. It's clear
that get_iplayer would never be able to access Nitro if and when it's
ever opened up. But, if somebody can repackage Nitro data for wider
use, that would be pretty useful.
b. iPlayer site scraping: This could also be the foundation of a
programme data service instead of Nitro. It is also the only real hope
for get_iplayer to regain a full-featured desktop cache, though I'm not
sure it will be practical. A full scrape is out of the question for
local caching - there are just too many programmes on the radio side.
However, even caching just the previous 7 days will be much much slower
than with the old data feeds. The number of requests and the amount of
data to move over the wire and parse would be vastly greater. Some sort
of parallelisation might help. The trick will be to figure out the right
way to filter the listings down to a practical volume.
I started down this road, but it was way too slow for radio and it was
going to be too much work for the time available. Plus, it didn't seem
worth leaving get_iplayer crippled any longer than necessary. To do
this properly will likely mean adding some dependencies to get_iplayer
as well as some major reworking. I'm going to keep working in that
direction just to see if it can be done, but no idea if it will be of
practical use.
Also see Steven Maude's recent post for his take on the problem.
c. External search/indexing applications: To my mind, it seems like a
good idea for some energetic person to split this out. get_iplayer
badly needs to lose weight, not gain it, and there is a pretty clear
functional separation between searching and downloading. get_iplayer
needs a lot of work in handling metadata that could make it a better
downloader, so it would be no bad thing to get out of the caching
business. I'll have my pony now, thanks.
More information about the get_iplayer
mailing list