How can I create a list of programmes from BBC Sounds

Jeremy Nicoll - ml gip jn.ml.gti.91 at wingsandbeaks.org.uk
Wed May 24 13:35:34 PDT 2023


On 2023-05-20 18:37, Budge wrote:

> Using the pid for the entire archive I can get a list of the entire
> archive but I have not found out how to sort the list by genre or
> obtain a list of a single genre.

It looks to me as "genre" is an arbitrary tag that the BBC do not expose
on any of the html pages - it must exist in their database of programmes
but nowhere else.


> Can anybody please suggest how I might obtain the list.

Use curl or wget to request all the pages that one could fetch manually
for a specific genre, eg for science:

  https://www.bbc.co.uk/programmes/p01gyd7j?page=1
  https://www.bbc.co.uk/programmes/p01gyd7j?page=2
  https://www.bbc.co.uk/programmes/p01gyd7j?page=3
  ...

until you get the "page not found" page.  Then examine the html and 
extract
the episode name, pid (and maybe episode description) for each one.


It looks easy(ish) for anyone who can write a computer program; the 
drawback
of this approach is that you have to study the html quite carefully to 
find
how its internal structure replicates for each entry on a page.

But broadly speaking it'll be a whole load of html for all the stuff 
that's
at the start of a page, then stuff for the start of that page's list, 
then
the entries (though that may also be more complex than needed because of
the way that they get laid out in rows), then the end of that page's 
list,
then the stuff at the end of every page.

How you then identify those areas on an html page depends greatly on the
programming language you use, and how complex you want your code to be.

There's also a problem that the (no doubt) machine-generated html on 
these
pages will quite likely change its layout quite often.

So ... if you need to run your indexer fairly often (to pick up new 
entries
or just to check that it still produces the same answers as last time) 
you
need ideally to have your scanning program able to test if its 
assumptions
about the layout of the entries is still reasonable.

Adding to the problem, perhaps, is that html doesn't need to be arranged 
in
a file in the sort of line-by-line layout that one might see if - in a
browser - one does a "view source" for a page.  Things that one day seem 
to
be on two consecutive lines might on another day be more or less spread 
out.


In some cases when I've extracted stuff from html pages I've started off 
by
eg replacing long runs of repeated spaces by single spaces, and removed
completely some parts of the html because - for what I wanted - it just
muddied the water.  In some cases it made more sense to introduce more 
line
breaks so the file I then scanned had many more, but shorter, lines than 
the
html that I got back from the web server.


But for example, on page 1 of the Science lists, assuming that none of 
the
relevant links span a line-break, there's 25 occurrences of

  href="https://www.bbc.co.uk/programmes/

nearly all of them occurring at 45 line intervals.  The first one is in 
the
"<head>" section of the page, so that leaves 24 of them in the <body> 
part,
corresponding to the 24 programmes described.

The place where two consecutive lines are not 45 lines apart occurs at 
the
last row where one episode is a repeat and doesn't have a "play" button.
You'd need to see how its html doesn't follow the structure of the 
entries
that do have such a button, and take that into account in any code you
write (AND look out for other unexpected differences).


Those 24 literals all have a pid in them so really look like eg

        href="https://www.bbc.co.uk/programmes/m001l291"
        href="https://www.bbc.co.uk/programmes/m001jc68"
        href="https://www.bbc.co.uk/programmes/m001hnlf"
        ...

I'd /guess/ that one could eg take such a file, skip past its first 300 
or
so lines (the first meaningful pid line is around line 330 at the 
moment,
then repeatedly scan forwards looking for

  href="https://www.bbc.co.uk/programmes/    and

read the pid that immediately follows that, then scan forwards for what
marks the start of the episode name (but not scan more than - say - ten
lines if - at the moment - you'd expect to find episode name in the next
(say) 5 lines, then similarly scan for the episode description's start.

Repeat until you fall off the end of that page's list.



If you can't program, eg in any version of BASIC, or python or perl or 
...
anything, maybe this would be a good time to learn how to?  Your code
would not need to be elegant or sophisticated ... just work.


For all I know, there may be utilities into which one could drag an
html page, and then manipulate it reasonably easily to extract the data
you want.  The trouble is, I don't know my way around tools that I don't
use.


I /do/ use a programmers' text editor; that's what showed me at a
glance that the instances of

  href="https://www.bbc.co.uk/programmes/

are at a specific repeating interval (though I expected that they
would be, more or less).  I'm sure that some other editors would show
the same thing but in different ways - many would probably let you
find successive such lines, but not simultaneously show how far apart
they all are.

Any sophisticated text editor has a steep learning curve and - if eg
you only use a very basic one - it's hard to know whether you'd benefit
from acquiring another one, and impossible to recommend one.  Typically,
when I've periodically looked at others to see what they offer, some
feel "right" but not versatile enough, and some feel "alien" in some
way and I never explore what they can do, and some are more versatile
than what I use now, but also far too complex.

Some - including the one I use - are scriptable.  Often that just means
that you can tell it to save a temporary copy of the file you're 
editing,
run an external program against it, then display the results (eg in
another tab) - which is better than nothing, but limiting.

Mine though has a programming language built into it and that can 
examine
the data that the editor is displaying (so it avoids the overhead of
making an external copy and later reading results), and manipulate it.
But to do that, you need to know not just the programming language, but
also the internal form of commands that otherwise one might just 
normally
issue via menus in the editor's interface, and how to ask the editor 
what
those commands did.  I've used this editor (& a similar-looking one with
a different command-set) for over 30 years.




Bear in mind that if you fetch the html for such a page you don't get 
all
the javascript, images, CSS, etc that make what a browser sees and does
so complex.  That can make this sort of thing really hard to do (if how
a page behaves depends on both CSS and Javascript) but in fact all the
info you need on one of these pages IS in just the html.

-- 
Jeremy Nicoll - my opinions are my own



More information about the get_iplayer mailing list