How can I create a list of programmes from BBC Sounds

Fri May 26 08:04:14 PDT 2023

On 24/05/2023 21:35, Jeremy Nicoll - ml gip wrote:
> On 2023-05-20 18:37, Budge wrote:
>
>> Using the pid for the entire archive I can get a list of the entire
>> archive but I have not found out how to sort the list by genre or
>> obtain a list of a single genre.
>
> It looks to me as "genre" is an arbitrary tag that the BBC do not expose
> on any of the html pages - it must exist in their database of programmes
> but nowhere else.
>
>
>> Can anybody please suggest how I might obtain the list.
>
> Use curl or wget to request all the pages that one could fetch manually
> for a specific genre, eg for science:
>
>  https://www.bbc.co.uk/programmes/p01gyd7j?page=1
>  https://www.bbc.co.uk/programmes/p01gyd7j?page=2
>  https://www.bbc.co.uk/programmes/p01gyd7j?page=3
>  ...
>
> until you get the "page not found" page.  Then examine the html and extract
> the episode name, pid (and maybe episode description) for each one.
>
>
> It looks easy(ish) for anyone who can write a computer program; the drawback
> of this approach is that you have to study the html quite carefully to find
> how its internal structure replicates for each entry on a page.
>
> But broadly speaking it'll be a whole load of html for all the stuff that's
> at the start of a page, then stuff for the start of that page's list, then
> the entries (though that may also be more complex than needed because of
> the way that they get laid out in rows), then the end of that page's list,
> then the stuff at the end of every page.
>
> How you then identify those areas on an html page depends greatly on the
> programming language you use, and how complex you want your code to be.
>
> There's also a problem that the (no doubt) machine-generated html on these
> pages will quite likely change its layout quite often.
>
> So ... if you need to run your indexer fairly often (to pick up new entries
> or just to check that it still produces the same answers as last time) you
> need ideally to have your scanning program able to test if its assumptions
> about the layout of the entries is still reasonable.
>
> Adding to the problem, perhaps, is that html doesn't need to be arranged in
> a file in the sort of line-by-line layout that one might see if - in a
> browser - one does a "view source" for a page.  Things that one day seem to
> be on two consecutive lines might on another day be more or less spread out.
>
>
> In some cases when I've extracted stuff from html pages I've started off by
> eg replacing long runs of repeated spaces by single spaces, and removed
> completely some parts of the html because - for what I wanted - it just
> muddied the water.  In some cases it made more sense to introduce more line
> breaks so the file I then scanned had many more, but shorter, lines than the
> html that I got back from the web server.
>
>
> But for example, on page 1 of the Science lists, assuming that none of the
> relevant links span a line-break, there's 25 occurrences of
>
>  href="https://www.bbc.co.uk/programmes/
>
> nearly all of them occurring at 45 line intervals.  The first one is in the
> "<head>" section of the page, so that leaves 24 of them in the <body> part,
> corresponding to the 24 programmes described.
>
> The place where two consecutive lines are not 45 lines apart occurs at the
> last row where one episode is a repeat and doesn't have a "play" button.
> You'd need to see how its html doesn't follow the structure of the entries
> that do have such a button, and take that into account in any code you
> write (AND look out for other unexpected differences).
>
>
> Those 24 literals all have a pid in them so really look like eg
>
>        href="https://www.bbc.co.uk/programmes/m001l291"
>        href="https://www.bbc.co.uk/programmes/m001jc68"
>        href="https://www.bbc.co.uk/programmes/m001hnlf"
>        ...
>
> I'd /guess/ that one could eg take such a file, skip past its first 300 or
> so lines (the first meaningful pid line is around line 330 at the moment,
> then repeatedly scan forwards looking for
>
>  href="https://www.bbc.co.uk/programmes/    and
>
> read the pid that immediately follows that, then scan forwards for what
> marks the start of the episode name (but not scan more than - say - ten
> lines if - at the moment - you'd expect to find episode name in the next
> (say) 5 lines, then similarly scan for the episode description's start.
>
> Repeat until you fall off the end of that page's list.
>
>
>
> If you can't program, eg in any version of BASIC, or python or perl or ...
> anything, maybe this would be a good time to learn how to?  Your code
> would not need to be elegant or sophisticated ... just work.
>
>
> For all I know, there may be utilities into which one could drag an
> html page, and then manipulate it reasonably easily to extract the data
> you want.  The trouble is, I don't know my way around tools that I don't
> use.
>
>
> I /do/ use a programmers' text editor; that's what showed me at a
> glance that the instances of
>
>  href="https://www.bbc.co.uk/programmes/
>
> are at a specific repeating interval (though I expected that they
> would be, more or less).  I'm sure that some other editors would show
> the same thing but in different ways - many would probably let you
> find successive such lines, but not simultaneously show how far apart
> they all are.
>
> Any sophisticated text editor has a steep learning curve and - if eg
> you only use a very basic one - it's hard to know whether you'd benefit
> from acquiring another one, and impossible to recommend one. Typically,
> when I've periodically looked at others to see what they offer, some
> feel "right" but not versatile enough, and some feel "alien" in some
> way and I never explore what they can do, and some are more versatile
> than what I use now, but also far too complex.
>
> Some - including the one I use - are scriptable.  Often that just means
> that you can tell it to save a temporary copy of the file you're editing,
> run an external program against it, then display the results (eg in
> another tab) - which is better than nothing, but limiting.
>
> Mine though has a programming language built into it and that can examine
> the data that the editor is displaying (so it avoids the overhead of
> making an external copy and later reading results), and manipulate it.
> But to do that, you need to know not just the programming language, but
> also the internal form of commands that otherwise one might just normally
> issue via menus in the editor's interface, and how to ask the editor what
> those commands did.  I've used this editor (& a similar-looking one with
> a different command-set) for over 30 years.
>
>
>
>
> Bear in mind that if you fetch the html for such a page you don't get all
> the javascript, images, CSS, etc that make what a browser sees and does
> so complex.  That can make this sort of thing really hard to do (if how
> a page behaves depends on both CSS and Javascript) but in fact all the
> info you need on one of these pages IS in just the html.
>
Hi Jeremy and many thanks for your suggested and detailed reply.
As will be evident from my earlier posts I am not a coder and am buried in work trying to sort out our accounts following a change of our accounting software so do not have too much time before I nod off at the end of the day.

My original question was just picking up where I left off a while ago.   I too have found that the genres used on the BBC website do not seem to be tagged anywhere I could find.  My only reason for asking is to make casual searching easier by using the tags and most of what I have on my system from IOT is tagged from the days when I somehow knew the genre.  I honestly do not recall but now I have more unclassified programmes than tagged I thought I would do something improve the situation.

What I have done by way of an experiment is take screenshots of all the web pages by genre and then turn them into text files using tesseract-ocr on the .png image.   This gives me a dirty text file which I can clean up without too much effort.  I can then use it to find the file in my collection and tag it with the appropriate genre.  The advantage is that with a bit of manipulation I can generate a list of titles and compare it with my own.

All looks good but it is still slow and painful but I can do it in small sessions when I have an opportunity.

Any further thoughts welcome and thanks again,
Alastair.