podverse / podcast-db Goto Github PK
View Code? Open in Web Editor NEWPodcast RSS feed parsing utility
Podcast RSS feed parsing utility
We'll want to be able to check logs for parser failures...
Podverse's current podcast and episode unique id problem:
Today the app uses podcastFeedURLs as the unique ids for podcasts, and episodeMediaURLs (url to mp3/ogg/etc) as the unique ids for episodes.
After having the site deployed and parsing between 50 - 150 podcasts a day over the past 3 weeks, I've seen podcastFeedURLs and/or episodeMediaURLs change for ~5 podcasts. Whenever this happens, I have to manually fix / update the db.
I can live with that rate failure, and can manually fix podcasts if they fall out of sync for now while supporting ~2000 podcasts on the site, but hopefully we find better ways to handle these things...
Potentially options for unique ids that some podcasts currently use:
An example of an ideal guid is in #WeThePeople Live's RSS feed, where all episodes have what I believe to be a proper uuid:
<guid isPermaLink="false"><![CDATA[ c9aa7c12-334b-47fc-8d60-eb28f527c8d0 ]]></guid>
If every RSS feed used a different uuid like that we'd be in great shape, but as of today most do not. Many podcasts use a different guid format, like this one from the Joe Rogan Experience RSS feed:
<guid isPermaLink="false"><![CDATA[483a81100097301f38b7dc15427599ef]]></guid>
Have you ever seen a guid like this? What is this format called? Can we validate it?
Another guid format appears when isPermaLink="true". This gid:// example can be found in the Duncan Trussell Family Hour feed:
<guid isPermaLink="false">gid://art19-episode-locator/V0/Gx0Krxiq-AwcKcvw8RE2g-uWf_9A-iQPRnjqlj_EosQ</guid>
:
I think I've seen https urls in there as well, although I don't have an example right now, it'd look something like:
<guid isPermaLink="true"><![CDATA[https://example.podcaster.com/unique123abc]]></guid>
While implementation of these two unique ids is spotty (sometimes people use integers for guids, sometimes people use non-permanent or non-unique urls as the permaLink)...still, it seems worthwhile to me to leverage each of these as unique ids wherever possible, in order to minimize maintenance / tech debt.
Proposed approach for handling unique ids for podcasts and episodes in Podverse
check for a valid uuid in the guid field, if that's not available
check for a valid one of those alternate guids without the hyphens, if that's not available
check for a valid isPermaLink that uses the gid:// protocol, and if none of those are available
check for a valid isPermaLink that uses the https:// protocol, and if none of those are available
check for a valid podcastFeedURL / episodeMediaURL as a last resort.
NOTE: I have seen a feed that included multiple tags per episode, so we should probably store guids in an array, then loop over the values checking for the first match in the order listed above.
Any thoughts on this proposed direction?
Any other ideas for how these podcast feed unique id issues can be ameliorated?
I deleted a podcast with the following command on my local host:
DELETE FROM "podcasts" WHERE "id"='idGoesHere';
But it appears that the podcast's episodes were not deleted when I did that. I thought the db is already setup to handle cascade deleting episodes, but now I'm not sure.
First, a rough idea of how I imagine podverse-feedparser working:
podverse-feedparser, podverse-web, and the podverse PostgreSQL database all listen on their separate ports, deployed on their separate servers.
Every few hours or so, a cron job triggers podverse-feedparser to query for all podcast RSS feed URLs in the database.
The parseFeeds method is called with the array of all the feed URLs currently in the db. parseFeeds adds each of these feeds on a queue to be parsed.
The parseFeeds queue runs sequentially, calling the parseFeed method with each URL until finished. As it goes parseFeed writes updated podcast and episode feeds to the PostgreSQL db. (This parseFeed function already exists in podverse-web here.)
I feel confident I can write code to make each of these things happen, but I am not sure how to elegantly reuse the podverse-web repositories/sequelize/engineFactory.js and models in the separate podverse-feedparser app.
I considered using npm install git://podverse-web as a dependency in podverse-feedparser, then somehow loading the models within podverse-feedparser by loading podverse-web files available in node_modules...but I'm not quite sure how I'd do that yet, and I wonder if I'm heading down the wrong path.
Having two separate apps that share a PostgreSQL db is new territory for me. Any tips on how to architect this stuff?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.