podverse / podcast-db Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 127 KB

Podcast RSS feed parsing utility

JavaScript 98.41% Shell 0.74% Dockerfile 0.85%

podcast-db's People

Contributors

Stargazers

Watchers

Forkers

tarsbase

podcast-db's Issues

Setup log tracking wherever podcast-db gets deployed

We'll want to be able to check logs for parser failures...

What are our options for podcast and episode unique ids?

Podverse's current podcast and episode unique id problem:

Today the app uses podcastFeedURLs as the unique ids for podcasts, and episodeMediaURLs (url to mp3/ogg/etc) as the unique ids for episodes.

After having the site deployed and parsing between 50 - 150 podcasts a day over the past 3 weeks, I've seen podcastFeedURLs and/or episodeMediaURLs change for ~5 podcasts. Whenever this happens, I have to manually fix / update the db.

I can live with that rate failure, and can manually fix podcasts if they fall out of sync for now while supporting ~2000 podcasts on the site, but hopefully we find better ways to handle these things...

Potentially options for unique ids that some podcasts currently use:

An example of an ideal guid is in #WeThePeople Live's RSS feed, where all episodes have what I believe to be a proper uuid:

<guid isPermaLink="false"><![CDATA[ c9aa7c12-334b-47fc-8d60-eb28f527c8d0 ]]></guid>

If every RSS feed used a different uuid like that we'd be in great shape, but as of today most do not. Many podcasts use a different guid format, like this one from the Joe Rogan Experience RSS feed:

<guid isPermaLink="false"><![CDATA[483a81100097301f38b7dc15427599ef]]></guid>

Have you ever seen a guid like this? What is this format called? Can we validate it?

Another guid format appears when isPermaLink="true". This gid:// example can be found in the Duncan Trussell Family Hour feed:

<guid isPermaLink="false">gid://art19-episode-locator/V0/Gx0Krxiq-AwcKcvw8RE2g-uWf_9A-iQPRnjqlj_EosQ</guid>:

I think I've seen https urls in there as well, although I don't have an example right now, it'd look something like:

<guid isPermaLink="true"><![CDATA[https://example.podcaster.com/unique123abc]]></guid>

While implementation of these two unique ids is spotty (sometimes people use integers for guids, sometimes people use non-permanent or non-unique urls as the permaLink)...still, it seems worthwhile to me to leverage each of these as unique ids wherever possible, in order to minimize maintenance / tech debt.

Proposed approach for handling unique ids for podcasts and episodes in Podverse

check for a valid uuid in the guid field, if that's not available
check for a valid one of those alternate guids without the hyphens, if that's not available
check for a valid isPermaLink that uses the gid:// protocol, and if none of those are available
check for a valid isPermaLink that uses the https:// protocol, and if none of those are available
check for a valid podcastFeedURL / episodeMediaURL as a last resort.

NOTE: I have seen a feed that included multiple tags per episode, so we should probably store guids in an array, then loop over the values checking for the first match in the order listed above.

Any thoughts on this proposed direction?

Any other ideas for how these podcast feed unique id issues can be ameliorated?

Setup shell scripts and a queue for running feed parser CRON job

I don't have much experience with writing shell scripts or queues (besides JS Promise.all() stuff), so I'll see what I can do.

Refer to @scvnc's notes here

"Possible EventEmitter memory leak detected" error when running npm test

This error pops up during npm test. The tests still pass, but I don't know what is causing it, and "memory leak" doesn't sound good.

(node:7292) Warning: Possible EventEmitter memory leak detected. 11 exit listeners added. Use emitter.setMaxListeners() to increase limit

Add thumbnail image parsing and saving

Add and then fix all the tests broken since the podverse-web decoupling

Is the db set up to cascade delete episodes when you delete a podcast?

I deleted a podcast with the following command on my local host:

DELETE FROM "podcasts" WHERE "id"='idGoesHere';

But it appears that the podcast's episodes were not deleted when I did that. I thought the db is already setup to handle cascade deleting episodes, but now I'm not sure.

Relevant file

How can we share Sequelize models between podverse-web and podverse-feedparser as separate apps?

First, a rough idea of how I imagine podverse-feedparser working:

podverse-feedparser, podverse-web, and the podverse PostgreSQL database all listen on their separate ports, deployed on their separate servers.
Every few hours or so, a cron job triggers podverse-feedparser to query for all podcast RSS feed URLs in the database.
The parseFeeds method is called with the array of all the feed URLs currently in the db. parseFeeds adds each of these feeds on a queue to be parsed.
The parseFeeds queue runs sequentially, calling the parseFeed method with each URL until finished. As it goes parseFeed writes updated podcast and episode feeds to the PostgreSQL db. (This parseFeed function already exists in podverse-web here.)

I feel confident I can write code to make each of these things happen, but I am not sure how to elegantly reuse the podverse-web repositories/sequelize/engineFactory.js and models in the separate podverse-feedparser app.

I considered using npm install git://podverse-web as a dependency in podverse-feedparser, then somehow loading the models within podverse-feedparser by loading podverse-web files available in node_modules...but I'm not quite sure how I'd do that yet, and I wonder if I'm heading down the wrong path.

Having two separate apps that share a PostgreSQL db is new territory for me. Any tips on how to architect this stuff?

podverse / podcast-db Goto Github PK

podcast-db's People

Contributors

Stargazers

Watchers

Forkers

podcast-db's Issues

Setup log tracking wherever podcast-db gets deployed

What are our options for podcast and episode unique ids?

Setup shell scripts and a queue for running feed parser CRON job

"Possible EventEmitter memory leak detected" error when running npm test

Add thumbnail image parsing and saving

Add and then fix all the tests broken since the podverse-web decoupling

Is the db set up to cascade delete episodes when you delete a podcast?

How can we share Sequelize models between podverse-web and podverse-feedparser as separate apps?

Write method to determine which podcast feeds should be parsed, then add those feeds to the toBeParsed queue

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent