Limit metadata scanning to new files

Limit metadata scanning

To extract the MP3 metadata stored in an mp3 file on amazon the files must be (partially) downloaded. This is somewhat expensive and racks up our amazon S3 usage.

Problem

Phoenix is storing their MP3 sermons in ID3v1 format, which puts metadata at the very end of the file. Processing is currently relying on metadata to be at the very start of a file.

Possible Solution

If the first part of a MP3 file does not contain a valid MPEG frame, download the last 128 bytes to check for metadata.

See http://id3.org/FAQ

Delete old sync execution records

The free usage tier of postgres limits us to a set number of database rows. As the sync execution audit log is not that useful, only maintain a single record in the table.

AWS links need to be be region agnostic

llc-archives/src/main/groovy/org/llc/archive/service/ArchivedSermonsServiceImpl.groovy

Line 95 in a37a3bc

fileUrl: "https://s3-us-west-2.amazonaws.com/${bucket}/${sermon.file}"

current has a hard coded amazon S3 url pointing to s3-us-west-2 This should be changed to s3.amazonaws.com

<Error>
<Code>PermanentRedirect</Code>
<Message>
The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.
</Message>
<Bucket>phoenix-archives</Bucket>
<Endpoint>s3.amazonaws.com</Endpoint>
<RequestId>93C8E90247E629B7</RequestId>
<HostId>
ivU1FTm2aEDtMGIrum8Vie2VR9O3BV+Chb9d7iu41z6LDPejxSzexa30Gu1rIhAF1HShXWNSyfQ=
</HostId>
</Error>

This will require running a query to update all existing MP3 file URLs in the database

Create a REST API to expose sermons

We need to be able to support multiple congregations. Perhaps end points such as

GET    /v1/{congregation_id}/sermons
GET    /v1/{congregation_id}/sermons/{sermon_id}
GET    /v1/congregations
GET    /v1/congregations/{congregation_id}

Allow custom mp3 tag mappings

Rockford uses the convention

mp3 tag field	parsed as
title	date + time
album	bible text
artist	minister

other congregations may not use the same mp3 tag conventions.

Update sermon "comment" domain field

Sermons have an event field displayed in the ui. The domain backing this calls the field "comment". Update the domain model with consistent naming.

Reduce logging level

The logs are getting very noisy. Reduce logging for the mp3 parsing library

Add heroku recharging page

Currently we are using the heroku free usage tier, which does not allow an app to run for more than 18 hours in a 24 hour period.

Create a custom html page for errors and maintenance that heroku will serve up in the event that our quota is hit and the app is offline

It would be useful to include the Google analytics hooks on this page as well, so we can track how often people are unable to view sermons due to downtime.

https://devcenter.heroku.com/articles/error-pages#customize-pages

Create a RemoteMp3DiscoveryService

RemoteMp3DiscoveryService

Create a RemoteMp3DiscoveryService strategy implementation that uses the amazon s3 java API to iterate all existing files uploaded. Download only the ID3 tag of each file and extract Mp3 data from each file.

architecture

benefits

Mp3 meta data synchronization as a service. Congregations would not need to run the spreadsheet-updater locally on their webcast computer. Synchronizing of MP3 data would happen via a quartz/cron job.

ugly example

        InputStream is = new URL(sermon).openStream()
        def size = 1024
        byte[] buf = new byte[size];
        is.read(buf, 0, size);

        File targetFile = File.createTempFile("temp", ".mp3");
        OutputStream outStream = new FileOutputStream(targetFile);
        outStream.write(buf);

        def mp3file = new Mp3File(targetFile.absolutePath);
        def id3v1Tag = mp3file.hasId3v1Tag() ? mp3file.id3v1Tag : mp3file.id3v2Tag
        def sermon = mp3DiscoveryService.extractId3v1TagData(targetFile, id3v1Tag)

        sermon.minister == 'John Doe'
        sermon.bibletext == '1 Kings 19:9-18'
        sermon.date == '04/12/2015'
        sermon.time == '18:30'
        sermon.notes == ''

Schedule auto-refresh sermons task

Set up a schedule to kick off synchronization of aws S3 buckets with the database.

Dates are not displaying correctly

Dates seem to be displayed as one day behind the actual date.

For example the file: 2015/20150830_RNikula.mp3

has a date of 08/29/2015 19:00

MP3 metadata API enhancements

Allow query parameters to specify various filters to select which sermons to refresh. This would allow one to sync specific files.

e.g.

Time Range

fromDate=02/03/2015&toDate=02/10/2015

Congregation

congregation=rllc

email status reports

As a webcast admin, it would be useful to get status emails with archive changes

Add podcast feed for sermons.

Several peopl mentioned it would be nice to have a podcast feed of the sermons. I assume it would feed from each congregations page on heroku.

MP3 metadata performance enhancements

It chews up a fair amount of RAM and CPU to download and process all files in one shot. Refactor processing to process files one at a time.

Nice To Have : ability to throttle processing at a threshold.

Dates parsing incorrectly

Dates are not being parsed properly in all cases.

Date	File Name
0014-03-16 00:00:00	2014/20140316_SRoiko.mp3
0014-03-23 00:00:00	2014/20140323_JHaapsaari.mp3
0014-04-27 00:00:00	/2014/20140427_NMuhonen.mp3
0014-04-27 00:00:00	/2014/20141027_RNevala.mp3
0014-05-04 00:00:00	/2014/20140504_JHaapsaari.mp3
0014-06-20 00:00:00	/2014/20140620_JHaapsaari.mp3
0014-08-10 00:00:00	/2014/20140810_RNevala.mp3
0014-09-14 00:00:00	/2014/20140914_CKumpula.mp3
0014-11-23 00:00:00	/2014/20141123_JHaapsaari.mp3
0015-05-03 00:00:00	/2015/20150503_NMuhonen.mp3
0015-07-12 00:00:00	/2015/20150712_CKumpula.mp3
0015-08-02 00:00:00	/2015/20150802_JLehtola.mp3

Setup UI build structure

determine which framework to use
** angular
** ember
** ???
scaffold out initial project structure
setup grunt/gulp

use AWS S3 notifications to kick off processing

Stop polling every hour to find changes. Use the built in amazon notification features of S3.

https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

Support Comment tag

Populate `Notes` from ID3v1 tag Comments field

The google spreadsheet has a Notes column. Map the ID3V1 comment field to the Notes column.

Bonus Points

The use case for the Notes column is displaying church calendar events to make specific sermons easier to locate. Examples of Notes data:

Mary's Day Services
Good Friday
New Years Day Service

If the church calendar could be parsed, the Notes field could be auto populated with the church calendar event by matching the date of the recording with event dates.

Revamp minister parsing

A few issues have been surfaced regarding minister name parsing

Parsing from MP3 Metadata

incorrect name conversion (Art Simon -> Martin Simonson instead of Arthur Simonson) as Martin is closer to Art than Arthur is.

Falling back to filename

If a mp3 file does not have a minister in the artist tag, an attempt is made to parse the minister out of the filename. Rockford assumes a file name in the format YYYYMMDD_JHaapsaari.mp3. When this happens, the most similar minister is being picked, regardless if the decision is indeterministic. For example a filename of 2014/0629_RSorvala.mp3 could be either from Rodrick Sorvala or Rory Sorvala. In a case like this, the minister should be left blank.

Delete minister database

Adrian has provided a revised list of ministers. Drop the existing table and import the new list.

update library versions

According to the README.md versions badge, our dependencies are out of date.

Update to the latest version of the outdated libraries.

Full Congregation Name On Tab

Currently when a congregation is selected the name is the abbreviation, e.g. PLLC, it should be the full congregation name, e.g. Phoenix since the title already indicated LLC Archived Sermons.

setup servlet welcome page

Refactor MP3DiscoverService

Problem

Mp3DiscoverServiceImpl only supports local directory scanning.

Suggested Action

Refactor to an abstract base class such that a Local and Remote Mp3 discovery service can be created easily.

Add favicon

The spring boot fav icon is not ideal. Replace with something church related.

Remove minister name autocorrect

autocorrecting names is adding a lot of complexity to the application and is choosing incorrect names in some cases.

Remove auto correction for the time being.

Create an Admin panel to allow CRUD operations on domain objects

As an admin user, I would like the ability to edit meta data that is displayed for sermons.

Documentation for setting up amazon s3 sync

Amazon S3 Documentation

Each congregation will be responsible for uploading their archived sermons to amazon s3 buckets.
Provide documentation on

downloading
installing
configuring
the amazon s3 command line tool.

Also provide instruction on scheduling this command to run periodically. Consider both Windows + Linux environments.

windows environment detailed
linux environment detailed

Sortable, Searchable tables

Ideally all columns on the sermons table would be sortable (multi column sort would be fantastic). Search/Filter utilities for bonus points.

Perhaps the angular-datatables library would do what we need?

https://l-lin.github.io/angular-datatables/#/welcome

Auto-Correct minister name

Situation

When webcasters are exporting audio, they hand type the minister's name. This is error prone, leading to misspelled names being displayed on the public facing archives. The spreadsheet-updater is downstream from the actual MP3 creation, as such it has no control over source data.

Proposed Solution

When parsing minister name from the MP3 tag, compare it to a master list of ministers; pick the minister's name that is the most similar. There are various algorithms for finding string similarity. See Grails' implementation of CosineSimilarity for an example.

Bonus Points

Extra nice if the master list is maintainable by LLC (perhaps another google spreadsheet?)

Google Analytics Tracking

It would be nice to track user events to know what content is being interacted with the most.

for example:

what do people sort sermons by most frequently
what search terms are people using
download count of sermons

Allow multiple congregations to be configured

In preparation for #9 , configuration for multiple congregations will need to be supported.
Perhaps this could be as simple as a properties file like:

llc.rockford.shortName=RLLC
llc.rockford.longName=Rockford Laestadian Lutheran Church
llc.rockford.aws.username=rllc-read
llc.rockford.aws.bucket=https://s3-us-west-2.amazonaws.com/
llc.rockford.aws.key.id=<aws-key>
llc.rockford.aws.key.token=<aws-token>
llc.rockford.google.username=<google-username>
llc.rockford.google.password=<google-password>
llc.rockford.google.spreadsheet=RLLC Archived Sermons
llc.rockford.google.worksheet=Sheet1

llc.minneapolis.shortName=MLLC
llc.minneapolis.longName=Minneapolis Laestadian Lutheran Church
llc.minneapolis.aws.username=mllc-read
...

Render minsters as Last Name, First Name

LLC is requesting rendering of ministers as Lastname, Firstname

scheduled aws scan not picking up older files, not deleting removed files

Description

If a file is uploaded to AWS that has a lastModified timestamp that is older than the lastExecution time of the scheduled scan, the file is not picked up.
If a file is removed from AWS, it is not removed from the database

The remote (aws s3) directory should be compared to the locally synced database content and examined for differences.

If the file exists on S3 but not in our database, process it
if the file does not exist in S3 but is in our database, delete it

Setup a collection mechanism for token requests

Create an electronic form to capture congregation requests for S3 credentials

Scan filename for minister and date when no mpeg frames are found

When no mpeg frames are found in the first snippet of a file, revert to parsing the filname for date and minister name.

Use tag for minister if not found in database

If the minister is not found in the database it is left blank. This happens for ministers from Finland who are not in the database.

existingSermon.minister = minister ? minister.naturalName : ''

rename application package

Description

If this is to be a LLC-wide application, we should rename the application from com.rllc.spreadsheet to something more appropriate.

Suggestions

org.llc.spreadsheet
org.llc.archive
org.llc.webcast

Expose ids for Sermons and Congregations

The Problem

Using the REST API for traversing sermons and congregations is a bit treacherous without exposing IDs. When sermons are sorted a certain way and an index is used to pick the sermon, the rendered sermon may be incorrect.

The solution

Expose id values for Sermon and Congregation objects

sync amazon s3 bucket

Purpose

currently the amazon s3 command line tool is being used to sync the local directory with amazon s3 bucket. While this is very nice as it uses a prebuilt command line tool, it requires 2 applications to run. Consolidate syncing into some kind of awsSyncService inside of spreadsheet-updater. Perhaps it could even be a wrapper around the s3 process?

May be OBE

If #9 is realized, the s3 command line tool will be the only tooling needed on congregational computers.

Order congregations by name

Congregations are currently sorted by the order in which they were inserted into the database.

Create logo

Add logo to web app.

Currently the favicon is the spring boot icon. Update to whatever logo is created.

Auto-Correct bible text

Situation

When webcasters are exporting audio, they hand type the bible text. This is error prone, leading to misspelled bible text and inconsistent abbreviations being displayed on the public facing archives. The spreadsheet-updater is downstream from the actual MP3 creation, as such it has no control over source data.

Proposed Solution

When parsing bible text from the MP3 tag, compare it to a master list of bible text; pick the bible text that is the most similar. There are various algorithms for finding string similarity. See Grails' implementation of CosineSimilarity for an example.

Bonus Points

Extra nice if the master list is maintainable by LLC

Concerns

This is more complicated than minister names, as abbreviations are in play. The ideal solution would create a mapping of all expected variants of books to their preferred convention then use the mapping to resolve the true value.

Example Mapping

[
    'Matthew' : 'Matt.',
    'St Matthew' : 'Matt.',
    'Matt' : 'Matt.',
    'Ezekiel' : 'Ezek.',
    'Ezek' : 'Ezek.',
    'Ez' : 'Ezek.'   
]

Update to latest spring-boot

http://mvnrepository.com/artifact/org.springframework.boot/spring-boot-starter-web

rllc / llc-archives Goto Github PK

llc-archives's People

Contributors

Stargazers

Watchers

Forkers

llc-archives's Issues

Limit metadata scanning

Suggested Solution

Problem

Possible Solution

RemoteMp3DiscoveryService

architecture

benefits

ugly example

Time Range

Congregation

Populate Notes from ID3v1 tag Comments field

Bonus Points

Parsing from MP3 Metadata

Falling back to filename

Problem

Suggested Action

Amazon S3 Documentation

Situation

Proposed Solution

Bonus Points

Description

Description

Suggestions

The Problem

The solution

Purpose

May be OBE

Situation

Proposed Solution

Bonus Points

Concerns

Example Mapping

Recommend Projects

Recommend Topics

Recommend Org

Populate `Notes` from ID3v1 tag Comments field