Giter Site home page Giter Site logo

pariyatti / kosa Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 3.0 2.06 MB

Digital library service

License: GNU Affero General Public License v3.0

Clojure 83.16% SCSS 4.84% Makefile 1.43% JavaScript 0.21% Shell 1.70% Jinja 0.13% Emacs Lisp 0.05% HCL 5.80% Python 2.11% Dockerfile 0.57%

kosa's People

Contributors

alokkhs avatar balwa avatar deobald avatar oxalorg avatar yudistrange avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kosa's Issues

kutis.storage - ActiveStorage Equivalent

  1. wire up to image-handler
  2. NEXT: try "resources/storage" instead of "resources/public/uploads"
  3. get file root from config
  4. grab byte-size and checksum
  5. blake2b instead of clojure hash
  6. write high-level tests
  7. sanity the bytearray approach
  8. restrict filetypes — ActiveStorage doesn't do this (in-library) so we won't, either.

Edit Resources Tab

An Editor should be able to customize what shows up in the Resources tab. Carousels, categories, etc. should be easy to modify in an admin interface that roughly resembles what the user will see, in terms of layout. These are determined by metadata types like Topic, Collection, etc.

These will probably need to be broken out into their own tasks eventually:

  • Create a Topic
  • Create a Collection
  • Create an Audience

Standardize Languages

Current Thinking (2021 July 22)

  1. Language: ISO 639-3 (eng, hin, etc.)
  2. Script: ISO 15924 only for Chinese (hans vs. hant)
  3. Kosa: Construct a flat list of languages on the server side, but content is divided by text vs. audio/video. Non-Chinese text and audio happen to have the same keys. Chinese text content is either keyed as zho-hant or zho-hans, but nothing else. Audio content is keyed as cmn, yue, nan, or hak.
  4. Mobile App: Users select a language and the app constructs a "preferred languages list" behind the scenes. All languages have a backup language of English. Selecting a Chinese language presents the user with a script selection box as well, with options of Traditional or Simplified. The "preferred" language is always zho-hant or zho-hans and the second-most-preferred language will be the spoken Chinese language (of cmn, yue, nan, or hak). Chinese users also get a final backup of English.
  5. Thinking: This system allows Kosa to serve content with a flat language key for anything, greatly simplifying how it tracks languages and preventing a language tree from emerging anywhere in the API. The "preferred languages list" allows us to (a) back up everything with English content and (b) add flexible language preferences and new script options later, if required. The language-selection algorithm can be dumb-but-flexible, allowing us to avoid lookup trees entirely.

Requirements

  • embed standardized language names in Dart and Ruby/Clojure using ICU libraries (?)
  • a minimum required set of languages include:
    • pali
    • english
    • espanol
    • italiano
    • simplified chinese
    • francais
    • portugues
    • srpsko-hrvatski (serbo-croatian)

The complete list of languages currently supported by Pariyatti:

It seems that ISO 639-3 (an extension of ISO 639-3) has reasonably comprehensive support:

My current thinking is ISO 639-3 + (optional) region specifier. Alternatively, some BCP 47 subset... but it's just so complicated.

Wikipedia uses a number of hacks to get around BCP 47 limitations:

Examples explaining why flattening Chinese languages won't work:

  1. Taiwan speaks cmn, nan, hak but always uses zho-hant
  2. Fujian / Guangdong (China) speak nan and hak but always use zho-hans

Chinese scripts can be decoded here:

https://www.chineseconverter.com/en/convert/find-out-if-simplified-or-traditional-chinese


Old notes from Asana:

1:

My first round of research turned up this:

A Language should have three fields: IANA code, English name ("Hindi"), Actual name ("हिंदी")

IANA tag registry is here: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Prefer tag combinations were are nearest matches to the Gettext locale standard, wherever possible:

https://www.gnu.org/software/gettext/manual/html_node/Locale-Names.html#Locale-Names


2:

Ooooohhhkkkayyyyy. It looks like THIS is maybe the standard way to do this? At least according to friends at Wikipedia:

https://github.com/unicode-org/cldr/tree/release-37/common/main

  • This list is available through ICU libraries. This CLDR format also contains the language name equivalents (आनगराी / English vs. Hindi / हिंदी vs every other possible combination).

3:

The canonical ICU webpage is here: http://site.icu-project.org/home

The Ruby library is listed here (gem icu): http://site.icu-project.org/related

There is a Dart package: https://pub.dev/packages/icu


4: (post-Asana)

Clojure: https://github.com/Vincit/satakieli (wraps ICU4J)
Java: http://site.icu-project.org (ICU4J)

Clojure + Crux

infra

  • initial setup? -- 👍
  • deployment story?
  • unit tests? -- @holtzermann17 will try spec generative tests
  • acceptance tests?

database

  1. Upgrade to 1.12: It's the new hotness.
  2. One-to-Many:
    • 1-N relationships for pali word & translations -- vector internal to the pali word doc
    • stacked inspiration -- (N-1) -- image
  3. Schema: crux-schema or spec for tabular entities for mobile app front-end API data
    • inserting (try it out for the "New Pali Word Card" screen): Manual validation works fine
    • schema / data migration
  4. Model Time: 3rd tier time model for ancient publication dates? (parallel to valid-time)
  5. Chained Relations: What do the queries look like over A->B->C->D ?
  6. Reify Relationship: graph-shaped documents with named (doc) relationships?
    • Answer: mostly no. Just use "vertex duck-typing" by creating meaningful ids on each entity.
  7. File Uploads:
    • what do db fields look like?
    • what happens when we switch disk storage? (local disk / S3 / cloudflare / etc)

code (non-database)

  1. Javascript:
    • install clr-icons from npm during deployment rather than vendoring otherwise we get source map errors
    • ClojureScript or vanilla JS?
  2. File Uploads:

rails parity

  1. how hard is it to migrate Rails ERB templates? -- not too bad. use html2hiccup
  2. dev vs. test environments? -- method mostly borrowed from nilenso approach (/config/config.test.edn etc)
  3. replace all html, css, controller behaviour, model behaviour -- chore for @deobald

fun stuff

  • enhance reitit route conflict handling. ... checked off, but I did not actually finish this since it's not of immediate value to the Pariyatti project. - @deobald

From ikitommi in #reitit clojurians slack:

the conflict resolver could be smarter and know that
[["/kikka/kikka"]
 ["/kikka/:id"]]
actually is not conflicting as the fixed term(s) comes first. PR welcome :)

Why is Lightsail DNS resolution so slow?

On first Ansible deploy, the Terraform script fails because DNS hasn't propagated yet. It resolves for the Pariyatti nameserver and 8.8.8.8, though. Can we work around or figure out why resolution is taking so long?

Pali Word: Multiple Translations

And Editor should be able to add multiple translations to a Pali Word Card when creating one manually.

This should be a dynamically-sized list which matches the has_many relationship on PaliWordCard (in Rails). The Editor should only be able to add a translation for any one of the predefined Standard Languages.

  • Do not link PaliWordCard to Language -- copy the language code into PaliWordCard instead
  • Provide an "Add Another Translation" button which adds a language+translation to the list
  • "Language" should be a drop-down from the predefined Languages

Topic-of-the-Week Cards

An Editor should be able to create a Topic-of-the-Week card.

  • research what this entails (probably speak to Nishant and Renate)

Storage cleanup script

Write a script to purge old uploaded files which don't have a corresponding kutis.storage document in Crux.

"looping" cards: mimic RSS feeds

  • permit the config file to set the exact start date/time of the loop so that the looping cards can mimic the RSS feeds precisely
  • allow the creation of retroactive "looped" cards so new server installs can have historical entries

Infinite Scroll JSON API

  • work on a new branch
  • implement infinite scroll (pagination)
  • change consumption on mobile app ... this is optional instead and hasn't changed on the mobile app yet

chores:

  • consider renaming :looped-words-of-buddha/audio-url to :looped-words-of-buddha/original-audio-url as a related breaking change

Upgrade production and sandbox to Ubuntu 22.04 LTS (Jammy)

A new LTS of Ubuntu was released on April 21st. By mid-May it should be safe to upgrade (create new Lightsail VMs and redeploy). As of 2023-04-09, our upgrade process probably looks like:

  1. backup our current 20.04 VMs
  2. stop + rename current VMs?
  3. create new Lightsail VMs with Ubuntu 22.04
  4. restore backups to new VMs
  5. redirect DNS

Reconcile Library

A Librarian or Editor should be able to "reconcile" the kosa library contents with the contents available on the https://pariyatti.org website. This will include www.pariyatti.org and store.pariyatti.org but possibly others.

  • discovery. should we:
    • use a site map from pariyatti.org? DotNetNuke has sitemaps
    • OR
    • crawl pariyatti.org (will be independent of CMS, hopefully)
  • provide user with a "start reconciliation" option or run on a weekly schedule
  • ingest and convert pariyatti.org web pages into kosa-shaped documents
  • diff:
    • additions / modifications (missing in kosa)
    • additions (missing on pariyatti.org) --- low priority
  • render a human-friendly diff in the web UI
    • each new / updated document becomes an operation the user can accept or reject from the web UI
    • provide an option to bulk-accept all changes
  • email Pariyatti staff a notification of the diff, requesting them to review proposed changes

Generic Kosa (Plugins)

Plugins: can pariyatti-specific domain entities live in their own .jar?

  • mounting routes?
  • db access? "namespaced", somehow?
  • localize db schema
  • views & handlers?

Kosa: production-readiness

Production-Readiness

  • add kosa.pariyatti.app for production
    • after file permissions change in Ansible, restart the service
    • copy authorized_keys from ubuntu user to kosa-user: provision with ubuntu, deploy and seed with kosa-user
    • add a DNS latch script to wait/retry until DNS resolves or fails (for 10 minutes)

Ingest TXT: Daily Words

  • parse TXT file
  • add "looped" entries in db:
    • parse basic text
    • bump loop index
    • download and attach audio
    • parse non-english translations
    • test all TXT files
    • extract generic code (Pali Word + Words of Buddha)
  • seed script to ingest TXT files
  • restrict MP3 downloads to "en" processing
  • daily job to republish each card in a loop
  • confirm API compatibility with mobile app
;; from pali_word/txt_job.clj —

;; 1. lookup word by `:words-of-buddha/words`
;;    (a) lookup translation for the current language
;;    (b) add translation if (a) fails

;; 2. add word if (1) fails
;;    (a) lookup largest "looped" index
;;    (b) add "largest + 1" as index
;;        ...but if largest = nil, index is 0
;;    (c) download and attach audio

TODO: sanity-check against Perl script

Scheduled Cards

An editor should be able to queue up a number of cards created in advance to self-publish on a schedule.

  • option to provide a future publish date
  • scheduler shouldn't be necessary: API queries should simply pull cards with a future publish date once we reach that point in the future; but sanity-check that this is true when this card is implemented in case the way the mobile app API queries changes before then.

Bootstrap: Sumukha

Steven Action Items:

  • Invite Sumukha as Triage to the kosa GitHub repo

Sumukha Research Action Items:

Sumukha Concrete Action Items:

Ingest TXT: Daily Doha

Refactoring

  • refactor: remove looped_xyz/db.clj duplication
  • refactor: remove xyz/db.clj duplication
  • refactor: remove publish_job.clj duplication
  • mark all DB tests as :db or :integration
  • remove duplicate publishing code
  • collapse q
  • nest / un-nest attachments automatically?

Parsing

  • parse TXT file
  • add "looped" entries in db:
    • parse basic text
    • bump loop index
    • download and attach audio
    • parse non-english translations
    • test all TXT files
    • restrict MP3 downloads to "en" processing
  • seed script to ingest TXT files

Publishing

  • daily job to republish each card in a loop
  • confirm API compatibility with mobile app

Pilgrimage Card

An Editor should be able to create a Pilgrimage card to announce news about Pariyatti-organized pilgrimages to sites in India, Nepal, and Myanmar.

  • research: talk to Brihas, Nishant, and Renate

Ingest RSS Feeds

  • sanity-check against old DWOB code (Swift and Java)
  • build a scheduler
  • schedule RSS downloads / parsing
  • alert (email [email protected]) on errors
  • parse RSS:
    • Pali Word
    • Words of Buddha
    • Daily Doha
  • create published cards in db:
    • Pali Word
    • Overlay Inspiration? (Words of Buddha)
    • Daily Doha
  • confirm API compatibility with mobile apps:
    • Pali Word
    • Words of Buddha
    • Daily Doha
  • kosa.mobile should add poller to kosa.library, not the other way around
  • store UTC everywhere
  • make jobs configurable (period, on/off, fn)
  • download and attach audio for Pali Word and Words of the Buddha cards

Implementation:

URLs

Recommended Reading Card

An Editor should be able to create a Recommended Reading card. Is this just a variation on a Staff Pick card? Do we really need both?

  • research: talk to Brihas, Nishant, Renate

Correct Size and/or Aspect Ratio

When an editor attaches an image to a card, kosa should take whatever actions are appropriate to make the image display attractively on all devices. This should include rejecting the image for being too small. Only oversized images can be downsized, not the other way around.

Bootstrap: Raghu

Steven Action Items:

  • Invite Raghu as Triage to the kosa GitHub repo

Raghu Research Action Items:

Raghu Concrete Action Items:

Setup AWS Lightsail

AWS Lightsail will be our production environment. We will deploy to Lightsail when we are ready to beta test with Pariyatti staff and internal (old meditator) users.

Production-Readiness

Note: these tasks have moved to #55

Initial Setup

  • @brihas to setup Pariyatti AWS Organization
  • @brihas give AWS permission to @deobald and @balwa to manage:
    • Lightsail boxes
    • Route53 (for DNS)
    • CloudFront (for CDN)

create TerraForm scripts

  • fix 500 errors
    • fix logging (see #kosa on discord)
    • resolve error for fresh kosa install
  • TerraForm a $20 box so we don't run out of memory :)
  • delegate to Ansible for deployment:
    • run provision.yml
    • run deploy.yml
    • run seeds.yml (but only once)
    • Ansible should update all Ubuntu packages on deploy
    • Caddyfile should be conditional by environment to avoid certificate errors
  • documentation
    • add "install terraform" to readme
    • add "getting default secret key" to readme
    • add terraform state file to vault? (currently only in S3) or just explain in readme?
  • request AWS Support to bump Lightsail instances from 2 to 20

Secrets

  • figure out how to use .kdbx (if possible) within scripts ... or pull scripted secrets into Vault

Search

This is obviously an epic, but the Resources tab should provide a full-text search of the Pariyatti library and return meaningful results before any search feature is released.

  • ElasticSearch on Neo4j?
  • Built-in Lucene on Crux?

Build fails on a clean checkout

This appears to be a silly mistake we wouldn't have made if we had a CI box set up. The resources/storage directory doesn't exist by default. There is a second issue which is that we don't actually initialize kutis.storage... it was always initialized automatically in the dev environment but it's nil for make run.

Fix:

  • file copy should fail loudly, not silently
  • resources/storage should exist in a new clone
  • mount kutis.storage on startup
  • FileNotFoundException should render cleanly to an end user, not in JSON

Acceptance Tests

The search has broken a few times (due to JSON serialization weirdness) now without me noticing. The only way to test this properly is with a high-level acceptance test, of which we should really have a few anyway. Headless Selenium is probably the way to go here.

Ingest TXT: Pali Word

  • parse TXT file
  • add "looped" entries in db:
  • daily job to republish each card in a loop
  • why are there only 209 :looped-pali-words after ingest?
  • avoid duplicate publishing on the same day
  • count ( N / M ingested ) during ingestion
  • ensure empty audio URL
  • confirm API compatibility with mobile app
  • seed script to ingest TXT files
  • web UI to ingest TXT files? - not now

;; from pali_word/txt_job.clj — (old notes)

;; 1. lookup word by `:pali-word/pali`
;;    (a) lookup translation for the current language
;;    (b) add translation if (a) fails

;; 2. add word if (1) fails
;;    (a) lookup largest "looped" index
;;    (b) add "largest + 1" as index
;;        ...but if largest = nil, index is 0
;;    (c) download and attach audio

;; 3. daily job creates "published on today" cards at 00:00:00:
;;
;; (def looped-card-count 220)
;; (def days-since-epoch (t/days (t/between (t/epoch) (t/now))))
;; (def days-since-perl (- days-since-epoch 12902))
;; (def todays-word (mod days-since-perl looped-card-count))

Staff Pick Cards

An Editor should be able to create a "Staff Pick" card for books, audiobooks, and videos.

  • Free (digital) book card
  • Paper book (bookstore / Amazon) card
  • Audiobook card
  • Video card

Migration Scripts

Schema-on-Write

  • implement all Datomic schema value types
  • implement rollbacks
  • :published-at => :type/published-at
  • :updated-at => :type/updated-at
  • :type => :kuti/type
  • switch all put to save!
    • ImageArtefact
    • Pali Word
    • Stacked Inspiration

Migration tool (joplin.crux):

  • try to build this as a generic library / tool
  • deal with data migration in addition to schema migration
  • publish to clojars.org

Consume tool:

  • match types based on type portion of Attribute? [entity :type/attr value]?
    • No.
  • keep a top-level :type key anyway?
    • Yes.
  • normalize :type/attr everywhere
  • normalize :modified-at / :updated-at
  • normalize :published-at
  • UUIDs with #uuid

Old notes:

Not sure why I assumed we needed to build this ourselves. There are two projects we could probably use quite happily:

https://github.com/macourtney/drift
https://github.com/juxt/joplin

Beautify

Let's clean up some messes we made when we started the Crux spike.

  • fix the resources macro test
  • cages.dispatch => kutis.dispatch
  • kosa.crux => kutis.record
  • kosa-crux => kosa ns
  • card CSS
  • font-family body CSS

Bootstrap: Ashok

Steven Action Items:

  • Ashok on Roster
  • Invite Ashok as Triage to the kosa GitHub repo

Ashok Research Action Items:

Ashok Concrete Action Items:

GitHub Actions CI / CD

We should configure some basic continuous integration with GitHub Actions. The Asana comment is a little out of date, as it would be relatively easy to move whatever we set up on GitHub to another provider anyway. This seems worth doing before our v1 release.

  1. Configure continuous integration -- run lein test
  2. Configure continuous deployment -- push to staging automatically?

Should we consider CircleCI instead? Circle can pull from multiple repos (in case that's ever something we want to do).


From Asana:

Doing this will tie us to GitHub since GitHub Actions don't conform to an open standard. Actions are free for open source projects, though. Let's consider this an extremely low priority until there are more developers working on kosa. For now, it's not hard for developers to run the build locally.

Remove XTDB ByteUtils hack

Not really a "hack" per se... but we're overriding the SHA1 algorithm using environment variables everywhere at the moment (throughout the Makefile) which is messy. It would be nice to apply this change universally — preferably somewhere other than lein / project.clj.

Rename all `master` branches to `main`

This is a meta-issue, since we should rename the primary (trunk) branch from master to main in all the repos:

  • agga
  • kosa
  • mobile-app
  • design
  • vault
  • Daily_emails_RSS
  • joplin.xtdb

`spec` for tests

  • what are the right seams to capture data formats in spec? only the boundaries like HTTP and Crux inserts? or elsewhere as well?
  • how much of the system do we test with each spec-enabled test? how easy is it to detect the root cause of a failed test?
  • are the tests easy to read?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.