cjcodeproj / medialibrary Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 0.0 341 KB

Python code to read XML media files

License: MIT License

Python 100.00%

xml movies dvds music vinyl-records audio-cds

medialibrary's People

Contributors

Stargazers

Watchers

medialibrary's Issues

Random sample output in POC tools

Some/most of the proof of concept tools should have a command line flag to output a random sample of the data instead of the entire list. This would allow for random spot checks that might not be noticeable compared to a complete listing.

Each tool where it's suitable will have a new option.

--random N

Where N is a positive integer value not exceeding the maximum number of units in the normal list output.

The tools should still return sorted output with identical headers to normal output.

XML elements without text may cause an error when stripped

The following element sequence is fine:

<keywords>
 <generic> </generic>
</keywords>

This one is not:

<keywords>
 <generic></generic>
</keywords>

Keyword initialization code in media/data/media/contents/generic/keywords.py fail to catch a condition where the element is truly empty. Ironically, it does capture cases where there's just whitespace, so it won't create a GenericKeyword object.

The code should catch this, and skip entries without text.

File involved:

https://github.com/cjcodeproj/medialibrary/blob/main/src/media/data/media/contents/generic/keywords.py

Exceptions for some missing data.

Some fields, like <movie><title></title> should not be empty.

Consider throwing an exception when some fields with empty values are encountered.

Or in other cases, prevent the object associated with the element from being created. For example, in a group of ten <keyword>/<generic> elements, if one of those element values is empty, it should just be ignored.

Title sort should have a language based exception for articles.

See this issue: cjcodeproj/vtmedia-schema#40

There are very rare cases where the article in a title should not be popped for the purpose of building the sort_string. But, those exceptions should be handled automatically by the appropriate code, and not something the data entry user needs to worry about.

Consider a rule where for any English title, if an article is discovered for the first word, but the second word is a known verb, then the popping operation should not take place.

Issue with Movie class when catalog data is incomplete.

Almost all data testing has been done using XML generated from the templates here.

https://github.com/cjcodeproj/vtmedia-schema/tree/main/templates

That's good for the data, but bad for the code base, since there weren't many cases when the code could encounter use cases of missing XML elements that were normally expected.

Every Movie object has a friendly unique id value that is used to generate the hash value for object comparison, which is based on the title, copyright year, and an extra integer when needed.

#17

There needs to be a code fix to address cases when data is assumed to be present, but it is missing.

Simple tool for validating movie data

Create a simple POC tool for detect movie data that could be improved.

Ideally, this should be a full framework, but a proof of concept is acceptable for now.

It should catch common errors in data entry, and the framework should have a scoring system to weight the severity of an issue.

It should provide output of movies with common faults, and also allow for random sampling.

pydoc: documentation improvements

All purpose ticket for documentation improvements to code base.

Stronger input sanitation checking and exception handling

There needs to be stronger input validation/sanitation checks for handling XML element text, such as:

<primary>Adventure</primary>

The checking code should eliminate all leading and trailing whitespace characters, and maybe check for non-sensical characters as well. There should be at least one exception class that can be thrown in cases where the text value ends up being empty or composed of invalid characters.

Develop a common set of functions to handle the input sanitation and test it out in classes that rely heavily on parsing XML elements that only contain text.

The proper noun code is kind of sloppy

The ProperNoun code needs a standard interface structure to handle operations like returning the presentable string representation of the value, returning the casefold() representation of the value for sorting, and possibly a third value suitable for search operations.

Consideration moving the CharcterName class to a different module since it's not 100% suitable for this module (even though it would likely adhere to the same interface structure)

Consider making Name, Place into subclasses of Noun.

Name values should return a search value that is properly (family name) followed by (given name).

Also investigate the pylint too-many-branches error.

Unit tests

Start bringing unit tests into the fold.

Primary goal:

One suite of tests against most (if not all) classes.

Secondary goals:

Identify how testing parameters are supported to be supported in setup.cfg, pyproject.toml, or something else suitable for a CI/CD build operation that is launched from an external tool. Python documentation in that sense seems to be lacking.

Organizational goal:

There are 27 .py files in this code base, not counting the __init__.py files. Finding some good documentation or examples of how tests are organized in similar large projects would be helpful. I'm find with one to one ratio of one test module to one source module.

My primary reference is going to be the Python Packaging tutorial which suggests that tests should be in a separate directory tree from source code.

https://packaging.python.org/en/latest/tutorials/packaging-projects/

Outstanding questions:

For distributions of the module, what is the final directory structure of code and tests going to look like?

Command line tools should be a separate module.

The core function of the medialibrary module is reading the XML files in a repository and providing Python code representations of the data. The Proof Of Concept tools included with the module are written to show what can be made with the code base. They all provide barebones examples of things work, and can be used as instructional guides for anyone using the code base.

If further improvements are made to the proof of concept tools, they should be directed to the effort of building a separate Python module designed to act as a command line interface client.

The code should act as a command line level tool, similar to other Python tools, where a front end program is installed in a bin/ path, accessible from a user shell, capable of handling command line requests.

$ media movies list --sort runtime
$ media movies cast list 
$ media devices list --type bluray

It will require a code base capable of handling a complex argparse command tree, and delegate the functionality to the correct module. It would also require thoughtful delegation between where the code goes in the medialibrary module, and the command line module, as well as a coordinated release effort in order to keep dependencies in check.

Code styling updates (PEP8, PEP257)

Right now I'm using pycodestyle and pylint to enforce style consistency across the code-base, with minor exceptions declared at the source file level.

I should look into additional tools like pydocstyle, flake8, and maybe black to do more style checking. I should also look into pyproject.toml integrations, because when the project started the configurations were pretty lacking.

Caveats:

I don't want automatic corrections, so black is probably out.
These will be gradual changes across the code-base, so this ticket won't be closed immediately.

Command line tools should have an internal call option for Python interactive mode.

All of the command line tools should be slightly refactored to be callable from the Python CLI, as closely as possible to how it would be called from the shell.

For example, this call:

python -m media.tools.movies.list --sort runtime

Should be usable in the Python shell as:

>>> media.tools.movies.list.list(repo_object, sort=runtime)

This will take a little bit of investigation and testing.

Every command line option flag available should be replicated in interactive mode.

Documentation should be enhanced to show calling invocation of the functions, both external documentation and pydoc.

Create technical/runtime code

Create code to handle the <technical>/<runtime> element for movies/visual works of art.

Update POC tools to report on the duration of a film, and possibly sort titles by duration.

Address leading/trailing whitespace in data

Most fields should not have leading or trailing whitespace. The python code should account for this when extracting data from the XML elements.

Title (and other information) sorting

Proof of concept tools, like listmovies and showmovies should sort movie titles on output.

Consideration should be made for sorting rules based on language, (ie, removing first word definite articles in English titles).

Title sorting should also take into consideration that the full unique key for a title should be (Title + Year + Optional Unique Incremental Value), for the rare cases that two movies could have the same title. In that case, the year should determine which one goes first.

The Catalog object should probably contain another object that is specifically a modified caseload string that is the value used for sorting comparisons, containing all 3 values. It should be sortable through the normal Python sort functions.

I'm not sure if that means movie objects should be directly comparable through the sort() function.

Sorting routines should also be considered for data, such as <subgenres> for improved readability since there is no implied ranking between multiple entries; but that can be deferred for a later ticket.

Every content object should have a unique string value based on its basic information.

Every content element contain data that allows for the creation of a unique key string that can be used to differentiate one work of art from another.

For movies, that is assumed to be a string created by combining these elements:

movie_title : (copyright_year/creation_year) : optional_incrementor_integer

That same string formula should probably work as well for other works of art. There may be a need to add a qualifier regarding the type of art. For example, a book and movie with the same title could conceivably be released in the same year.

Whatever the value is, it needs to be accessible from the main object. IE, an instance of a media.data.media.contents.movie.Movie object needs a simple attribute or getter method to return the value.

Related issues/issue comments:

cjcodeproj/vtmedia-schema#9
#25 (comment)

Reevaluate file loading code.

The Python code that walks the filesystem is somewhat robust, but it's a little backwards in the sense of OOP practices.

The primary flow of all operations are:

Identify all files in a directory structure, matching a suitable filename pattern.
Iterate through all filenames, load each file with the XML parser, and then return a Media object for everything matching the XML criteria.
Iterate through the list of Media objects.

For situations where we want to extend the POC tools so they can be easily usable within the Python shell, I want to consider this logical flow.

A pool list object containing nothing but Media objects in memory.
Repository objects which point to paths where XML files reside.
Implementor scans the repository using a suitable filename match.
The repository object returns the list of files.
The loader object (or maybe the repository object?) reads the files, creates the Media objects, and puts them in the pool object.

Anyone should be able to perform this functionality from the Python shell where they have a persistent pool object that can be populated by reading one or more directory paths.

Something along the lines of...

>>> x = Repo(pathA)
>>> p = []
>>> fileset = x.identify(movie_files_regex_pattern)
>>> p.extend(loader_object.scan_files(fileset))
>>> p[0]
<media.data.media.contents.movie.internal.Movie object at 0xfffffc7fedb6fbe0>

Additional notes (based on the example above):

The p object could be either a regular Python list, or even a custom object that's iterable. It shouldn't matter.
The p object in this case, would store all of the Media device objects.
The current code that walks the directory tree is pretty good, with a decent feature set, but not too complex. Consider a model where the Repo object doesn't do the scanning directly, but passes the job onto a delegate object.
Filename components don't matter to the internal data, so the current method of assuming anything with '-dvd' in the filename being a DVD is kind of sloppy. However, there should be future tools that can scan the pool, and legitimacy filter out objects by things like device type, or other characteristics.

Handling pylint R0801 (duplicate-code) error

Figure out a global solution to wipe out the pylint R0801 error.

Per-file level disable statements probably isn't feasible.

Implement a solution either in pyproject.toml or a project scoped pylintrc file.

Handle description element for non-fiction films

Fiction films have this element structure:

/movie/story/plot

Non-fiction films have this element structure:

/movie/description/overview

Fix the story code so it can handle non-fiction movies temporarily. There should be a better fix to handle the tags in case the applications for fiction movies and non-fiction movies diverge.

Format specifiers for some data types

Some data types would benefit from having format specifiers that can change the output based on what data the caller wants.

Proper Noun name sorting mixes up when there is no family name.

The name sorting code generates a value like this.

{family}_{given}

But if there is no family name, it generates.

_{given}

Which puts given name monickers at the top of the sorted list every time. The code should be adjusted to remove the leading underscore.

Implement code for pulling movie/crew/cast data

Implement code for pulling data on cast members for all visual mediums.

Update code for Character Names

The CharacterName class should be updated to handle the following.

Recognizing element order for handling structures that include nicknames.
Better class structure for non-titled roles like 'self' or 'narrator'.
Add test POC tool to display actor roles
Provide ways to present both full name, and formal expression of full name ("Bob Aliceton" vs "Professor Bob Aliceton")

Basic code for handling movie classifications

Build implementation code for handling movie classifications, including categories, genres, subgenres.

Skip for now: subjects (non-fiction movies), certifications (movie ratings)

Improve proof of concept tool namelist

Expand the proof of concept tool media.tools.movies.namelist to output movie titles associated with job roles.

Benefits: More work with objects and relations with objects (Title object improvements tied to sorting).

Note: Build up better classes for crew members to deal with the XML attributes.

The keyword code is kind of sloppy

Individual keyword elements in the schema fall under two classes: GenericKeyword(), and ProperNounKeyword().

Both have many identical operations and interval values, so consider creating an abstract parent class they can both inherit from.

Also, the primary sorting routines for keywords should still take the relevance value into account; however, there should be a common method in the parent abstract class for retrieving the raw string value of the keyword.

Slight flaw with title search code

The search code makes concessions for movie titles that start with articles by moving the article to the end of the title, and doing a Python casefold operation.

A title like "The Courier" becomes "courier_the".

Unfortunately, by keeping the article, these two titles now appear out of order:

The Rescuers (modified to "rescuers_the")
The Rescuers Down Under (modified to "rescuers_down_under_the")

The solution is one of two possibilities:

Drop the article instead of appending it.
Prefix the article with a second underscore, which would fix the order.

Also consider moving the title search code to a separate function, which may make dealing with other languages easier in the future, and allow for generating sort values against other data types, like keywords.

File loader ignored Blu-Ray 3d movies

The file loader code ignores media device filenames for "3D Blu-Ray" movies, which should be "bluray3d".

Provide breakdown of film genres

Provide a proof of concept tool that reports on all films organized by genre. Focus on primary genre, but include reports on secondary genre breakdown. Include a count breakdown of all films in a repo for comparison purposes, and output random sample titles from the library for illustrative purposes.

Search Tools: Starting Framework and Title Search

Put together a framework for search tools.

Python module path will most likely be: media.tools.movies.search.title

Title search should cover the primary title of the feature, as well as the variant title, or other titles.

Tool should be able to output matching results in a list format, and also provide a detail option to present a full text record, identical to the output from media.tools.movies.show. It should also provide summary information, detailing the number of matches out of the total number of records.

Fix too-many-branches pylint error in nouns.py

Address the issue of the pylint warning "too-many-branches" found here.

medialibrary/src/media/data/nouns.py

Line 51 in d043ddc

def _build_string(self):

There is probably a better way to construct a proper noun string using reusable code.

Improvements to POC tools

Make the following improvements to the Proof Of Concept tools.

Eliminate the duplicate output of movies for situations where a movie exists on multiple media, like both a DVD and a Blu-ray Disc.
Implement a --random command line flag to output a small subset of movies, for the purpose of spot checking and quality control.

Depends on completion of #17

FilenameMatches regex pattern is incorrect

The regex pattern should accept an optional numerical digit after the device portion of the filename, for example.

g/gre/the_great_escape-1963-bluray-1.xml

Unfortunately, the pattern for the regex is incorrect (\n instead of \d).

Need code handling for other content elements.

Besides the <movie> element, the schema also supports the <test> element, which has been ignored in the code base for now. It should be supported.

Also, a new element named <essay> is being tested in a code branch right now. It should be supported in the code base.

cjcodeproj/vtmedia-schema#28

There should also be a POC tool that can catalog all of the different types of content.

Investigate extensibility with extensions to the schema.

Investigate ways to integrate code with the library when other parties decide to extend the media schema.

If anyone adds elements to the schema, there should be a code mechanism that would allow them to use this code base to still read the custom data they added.

POC tool to graph time distribution

Create a proof of concept tool to report on time distribution of all movies in the repository.

It should take every movie runtime, sort the values, and break them up into distribution buckets to identify the range of runtimes in the collection.

It should also report the average runtime, and the mean runtime.

Handle bad keyword values with a Python exception.

Working on a previous bug (see below) I realized the creation of keyword objects should be handled better.

Right now 50% of the processing is handled by the containing Keywords object, when it should probably be handled directly by the GenericKeyword and ProperNounKeyword object classes. The reason this was done was to catch situations where those elements may have improper values; based on the fact that initializing an object always returns an object, even if the data in the object is bad. (IE, it's not possible to return None).

But the better pattern is to just throw out a ValueException whenever the instantiation of an object goes bad, and then have the outside code catch the exception and work around it.

Related bug: #45

Investigate delegation pattern for code extensibility.

One goal of the medialibrary module is usability and extensibility.

This ticket is to take a look at the code from a higher level and investigate delegation patterns, and find suitable entry points.

Allow implementors to set delegate objects that can interact with the code base at certain points, similar to the Objective-C delegate pattern. Likely points would be whenever a new Media object is created, or a new Content object within a Media object is created.

Not sure what the pattern would be, and Python does not yet have support for something like Obj-C protocols that can be used to ensure delegate code works properly.

More metadata in the pyproject.toml

See what additional information can be put into the pyproject.toml file that won't risk breaking the build.

The information provided should be suitable enough to provide more context when the module is loaded to PyPi; stuff like homepages, repo pages, links to the CHANGELOG, additional documentation, etc, etc.

At the same time, there should be testing to see whether or not information should be pulled from setup.cfg, or whether it should stay intact.

These will be the two reference sources of data for the change:
https://peps.python.org/pep-0621/
https://packaging.python.org/en/latest/specifications/declaring-project-metadata/

Movie list tool should have more sort/grouping options.

Should be able to sort films by copyright year.

Code should have a common grouping/buckets mechanism for pooling the list into smaller sets.

Sort options should also allow for sorting by year.

Grouping options should allow for grouping by first title letter, decade, runtime, or genre.

Random sampling will still be supported.

Operational order would be:

Load all data
Create random sample of data if needed
Group data based on desired trait
Sort data within groups
Output data
Output stats

Implement proof of concept code for pulling names from film data

Implement proof of concept code that can pull all associated names from the movie data.

List POC tools should support CVS output.

The proof of concept tools that output a single record per line should support CSV output to make data usage easier.

Enhancement to repository format.

The vtmedia-schema specifies a format for the data, but not a format for how the data is stored. Technically, the XML could be stored directly in an SQL database, if desired. The repo structure is a system used by the medialibrary for the purpose of finding all of the XML files in a directory structure that will not get overloaded (too many files in a single directory). There is one sample of the directory structure in the vtmedia-schema repository.

The repo code takes a directory path, and searches recursively. It's easy to manage, easy to search, usable with other tools (like 'grep -R'), easy to archive and replicate ('tar'/'rsync'), and easy to do version control ('git', 'hg', or 'svn').

The only downside is the repository code loads every single file for operations ('media.tools.movies.list', 'media.tools.movies.keywordlist'). It's still fast, but it's probably going to get inefficient someday if a repository has sufficient growth.

There are enhancement tickets to make the repository more usable, with tools that allow for searching. In order for those tools to be usable, the repo format should probably change.

Enhancement ticket:

medialibrary-31 Search Tools

A repo should be considered a single directory endpoint.

repo1

Underneath that path should be a new directory structure:

repo1
repo1/data
repo1/data/movies (optional?)
repo1/data/test
repo1/index/
repo1/index/... (to be determined)
repo1/cache/
repo`/cache/... (to be determined)

End user would have a MEDIAPATH variable value of MEDIAPATH=repo1/

All raw XML source files would be kept under the data directory in a path structure of the end user's own choosing. The index and/or cache structures are filled with supplemental data that is built on the fly through usage of the metadata tools. It could contain a title index, a keyword index, or anything that makes search operations faster.

The files under index or cache could be archived, but they should not be recorded in a version control history due the the possibility of a high rate of change. The files should be considered 'build products', that don't need to be preserved (similar to a C project object file .o). All tools should be able to operate identically whether or not files are kept in the index/cache directories or not.

Consider f-strings for output

Previous pylint runs were done using version 2.10.2. Subsequent testing using version 2.11.1 introduced a new rule.

C0209: Formatting a regular string which could be a f-string (consider-using-f-string)

This ticket is a placeholder to see how much effort it would take to replace format strings with f-strings. Considerations should be made for object classes with format methods defined.

Setup object hashes

Complex objects, like "movie" should have hash values to identify uniqueness, so those objects can be used as keys in dictionaries.

The hash should be computed based on hash values of child objects, like title and catalog information. Just enough to guarantee that two different objects will not be the same.

A movie object generated with identical data should create an identical hash value; however, the internal code to generate the hash can be subject to change. That is, the hash guarantees uniqueness at runtime, but is not suitable to use as a permanent unchangeable unique id.

Reign in the README.md file

The READAME.md file needs to be scaled back a bit to focus on the details of what the module is, similar to other modules on PyPi. The example information can be moved to a separate file, or consider a better documentation system for the command line tools.

First distribution test

First build and distribution test.

Code to read physical media data

Currently, there is no code to report on the physical media data; only code to read movies.

There is a need for data structures to represent the physical aspect of the data, like Blu-Rays, UltraHD disks, and so on. At the very least, the barebones objects need to be created, with an abstract class to handle the common aspects of the media, with subclasses where appropriate. There should also be an instance object to track specific copies.

A barebones reporting tool should be created under media.tools that can list every physical media device in a repository. It should at least report on type of device, number of instances, and the media contents of the device.

Command line tools should have stats options

All of the command line tools should include some statistical output, which may or may not be customized to the application at hand.

At the very least, they should report on the number of movies/media objects processed, and any additional data that may be derived from it. For the tools that support the --random option flag, it should report on what percentage of the total data set was output.

For example, media.tools.movies.castlist could report statistics on the total number of actors and maybe the average number of roles the every actor had in the output.

Also consider timing metrics, such as repository load time, and total program run time.

cjcodeproj / medialibrary Goto Github PK

medialibrary's People

Contributors

Stargazers

Watchers

medialibrary's Issues

Recommend Projects

Recommend Topics

Recommend Org