Giter Site home page Giter Site logo

Improve migration from v1 to v2 about ytcc HOT 13 CLOSED

woefe avatar woefe commented on August 26, 2024
Improve migration from v1 to v2

from ytcc.

Comments (13)

EmRowlands avatar EmRowlands commented on August 26, 2024 1

I am currently working on a (pair of) script(s) that will perform the migration in a somewhat automated manner. This will include all videos, if they're watched and anything that's in the version 1 database. The first script is run with version 1 installed, and the second with v2 due to python not being able to handle importing two modules with the same name simultaneously.

from ytcc.

woefe avatar woefe commented on August 26, 2024

First approach depends on #43

from ytcc.

woefe avatar woefe commented on August 26, 2024

First approach has been implemented in 1da677b and 0d03bac

from ytcc.

woefe avatar woefe commented on August 26, 2024

I'm going to leave this issue open for now. Hoping to get more feedback.

from ytcc.

EmRowlands avatar EmRowlands commented on August 26, 2024

I'm currently trying to work out how to calculate the extractor_hash from a video URL, but I'm not getting very far. Could you offer some help on this? This is the only thing (I think) holding back the entire script

from ytcc.

woefe avatar woefe commented on August 26, 2024

@EmRowlands Awesome!!

The extractor_hash is calculated as a sha256 of youtube-dl's (unprocessed) information extractor output.

Basically pseudocode for one item:

sha256(YoutubeDL(...).extract_info(..., process=False).entries[0])

Relevant lines:

def extractor_hash(data: Dict[str, str]) -> str:

e_hash = extractor_hash(entry)

from ytcc.

woefe avatar woefe commented on August 26, 2024

And important here is that the hashed entry is from a playlist, not from the video page itself. Not sure, if we can reverse it from a yt_video_id easily, which is probably what we would need when converting from v1?

from ytcc.

EmRowlands avatar EmRowlands commented on August 26, 2024

I knew how it was being created, I just couldn't reproduce it because I didn't have access to the playlist data. I'm also not sure this way of generating the hash makes sense, since if a video is in multiple playlists it will have multiple extractor_hashes (unless this is intentional). I considered suggesting using the same method, but using the extractor info for the specific video instead, but that would require youtube-dling the info for every single video that is being imported.

Perhaps it would be better to use something like the format provided by --download-archive, which provides strings that look like this:

youtube dQw4w9WgXcQ

Where the first part is the name of the extractor, and the second is a site-specific string that uniquely identifies a video.

from ytcc.

woefe avatar woefe commented on August 26, 2024

Admittedly, the extractor_hash approach has problems. I actually found cases where it won't work with the current function that relies on a Dict[str, str], which is not always the output of processors. Sometimes the values might be more complex structures.

The hash should be the same for videos of different playlists. At least, for all examples that I checked it was the same.

I have looked into the --download-archive option again. It uses _make_archive_id to create the id, which can be generated from an unprocessed result and therefore does not require more network requests than the current approach. I think using _make_archive_id is more reliable, because then we rely on existing youtube_dl internals, which should work nicer with the rest of youtube_dl.

It is possible to replace the extractor_hash() with _make_archive_id(). Ytcc will simply resync all playlist content on the next update. I'll commit my changes and release a second beta soon.

from ytcc.

EmRowlands avatar EmRowlands commented on August 26, 2024

I've done some testing, and it appears that this approach will work with my scripts in a drop-in way. Since all of the videos from v1 will be from youtube, it's trivial to reimplement _make_archive_id() to not require network access. I'm also not sure how to contribute these scripts, since they require v1 to be installed, with v2 code sitting in a different directory (or vice-versa for the price of a trivial change)

from ytcc.

EmRowlands avatar EmRowlands commented on August 26, 2024

I have finished my implementation, but there are some caveats:

  • It requires version 1 to be installed
  • It requires a copy of the source code of version 2
  • It exists entirely "out of tree"

As such, I'm not really sure how to submit it for review. It could sit in the scripts directory if it was only a single file, but it includes 5 files (a common file, an export and import script, a config file, and a migration shell script which runs them all).

If you think it would be acceptable to put them in a subdirectory of scripts, I'm happy to submit a PR.

from ytcc.

woefe avatar woefe commented on August 26, 2024

@EmRowlands, Im not sure how to handle it. Is it public somewhere for me to see? Can you maybe push it to a new branch on your fork in a new subfolder of scripts/? Then we can still decide where to put it when we merge it. Maybe, we create an orphan branch (git checkout --orphan ...).

from ytcc.

EmRowlands avatar EmRowlands commented on August 26, 2024

I have added them in #50

from ytcc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.