The overview from openzim

2023-01 Python Security Version Upgrade

Please bump Python version used in all python repos (docker images) to the
latest minor versions (within that major version)

Note: this is an automatic reminder intended for the assignee(s).

artofproblemsolving_en_all_maxi_2019-04 zim file is not rendering

Hi,
I tested artofproblemsolving_en_all_maxi_2019-04 zim file, some pages are empty or missing elements in the page.
I also tested it in this link : http://library.kiwix.org/artofproblemsolving_en_all_maxi_2019-04/A/Main_Page
the result is the same.

Needing soon hard token for code signing

https://knowledge.digicert.com/generalinformation/new-private-key-storage-requirement-for-standard-code-signing-certificates-november-2022.html

I guess this will mean we have to move to a cloud code signing solution. For example https://www.ssl.com/esigner/ (electron-userland/electron-builder#6158)

https://www.ssl.com/how-to/cloud-code-signing-integration-with-github-actions/

Made a ZIM of openfoodfacts.org

From @kelson42 on March 16, 2017 11:53

Data are here https://world.openfoodfacts.org/data

Copied from original issue: openzim/zimfarm#1

New scraper: Confluence / Atlassian

@Popolechien commented on Jan 24, 2020, 3:01 PM UTC:

There is a bunch of wikis running on that mediawiki competitor, it would be a neat addition to our portfolio.

This issue was moved by kelson42 from openzim/zim-requests#221.

2023-04 Python Security Version Upgrade

Please bump Python version used in all python repos (docker images) to the
latest minor versions (within that major version)

Note: this is an automatic reminder intended for the assignee(s).

Define a convention on TCP/UDP ports used by development stacks

Background

In many of our projects, we deploy local development servers (e.g., an API and a Database) on our development machines for testing purposes. These servers expose a TCP (occasionally UDP) port on our local machine. Currently, there is no standardized convention for the usage of these TCP/UDP ports across projects. For instance, some projects use port 8000 for web APIs, while others use 8080.

Note: This intentionally simplifies the distinction between TCP and UDP ports and assumes we don't want two distinct services, one on TCP and one on UDP, running on the same port number. Although technically possible, it's deemed cumbersome for our purposes.

Problem Statement

The absence of a convention on TCP/UDP port assignments for local development services leads to two issues:

After starting a local development stack, it's unclear where the services are listening, causing delays when switching between projects.
- This becomes more pronounced with the shift to docker-compose-based local dev stacks, initiated with a simple docker compose up -d.
Running two local development stacks simultaneously is usually impossible due to port conflicts.
- This often occurs when transitioning from developing project A to reviewing project B.

Proposition

We can address the problem by establishing a convention for TCP/UDP port assignments.

The proposed convention is to use port XXXY for every server in our systems, where:

Y is a number indicating the type of service:
- UI is always on Y=0
- Backend server (+/- API) is on Y=1
- Database is on Y=2
- Y=3 to 5 are reserved for potential generic usage
- Y=6 to 9 are available for non-generic services (e.g., a second backend server)
XXX is a number reserved per project (Github repository)
- Each Github repository will reserve a number in a centralized reference.
- Repositories may reserve multiple numbers if needed, and these numbers are contiguous. If the need wasn't anticipated, the project is moved to other contiguous numbers.
- To determine where XXX starts, we need a broad port range to accommodate all our projects. Since we don't have many other services running on our development machines and the TCP/UDP port ranges are cluttered with various services, we can use any meaningful port range for these assignments, reserving some numbers for external services if conflicts arise.
- XXX will hence start at 800, with the 800 and 808 ranges already reserved due to known conflicts with many of our (not yet migrated) projects and other web servers.

Feedback and implementation

All feedbacks are welcomed, after that I will transition this to a Wiki entry.

Proposal: Create a bot for the organisation.

@srv-twry commented on Feb 13, 2018, 9:31 PM UTC:

Proposal

Most of the open-source organisations these days have a bot to manage things like assigning, unassigning developers. Adding labels like "In-Progress" etc.

OpenDataKit - opendatakitbot handles labelling/unlabelling of issues and pr's using commands.
You can claim an issue using it.
It automatically assigns developers and shows a welcome message if it's your first contribution to the project.
It automatically un-assigns the developer in 7 days if there isn't any activity associated with it.
Here is an example issue and how it works.

Other organisations such as Fossasia also have their bots.

This will be very helpful in the long run specially with GSoC round the corner.

@mhutti1 @kelson42 Please review.
PS: I don't know how to make one but i can certainly ask them if it seems interesting to you.

This issue was moved by kelson42 from kiwix/kiwix-android#373.

2022 Python Version Upgrade

Please assess most appropriate major Python version to use for next year.

https://en.wikipedia.org/wiki/History_of_Python#Table_of_versions

If different than current one, upgrade all python repositories to it.

Note: this is an automatic reminder intended for the assignee(s).

Create a Python libzim binding

Should work on Linux
Python3

Vuex is not anymore the standard Vue state management library

In our standard stack, we previously decided to use Vuex for state management: https://github.com/openzim/overview/wiki/openZIM-workflow#frontend

This is not anymore the official state management management library : https://vuex.vuejs.org/

I recommend to adapt our standard to use Pinia instead of Vuex, so that new projects (including offspot/metrics frontend) use this new Vue official library.

Clarify the PR review process

I need you to help me clarify the PR review process, because it is too blurry for me and causing frustration.

I do not find enough precision in kiwix/overview/CONTRIBUTING.md or in openzim/overview Wiki. Please point me to the correct direction if I missed something.

This is a mid-term enhancement, I hope to reach a conclusion on this within few weeks, but there is clearly no hurry.

This issue describes below the situation(s) that occurred between Renaud and I to give a background, but this is not a personal conflict (at least from my position 😄 ).

What happened

I've once again failed to understand the PR process and caused frustation to Renaud by resolving a conversation too soon (this time) : offspot/metrics#25 (comment)

I had understood that once PR is approved, I have to resolve all conversations on my own because they are just comments to raise my awareness should I want to fix this. But obviously this is not always the case. And I find that it is weird to have pending conversation but an approved PR.

In contradiction to that, on an unapproved PR Renaud asked me to stop waiting for him on unresolved conversations and resolve the conversations on my own once the requested code change is done, but I'm always hesitating to do that because I often have doubt whether I got the code change request properly and whether I did the right change.

In the past, the situation already occurred that Renaud was disappointed by a change I've made after his PR approval and explicit request to resolve conversation and proceed with the merge once the change is done. This was exactly because I didn't applied the change / understood the change request properly. I consider that misunderstanding is normal.

All this is very frustrating for all of thus, and I would really like to come up with a much simpler process / rule(s). Simpler meaning, for me, less subject to personal interpretation.

Background research

Luckily (or not), we are not the only ones to struggle on this topic (I just did a very fast Google search):

Proposition

My process proposition is:

PR must not be approved until the reviewer is happy with the code / conversations
no conversation must be left unresolved before the approval is given by the reviewer
it is the reviewer responsibility to resolve conversations
authors must not resolve conversations, except for obvious code suggestions that have been applied to the code base
once the PR is approved it means that the author can merge
author must not change the code once the approval is given, at least not without requiring a new approval (but this should be a rare case, normal process is to merge asap to spread the change, and open a new PR for new changes ; this only makes sense if the merge makes no sense, e.g. because a very significant bug has been discovered)
should something still have to be discussed in a conversation but is not blocking for the change to be merged, an issue must be open to track the discussion point and the conversation will be resolved (reviewer can explicitly ask the author to do it, or the author can suggest it in a conversation)

This has some drawbacks of course (probably more work for the reviewer), but it is way clearer for me.

I would like also to add some resources to read as a PR author:

And other to read as a PR reviewer:

https://mtlynch.io/human-code-reviews-1/ (Part 1) and https://mtlynch.io/human-code-reviews-2/ (Part 2)

Suggestions / feedbacks welcomed!

Enable basic html formatting in content descriptor

copy/paste from zimfarm/719

Some of our zim files have fairly long descriptions and we end up with a block of text. It would be convenient if in the Description field of recipes we could insert some basic HTML (e.g.
) so as to render this text in a more palatable format

Switch to repeatable Python setups

We are now sufficiently affected by sub-dependencies issues for it to be necessary.

Scrapers and other Python projects should all be switched from good-old requirements.txt to repeatable, frozen environments. Harmonizing how we build/publish (from python perspective, not publish workflows!) should probably be looked at as well.

I'd suggest using pipenv/Pipfile/Pipfile.lock but looking at what's the current recommended way is mandatory.

Note: this should be tackled with #13 and #14

Decide where content team documentation should be placed

Currently, it looks like there are a lot of places where documentation about content edition is placed, and I probably miss some locations.

This documentation is about:

what is the overall process of content edition (high level picture of how do we go from a zim request to a published ZIM)
how do we configure an offliner kind (youtube, mwoffliner, ...)

Some documentation is stored in various google docs (and I don't know where most of them are stored).

Some documentation is placed in the Zimfarm wiki (https://github.com/openzim/zimfarm/wiki), e.g. https://github.com/openzim/zimfarm/wiki/Ticket-Lifecycle-(Zimit), https://github.com/openzim/zimfarm/wiki/Tickets-Lifecycle-(Mwoffliner), https://github.com/openzim/zimfarm/wiki/Youtube-scraper-configuration-and-debug

And I just added a new location with Kiwix content Google shared drive.

I find this situation not convenient because:

the various google docs are not centralized and will probably get lost at some point ; from my experience, every team document created in Google drive must be placed in a Google Shared drive and this Shared drive must be shared with appropriate persons of the team
the documentation placed in Zimfarm wiki is mixed with very technical content for devs/ops
Github wikis do not allow at all a review process, every change is immediately implemented in production
Github wikis do not allow to have "folders" of documentation to structure stuff

From my perspective we should have:

a central, public, reviewable, dedicated location for most documentation
a very small Google shared drive (or something else) only for documentation which cannot be made public (like the API keys file I just created since these are secrets)

WDYT? Did I missed some locations? What would you suggest?

Harmonize python repos to 3.11

For the sake of harmonization and ease of dev/maint, we want to use a single major Python version as base for all our python projects.

Following https://en.wikipedia.org/wiki/History_of_Python#Table_of_versions, we want to use 3.11 as of now.
As such, support for python 3.6 (not receiving secu updates) and 3.7 (secu updates to end in 6m) will be dropped from those projects.

Upgrade python-scraperlib to 3.x, including CLI support for description / long_description flags

Python scrapers must be updated to use python-scraperlib 3.x

At the same time, they must:

add (if not already present) CLI parameters to set description + long_description
use the shared logic of openzim/python-scraperlib#110 to handle these fields

For every scraper, do not forget to also update Zimfarm configuration to add these new CLI parameters + set type (input or textarea) + maximum length.

This is an overview ticket, works has to be done in the individual scrapers:

Code norms

It seems to be defficult to get a natural/grassroot agreement about coding style. In particular the case regarding variables, types, classes, constants, etc. I would be happy if we get a minimal agreement on this.

wikipedia_tr_all_novid_2019-04.zim is not browsable

Hi,
First of all thank you for developing and providing this tool.
I use Zimserver python module to serve zim archive in localhost.
I noticed that wikipedia_tr_all_novid_2019-04.zim wouldn't render the pages, I just see blurry empty pages. I also tried other tools such as web-archives and kiwix but none of them worked.

How can I create zim archive of https://tr.wikipedia.org by myself?
I also want to inform maintainers of https://download.kiwix.org/zim about the issue.

There are no portable (split) distributions of stackexchange ZIMs

Not sure if this is the right place to note this issue. It may be "by design", which is fine, but at least one of the ZIMs for Stackexchange -- stackoverflow.com_eng_all_2017-05.zim -- is 52GB, so a split version could be useful if time/resources permit. Or if not, perhaps a README should be put in the /portable/stackexchange directory pointing people to the FAQ where it explains how to split files manually.

Consider saving content publication date in the ZIM

And specify it is the ZIM specification.

For the moment, we have only the ZIM creation date, but this might be really different from the content publication date, in particular if the content is really old.

Folllowing a comment from https://github.com/veloman-yunkan at kiwix/libkiwix#702 (comment)

Provide documentation for ZIM Index Format

Currently there is a link in https://wiki.openzim.org/wiki/ZIM_file_format#Namespaces to documentation on the fulltext index that resides in namespace X of some ZIMs. Clicking on this link leads to https://wiki.openzim.org/wiki/ZIM_Index_Format, which is a mainly empty page "There is currently no text in this page". It would be useful to have some information about the format.

openzim / overview Goto Github PK

overview's People

Contributors

Stargazers

Watchers

overview's Issues

Background

Problem Statement

Proposition

Feedback and implementation

What happened

Background research

Proposition

Recommend Projects

Recommend Topics

Recommend Org