Giter Site home page Giter Site logo

jupyter-server / jupyter_server_fileid Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 11.0 136 KB

An extension that maintains file IDs for documents in a running Jupyter Server

Home Page: https://jupyter-server-fileid.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
jupyter jupyter-server jupyter-server-extension

jupyter_server_fileid's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jupyter_server_fileid's Issues

tmpfs support

Description

When running the test suite, I'm getting inconsistent results, with 1-4 of the following tests failing:

FAILED tests/test_manager.py::test_get_path_oob_move_nested[False] - AssertionError: assert None == 'new_path/test_path'
FAILED tests/test_manager.py::test_get_path_oob_move_nested[True] - AssertionError: assert None == 'new_path/test_path'
FAILED tests/test_manager.py::test_get_path_oob_move_deeply_nested[True] - AssertionError: assert None == 'new_path/child/test_path'
FAILED tests/test_manager.py::test_get_path_oob_move_deeply_nested[False] - AssertionError: assert None == 'new_path/child/test_path'

Full output from one run (where two of them failed)

Full output from one run (where two of them failed)
========================================================= test session starts =========================================================
platform linux -- Python 3.11.2, pytest-7.2.1, pluggy-1.0.0
rootdir: /tmp/jupyter_server_fileid, configfile: pyproject.toml, testpaths: tests/
collected 53 items                                                                                                                    

tests/test_manager.py ............ss.................F..F..................                                                     [100%]

============================================================== FAILURES ===============================================================
_________________________________________________ test_get_path_oob_move_nested[True] _________________________________________________

fid_manager = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x7fddab1eec90>, old_path = 'old_path', new_path = 'new_path'
stub_stat_crtime = True, fs_helpers = <jupyter_server_fileid.pytest_plugin.fs_helpers.<locals>.FsHelpers object at 0x7fddab1efad0>

    @pytest.mark.parametrize("stub_stat_crtime", [True, False], indirect=["stub_stat_crtime"])
    def test_get_path_oob_move_nested(fid_manager, old_path, new_path, stub_stat_crtime, fs_helpers):
        old_test_path = "test_path"
        new_test_path = os.path.join(new_path, "test_path")
        fs_helpers.touch(old_test_path)
        fid_manager.index(old_path)
        id = fid_manager.index(old_test_path)
    
        fs_helpers.move(old_path, new_path)
        fs_helpers.move(old_test_path, new_test_path)
    
>       assert fid_manager.get_path(id) == normalize_path(fid_manager, new_test_path)
E       AssertionError: assert None == 'new_path/test_path'
E        +  where None = <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x7fddab1eec90>>('e4f6ec8b-0e47-48f1-96a8-dae4de72a331')
E        +    where <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x7fddab1eec90>> = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x7fddab1eec90>.get_path
E        +  and   'new_path/test_path' = normalize_path(<jupyter_server_fileid.manager.LocalFileIdManager object at 0x7fddab1eec90>, 'new_path/test_path')

tests/test_manager.py:429: AssertionError
_____________________________________________ test_get_path_oob_move_deeply_nested[False] _____________________________________________

fid_manager = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x7fddab0d0c50>, old_path = 'old_path', new_path = 'new_path'
old_path_child = 'old_path/child', new_path_child = 'new_path/child', stub_stat_crtime = False
fs_helpers = <jupyter_server_fileid.pytest_plugin.fs_helpers.<locals>.FsHelpers object at 0x7fddab0d1cd0>

    @pytest.mark.parametrize("stub_stat_crtime", [True, False], indirect=["stub_stat_crtime"])
    def test_get_path_oob_move_deeply_nested(
        fid_manager, old_path, new_path, old_path_child, new_path_child, stub_stat_crtime, fs_helpers
    ):
        old_test_path = "test_path"
        new_test_path = os.path.join(new_path_child, "test_path")
        fs_helpers.touch(old_test_path)
        fid_manager.index(old_path)
        fid_manager.index(old_path_child)
        id = fid_manager.index(old_test_path)
    
        fs_helpers.move(old_path, new_path)
        fs_helpers.move(old_test_path, new_test_path)
    
>       assert fid_manager.get_path(id) == normalize_path(fid_manager, new_test_path)
E       AssertionError: assert None == 'new_path/child/test_path'
E        +  where None = <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x7fddab0d0c50>>('dd1de351-0a20-419b-a42c-e58657a1d0a6')
E        +    where <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x7fddab0d0c50>> = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x7fddab0d0c50>.get_path
E        +  and   'new_path/child/test_path' = normalize_path(<jupyter_server_fileid.manager.LocalFileIdManager object at 0x7fddab0d0c50>, 'new_path/child/test_path')

tests/test_manager.py:448: AssertionError
========================================================== warnings summary ===========================================================
.venv/lib/python3.11/site-packages/jupyter_server/services/contents/filemanager.py:16
  /tmp/jupyter_server_fileid/.venv/lib/python3.11/site-packages/jupyter_server/services/contents/filemanager.py:16: DeprecationWarning: Jupyter is migrating its paths to use standard platformdirs
  given by the platformdirs library.  To remove this warning and
  see the appropriate new directories, set the environment variable
  `JUPYTER_PLATFORM_DIRS=1` and then run `jupyter --paths`.
  The use of platformdirs will be the default in `jupyter_core` v6
    from jupyter_core.paths import exists, is_file_hidden, is_hidden

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================== slowest 10 durations =========================================================

(10 durations < 0.005s hidden.  Use -vv to show these durations.)
======================================================= short test summary info =======================================================
SKIPPED [1] tests/test_manager.py:215: Requires crtime support.
SKIPPED [1] tests/test_manager.py:228: Requires crtime support.
FAILED tests/test_manager.py::test_get_path_oob_move_nested[True] - AssertionError: assert None == 'new_path/test_path'
FAILED tests/test_manager.py::test_get_path_oob_move_deeply_nested[False] - AssertionError: assert None == 'new_path/child/test_path'
========================================= 2 failed, 49 passed, 2 skipped, 1 warning in 0.34s ==========================================

Reproduce

  1. python -m venv .venv
  2. . venv/bin/activate
  3. pip install .[test]
  4. pytest

Expected behavior

Tests passing on every run ;-).

Context

  • Operating System and version: Gentoo Linux amd64
  • Browser and version: n/a
  • Jupyter Server version: 2.3.0
Troubleshoot Output
$PATH:
	/tmp/jupyter_server_fileid/.venv/bin
	/home/mgorny/perl5/bin
	/home/mgorny/node_modules/.bin/
	/home/mgorny/bin
	/usr/i686-pc-linux-gnu/gcc-bin/4.9.2
	/usr/local/sbin
	/usr/local/bin
	/usr/sbin
	/usr/bin
	/sbin
	/bin
	/opt/bin
	/usr/lib/llvm/17/bin
	/usr/lib/llvm/16/bin
	/usr/lib/llvm/15/bin
	/usr/lib/llvm/9/bin
	/etc/eselect/wine/bin
	/home/mgorny/.local/bin

sys.path:
/tmp/jupyter_server_fileid/.venv/bin
/usr/lib/python311.zip
/usr/lib/python3.11
/usr/lib/python3.11/lib-dynload
/tmp/jupyter_server_fileid/.venv/lib/python3.11/site-packages

sys.executable:
/tmp/jupyter_server_fileid/.venv/bin/python

sys.version:
3.11.2 (main, Feb 15 2023, 08:19:40) [GCC 12.2.1 20230121]

platform.platform():
Linux-6.1.12-gentoo-dist-x86_64-AMD_Ryzen_5_3600_6-Core_Processor-with-glibc2.36

which -a jupyter:
/tmp/jupyter_server_fileid/.venv/bin/jupyter
/usr/bin/jupyter

pip list:
Package Version
------------------------ ---------
anyio 3.6.2
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
asttokens 2.2.1
attrs 22.2.0
backcall 0.2.0
beautifulsoup4 4.11.2
bleach 6.0.0
certifi 2022.12.7
cffi 1.15.1
cfgv 3.3.1
charset-normalizer 3.0.1
comm 0.1.2
coverage 7.1.0
debugpy 1.6.6
decorator 5.1.1
defusedxml 0.7.1
distlib 0.3.6
executing 1.2.0
fastjsonschema 2.16.2
filelock 3.9.0
fqdn 1.5.1
identify 2.5.18
idna 3.4
iniconfig 2.0.0
ipykernel 6.21.2
ipython 8.10.0
isoduration 20.11.0
jedi 0.18.2
Jinja2 3.1.2
jsonpointer 2.3
jsonschema 4.17.3
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyter-events 0.6.3
jupyter_server 2.3.0
jupyter_server_fileid 0.7.0
jupyter_server_terminals 0.4.4
jupyterlab-pygments 0.2.2
MarkupSafe 2.1.2
matplotlib-inline 0.1.6
mistune 2.0.5
nbclient 0.7.2
nbconvert 7.2.9
nbformat 5.7.3
nest-asyncio 1.5.6
nodeenv 1.7.0
packaging 23.0
pandocfilters 1.5.0
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pip 23.0
platformdirs 3.0.0
pluggy 1.0.0
pre-commit 3.0.4
prometheus-client 0.16.0
prompt-toolkit 3.0.36
psutil 5.9.4
ptyprocess 0.7.0
pure-eval 0.2.2
pycparser 2.21
Pygments 2.14.0
pyrsistent 0.19.3
pytest 7.2.1
pytest-console-scripts 1.3.1
pytest-cov 4.0.0
pytest-jupyter 0.6.2
pytest-timeout 2.1.0
python-dateutil 2.8.2
python-json-logger 2.0.6
PyYAML 6.0
pyzmq 25.0.0
requests 2.28.2
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
Send2Trash 1.8.0
setuptools 67.3.2
six 1.16.0
sniffio 1.3.0
soupsieve 2.4
stack-data 0.6.2
terminado 0.17.1
tinycss2 1.2.1
tornado 6.2
traitlets 5.9.0
uri-template 1.2.0
urllib3 1.26.14
virtualenv 20.19.0
wcwidth 0.2.6
webcolors 1.12
webencodings 0.5.1
websocket-client 1.5.1

Tests are failing on archs like ppc64le or s390x

Description

I maintain this lib as an RPM package in Fedora Linux and when I try to build it on non-mainstream architectures, some tests fail.

Reproduce

  1. Install this package on a s390x, ppc64le or aarch64 machine
  2. Run tests

Expected behavior

All tests should pass.

Context

Tests Output ======================================= FAILURES ======================================== __________________________ test_get_path_oob_move_nested[True] __________________________

fid_manager = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff82368250>
old_path = 'old_path', new_path = 'new_path', stub_stat_crtime = True
fs_helpers = <jupyter_server_fileid.pytest_plugin.fs_helpers..FsHelpers object at 0x3ff8254c050>

@pytest.mark.parametrize("stub_stat_crtime", [True, False], indirect=["stub_stat_crtime"])
def test_get_path_oob_move_nested(fid_manager, old_path, new_path, stub_stat_crtime, fs_helpers):
    old_test_path = "test_path"
    new_test_path = os.path.join(new_path, "test_path")
    fs_helpers.touch(old_test_path)
    fid_manager.index(old_path)
    id = fid_manager.index(old_test_path)

    fs_helpers.move(old_path, new_path)
    fs_helpers.move(old_test_path, new_test_path)
  assert fid_manager.get_path(id) == normalize_path(fid_manager, new_test_path)

E AssertionError: assert None == 'new_path/test_path'
E + where None = <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff82368250>>('c02d1ffa-d33f-43ad-8b2c-36017ef35b39')
E + where <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff82368250>> = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff82368250>.get_path
E + and 'new_path/test_path' = normalize_path(<jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff82368250>, 'new_path/test_path')

tests/test_manager.py:429: AssertionError
_________________________ test_get_path_oob_move_nested[False] __________________________

fid_manager = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff8236af90>
old_path = 'old_path', new_path = 'new_path', stub_stat_crtime = False
fs_helpers = <jupyter_server_fileid.pytest_plugin.fs_helpers..FsHelpers object at 0x3ff8236a0d0>

@pytest.mark.parametrize("stub_stat_crtime", [True, False], indirect=["stub_stat_crtime"])
def test_get_path_oob_move_nested(fid_manager, old_path, new_path, stub_stat_crtime, fs_helpers):
    old_test_path = "test_path"
    new_test_path = os.path.join(new_path, "test_path")
    fs_helpers.touch(old_test_path)
    fid_manager.index(old_path)
    id = fid_manager.index(old_test_path)

    fs_helpers.move(old_path, new_path)
    fs_helpers.move(old_test_path, new_test_path)
  assert fid_manager.get_path(id) == normalize_path(fid_manager, new_test_path)

E AssertionError: assert None == 'new_path/test_path'
E + where None = <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff8236af90>>('08b7e8c1-3916-48f3-8770-43bc9ac076fe')
E + where <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff8236af90>> = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff8236af90>.get_path
E + and 'new_path/test_path' = normalize_path(<jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff8236af90>, 'new_path/test_path')

tests/test_manager.py:429: AssertionError
______________________ test_get_path_oob_move_deeply_nested[True] _______________________

fid_manager = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff82343ad0>
old_path = 'old_path', new_path = 'new_path', old_path_child = 'old_path/child'
new_path_child = 'new_path/child', stub_stat_crtime = True
fs_helpers = <jupyter_server_fileid.pytest_plugin.fs_helpers..FsHelpers object at 0x3ff82343390>

@pytest.mark.parametrize("stub_stat_crtime", [True, False], indirect=["stub_stat_crtime"])
def test_get_path_oob_move_deeply_nested(
    fid_manager, old_path, new_path, old_path_child, new_path_child, stub_stat_crtime, fs_helpers
):
    old_test_path = "test_path"
    new_test_path = os.path.join(new_path_child, "test_path")
    fs_helpers.touch(old_test_path)
    fid_manager.index(old_path)
    fid_manager.index(old_path_child)
    id = fid_manager.index(old_test_path)

    fs_helpers.move(old_path, new_path)
    fs_helpers.move(old_test_path, new_test_path)
  assert fid_manager.get_path(id) == normalize_path(fid_manager, new_test_path)

E AssertionError: assert None == 'new_path/child/test_path'
E + where None = <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff82343ad0>>('2d524c33-c4fc-44d9-ad64-425e7f44d044')
E + where <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff82343ad0>> = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff82343ad0>.get_path
E + and 'new_path/child/test_path' = normalize_path(<jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff82343ad0>, 'new_path/child/test_path')

tests/test_manager.py:448: AssertionError
______________________ test_get_path_oob_move_deeply_nested[False] ______________________

fid_manager = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff8232a610>
old_path = 'old_path', new_path = 'new_path', old_path_child = 'old_path/child'
new_path_child = 'new_path/child', stub_stat_crtime = False
fs_helpers = <jupyter_server_fileid.pytest_plugin.fs_helpers..FsHelpers object at 0x3ff82328890>

@pytest.mark.parametrize("stub_stat_crtime", [True, False], indirect=["stub_stat_crtime"])
def test_get_path_oob_move_deeply_nested(
    fid_manager, old_path, new_path, old_path_child, new_path_child, stub_stat_crtime, fs_helpers
):
    old_test_path = "test_path"
    new_test_path = os.path.join(new_path_child, "test_path")
    fs_helpers.touch(old_test_path)
    fid_manager.index(old_path)
    fid_manager.index(old_path_child)
    id = fid_manager.index(old_test_path)

    fs_helpers.move(old_path, new_path)
    fs_helpers.move(old_test_path, new_test_path)
  assert fid_manager.get_path(id) == normalize_path(fid_manager, new_test_path)

E AssertionError: assert None == 'new_path/child/test_path'
E + where None = <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff8232a610>>('e595adb2-0483-4863-89b0-d59885318541')
E + where <bound method LocalFileIdManager.get_path of <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff8232a610>> = <jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff8232a610>.get_path
E + and 'new_path/child/test_path' = normalize_path(<jupyter_server_fileid.manager.LocalFileIdManager object at 0x3ff8232a610>, 'new_path/child/test_path')

tests/test_manager.py:448: AssertionError
=================================== warnings summary ====================================
venv/lib64/python3.11/site-packages/jupyter_server/services/contents/filemanager.py:14
/root/jupyter_server_fileid/venv/lib64/python3.11/site-packages/jupyter_server/services/contents/filemanager.py:14: DeprecationWarning: Jupyter is migrating its paths to use standard platformdirs
given by the platformdirs library. To remove this warning and
see the appropriate new directories, set the environment variable
JUPYTER_PLATFORM_DIRS=1 and then run jupyter --paths.
The use of platformdirs will be the default in jupyter_core v6
from jupyter_core.paths import exists, is_file_hidden, is_hidden

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================= slowest 10 durations ==================================

(10 durations < 0.005s hidden. Use -vv to show these durations.)
================================ short test summary info ================================
SKIPPED [1] tests/test_manager.py:215: Requires crtime support.
SKIPPED [1] tests/test_manager.py:228: Requires crtime support.
FAILED tests/test_manager.py::test_get_path_oob_move_nested[True] - AssertionError: assert None == 'new_path/test_path'
FAILED tests/test_manager.py::test_get_path_oob_move_nested[False] - AssertionError: assert None == 'new_path/test_path'
FAILED tests/test_manager.py::test_get_path_oob_move_deeply_nested[True] - AssertionError: assert None == 'new_path/child/test_path'
FAILED tests/test_manager.py::test_get_path_oob_move_deeply_nested[False] - AssertionError: assert None == 'new_path/child/test_path'
================== 4 failed, 47 passed, 2 skipped, 1 warning in 0.41s ===================

jupyter-fileid script needs click

Description

I've installed jupyter-server-fileid from PyPI and it created an executable file bin/jupyter-fileid. Unfortunately, the script does not work because click is not installed by default.

Reproduce

  1. run pip install jupyter-server-fileid
  2. run bin/jupyter-fileid

Expected behavior

I can imagine two possible solutions:

  1. Add click to the standard dependencies. I've calculated the size of a virtualenv where I installed only jupyter-server-fileid and all its dependencies and it has 66 MB. Adding click there adds only one more MB which is not that big difference.
  2. Make it more user-friendly for users. Add try/except to the script and an error message saying that you should install jupyter-server-fileid[cli] or click directly to make it work.

Cannot switch file ID managers with same database

Description

Switching file ID managers from e.g. LocalFileIdManager to ArbitraryFileIdManager doesn't play well with the database:

    HTTPServerRequest(protocol='http', host='127.0.0.1:8888', method='GET', uri='/api/yjs/roomid/Untitled.ipynb?1666790876574', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/home/david/mambaforge/envs/jupyterlab/lib/python3.10/site-packages/tornado/web.py", line 1713, in _execute
        result = await result
      File "/home/david/github/davidbrochart/jupyter_server_ydoc/jupyter_server_ydoc/ydoc.py", line 310, in get
        idx = file_id_manager.index(path)
      File "/home/david/github/davidbrochart/jupyter_server_fileid/jupyter_server_fileid/manager.py", line 138, in index
        cursor = self.con.execute("INSERT INTO Files (path) VALUES (?)", (path,))
    sqlite3.IntegrityError: NOT NULL constraint failed: Files.ino

Reproduce

  1. Run JupyterLab in collaborative mode and LocalFileIdManager installed, and open a file.
  2. Run again JupyterLab with an ArbitraryFileIdManager, and open the file.
  3. See the error above.

Expected behavior

Should we use a different database when switching file ID managers? If yes, this issue can be closed.
Otherwise, I'm not sure we should have filesystem-specific attributes in the database. I understand it is needed for local filesystems, but maybe we should give it a more abstract name? In the end, it is some kind of identifier. And other file ID managers might not have this notion, so maybe we should allow it to be NULL?

optimistic get_path()

Problem

get_path() calls sync_all() even when the file wasn't moved.

Proposed Solution

Implement an optimistic strategy that looks up the existing path record in the DB. Then check if the file currently is still present at the same path, with the same inode number and timestamps. If so, then return the path early without calling sync_all().

create SQLite mixin

Helps ensure self.con is of type SQLite connection whenever it is used, and allows us to remove redundant type assertions, e.g.

if not hasattr(self, "con") or self.con is None:
    return

See discussion in #49.

Allow db_path to be set to ":memory:"

Problem

Some network filesystems (e.g.: NFS, Lustre) do not handle well SQLite database transactions and it can cause serious hanging issue when creating and making transaction on the database when it resides on these filesystems. This results in a poor Jupyter user experience: timeouts, error 504, etc.

Similar issues have been reported for IPython in the past:

Proposed Solution

When validating BaseFileIdManager.db_path, allow the string to be either an absolute path or ":memory:".

This solution is implemented in different Jupyter subprojects:

I have implemented the solution in the PR #70 with the same title as this issue.

Additional context

Furthermore, to be more consistent with other Jupyter projects, I would recommend:

  1. the parameter be renamed db_file instead of db_path.
  2. the default parameter value be handled by the "@default" decorated function, like what is done for NotebookNotary.db_file instead of a global variable at the module level.

prefix contents manager root to paths stored in ArbitraryFileIdManager

  • "API paths" are paths returned from the Contents Manager events
  • They are always relative to the Contents Manager root
  • They are always delimited by forward slashes
  • These are different from "filesystem paths" which should be absolute and may be delimited by a different separator (e.g. \ on Windows).

Based on this, we can do os.path.join(os.path, self.root_dir) whenever receiving a path in the ArbitraryFileIdManager, which means it will work regardless of server root.

UNIQUE constraint failed when moving file

Description

Creating a file in Jupyter, then moving it using mv, and finally renaming the moved file from Jupyter does not work.

It seems like there needs to be a sync of some sort before doing the move...

Reproduce

  1. Use LocalFileIdManager.
  2. Create notebook foo1.ipynb
  3. Open sqlite db: sqlite3 /Users/dleen/Library/Jupyter/file_id_manager.db
  4. Verify file is indexed:
SELECT * FROM Files;
4462c993-1a32-4a29-93b6-e14325a57cd1|/Users/dleen/code/catalog/foo1.ipynb|91561383|1680642664149532160|1680739962758658767|0
  1. Outside of Jupyter move the file: mv foo1.ipynb foo2.ipynb
  2. In Jupyter close foo1.ipynb if it was still open, and open foo2.ipynb.
  3. Rename foo2.ipynb to foo3.ipynb.
  4. See error in log:
[E 2023-04-05 17:13:02.255 ServerApp] UNIQUE constraint failed: Files.ino
  1. In sqlite no new entry has been created.

Expected behavior

The file is correctly indexed and preserves the id of foo1.ipynb:

SELECT * FROM Files;
4462c993-1a32-4a29-93b6-e14325a57cd1|/Users/dleen/code/catalog/foo3.ipynb|91561383|1680642664149532160|1680739962758658767|0

fix CI for Windows platforms

Testing CI step was added in #4, but support for Windows and Ubuntu pypy platforms had to disabled due to failure. These tests should be fixed and re-enabled to ensure platform compatibility.

Support any contents manager

Problem

The current implementation of jupyter-server-fileid is hard-coded to work on the local file-system, by making use of scandir and OS-specific features like inode (on Unix) or file index (on Windows). Thus it is only compatible with contents managers that interact with the local file-system.

Proposed Solution

We should support any contents manager, maybe through a degraded mode (since the equivalent of inode won't be available).

Additional context

See jupyterlab/jupyterlab#13246 (comment).

conda-forge package

Problem

This package is published on PyPI but not on conda-forge.

Proposed Solution

Create a conda-forge package.

How to deal with file deletion?

Problem

Imagine that we are building a system that allows users to add comments on a notebook. The comments are stored in a backend system hosted in the cloud.

The user creates a notebook. The user uses the commenting UI to add a comment and clicks submit. The following happens:

  1. The newly created notebook is indexed via the save method of FIM (caveat: see issue #66)
  2. The commenting extension UI sends the notebook path to the Jupyter server.
  3. The commenting extension backend code calls FIM.get_id(path) (assuming the file was successfully indexed) to get the associated id of the notebook.
  4. The commenting extension submits a request to the backend system to store the comment using the id retrieved from the FIM.

Deletion scenario

  1. The user deletes a notebook which has associated comments.
  2. The commenting extension code subscribes to the contents manager delete event.
  3. The FileID manager extension also subscribes to the contents manager delete event.

I don't know for sure but but it seems fair that we cannot assume which subscriber is triggered first. There are potentially two cases:

  1. The FileID manager is triggered first. The record of the notebook is deleted from the sqlite DB.
  2. The commenting extension calls FIM.get_id(path) which returns None. The extension does not know which record to delete from the commenting backend system.

OR

  1. The commenting extension calls FIM.get_id(path) which returns the id. The extension deletes the corresponding record from the commenting backend system.
  2. The FileID manager is triggered and the record of the notebook is deleted from the sqlite DB.

Solutions?

Alternatively, we could implement a custom contents manager which calls FIM.get_id(path) potentially before the event is emitted. However this does not seem like a long term maintainable solution due to the dependency on when the event is emitted.

Another solution could be for the FileID manager to emit an event for moves/save/deletes with a corresponding id so that extensions using the ids can react to changes.

One other solution could be to use "soft" deletes where the row in the DB is "flagged" as deleted but kept in the DB.

Discussion

Or is it possible that there's a simpler solution here?

Newly created notebook is not indexed

Description

When I create a new notebook the save method is triggered but the notebook is not indexed.

Reproduce

  1. Using --FileIdExtension.file_id_manager_class=jupyter_server_fileid.manager.LocalFileIdManager
  2. Create a new notebook Untitled.ipynb
  3. Open the sqlite DB: sqlite3 /Users/dleen/Library/Jupyter/file_id_manager.db
  4. SELECT * FROM Files where path like '%.ipynb%';
  5. No results are returned
  6. Rename the file: foo.ipynb
  7. Run the same query, the file is now indexed: a74e32af-dad7-462d-b299-a0a031f834e1|/Users/dleen/code/catalog/foo.ipynb|93704796|1681152929593818880|1681153012072852876|0.

Expected behavior

The initial file Untitled.ipynb is indexed.

I added log statements in the save method and verified that it is indeed being called. But from reading the code if the file doesn't already exist in the index then the method returns early.

Discussion

I can't tell if this behavior is expected. That the indexing only happens when a rename happens?

What is the primary way that a file is supposed to get indexed?

support crtime on ext4

Right now, out-of-band edits change the modified time of the file, and thus cause a new file ID to be generated when index() is called at that path on ext* filesystems.

A huge improvement would be to leverage the statx() system call available on newer versions of Linux, and store the file creation time, which is preserved on edits. This would allow out-of-band edits to preserve the file ID on ext4 filesystems.

Setup Github publishing actions and PyPI with Jupyter Server Bot.

We should take advantage of the great work happening in Jupyter Releaser.

We can the Jupyter Server Bot to publish directly from this repo.

  • Add the Github Jupyter Server Bot as a maintainer on this repo
  • Add the PyPI Jupyter Server Bot as a maintainer on the PyPI repo.
  • Set up the ADMIN_GITHUB_TOKEN to use the Bot's token.

Add simple REST API to fetch a file ID given a file path

I have a use-case where a JupyterLab plugin would like to know file's ID created by this plugin using the file path.

A nice enhancement would be to include a simple REST API to fetch this info. I don't think this is available here yet, correct?

remove mtime fallback

cc @ellisonbg @davidbrochart @kevin-bates @afshin

Problem

Right now, if the filesystem does not support creation timestamps, we default to looking at the modified timestamps to verify the file is really the same file. We do this because relying on an inode number alone doesn't provide a strong guarantee of identity. For example, let's say we index the file foo with an ID of 1 (not using UUIDs here for brevity).

rm foo
touch foo

This sequence of out-of-band creates and deletes generally preserves the inode number, but in reality the new file is completely different from the old file. Thus, the current implementation verifies that both the inode number and the modified timestamps are identical to the one recorded previously before assigning the same file ID to a given path. So here, file_id_manager.get_path(1) correctly returns None. We term this behavior mtime fallback.

However, one issue with mtime fallback is that out-of-band edits result in the file no longer being associated with its file ID. For example, editing foo via vim results in file_id_manager.get_path(1) returns None, even though foo is still the same file. Furthermore, the file ID 1 gets completely dropped from the database and never used again.

Originally, the initial design was to make get_path() as accurate as possible, because I viewed that incorrectly associating a file ID with a different file would be a bad outcome. However, as we are using file ID more, we are noticing that incorrectly dropping a file ID is an even worse outcome, as it can lead to catastrophic loss of data associated with a file ID. As @kevin-bates puts it:

You're choosing this "weird edge case" (that also requires the new file to have the same inode as the old file!) over simple (everyday) modifications. Could you forgo the timestamp checks against modification time when creation time is not available and you know the inodes are equal? All references should not be orphaned because a file was modified and the user happened to be on some filesystem that doesn't track crtime.

While #40 would eliminate this problem for ext4, ext3 and other filesystems without creation timestamps would still have this issue.

Proposed Solution

Add a configurable boolean trait called mtime_fallback on LocalFileIdManager, defaulting to False. Only perform mtime_fallback if this trait is True.

make `BaseFileIdManager` a true ABC

From discussion in #24:

This should derive from ABC along with a metaclass definition in order for @AbstractMethod decorators to be effective. A good example of this can be found here. By not deriving BaseFileIdManager from ABC, @AbstractMethod decorators do not work (and I see those too have been removed). As a result, a subclass's violation for not implementing various methods will not be discovered until that method is called, rather than when the class instance is instantiated. Proper decoration also prevents BaseFileIdManager from being instantiated (i.e., a true abstract base class).

ArbitraryFileIdManager should have a configurable option to determine its "content root"

In looking at various ContentsManager implementations (e.g., S3, PostgreSQL) the root_dir is not used. Instead, a different (custom) attribute would be tantamount to determining the content root. For example, the referenced S3 ContentsManager has a bucket attribute indicating the bucket in which the content resides. Similarly, the referenced PostgreSQL ContentsManager uses db_url to identify the database, which would be sufficient as the content root. (One might argue that the user_id attribute may be a better choice, but that's an unrelated discussion.)

It seems like ArbitraryFileIdManager should introduce a configurable trait that either indicates the name of the attribute from which to pull the content root identifier from or a trait the user must set on both its ContentsManager and ArbitraryFileIdManager.

I'd lean toward the former.

Proposal

  • Introduce a trait content_root_attribute: str that refers to the attribute on the configured ContentsManager from which the content_root value (known as root_dir today, and that name can continue) is pulled. The default value for content_root_attribute would be root_dir.

The other option would be to simply expose root_dir on ArbitraryFileIdManager which is configured to be the same value as the appropriate attribute on ContentsManager (e.g., S3ContentsManager.bucket).

The advantage of the first option is that operators/admins would only need to configure the ArbitraryFileIdManager once for the duration that that particular ContentsManager is in use. While the second option would require a different configuration for each user. As a result, the first option is helpful for IT staffs that need to push configurations (like in Berkeley's Data 8 class or in PaaS envs).

Async API

Problem

This package deals with file systems and databases, which can be very slow to operate compared to the CPU.
Also, it will be used in a web server where performance is critical.

Proposed Solution

I think we should make all file-system and database related functions async, for instance using aiofiles and aiosqlite.

pypy compatibility

pypy tests are currently disabled due to a difference in implementation of sqlite3. See #6 for more context.

use a decorator to automatically commit after public methods

Problem

Forgetting to call self.con.commit() imperatively can be the source of a lot of silly bugs, as seen in #63.

Setting autocommit=True is bad for performance, as a single batched commit is significantly faster than multiple commits. The way this is currently being managed is that private methods never commit, and public methods should always commit right before returning the value. However, this currently needs to be done imperatively at every possible return point in each public method, which is fragile.

Proposed Solution

There should be a @commit decorator that wraps every public method.

Using sqlite in WAL mode causes file saving failure when used on JupyterHub on NFS

Description

We upgraded from lab 3.5 to 3.6, and that seems to have brought this dependency in. At startup, it seems to try to open a sqlite db in WAL mode under ~/.local/share/jupyter, which is on NFS. This fails intermittently, as sqlite's WAL mode is not supported on NFS. https://www.sqlite.org/wal.html states:

All processes using a database must be on the same host computer; WAL does not work over a network filesystem.

So on server startup, this fails with these log messages:

[I 2023-02-22 14:05:52.993 FileIdExtension] Configured File ID manager: ArbitraryFileIdManager
[I 2023-02-22 14:05:52.993 FileIdExtension] ArbitraryFileIdManager : Configured root dir: /home/jovyan
[I 2023-02-22 14:05:52.993 FileIdExtension] ArbitraryFileIdManager : Configured database path: /home/jovyan/.local/share/jupyter/file_id_manager.db
[I 2023-02-22 14:05:53.009 FileIdExtension] ArbitraryFileIdManager : Successfully connected to database file.
[I 2023-02-22 14:05:53.009 FileIdExtension] ArbitraryFileIdManager : Creating File ID tables and indices.
[W 2023-02-22 14:05:53.096 SingleUserLabApp manager:359] jupyter_server_fileid | extension failed loading with message: unable to open database file
[E 2023-02-22 14:05:53.096 SingleUserLabApp manager:360] jupyter_server_fileid | stack trace
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/extension/manager.py", line 355, in load_extension
        extension.load_all_points(self.serverapp)
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/extension/manager.py", line 229, in load_all_points
        return [self.load_point(point_name, serverapp) for point_name in self.extension_points]
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/extension/manager.py", line 229, in <listcomp>
        return [self.load_point(point_name, serverapp) for point_name in self.extension_points]
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/extension/manager.py", line 222, in load_point
        return point.load(serverapp)
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/extension/manager.py", line 148, in load
        return loader(serverapp)
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/extension/application.py", line 453, in _load_jupyter_server_extension
        extension.initialize()
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/extension/application.py", line 420, in initialize
        self._prepare_settings()
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/extension/application.py", line 302, in _prepare_settings
        self.initialize_settings()
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server_fileid/extension.py", line 23, in initialize_settings
        file_id_manager = self.file_id_manager_class(
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server_fileid/manager.py", line 244, in __init__
        self.con.execute("PRAGMA journal_mode = WAL")
    sqlite3.OperationalError: unable to open database file

Unfortunately, this means all file save operations via contentsmanager fail! So this is in the critical path for all file saves, new notebook creations, etc - even when not using RTC.

[E 2023-02-22 14:09:35.383 SingleUserLabApp web:1798] Uncaught exception GET /user/<name>/api/contents/KMeans_Tutorial.ipynb?type=notebook&_=1677092974636 (10.1.0.10)
    HTTPServerRequest(protocol='https', host='jupyter.utoronto.ca', method='GET', uri='/user/<name>/api/contents/KMeans_Tutorial.ipynb?type=notebook&_=1677092974636', version='HTTP/1.1', remote_ip='10.1.0.10')
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.10/site-packages/tornado/web.py", line 1713, in _execute
        result = await result
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/services/contents/handlers.py", line 116, in get
        self.contents_manager.get(
      File "/opt/conda/lib/python3.10/site-packages/jupytext/contentsmanager.py", line 207, in get
        model = self.super.get(path, content, type="notebook", format=format)
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/services/contents/filemanager.py", line 424, in get
        model = self._notebook_model(path, content=content)
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/services/contents/filemanager.py", line 372, in _notebook_model
        nb = self._read_notebook(
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/services/contents/fileio.py", line 267, in _read_notebook
        with self.open(os_path, "r", encoding="utf-8") as f:
      File "/opt/conda/lib/python3.10/contextlib.py", line 135, in __enter__
        return next(self.gen)
      File "/opt/conda/lib/python3.10/site-packages/jupyter_server/services/contents/fileio.py", line 198, in open
        with open(os_path, *args, **kwargs) as f:
    OSError: [Errno 121] Remote I/O error: '/home/jovyan/KMeans_Tutorial.ipynb'

We have reverted to JupyterLab 3.5 in the meantime.

Questions

  1. Is fileid required even when not using RTC?
  2. Does the db need to persist across JupyterHub sessions? if not, we can put it in /tmp (like in 2i2c-org/infrastructure#2246).
  3. Does it need the WAL? Based on
    # do not allow reads to block writes. required when using multiple processes
    that looks like an intentional decision. If so, this would cause lots of problems on NFS

use UUIDs for primary keys

Preferring UUIDs over integers for primary keys in the Files table should be a configurable trait on the FileIdManager class.

Use UUIDs for file IDs instead of integers.

See below discussion for further context.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.