ark-builders / arklib Goto Github PK

Core of the programs in ARK family

License: MIT License

Rust 100.00%

cross-platform files library rust rust-library content-addressing file-modification-time file-monitoring resource-management

arklib's Introduction

ArkLib

This is the home of core ARK library.

Being implemented in Rust, it provides us capability to port all our apps to all common platforms. Right now, Android is supported by using the arklib-android project. And for Linux/macOS/Windows, the library can be used as-is and easily embedded into an app, e.g. built with Tauri. Development docs will come sometime.

The Concept of the library

The purpose of the library is to manage resource index of folders with various user data, as well as to manage user-defined metadata: tags, scores, arbitrary properties like movie title or description. Such a metadata is persisted to filesystem for easier sync and backup. The resource index provides us with content addressing and allows easier storage and versions tracking. We also believe it'll allow easier cross-device sync implementation.

Prerequisites

PDFium prebuilt (bblanchon)

Build

Like most of Rust projects:

cargo build --release

Run unit tests:

cargo test

Development

For easier testing and debugging, we have the ARK-CLI tool working with ARK-enabled folders.

Benchmarks

arklib relies on the criterion crate for benchmarking to ensure optimal performance. Benchmarks are crucial for evaluating the efficiency of various functionalities within the library.

Running Benchmarks

To execute the benchmarks, run this command:

cargo bench

This command runs all benchmarks and generates a report in HTML format located at target/criterion/report. If you wish to run a specific benchmark, you can specify its name as an argument as in:

cargo bench index_build

Benchmarking Local Files

Our benchmark suite includes tests on local files and directories. These benchmarks are located in the benches/ directory. Each benchmark sets a time limit using group.measurement_time(), which you can adjust manually based on your requirements.

You have the flexibility to benchmark specific files or folders by modifying the variables within the benchmark files. By default, the benchmarks operate on the tests/ directory and its contents. You can change the directory/files by setting the DIR_PATH and FILE_PATHS variables to the desired values.

For pre-benchmark assessment of required time to index a huge local folder, you can modify test_build_resource_index test case in src/index.rs.

arklib's People

Contributors

Stargazers

Watchers

Forkers

sisco0 zannis j4w3ny hhio618 oluiscabral cycle-five blockchainlover2019 moooover funnyvolt tareknaser

arklib's Issues

Link-to-Web (bookmark) resource kind

We can move Link-to-Web resource kind from ARK Shelf (ARK-Builders/ARK-Shelf#1)
and ARK Navigator (ARK-Builders/ARK-Shelf#37) into this library.

We would benefit from this in 2 ways:

Creation and viewing of links would be kept in the same place
(right now we create it in one app, and view it in another one).
We could write apps on another platforms which would be
compatible with the data generated on Android.

Define custom error type

Currently the error on the Rust side are not really managed.The arlkib crate should return a custom error and not anyhow error. It's possible to wrap anyhow to a custom error. With that different errors would be possible

Benchmark faster hash functions

We are interested in overall performance for index construction, i.e. both pure hashing speed and collisions amount are important. Note that there are at least 2 kinds of t1ha function: t1ha0 (the fastest) and t1ha1 (portable).

We should also measure t1ha2 and t1ha3 since they should have less collisions.

https://github.com/PositiveTechnologies/t1ha

"First discovered" attribute of resources

Usually, for sorting by "date" timestamp "last modified" is used. This timestamp comes from common filesystems and means last time file was modified, but also it can be updated without actual content modification. And in case of such modification, our app considers the resource to be the same due to usage of content-addressing. Our app is also supposed to be used in distributed setup (at this moment, by using external syncing app like Syncthing). When a resource is replicated to other devices, all replicas have different "last modified" attribute.

It makes more sense for a user to think about resource creation time. Some filesystems has "created" timestamp, but such a timestamp would be reset every time the user moves the resource. We can provide semantically similar attribute "first discovered" which would mean the time when the resource was first time indexed by any of our apps on any of the user's devices.

This timestamp should be stored in the (persisted and replicated) index, so it would be propagated to other devices.

Chunked mapping for storages

A storage is a subfolder of .ark, e.g. .ark/index or .ark/tags. It represents a mapping from ResourceId to some T.

For .ark/index, the T is Path. And for .ark/tags, the T is Set<String>. Each entry can be represented by a file .ark/<storage>/<resource_id> with a single line content. This kind of storage should give us the least amount of read/write conflicts, but not very efficient for syncing and reading. Old chunks could be batched into bigger multi-line files.

So, chunked storage would be a set of files like this:

.ark/<storage_name>/<batch_id1>
|-- <resource_id1> -> <value1>
|-- <resource_id2> -> <value2>

.ark/<storage_name>/<resource_id3>
|-- <value3>

.ark/<storage_name>/<batch_id2>
|-- <resource_id4> -> <value4>
|-- <resource_id5> -> <value5>
|-- <resource_id6> -> <value6>

.ark/<storage_name>/<resource_id7>
|-- <value7>

Link-to-Web: separate URL and metadata

At the moment, we write all fields in single JSON: URL, title and description; all these fields are treated as parts of resource. This causes ResourceId be recalculated every time title or description changes. ResourceId should depend only on the URL. The rest must be written into upcoming metadata storage.

This changes the very core of ARK Shelf app (mobile and desktop versions).

Missing file causes index loading failure and leads to complete rebuild

Build index of some folder using ark-cli monitor.
Verify .ark/index exists.
Remove a file from the folder. Verify the index has entry with its path.
Load index.
Last step fails, the index is completely rebuilt in this case.

Pay attention to WARN lines:

[kirill@lenovo TEST]$ RUST_LOG=info ark-cli monitor
Building index of folder /tmp/TEST
[2023-03-06T16:21:33Z INFO  arklib] Index has not been registered before
[2023-03-06T16:21:33Z INFO  arklib::index] Loading the index from file
[2023-03-06T16:21:33Z WARN  arklib::index] No persisted index was found by path /tmp/TEST/.ark/index
[2023-03-06T16:21:33Z INFO  arklib::index] Building the index from scratch
[2023-03-06T16:21:36Z INFO  arklib::index] Index built
[2023-03-06T16:21:36Z INFO  arklib::index] Storing the index to file
[2023-03-06T16:21:36Z INFO  arklib] Index was registered
Build succeeded in 3.667931346s

Updating succeeded in 158.747289ms

^C
[kirill@lenovo TEST]$ RUST_LOG=info ark-cli monitor     #just load index from the file and check for updates
Building index of folder /tmp/TEST
[2023-03-06T16:21:57Z INFO  arklib] Index has not been registered before
[2023-03-06T16:21:57Z INFO  arklib::index] Loading the index from file
[2023-03-06T16:21:58Z INFO  arklib::index] Storing the index to file
[2023-03-06T16:21:58Z INFO  arklib] Index was registered
Build succeeded in 297.370813ms

^C
[kirill@lenovo TEST]$ grep gagarin .ark/index
1678115269692 200433-880886451 gagarin.jpg
[kirill@lenovo TEST]$ mv gagarin.jpg /tmp/

[kirill@lenovo TEST]$ RUST_LOG=info ark-cli monitor     #must load the index and remove disappeared resource
Building index of folder /tmp/TEST
[2023-03-06T16:22:43Z INFO  arklib] Index has not been registered before
[2023-03-06T16:22:43Z INFO  arklib::index] Loading the index from file
[2023-03-06T16:22:43Z WARN  arklib::index] No such file or directory (os error 2)
[2023-03-06T16:22:43Z INFO  arklib::index] Building the index from scratch
[2023-03-06T16:22:47Z INFO  arklib::index] Index built
[2023-03-06T16:22:47Z INFO  arklib::index] Storing the index to file
[2023-03-06T16:22:47Z INFO  arklib] Index was registered
Build succeeded in 3.651734571s

Thanks @mdrlzy for discovering this bug.

Write files atomically

We can't allow half-written files in storages, even in cache.

Write data to a temporary file, then move it to the target path.

Configure build process

Depends on ARK-Builders/ARK-Navigator#163
The library must be built in 2 modes: Debug and Release, both must be published and downloadable from GitHub Actions.
ARK Navigator and other dependents should be configured to use Release build by default.

Store relative paths in the index mappings

Only relative paths should be used in the internal index mappings
Canonicalization must be performed before registering the paths
The common prefix ("root") should be trimmed from the paths
The lib consumers must be able to reconstruct absolute paths

`Invalid cross-device link` crash when working with temporary files

Simplified procedure for atomic writing is "write to a temporary file, then hard link to the destination". This becomes a problem when the "destination" and "temporary file" reside on different filesystems, e.g. /tmp and /home.

Possible solutions:

Create temporary files in same folder as hidden files, then delete them
Use plain copy instead of hard link (slower?)

Better privacy with random unique device ids

We introduce machine-uid crate in this PR:

The crate comes with a disclaimer:

In Linux, machine id is a single newline-terminated, hexadecimal, 32-character, lowercase ID. When decoded from hexadecimal, this corresponds to a 16-byte/128-bit value. This ID may not be all zeros. This ID uniquely identifies the host. It should be considered “confidential”, and must not be exposed in untrusted environments. And do note that the machine id can be re-generated by root.

Alternative would be generating a random device id, storing it in app data folder, sharing it privately with other devices when necessary, etc.

We've decided in favor of machine-uid because it's much easier to implement and outside entity can't figure out the ids if secure transport is utilized. But with unique random ids, unencrypted transport could be used. If privacy becomes a concern, we should implement this approach.

Better handling of _id_ collisions

At this moment, our index is structured like this:

pub struct ResourceIndex {
    pub id2path: HashMap<ResourceId, CanonicalPathBuf>,
    pub path2id: HashMap<CanonicalPathBuf, IndexEntry>,

    pub collisions: HashMap<ResourceId, usize>,

    root: PathBuf,
}

Ideally, we need id2path to have values of type HashSet<CanonicalPathBuf> because an id can have multiple path attached due to id collisions. We track a number of these collisions, but removing an id in a generic way still requires iteration through all paths to find matching entries (see the PR #57). Otherwise, if just take the path from id2path and remove it, we'll have the id left in path2id which is unreachable from id2path. And collisions[id] would be positive, too.

An idea I had some time before is composite ids:

ARK-Builders/ARK-Navigator#147

The way I see it functioning is:

When we index a resource, we compute it's id in a fast way (CRC-32 + file size).
If there is no entry in the collisions mapping for the id, we just add it into id2path.
If there is an entry in the collisions mapping for the id, we add value of another quick hashing function to it.
If later another resource collides with this composite id again, we use SHA-256 for it.

We can use SHA-256 instead of using several hash functions, but it's very slow so we want to avoid it because we work with local user files. We might have background indexing process which upgrades fast ids to secure ids though.

Either way, if we do tricks with ids we have a problem:

Library consumer's storages need to be updated with new ids, e.g. we emit events like Upgraded(old_id, new _id).
If we lose the index file between app runs, we cannot catch up and upgrade consumer's storages. We can only present users with list of colliding paths involved into any of user's mappings and ask to manually select correct values...

Empty index is generated if name of the root folder starts with `.`

Hidden subfolders must be ignored during the filesystem traversal. But if the root is hidden we still need to perform the traversal.

Cover index construction and updates with unit tests

Index module must be well-tested to ensure there are no significant bugs:
https://github.com/ARK-Builders/arklib/blob/main/src/index.rs

Test cases must be run during CI.

Experiment: Android/iOS bindings using uniffi-rs

Here is an example:
https://github.com/mozilla/uniffi-rs/blob/main/examples/todolist/src/todolist.udl

We have bindings for Android in https://github.com/ARK-Builders/arklib-android but we maintain them manually. Uniffi-rs might help reducing boilerplate and necessity to update bindings manually.

If it also provides equivalent bindings for iOS from the same file, it's especially beneficial.

Git pre-commit hook calling `cargo fmt`

This would prevent pushing code breaking code style.

Ensure that `update_all` and `update_one` are consistent

Unit tests covering update_one
Unit tests covering update_all

Then, we want to verify that:

update_all can be replaced by multiple update_one

Unit tests for simple cases with both update_all and update_one
Randomized test scenario checking different combinations

Refactor the API for `ResourceIndex`

This issue is an open discussion to changes proposed to the API in src/index.rs.

Current situation

pub struct ResourceIndex {
    /// A mapping of resource IDs to their corresponding file paths
    id2path: HashMap<ResourceId, PathBuf>,
    /// A mapping of file paths to their corresponding index entries
    path2id: HashMap<PathBuf, IndexEntry>,
    /// A mapping of resource IDs to the number of collisions they have
    pub collisions: HashMap<ResourceId, usize>,
    /// The root path of the index
    root: PathBuf,
}


impl ResourceIndex {
/// Returns the number of entries in the index
///
/// Note that the amount of resource can be lower in presence of collisions
pub fn count_files(&self) -> usize;

/// Returns the number of resources in the index
pub fn count_resources(&self) -> usize;

/// Builds a new resource index from scratch using the root path
///
/// This function recursively scans the directory structure starting from
/// the root path, constructs index entries for each resource found, and
/// populates the resource index
pub fn build<P: AsRef<Path>>(root_path: P) -> Self;

/// Loads a previously stored resource index from the root path
///
/// This function reads the index from the file system and returns a new
/// [`ResourceIndex`] instance. It looks for the index file in
/// `$root_path/.ark/index`.
///
/// Note that the loaded index can be outdated and `update_all` needs to
/// be called explicitly by the end-user. For automated updating and
/// persisting the new index version, use [`ResourceIndex::provide()`] method.
pub fn load<P: AsRef<Path>>(root_path: P) -> Result<Self>;    

/// Stores the resource index to the file system
///
/// This function writes the index to the file system. It writes the index
/// to `$root_path/.ark/index` and creates the directory if it's absent.
pub fn store(&self) -> Result<()>;

/// Provides the resource index, loading it if available or building it from
/// scratch if not
///
/// If the index exists at the provided `root_path`, it will be loaded,
/// updated, and stored. If it doesn't exist, a new index will be built
/// from scratch
pub fn provide<P: AsRef<Path>>(root_path: P) -> Result<Self>;

/// Updates the index based on the current state of the file system
///
/// Returns an [`IndexUpdate`] object containing the paths of deleted and
/// added resources
pub fn update_all(&mut self) -> Result<IndexUpdate>;

/// Indexes a new entry identified by the provided path, updating the index
/// accordingly.
///
/// The caller must ensure that:
/// - The index is up-to-date except for this single path
/// - The path hasn't been indexed before
///
/// Returns an error if:
/// - The path does not exist
/// - Metadata retrieval fails
pub fn index_new(&mut self, path: &dyn AsRef<Path>) -> Result<IndexUpdate>;


/// Updates a single entry in the index with a new resource located at the
/// specified path, replacing the old resource associated with the given
/// ID.
///
/// # Restrictions
///
/// The caller must ensure that:
/// * the index is up-to-date except for this single path
/// * the path has been indexed before
/// * the path maps into `old_id`
/// * the content by the path has been modified
///
/// # Errors
///
/// Returns an error if the path does not exist, if the path is a directory
/// or an empty file, if the index cannot find the specified path, or if
/// the content of the path has not been modified.
pub fn update_one(
	&mut self,
	path: &dyn AsRef<Path>,
	old_id: ResourceId,
) -> Result<IndexUpdate>;

/// Inserts an entry into the index, updating associated data structures
///
/// If the entry ID already exists in the index, it handles collisions
/// appropriately
fn insert_entry(&mut self, path: PathBuf, entry: IndexEntry);

/// Removes the given resource ID from the index and returns an update
/// containing the deleted entries
pub fn forget_id(&mut self, old_id: ResourceId) -> Result<IndexUpdate>;

/// Removes an entry with the specified path and updates the collision
/// information accordingly
///
/// Returns an update containing the deleted entries
fn forget_path(
	&mut self,
	path: &Path,
	old_id: ResourceId,
) -> Result<IndexUpdate>;

Some of the issues with the current structure are:

We have 2 mappings in ResoureIndex which prevents us from simply serializing the index to a file
Methods such as provide may be confusing when compared to other methods

Proposal

We could consider this structure for ResourceIndex

#[derive(PartialEq, Clone, Debug, Serialize, Deserialize)]
pub struct IndexedResource {
    pub id: ResourceId,
    pub path: PathBuf,
    pub last_modified: SystemTime,
}

#[derive(PartialEq, Clone, Debug, Serialize, Deserialize)]
pub struct ResourceIndex {
    pub resources: Vec<IndexedResource>,
    pub root: PathBuf,
}

To port the conversation from #87 (comment) : We should consider refactoring the methods so that the API has 2 halves:
- "Snapshot API" allowing to query particular paths or ids, separate functions for relative and absolute paths. We could also implement something like Iter interface.
- "Reactive API" allowing updates handling without explicit queries. It can be done in pull-based manner like now (update_all), push-based using Tokio streams and/or registering handlers which would be called automatically (variant of push-based)

Filesystem monitoring using `notify-rs`

Can we benefit from this crate?
https://github.com/notify-rs/notify

provide_index should return IndexUpdate

Because of this, in arklib-android we have to expect that all resources were just added at startup
And we can't find out what resources were deleted

Previews generation: PDF

By using a Rust crate we can solve both issues with PDFs in ARK Navigator:
ARK-Builders/ARK-Navigator#153
ARK-Builders/ARK-Navigator#157

Maybe something from here?

Previews should be possible to generate in both low and high quality: the function should accept quality parameter with acceptable values low, medium, high where high is looking nice on laptop and zoomable. For easier verification, this function should have dedicated command in this tool: https://github.com/ARK-Builders/ark-cli which would accept path to a PDF file and save JPG/PNG to another file.

Declare storage path constants

This repo must be the "single source of truth" for the whole ARK project.

We have also ArkLib Android repo providing bindings to Rust code for Android and also some other functionality. That extra functionality will go to this repo eventually.

An important file from the Android repo defines storage path constants:

object ArkFiles {
    const val ARK_FOLDER = ".ark"
    const val STATS_FOLDER = "stats"
    const val FAVORITES_FILE = "favorites"

    // User-defined data
    const val TAG_STORAGE_FILE = "user/tags"
    const val SCORE_STORAGE_FILE = "user/scores"
    const val PROPERTIES_STORAGE_FOLDER = "user/properties"

    // Generated data
    const val METADATA_STORAGE_FOLDER = "cache/metadata"
    const val PREVIEWS_STORAGE_FOLDER = "cache/previews"
    const val THUMBNAILS_STORAGE_FOLDER = "cache/thumbnails"
}

These constants must be moved to this repo and be imported to the Android repo via bindings.

Aggregated Resource Indexes

It's necessary to initialize compound indexes, store them somewhere in library process, allow to re-use them, allow to use them as normal indexes. Aggregated indexes must provide interface as close to "plain" index interface as possible. Aggregated indexes should use plain indexes as shards delegating operations to them (execute some operation on all shards, if any succeeded return its result).
https://github.com/ARK-Builders/ARK-Navigator/blob/1d6cfa9a15d95a2ca1d7628042142f972931393f/app/src/main/java/space/taran/arknavigator/mvp/model/repo/index/AggregatedResourcesIndex.kt

Am I right that after we loaded Rust library in an Android app, we can:

call methods from it multiple times without re-loading the library
expect state be persisted between calls?

Because the same index should be re-usable in different aggregated indexes and also the app can request the same aggregated index again several times, would be unnecessary to re-construct them all the time.

Index generic over hash function

We can abstract ResourceIndex over the id type and, consequently, over hashing function.

#[derive(Eq, Ord, PartialEq, PartialOrd, Hash, Clone, Debug)]
pub struct IndexEntry<Id> {
    pub modified: SystemTime,
    pub id: Id,
}

#[derive(PartialEq, Clone, Debug)]
pub struct ResourceIndex<Id> {
    pub id2path: HashMap<Id, CanonicalPathBuf>,
    pub path2id: HashMap<CanonicalPathBuf, IndexEntry<Id>>,

    pub collisions: HashMap<Id, usize>,
    root: PathBuf,
}

This way, we can use cryptographic hash functions to test index in simpler cases, when no collisions are present. This can simplify development and debugging. We could also use fake hash functions for testing only collisions.

Cryptographic hash function can also be used for apps where safety is more important than performance. The fast hash functions will be used in experimental "fast mode", it's the most useful for file browser apps.

ARK Shelf crashes when loading some test data

ARK Shelf app crashes when loading some test data provided here:
ARK-Builders/ARK-Navigator#412 (comment)

Below is the crash stack trace:

Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 13813 (DefaultDispatch), pid 13787 (ilders.arkshelf)
Cmdline: dev.arkbuilders.arkshelf
pid: 13787, tid: 13813, name: DefaultDispatch  >>> dev.arkbuilders.arkshelf <<<
      #01 pc 000000000060f204  /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!libarklib.so
      #02 pc 000000000060c9d0  /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!libarklib.so
      #03 pc 000000000060c7f4  /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!libarklib.so
      #04 pc 000000000060c540  /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!libarklib.so
      #05 pc 000000000060b234  /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!libarklib.so
      #06 pc 000000000060c290  /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!libarklib.so
      #07 pc 000000000062b63c  /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!libarklib.so
      #08 pc 000000000062b95c  /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!libarklib.so
      #09 pc 00000000002e9f84  /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!libarklib.so (Java_dev_arkbuilders_arklib_LibKt_loadLinkFileNative+1272)
      #12 pc 0000000000422c3a  [anon:dalvik-classes.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk]
      #14 pc 000000000000558e  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes3.dex]
      #16 pc 0000000000005234  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes3.dex]
      #18 pc 000000000000514a  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes3.dex]
      #20 pc 00000000004a7baa  [anon:dalvik-classes.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk]
      #22 pc 00000000001586aa  [anon:dalvik-classes9.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes9.dex]
      #24 pc 000000000014af02  [anon:dalvik-classes9.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes9.dex]
      #26 pc 00000000004a7bfe  [anon:dalvik-classes.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk]
      #28 pc 0000000000150eac  [anon:dalvik-classes9.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes9.dex]
      #30 pc 000000000018654e  [anon:dalvik-classes9.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes9.dex]
      #32 pc 000000000018dffe  [anon:dalvik-classes9.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes9.dex]
      #34 pc 000000000018d102  [anon:dalvik-classes9.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes9.dex]
      #36 pc 000000000018be12  [anon:dalvik-classes9.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes9.dex]
      #38 pc 000000000018bf40  [anon:dalvik-classes9.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes9.dex]
      #40 pc 000000000018bef0  [anon:dalvik-classes9.dex extracted in memory from /data/app/~~IUZgOgwoWHgu_I0i8Furdg==/dev.arkbuilders.arkshelf--T4EM_LqAKrWhBDSX9c5YA==/base.apk!classes9.dex]

Absent index file and parse errors handling

Using ark-cli:

Create or copy some test folder with data.
Make sure there are no .ark/index file.
Run ark-cli monitor, wait till the index is computed, Ctrl+C.
Check .ark.index.

Expected: the index file exists.
Actual: it is absent.

Example:

[kirill@lenovo test]$ rm .ark/index 
[kirill@lenovo test]$ ../ark-cli monitor
Building index of folder /tmp/test
Build succeeded in 12.901021531s

Updating succeeded in 287.573914ms

Updating succeeded in 283.567582ms

Updating succeeded in 292.411364ms

^C
[kirill@lenovo test]$ ls -lah .ark/index
ls: cannot access '.ark/index': No such file or directory

If you enable debug log, you can see there is an IO error in arklib (absent file).

[kirill@lenovo test]$ RUST_LOG=debug ../ark-cli monitor
Building index of folder /tmp/test
[2024-01-03T15:29:07Z INFO  arklib] Index has not been registered before
[2024-01-03T15:29:07Z INFO  arklib::index] Loading the index from file /tmp/test/.ark/index
[2024-01-03T15:29:07Z WARN  arklib::index] IO error
[2024-01-03T15:29:07Z INFO  arklib::index] Building the index from scratch
[2024-01-03T15:29:07Z DEBUG arklib::index] Discovering all files under path /tmp/test
^C

Apparently, IO errors are not handled in a nice way.

Previews generation: PDF

By using a Rust crate we can solve both issues with PDFs in ARK Navigator:
ARK-Builders/ARK-Navigator#153
ARK-Builders/ARK-Navigator#157

Maybe something from here?

Function `update_one` cannot accept `CanonicalPathBuf` for deleted files

This is an oversight of implementation in #42

See #38 for more context. TL;DR: We use update_one for cases when a resource by some path changed or was deleted at all. We can't canonicalize a non-existent path, so we fail to call update_one in this case. This API should receive plain Path-like type: Path, PathBuf or be generic with AsRef<Path>.

"Created" date

It seems that we have 3 different ways to provide created_at timestamp:

extracted automatically for specific resource kinds, e.g. from EXIF
must be stored in .ark/cache/metadata
- ARK-Builders/ark-android#56
specified manually by user or automatically by an app
must be stored in .ark/user/properties
- ARK-Builders/arklib-android#89
generated automatically for any resource during its first indexing
must extend .ark/index
- #66

Tags storage

Core feature of ARK Navigator is tags-based resources filtering. A resource identified by ResourceId can have multiple tags attached to it. A tag is just a string. This way, tags storage is just a mapping between identifiers and sets of strings.

The storage is persisted using hidden .ark-tags file. It is assumed, that .ark-tags file is replicated to other user devices (e.g. phone and tablet can have tags in sync) by using external software like Syncthing. This way, all user devices can have tags in sync.

In this task, loading and persisting of .ark-tags file must be ported from Kotlin to Rust. Functions of Storage interface must be implemented in the lib. The library must be stateful, keeping data in memory between calls. The library must support having several tag storages in memory simultaneously, so StorageId should be returned upon loading and be used in function calls by client app.

Loading of the storage should be done by passing root folder path from the client app, not by path of .ark-tags itself since structure of internal files can change in future and the library must automatically locate necessary internal files.

At this stage, it seems to be redundant to port Sharded storage. The most important is to stick storage format to the lib, so apps depending on the lib would always use the same format of tags storage.

Darwin (macOS) and `machine-uid` crate

This is a blocker for cross-platform functioning.

Update method should return complete resources with their details

ResourceIndex should pass complete resources with their details during update.
Right now, only ResourceIds are passed. This causes the library clients to reconstruct details again.

See fun compute in Resource.kt (https://github.com/ARK-Builders/arklib-android):
https://github.com/ARK-Builders/arklib-android/blob/f95ffc3b97c18e00f0cbed40b6b5c854254cab1c/lib/src/main/java/space/taran/arklib/domain/index/Resource.kt#L22

Persisted/Replicated index

See ARK-Builders/ARK-Navigator#142 for the context.
TL;DR: At the moment, index is built for any "root" folder and stored in memory.

Would be cool to persist it, so we wouldn't recalculate resource ids on different devices. If we store the index into our .ark folder, then it cache gets synced to other devices (that's why replicated). Smart write to avoid conflicts is necessary, but would be too difficult for this moment, let's assume now that only 1 device writes index into some root at the same moment.

Let's fix .ark/index path for it.

Lightweight, single-resource update method

Technically, it is more sound to perform a full update, even when only a single resource is of interest. Because other resources could have been changed, too. However, it is not always easy for the client to process additional updates.

Single-resource update function could be convenient. This function should receive path of the resource to update.

Here are some scenarios where this function would be applicable:

We could use this function when we delete a resource, although it would always return Deleted event.
When an external application is used to modify a resource. In this case, we know that its id has changed, and we want it right in the place where we returned from the external app. For instance, a user browses resources in Gallery, edits an image, and once saving completed, returns to the Gallery. At this moment, either all resources must be updated, or the edited image must be temporarily removed. With the single-resource update, the edited image could simply be updated and presented to the user.

Benchmark `crc32` and `blake3` hash functions

We use crc32fast crate to generate ResourceId. It was one of the fastest hash functions 3 years ago, when blake3 was invented.

Official metrics:

Blake3, AWS c5.metal, 16 KiB input, 1 thread:
6866 MiB/s
CRC32, unknown env and run parameters:
baseline: 1499 MiB/s,
pclmulqdq: 7314 MiB/s

Let's create a small benchmark to compare them in same environment, with same parameters.

Even if blake3 is same performant as crc32, it would be worth updating arklib, because blake3 is cryptographic hash function and it means no collisions in the index.

Test Clean-Up necessary

The Test
atomic/files.rs

    #[test]
    fn multiple_version_files() {

leaves the artifact {}_cellphoneID in the tmp folder.

Correct Behavior: It should be cleaned up after the test.

Always provide updated index

See RootIndex.kt in arklib-android:

BindingIndex.update(path)
BindingIndex.store(path)

Right now, we do updating twice if the index is being provided for the first time.
This updating should be done in arklib during provision, if we already have index instance.

Simple performance benchmark and GitHub Actions

Benchmark should be ran in CI upon all commits in main.

Better privacy with hashing `machine-uid` ids

https://www.freedesktop.org/software/systemd/man/latest/machine-id.html