Giter Site home page Giter Site logo

dogsheep / dogsheep-beta Goto Github PK

View Code? Open in Web Editor NEW
180.0 9.0 6.0 61 KB

Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette

Home Page: https://dogsheep.github.io/

Python 86.09% HTML 13.91%
dogsheep search datasette-plugin datasette-io datasette datasette-tool

dogsheep-beta's Introduction

dogsheep-beta

PyPI Changelog Tests License

Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette

Example

A live example of this plugin is running at https://datasette.io/-/beta - configured using this YAML file.

Read more about how this example works in Building a search engine for datasette.io.

Installation

Install this tool like so:

$ pip install dogsheep-beta

Usage

Run the indexer using the dogsheep-beta command-line tool:

$ dogsheep-beta index dogsheep.db config.yml

The config.yml file contains details of the databases and document types that should be indexed:

twitter.db:
    tweets:
        sql: |-
            select
                tweets.id as key,
                'Tweet by @' || users.screen_name as title,
                tweets.created_at as timestamp,
                tweets.full_text as search_1
            from tweets join users on tweets.user = users.id
    users:
        sql: |-
            select
                id as key,
                name || ' @' || screen_name as title,
                created_at as timestamp,
                description as search_1
            from users

This will create a search_index table in the dogsheep.db database populated by data from those SQL queries.

By default the search index that this tool creates will be configured for Porter stemming. This means that searches for words like run will match documents containing runs or running.

If you don't want to use Porter stemming, use the --tokenize none option:

$ dogsheep-beta index dogsheep.db config.yml --tokenize none

You can pass other SQLite tokenize argumenst here, see the SQLite FTS tokenizers documentation.

Columns

The columns that can be returned by our query are:

  • key - a unique (within that type) primary key
  • title - the title for the item
  • timestamp - an ISO8601 timestamp, e.g. 2020-09-02T21:00:21
  • search_1 - a larger chunk of text to be included in the search index
  • category - an integer category ID, see below
  • is_public - an integer (0 or 1, defaults to 0 if not set) specifying if this is public or not

Public records are things like your public tweets, blog posts and GitHub commits.

Categories

Indexed items can be assigned a category. Categories are integers that correspond to records in the categories table, which defaults to containing the following:

id name
1 created
2 saved
3 received

created is for items that have been created by the Dogsheep instance owner.

saved is for items that they have saved, liked or favourited.

received is for items that have been specifically sent to them by other people - incoming emails or direct messages for example.

Datasette plugin

Run datasette install dogsheep-beta (or use pip install dogsheep-beta in the same environment as Datasette) to install the Dogsheep Beta Datasette plugin.

Once installed, a custom search interface will be made available at /-/beta. You can use this interface to execute searches.

The Datasette plugin has some configuration options. You can set these by adding the following to your metadata.json configuration file:

{
    "plugins": {
        "dogsheep-beta": {
            "database": "beta",
            "config_file": "dogsheep-beta.yml",
            "template_debug": true
        }
    }
}

The configuration settings for the plugin are:

  • database - the database file that contains your search index. If the file is beta.db you should set database to beta.
  • config_file - the YAML file containing your Dogsheep Beta configuration.
  • template_debug - set this to true to enable debugging output if errors occur in your custom templates, see below.

Custom results display

Each indexed item type can define custom display HTML as part of the config.yml file. It can do this using a display key containing a fragment of Jinja template, and optionally a display_sql key with extra SQL to execute to fetch the data to display.

Here's how to define a custom display template for a tweet:

twitter.db:
    tweets:
        sql: |-
            select
                tweets.id as key,
                'Tweet by @' || users.screen_name as title,
                tweets.created_at as timestamp,
                tweets.full_text as search_1
            from tweets join users on tweets.user = users.id
        display: |-
            <p>{{ title }} - tweeted at {{ timestamp }}</p>
            <blockquote>{{ search_1 }}</blockquote>

This example reuses the value that were stored in the search_index table when the indexing query was run.

To load in extra values to display in the template, use a display_sql query like this:

twitter.db:
    tweets:
        sql: |-
            select
                tweets.id as key,
                'Tweet by @' || users.screen_name as title,
                tweets.created_at as timestamp,
                tweets.full_text as search_1
            from tweets join users on tweets.user = users.id
        display_sql: |-
            select
                users.screen_name,
                tweets.full_text,
                tweets.created_at
            from
                tweets join users on tweets.user = users.id
            where
                tweets.id = :key
        display: |-
            <p>{{ display.screen_name }} - tweeted at {{ display.created_at }}</p>
            <blockquote>{{ display.full_text }}</blockquote>

The display_sql query will be executed for every search result, passing the key value from the search_index table as the :key parameter and the user's search term as the :q parameter.

This performs well because many small queries are efficient in SQLite.

If an error occurs while rendering one of your templates the search results page will return a 500 error. You can use the template_debug configuration setting described above to instead output debugging information for the search results item that experienced the error.

Displaying maps

This plugin will eventually include a number of useful shortcuts for rendering interesting content.

The first available shortcut is for displaying maps. Make your custom content output something like this:

<div
    data-map-latitude="{{ display.latitude }}"
    data-map-longitude="{{ display.longitude }}"
    style="display: none; float: right; width: 250px; height: 200px; background-color: #ccc;"
></div>

JavaScript on the page will look for any elements with data-map-latitude and data-map-longitude and, if it finds any, will load Leaflet and convert those elements into maps centered on that location. The default zoom level will be 12, or you can set a data-map-zoom attribute to customize this.

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:

cd dogsheep-beta
python3 -mvenv venv
source venv/bin/activate

Or if you are using pipenv:

pipenv shell

Now install the dependencies and tests:

pip install -e '.[test]'

To run the tests:

pytest

dogsheep-beta's People

Contributors

simonw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dogsheep-beta's Issues

Mechanism for differentiating between "by me" and "liked by me"

Some of the content I'm indexing is by me - photos I've taken, tweets I wrote, commits, comments I posted.

Some of it is stuff that I've "liked" or "bookmarked" in some way - favourited tweets, Pocket articles, starred GitHub repos.

It woud be useful to be able to differentiate between the two.

Public / Private mechanism

Some of the data in Dogsheep is stuff that was written publicly - tweets, blog posts, GitHub commits to public repos.

Some of it is private data - emails, photos, direct messages, Swarm checkins, commits to private repos.

Being able to filter for just one or the other (or both) would be useful. Especially when giving demos!

Switch to using datasette.client

datasette.client is designed for this kind of thing, to replace this code:

inner_request = Request.fake(path_with_query_string)
view = TableView(datasette)
data, _, _ = await view.data(
inner_request,
database=database_name,
hash=None,
table="search_index",
_next=None,
)
count, facets = data["filtered_table_rows_count"], data["facet_results"]

Add a context column that's not searchable

I sometimes like to configure titles that are things like "Comment on issue X" or "Photo in Golden Gate Park" - these shouldn't be included in the search index but should be stored so they can be displayed to provide context.

Add a column for this - probably called context - and make it so it can be populated.

Getting sqlite_master may not be modified when creating dogsheep index

When creating a dogsheep index from config.yml file on pocket.db (created using pocket-to-sqlite), I am getting this error

Traceback (most recent call last):
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/bin/dogsheep-beta", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/lib/python3.11/site-packages/dogsheep_beta/cli.py", line 36, in index
    run_indexer(
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/lib/python3.11/site-packages/dogsheep_beta/utils.py", line 32, in run_indexer
    ensure_table_and_indexes(db, tokenize)
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/lib/python3.11/site-packages/dogsheep_beta/utils.py", line 91, in ensure_table_and_indexes
    table.add_foreign_key(*fk)
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/lib/python3.11/site-packages/sqlite_utils/db.py", line 2155, in add_foreign_key
    self.db.add_foreign_keys([(self.name, column, other_table, other_column)])
  File "/Users/khushmeeet/.pyenv/versions/3.11.2/lib/python3.11/site-packages/sqlite_utils/db.py", line 1116, in add_foreign_keys
    cursor.execute(
sqlite3.OperationalError: table sqlite_master may not be modified

Command I ran to get this error

dogsheep-beta index pocket.db config.yml

Dogsheep version

dogsheep-beta, version 0.10.2

Python version

Python 3.11.2

Create a view for running faceted searches

select
  search_index_fts.rank,
  search_index.rowid,
  search_index.[table],
  search_index.key,
  search_index.title,
  search_index.timestamp,
  search_index.search_1
from
  search_index join search_index_fts on search_index.rowid = search_index_fts.rowid
order by
  search_index_fts.rank, search_index.timestamp desc

Upgrade to sqlite-utils 3.0 (tests are failing)

        results = beta_db["search_index"].search("run")
        if use_porter:
            assert results == [
                (
                    "dogs.db/dogs",
                    "1",
                    "Cleo",
                    "2020-08-22 04:41:33",
                    1,
                    0,
                    "running",
                    None,
                    None,
                )
            ]
        else:
>           assert results == []
E           assert <generator ob...x7f3c080a3e08> == []
E             +<generator object Table.search at 0x7f3c080a3e08>
E             -[]
E             Full diff:
E             - []
E             + <generator object Table.search at 0x7f3c080a3e08>

This was caused by a backwards incompatible change in sqlite-utils 3.0: https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v3-0

Pagination

Useful for #16 (timeline view) since you can now filter to just the items on a specific day - but if there are more than 50 items you can't see them all.

cannot use jinja filters in display?

Hi, I'm trying to have a display function in Dogsheep's config.yml that includes something like this:

<h3>
  <a href="{{ urls.row('my_database', 'my_table', key) }}">{{ display.title }}</a>
  <a href="{{ display.url }}๐Ÿ”—" target="_blank">(source)</a>
</h3>
<p>{{ display.snippet|safe }}</p>

Unfortunately, rendering fails with a message 'urls is undefined'.

The same happens if I'm trying to build a row URL manually, using filters like quote_plus (as my keys are URLs).

Any hints?

Thanks!

Data folder as index command parameter

Hi,

First of all, thank you for this wonderful project ๐Ÿ˜„

I started to use dogsheep to make my personal data searchable, and by using the project I noticed an issue with the index command.

It always expects you are running it from the root folder from where the data is located, so I got some errors while trying to make it work on my setup.

I separate all databases inside a data folder (I published my setup to be easier to follow: https://github.com/humrochagf/my-dogsheep)

Before, I configured dogsheep.yml to add the data folder to its path like this:

data/twitter.db:
    tweets:
        sql: |-
...

And running the index command like this:

dogsheep-beta index data/dogsheep.db dogsheep.yml

It worked to the normal search feature with no problem this way, but when I started adding display_sql rules the app started to crash, because at datasette get_database it was looking for data/twitter and it only had a db called twitter there.

So my workaround to that was to cd into the data folder and run the indexer. You can check the way I'm doing it at this line of the makefile: https://github.com/humrochagf/my-dogsheep/blob/main/makefile#L3

It works but it would be nice to have an option to pass the path where the data is located to the index function.

Rename "table" to "type"

I think "table" is the wrong name for the concept I'm using it for here.

Two reasons: firstly, table is a reserved word in SQLite. More importantly, it turns out there's not a direct mapping from tables to types of search result. In particular, for GitHub I ended up having two different "tables" of repositories - one for repos created by me, another for repos that I have starred.

Deleted records stay in the search index

Here's why:

sql_rest = sql.split("select", 1)[1]
sql = "select '{}/{}' as [table],{}".format(db_name, table, sql_rest)
columns = derive_columns(other_db, sql)
with other_db.conn:
other_db.conn.execute(
"REPLACE INTO index1.search_index ({}) {}".format(
", ".join("[{}]".format(column) for column in columns), sql
)
)
other_db.conn.close()

That should probably do DELETE FROM index1.search_index WHERE [table] = ? first.

template_debug mechanism

I'd prefer it if errors in these template fragments were displayed as errors inline where the fragment should have been inserted, rather than 500ing the whole page - especially since the template fragments are user-provided and could have all kinds of odd errors in them which should be as easy to debug as possible.
Originally posted by @simonw in #24 (comment)

Figure out incremental re-indexing

As tables get bigger reindexing everything on a schedule (essentially recreating the entire index from scratch) will start to become a performance bottleneck.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.