Omochi 😊

Full text search engine from scratch by Golangʕ◔ϖ◔ʔ (Just a toy)

✨ Features

Omochi is an inverted index based search engine by Golang.
If indexed correctly, any document can be searched.
You can search documents from RESTful API.
Supported language: English, Japanese.

📍 Requirements

Golang 1.18+
Docker 20.10+

📦 Setup

Create network

Create docker network(omochi_network) by:

$ docker network create omochi_network

Database migration

Omochi uses MariaDB for storing Inverted Indexes & Documents, and Ent for ORM.

For database migration, connect docker container shell by:

$ docker-compose run api bash

Then, running database migration by:

$ go run ./cmd/migrate/migrate.go

Seed data

To try search engine, this project provides two datasets as samples in TSV Format.

The dataset for English is a Movie title dataset, and the dataset for Japanese is a Doraemon comic title dataset.

At first, connect docker container shell by:

$ docker-compose run api bash

Then, seed data by:

$ go run {path to seed.go}

If you initialize with a Japanese dataset, {path to seed.go} should be ./cmd/seeds/ja/seed.go . On the other hand, for English, ./cmd/seeds/eng/seed.go .

🏇 Start Application

After completing setup, you can start application by running:

$ docker-compose up

This app starts a RESTful API and listens on port 8081 for connections

🌎 How to use & Demo

After seeding data , you can search documents by send GET request to /v1/document/search .

Query parameters are as follow:

"keywords": Keywords to search. If there are multiple search terms, specify them separated by commas like "hoge,fuga,piyo"
"mode": Search mode. The search modes that can be specified are "And" and "Or"

Demo

Doraemon comic title dataset

After data seeding by Doraemon comic title dataset, you can search documents which include "ドラえもん" by:

$ curl "http://localhost:8081/v1/document/search?keywords=ドラえもん" | jq . 
{
  "documents": [
    {
      "id": 12054,
      "content": "ドラえもんの歌",
      "tokenized_content": [
        "ドラえもん",
        "歌"
      ],
      "created_at": "2022-07-08T12:59:49+09:00",
      "updated_at": "2022-07-08T12:59:49+09:00"
    },
    {
      "id": 11992,
      "content": "恋するドラえもん",
      "tokenized_content": [
        "恋する",
        "ドラえもん"
      ],
      "created_at": "2022-07-08T12:59:48+09:00",
      "updated_at": "2022-07-08T12:59:48+09:00"
    },
    {
      "id": 11230,
      "content": "ドラえもん登場！",
      "tokenized_content": [
        "ドラえもん",
        "登場"
      ],
      "created_at": "2022-07-08T12:59:44+09:00",
      "updated_at": "2022-07-08T12:59:44+09:00"
    },
    ...

Movie title dataset

After data seeding by Movie title dataset, you can search documents which include "toy" and "story" by:

$ curl "http://localhost:8081/v1/document/search?keywords=toy,story&mode=And" | jq .
{
  "documents": [
    {
      "id": 1,
      "content": "Toy Story",
      "tokenized_content": [
        "toy",
        "story"
      ],
      "created_at": "2022-07-08T13:49:24+09:00",
      "updated_at": "2022-07-08T13:49:24+09:00"
    },
    {
      "id": 39,
      "content": "Toy Story of Terror!",
      "tokenized_content": [
        "toy",
        "story",
        "terror"
      ],
      "created_at": "2022-07-08T13:49:34+09:00",
      "updated_at": "2022-07-08T13:49:34+09:00"
    },
    {
      "id": 83,
      "content": "Toy Story That Time Forgot",
      "tokenized_content": [
        "toy",
        "story",
        "time",
        "forgot"
      ],
      "created_at": "2022-07-08T13:49:53+09:00",
      "updated_at": "2022-07-08T13:49:53+09:00"
    },
    {
      "id": 213,
      "content": "Toy Story 2",
      "tokenized_content": [
        "toy",
        "story"
      ],
      "created_at": "2022-07-08T13:50:35+09:00",
      "updated_at": "2022-07-08T13:50:35+09:00"
    },
    {
      "id": 352,
      "content": "Toy Story 3",
      "tokenized_content": [
        "toy",
        "story"
      ],
      "created_at": "2022-07-08T13:51:23+09:00",
      "updated_at": "2022-07-08T13:51:23+09:00"
    }
  ]
}

📚 Reference

Dataset

Fujiko.F.Fujio,Doraemon(Tentomushi Comics) 1~45, Shogakukan , 1974～1996
ROUNAK BANIK."The Movies Dataset".kaggle.https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset. Accessed on 07/08

Book

🧑‍💻 License

MIT

ikawaha / omochi Goto Github PK

omochi's Introduction

Omochi 😊

✨ Features

📍 Requirements

📦 Setup

Create network

Database migration

Seed data

🏇 Start Application

🌎 How to use & Demo

Demo

📚 Reference

Dataset

Book

🧑‍💻 License

omochi's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent