Giter Site home page Giter Site logo

marty1885 / tlgs Goto Github PK

View Code? Open in Web Editor NEW
19.0 4.0 2.0 331 KB

"Totally Legit" Gemini Search - Open source search engine for the Gemini protocol

Home Page: https://tlgs.one

License: MIT License

CMake 2.52% C++ 97.31% C 0.17%
gemini gemini-protocol search-engine drogon indexer

tlgs's Introduction

TLGS - Totally Legit Gemini Search

Overview

TLGS is a search engine for Gemini. It's slightly overengineered for what it currently is and uses weird tech. And I'm proud of that. The current code basse is kinda messy - I promise to clean them up. The main features/characteristics are as follows:

  • Using the state of the art C++20
  • Parses and indexes textual contents on Gemninispace
  • Highly concurrent and asynchronous
  • Stores index on PostgreSQL
  • Developed for Linux. But should work on Windows, OpenBSD, HaikuOS, macOS, etc..
  • Only fetch headers for files it can't index to save bandwith and time
  • Handles all kinds of source encoding
  • Link analysis using the SALSA algorithm

As of now, indexing of news sites, RFCs, documentations are mostly disabled. But likely be enabled once I have the mean and resources to scale the setup.

Using this project

Requirments

Building and running the project

To build the project. You'll need a fully C++20 capable compiler. The following compilers should work as of writing this README

  • GCC >= 11.2
  • MSVC >= 16.25

Install all dependencies. And run the commands:

mkdir build
cd build
cmake ..
make -j

Creating and maintaining the index

To create the inital index:

  1. Initialize the database ./tlgs/tlgs_ctl/tlgs_ctl ../tlgs/config.json populate_schema
  2. Place the seed URLs into seeds.text
  3. In the build folder, run ./tlgs/crawler/tlgs_crawler -s seeds.text -c 4 ../tlgs/config.json

Now the crawler will start crawling the geminispace while also updating outdated indices (if any). To update an existing index. Run:

./tlgs/crawler/tlgs_crawler -c 2 ../tlgs/config.json
# -c is the maximum concurrent connections the crawler will make

NOTE: TLGS's crawler is distributable. You can run multiple instances in parallel. But some intances may drop out early towards the end or crawling. Though it does not effect the result of crawling.

Running the capsule

openssl req -new -subj "/CN=my.host.name.space" -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 -days 36500 -nodes -out cert.pem -keyout key.pem
cd tlgs/server
./tlgs_server ../../../tlgs/server_config.json

Via systemd

sudo systemctl start tlgs_server
sudo systemctl start tlgs_crawler

Server config

The custom_config.tlgs section in search_config.json (installed at /etc/tlgs/server_config.json) contains confgurations for TLGS server. Besides the usual Drogon's config options. custom_config changes the property of TLGS itself. Current supported options are:

ranking_algo

The ranking algorithm TLGS uses to rank pages in search result. The ranking is then combined with the text match score to produce the final search rank. Current supported values are hits and salsa. Refering to the HITS and SALSA ranking algorithm. It defaults to salsa if no value is provided.

SALSA runs slightly faster than HITS for large search results. Both literature and imperical experience suggests SALSA provides better ranking. Thus we switched from HITS to SALSA.

"ranking_algo": "salsa"

TODOs

  • Code cleanup
    • I really need to centralized the crawling logic
  • Randomize the order of crawling. Avoid bashing a single capsule
    • Sort of.. by sampling the pages table with low percentage and increase later
  • Support parsing markdown
  • Try indexing news sites
  • Optimize the crawler even more
    • Checks hash before updating index
    • Peoper UTF-8 handling in ASCII art detection
    • Use a trie for blacklist URL match
  • Link analysis using SALSA
  • BM25 for text scoring
  • Dedeuplicate search result
  • Impement Filters
  • Proper(?) way to migrate schema

tlgs's People

Contributors

adiabatic avatar marty1885 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tlgs's Issues

Crawler sends fragment to server

While it's not specifically stated in the Gemini spec, a growing consensus is that sending a URL with the fragment, like gemini://gemini.conman.org/boston/2009/11/16/2009/11/16.1#nf2009-11-16-1-2 is incorrect behavior and shouldn't be done (as the fragment is a client-side issue, not a server side issue).

Crawler misparses URIs

I'm noticing your crawler is not parsing URIs properly, which is resulting in requests like gemini://gemini.conman.org/boston/2015/04/05-05/mailto:[email protected], gemini://gemini.conman.org/boston/2002/03/javascript:addSidebarPanel() or gemini://gemini.conman.org/boston/2007/05/news:alt.society.generation-x.

Crawler resolves relative URLs incorrecty on pages reached from a redirect

When the crawler processes a URL that results in a redirect, and the target page contains relative links, those relative links should be resolved using the target URL as a base, not the redirecting URL.

Example:

  • gemini://raek.se/ links to gemini://raek.se/orbits/omloppsbanan/next?gemini%3A%2F%2Fraek.se%2F
  • gemini://raek.se/orbits/omloppsbanan/next?gemini%3A%2F%2Fraek.se%2F redirects to gemini://hanicef.me/
  • gemini://hanicef.me/ links to /about

Then the crawler should resolve the last link into gemini://hanicef.me/about, but currently it incorrectly resolves it into gemini://raek.se/about.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.