Giter Site home page Giter Site logo

marginaliasearch / marginaliasearch Goto Github PK

View Code? Open in Web Editor NEW
853.0 7.0 22.0 12.73 MB

Internet search engine for text-oriented websites. Indexing the small, old and weird web.

Home Page: https://search.marginalia.nu/

License: Other

Java 41.78% Shell 0.13% HTML 57.22% CSS 0.20% JavaScript 0.14% Roff 0.31% Python 0.02% SCSS 0.20%
search-engine no-cloud small-web internet-search indexer language-processing web-crawler alt-search no-ai-used self-hostable

marginaliasearch's Introduction

Marginalia Search

This is the source code for Marginalia Search.

The aim of the project is to develop new and alternative discovery methods for the Internet. It's an experimental workshop as much as it is a public service, the overarching goal is to elevate the more human, non-commercial sides of the Internet.

A side-goal is to do this without requiring datacenters and enterprise hardware budgets, to be able to run this operation on affordable hardware with minimal operational overhead.

The long term plan is to refine the search engine so that it provide enough public value that the project can be funded through grants, donations and commercial API licenses (non-commercial share-alike is always free).

The system can both be run as a copy of Marginalia Search, or as a white-label search engine for your own data (either crawled or side-loaded). At present the logic isn't very configurable, and a lot of the judgements made are based on the Marginalia project's goals, but additional configurability is being worked on!

Here's a demo of the set-up and operation of the self-hostable barebones mode of the search engine: 🌎 https://www.youtube.com/watch?v=PNwMkenQQ24

Set up

To set up a local test environment, follow the instructions in 📄 run/readme.md!

Further documentation is available at 🌎 https://docs.marginalia.nu/.

Before compiling, it's necessary to run ⚙️ run/setup.sh. This will download supplementary model data that is necessary to run the code. These are also necessary to run the tests.

If you wish to hack on the code, check out 📄 doc/ide-configuration.md.

Hardware Requirements

A production-like environment requires a lot of RAM and ideally enterprise SSDs for the index, as well as some additional terabytes of slower harddrives for storing crawl data. It can be made to run on smaller hardware by limiting size of the index.

The system will definitely run on a 32 Gb machine, possibly smaller, but at that size it may not perform very well as it relies on disk caching to be fast.

A local developer's deployment is possible with much smaller hardware (and index size).

Project Structure

📁 code/ - The Source Code. See 📄 code/readme.md for a further breakdown of the structure and architecture.

📁 run/ - Scripts and files used to run the search engine locally

📁 third-party/ - Third party code

📁 doc/ - Supplementary documentation

📄 CONTRIBUTING.md - How to contribute

📄 LICENSE.md - License terms

Contact

You can email [email protected] with any questions or feedback.

License

The bulk of the project is available with AGPL 3.0, with exceptions. Some parts are co-licensed under MIT, third party code may have different licenses. See the appropriate readme.md / license.md.

Versioning

The project uses modified Calendar Versioning, where the first two pairs of numbers are a year and month coinciding with the latest crawling operation, and the third number is a patch number.

            version
           --
     yy.mm.VV
     -----
     crawl

For example, 23.03.02 is a release with crawl data from March 2023 (released in May 2023). It is the second patch for the 23.02 release.

Versions with the same year and month are compatible with each other, or offer an upgrade path where the same data set can be used, but across different crawl sets data format changes may be introduced, and you're generally expected to re-crawl the data from scratch as crawler data has shelf life approximately as long as the major release cycles of this project. After about 2-3 months it gets noticeably stale with many dead links.

For development purposes, crawling is discouraged and sample data is available. See 📄 run/readme.md for more information.

Funding

Donations

Consider donating to the project.

Grants

This project was funded through the NGI0 Entrust Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 101069594.

NLnet Foundation NGI0

marginaliasearch's People

Contributors

adrthegamedev avatar conor-f avatar jmholla avatar vlofgren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

marginaliasearch's Issues

RSS feed for publicity column

Hi! instead of sending another email I thought I'd post to GitHub this time.
I'd like an RSS feed for the "Publicity, Discussion and Events" column on marginalia. I'm already following the maintenance logs, I'd like to be able to follow what the press is saying about marginalia. There haven't been any news lately but when there is an update, why not make it an RSS feed too? Unless I missed it somehow, I'm not seeing a feed for it.

Make random exploration mode git-manageable

There's currently a dump of the data in the random exploration mode available in the MarginaliaSearch/PublicData repo. It would be nice to be able to automatically synchronize changes made within this repo back to the live data. Probably just a button in the control service would suffice.

Suggestions dropdown is kinda janky

Problem 1: The suggestions box doesn't always align with the input field with some window sizes.

image

Problem 2: The behavior is kind of ... off? I don't know what it is. Maybe clicking the suggestion should execute the query? Maybe compare with other suggestions drop downs?

Problem 3: The quality of suggestions could probably be improved. (FIXED)

(public api) Expose query ranking parameters on the public API

These are presently available internally in ResultRankingParameters, most are probably safe to expose.

public final Bm25Parameters fullParams;
/** Tuning for BM25 when applied to priority matches, terms with relevance signal indicators */
public final Bm25Parameters prioParams;

/** Documents below this length are penalized */
public int shortDocumentThreshold;

public double shortDocumentPenalty;


/** Scaling factor associated with domain rank (unscaled rank value is 0-255; high is good) */
public double domainRankBonus;

/** Scaling factor associated with document quality (unscaled rank value is 0-15; high is bad) */
public double qualityPenalty;

/** Average sentence length values below this threshold are penalized, range [0-4), 2 or 3 is probably what you want */
public int shortSentenceThreshold;

/** Magnitude of penalty for documents with low average sentence length */
public double shortSentencePenalty;

public double bm25FullWeight;
public double bm25PrioWeight;
public double tcfWeight;

Some of the parameters in QueryLimits are probably also useful if given upper bounds

int resultsByDomain; 
int resultsTotal;  
int timeoutMs;  
int fetchSize;

(summary-extractor) theregister.com picks up popover text

All summaries are

'Oh no, you're thinking, yet another cookie pop-up. Well, sorry, it's the law. We measure how many people read us, and ensure you see relevant ads, by storing cookies on your device. If you're cool with that, hit “Accept all Cookies”. For more info and to'

(converter) Tilde search?

It would be cheap and trivially simple to add a magic keyword for only searching tilde urls. No idea if it would be useful, but worth experimenting with.

(crawler) Rare NPE in sniffRootDocument

ERROR CrawlerRetreiver- Error configuring link filter
java.lang.NullPointerException: Cannot invoke "nu.marginalia.bigstring.BigString.decode()" because "sample.documentBody" is null
        at nu.marginalia.crawl.retreival.CrawlerRetreiver.sniffRootDocument(CrawlerRetreiver.java:245) ~[crawling-process.jar:?]
        at nu.marginalia.crawl.retreival.CrawlerRetreiver.crawlDomain(CrawlerRetreiver.java:144) ~[crawling-process.jar:?]
        at nu.marginalia.crawl.retreival.CrawlerRetreiver.fetch(CrawlerRetreiver.java:99) ~[crawling-process.jar:?]
        at nu.marginalia.crawl.CrawlerMain.fetchDomain(CrawlerMain.java:121) ~[crawling-process.jar:?]
        at nu.marginalia.crawl.CrawlerMain.lambda$startCrawlTask$1(CrawlerMain.java:103) ~[crawling-process.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]

(dating-service) UI Improvements

Overview

The UI of the dating service has a few issues that can be a bit frustrating, especially on mobile, but are also pretty easily fixed.

  1. The buttons are extremely large due to invisible padding. This is mostly ok enough on desktop since the hover effect shows you when you're hovering, but on mobile I often find myself accidentally tapping in that invisible are since I'm not exactly sure where the bounds of the button are and hover isn't there to help. It'd be nice if the clickable area of the buttons always matched their visible area.
  2. Overlapping interactive elements can be a bit annoying if your intended click misses and triggers something else. I think fixing the problem above would introduce this new problem. Instead, it would be nice if the interactive elements (buttons/links) had their own distinct space on the screen.
  3. The buttons get stuck behind the screenshot link after clicking the link. Only way to get them back in front is to take focus off of the screenshot link.
    hidden-buttons
  4. There is some unnecessary horizontal overflow on narrow screens.
    overflow

Example Markup and Styling Solution

I'm not too familiar with Java, the template language used in the dating-service, or the Marginalia code base in general, but I thought it might be nice to provide an example that answers the issues mentioned above and could maybe inspire a solution and save some time. I just copied the markup template into an HTML page and tweaked the HTML and styles and hard-coded in some things. Here's a StackBlitz link to that code.

Notes about the example:

  • I tried to stay true to the original look of the page while providing the mobile friendliness that I'm after.
  • I added a [data-back] attribute to the body tag to control whether or not to display the back button via CSS instead of removing it with a template conditional. This can be set to "true" or "false" to show/hide the back button.
  • I rounded some things a bit for consistency with the emoji icons and to make the box shadows look a bit nicer on the corners.
  • I recognize my use of background colors on the buttons is a bit of a naive choice since the emojis have different colors and appearances on different devices. The blue I chose is identical to the background of the emoji on my desktop, but is very different on my android device. Only thing I can think to do here is use SVG icons that would be consistent across devices.
    Untitled-2022-10-26-1658

Wrong content-type for opensearchdescription

Hi! I tried to add Marginalia to my Firefox but right clicking the urlbar don't give me the expected "Add Marginalia Search" thing.

I looked at it, and according to https://developer.mozilla.org/en-US/docs/Web/OpenSearch#troubleshooting_tips the xml document should be provided with a application/opensearchdescription+xml content type, but it is not:

$ curl -I https://search.marginalia.nu/opensearch.xml
HTTP/1.1 200 OK
Server: nginx/1.22.1
Date: Fri, 10 Nov 2023 07:59:16 GMT
Content-Type: text/html;charset=utf-8

I don't know if fixing the content type is enough, but it'll probably be a good first step :)

(process) Processes should be possible to run from the command line

As far as practical, processes should be possible to run from the command line and not always pull instructions from the control message queue.

This could be done by having e.g.

$ crawler-process auto
$ crawler-process manual /path/to/specs /path/to/data

ProcessService in control-service would need a bit of modification as well to feed the auto command into the execve-array.

...

This is probably relatively simple, but also a good opportunity to look over the highly duplicated code in fetching instructions.

Dark Theme on search.marginalia.nu

Overview

https://search.marginalia.nu no longer has dark mode styles.

I tested on a couple browsers on a couple devices and made sure dark mode was enabled at the OS level but the search page stays on a light theme. I'd love for dark mode to return; perhaps it was removed by accident.

Additional Info

It's been quite a bit of time since I first noticed dark theme was missing, maybe a month or more. I'm only just getting around to making an issue.

Dark mode appears to work on other pages (at least https://www.marginalia.nu) when I toggle my color preferences at the OS level.

Could maybe be its own issue, but the first time I noticed dark theme was missing I also noticed that the search page is quite a bit narrower. It seems like it used to span the full page width or quite close to it. I was a fan of the wider search page, although I suppose this comes down to personal preference.

Loader throws (rare) exceptions, IP column size too small

Data for IP column in EC_URL is too long. This is probably due to IPv6 addresses. Need to either enlarge the column or truncate/blank the field if it doesn't fit. Right now the domain will be dropped from loading, not good.

The field is VARCHAR(32). Ipv6 can apparently be up to 40 bytes.

(index) Rare IOOB Exception when excluding a term that isn't known to the lexicon

 java.lang.IndexOutOfBoundsException: Index (0) is greater than or equal to list size (0)
  at it.unimi.dsi.fastutil.ints.IntArrayList.getInt(IntArrayList.java:341) ~[fastutil-8.5.8.jar:?]
  at nu.marginalia.index.svc.IndexQueryService.logSearchTerms(IndexQueryService.java:239) ~[index-service.jar:?]
  at nu.marginalia.index.svc.IndexQueryService.evaluateSubqueries(IndexQueryService.java:183) ~[index-service.jar:?]
  at nu.marginalia.index.svc.IndexQueryService.executeSearch(IndexQueryService.java:131) ~[index-service.jar:?]
  at nu.marginalia.index.svc.IndexQueryService.lambda$search$0(IndexQueryService.java:87) ~[index-service.jar:?]

Seems to happen with - or ? terms that refer to keywords that aren't known to the system.

(blacklist) Blacklist Transparency

Create a view for listing/exporting the blacklisted domains to enhance the transparency of the search engine in some way that makes sense. Needs to be able to deal with a large amount of data. Maybe sort by initial, use a patricia trie for basic searching like in encyclopedia.marginalia.nu?

SELECT LEFT(URL_DOMAIN, 1) AS INITIAL, COUNT(*) FROM EC_DOMAIN_BLACKLIST GROUP BY INITIAL;
+---------+----------+
| INITIAL | COUNT(*) |
+---------+----------+
| 0       |      564 |
| 1       |      738 |
| 2       |      595 |
| 3       |      596 |
| 4       |      713 |
| 5       |      600 |
| 6       |      556 |
| 7       |      609 |
| 8       |      659 |
| 9       |      559 |
| a       |     1629 |
| b       |     1676 |
| c       |     1879 |
| d       |     1060 |
| e       |     1347 |
| f       |     1074 |
| g       |     1302 |
| h       |     1268 |
| i       |     1290 |
| j       |      994 |
| k       |      502 |
| l       |      901 |
| m       |     1773 |
| n       |      684 |
| o       |      601 |
| p       |     1359 |
| q       |      153 |
| r       |      867 |
| s       |     3021 |
| t       |     2116 |
| u       |      331 |
| v       |      475 |
| w       |      847 |
| x       |      252 |
| y       |      219 |
| z       |      233 |
+---------+----------+

(crawler) thefossilforum.com doesn't get crawled properly

Crawler only fetches the index. Don't see anything in robots.txt, the headers or the meta tags that would indicate why it shouldn't crawl further.

Contents of 3f/12/3f123f37cf205631a7a953398aa30294-www.thefossilforum.com.zstd with body and IP redacted:

{
  "url": "http://www.thefossilforum.com/",
  "contentType": "text/html;charset=UTF-8",
  "timestamp": "2023-07-27T22:16:20.498185969",
  "httpStatus": 200,
  "crawlerStatus": "OK",
  "headers": "Date: Thu, 27 Jul 2023 20:16:20 GMT\nServer: Apache\nPragma: no-cache\nX-IPS-LoggedIn: 0\nContent-Encoding: gzip\nVary: cookie,Accept-Encoding\nX-XSS-Protection: 0\nX-Frame-Options: sameorigin\nX-IPS-Cached-Response: Thu, 27 Jul 2023 20:16:18 GMT\nExpires: Thu, 27 Jul 2023 20:16:50 GMT\nCache-Control: max-age=30, public\nConnection: close\nContent-Length: 24931\nLast-Modified: Thu, 27 Jul 2023 20:16:18 GMT\nExpires: max-age=29030400, public\nContent-Type: text/html;charset=UTF-8\n",
  "documentBody": "...",
  "documentBodyHash": "20d820427e20598470ff5d551852544e",
  "canonicalUrl": "http://www.thefossilforum.com/index.php",
  "recrawlState": "SAME-BY-COMPARISON"
}
{
  "url": "http://www.thefossilforum.com/index.php",
  "contentType": "text/html;charset=UTF-8",
  "timestamp": "2023-07-27T22:16:21.837639302",
  "httpStatus": 200,
  "crawlerStatus": "OK",
  "headers": "Date: Thu, 27 Jul 2023 20:16:21 GMT\nServer: Apache\nPragma: no-cache\nX-IPS-LoggedIn: 0\nContent-Encoding: gzip\nVary: cookie,Accept-Encoding\nX-XSS-Protection: 0\nX-Frame-Options: sameorigin\nExpires: Thu, 27 Jul 2023 20:16:51 GMT\nCache-Control: max-age=30, public\nConnection: close\nContent-Length: 24952\nLast-Modified: Thu, 27 Jul 2023 20:16:21 GMT\nExpires: max-age=29030400, public\nContent-Type: text/html;charset=UTF-8\n",
  "documentBody": "...",
  "documentBodyHash": "42d78ecbd431ae7a7ba14f0f1394c3dc",
  "canonicalUrl": "http://www.thefossilforum.com/index.php",
  "recrawlState": "SAME-BY-COMPARISON"
}
{
  "id": "3f123f37cf205631a7a953398aa30294",
  "domain": "www.thefossilforum.com",
  "crawlerStatus": "OK",
  "ip": "...",
  "doc": [],
  "cookies": [
    "ips4_IPSSessionFront=e75f89a3dc49ebff4ff8b73c5b15aa70; path=/; httponly",
    "ips4_guestTime=1690488978; path=/; httponly",
    "ips4_forum_view=table; expires=Sat, 27 Jul 2024 20:16:18 GMT; path=/; httponly"
  ]
}

robots.txt

User-Agent: *
Disallow: /startTopic/
Disallow: /discover/unread/
Disallow: /markallread/
Disallow: /staff/
Disallow: /online/
Disallow: /discover/
Disallow: /leaderboard/
Disallow: /search/
Disallow: /*?advancedSearchForm=
Disallow: /register/
Disallow: /lostpassword/
Disallow: /login/
Disallow: /*?sortby=
Disallow: /*?filter=
Disallow: /*?tab=
Disallow: /*?do=
Disallow: /*ref=
Disallow: /*?forumId*
Disallow: /profile/
Sitemap: http://thefossilforum.com/sitemap.php

Sitemap is nested:

<sitemapindex>
<sitemap>
<loc>
http://www.thefossilforum.com/sitemap.php?file=sitemap_content_forums_Forum
</loc>
<lastmod>2023-09-03T09:43:25+02:00</lastmod>
</sitemap>
...
</sitemapindex>
<urlset>
<url>
<loc>http://www.thefossilforum.com/forum/2-fossil-news/</loc>
<lastmod>2023-09-02T20:19:41+01:00</lastmod>
</url>
<url>
<loc>
http://www.thefossilforum.com/forum/186-paleo-re-creations/
</loc>
<lastmod>2023-08-30T13:03:45+01:00</lastmod>
</url>
...
</urlset>

(common-service) Hot reload for static resources

It would be a very nice productivity enhancement if StaticResources could do a hot reload, or at least be made to do a hot reload in test when developing locally. Having to rebuild and restart a docker container every time you change a css file is a huge pain in the ass timewaster.

(task) Update Stale Language Models

The language ngram and term frequency models are very old and questionable in how they were constructed.

  • NGramBloomFilter -- how was this even created? It's used in query construction. May not be necessary.
  • TermFrequencyDict -- construction logic needs patching to run, can then be generated on prod data probably.

(search-service) API index parameter doesn't map to all settings

In SearchApiQueryService change

profile = switch (index) {
  case "0" -> SearchProfile.YOLO;
  case "1" -> SearchProfile.MODERN;
  case "2" -> SearchProfile.DEFAULT;
  case "3" -> SearchProfile.CORPO_CLEAN;
  default -> SearchProfile.CORPO_CLEAN;
};

to something that makes more sense. [0,3] probably need to be kept unchanged for backward compatibility. Maybe allow string values too and map those to the name?

(crawl-job-extractor) Bunch of CJE papercuts and issues

Known problems with master:

  • The EC_URL table has been retired, so extraction from DB doesn't work in master anymore. We probably don't even really need known URLs anymore since we're mostly doing recrawls. Maybe the specs format should simplify to just a CSV with domain and crawl depth.

Notes from triggering the last crawl:

  • The process lacks a ProcessServiceHeartbeat so isn't visible in the control gui. Could do with chatting a bit in the EventLog as well.

  • Workflow for recrawl is a mess. Right now re-crawls are only possible for the same spec, given it's still related. This is not a combination that is useful. There should either be a way to manually relate or de-relate specifications and crawls; or explicitly specify a specification when doing a recrawl.

  • It would be nice if there was a way to merge specifications without using command line tools. Possibly less necessary if we go the CSV route mentioned in the first point, but still, it would be nice to reduce the amount of routine work done over SSH ⌨️

(crawler-process) Duplicate IDs in spec handled improperly

If a crawlspec contains duplicate IDs, the website is crawled multiple times. If the duplicate IDs are close in sequence in the file, the result is corrupted crawl data.

Fix: Add a set with seen id:s and to deduplicate before launching new processes.

Overflow text in search results cards title

Hi !

I was playing a bit with the search engine when I encounter this bug when displaying a search result. I guess a picture says more than thousands words :
Screenshot 2023-08-31 at 10-24-02 Marginalia Search - BMC ipmi error

Hope that helps.
Congratulation for the search engine, this project seems really cool !

(public api) Improve Rate Limiting For Anonymous API

Right now for the anonymous public key a global ratelimit is used. This rate limit is the same regardless of how much search pressure there is on the API gateway. This means that we sometimes ratelimit more aggressively than is needed.

An alternative logic may be to have a global rate limit across all API consumers that only the public API actually checks against, but all API consumers try to take a token. This would permit far more anonymous traffic when the traffic is low, and restrict the public API when API consumers with a key are also accessing the gateway.

There also appears to be poorly configured SearXNG clients accessing the querying the API gateway with the same query multiple times. To help these clients, a small response cache could be introduced. If the query exists within the cache, rate limit isn't checked. Guava's Cache is probably sufficient.

(search) Improper handling of tabs

From email

When I pasted in something with a tab, it doesn't find anything. When I replace the tab with a space, it does the normal search.
Tabs should be parsed just like spaces, as whitespace.

Bang Commands

I've been enjoying using Marginalia, but as it has a higher rate of serendipity, it can make it difficult when you're not exploring a topic, but looking for some very precise content in my experience.

Would adding support for "bang commands" be something that would fit within the project's goals/objectives? Allowing you to optionally search on a different engine for specific queries?

Not sure if this is already supported/this is the correct venue to discuss this, but let me know and I am happy to move the conversation elsewhere!

Improve list of indexed sites

Your search engine is very unique. But i have some ideas for improving the list of indexed sites.

  1. Below is a short useful list of sites dedicated to the analysis of search engines in the English-speaking space.
    seirdy.one + Webmentions
    thenewleafjournal.com
    dkb.io
    dkb.blog (new version of dkb.io)
    Website dkb.io is not indexed, although there are many mentions of this site on the news.ycombinator.com. Webmentions (best sample on site dkb.io) will help expand the search for sites that have not been indexed.
  2. What about aggregators of useful links in the Spanish-speaking space?

(array,btree) merge function

It would be useful to be able to merge BTrees. Having this capacity might make their construction faster and easier, and it would also be possible to join disparate sources of keywords/metadata relatively painlessly.

The first step is to build relevant operations in Array. This is likely relatively easy in theory but tricky to get performant, since the usual paging punch-through tricks may not work.

(array) sorting algorithms have room for improvement

Issues:

  • quick sort uses an inclusive upper bound. There's also improvements to the algorithm itself. It's the most basic-ass quicksort algorithm right now, better variants exist and might speed up conversion step a bit.
  • merge sort could be modified to use a fixed size merge buffer
  • actually, do we really need merge sort?

Here's more than anyone would ever want to know about quicksort:
https://kluedo.ub.rptu.de/frontdoor/deliver/index/docId/4468/file/wild-dissertation.pdf

It's finicky and time-consuming to get right and probably takes more time to implement than it will save in processing time, but it'd be nice to improve these.

https://github.com/MarginaliaSearch/MarginaliaSearch/blob/master/code/libraries/array/src/main/java/nu/marginalia/array/algo/SortAlgoQuickSort.java
https://github.com/MarginaliaSearch/MarginaliaSearch/blob/master/code/libraries/array/src/main/java/nu/marginalia/array/algo/SortAlgoMergeSort.java

Move domain links outside of the MariaDB database

Updating and querying this table is very slow and bogs down both the loader and ranking calculations by a substantial amount. The actual data itself is pretty small, 75,000,000 x 16 bytes = 1.2 GB in prod, but all the indices and uniqueness constraints blows this up to ~10 GB in MariaDB. If we split the data across nodes this would be a few hundred megabytes each, and we'd shave off hours of processing.

Sketch of a solution:

  • Keep it in memory in the index service and have some API for querying it (maybe via QS)
  • Loader writes to file to a file which gets switched over atomically as a whole batch by the index svc

(search-service) The query string "The Shining" (in quotation marks) isn't processed properly

The search ["the shining"] generates the query

include: the
include: shining
advice: the_shining

This returns nothing because 'the' is a stop word. It needs to also generate another query like

include: the_shining

or

include: shining
advice: the_shining

Debug logs from testing:

search-service | 16:46:21,674 QUERY INFO req:#9fd30347:b29672af SearchOperator -- Human terms: the,shining
index-service | 16:46:21,682 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- IndexQueryParams[qualityLimit=NONE, year=NONE, size=NONE, rank=NONE, searchSet=SearchSetAny, queryStrategy=AUTO]
index-service | 16:46:21,683 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- include=[the,shining] advice=[the_shining] coherences=[the,shining]
index-service | 16:46:21,683 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- the -> 60976 I
index-service | 16:46:21,684 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- shining -> 21826 I
index-service | 16:46:21,684 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- the_shining -> 495969 A
index-service | 16:46:21,685 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- 0 from [Priority:495969] -> [Retain:priority/60976, Retain:full/21826, [Predicate], ParamMatchingQueryFilter]
index-service | 16:46:21,686 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- 0 from [Priority:495969] -> [Retain:priority/21826, Retain:full/60976, [Predicate], ParamMatchingQueryFilter]
index-service | 16:46:21,687 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- 0 from [Priority:60976] -> [Retain:priority/21826, Retain:full/495969, [Predicate], ParamMatchingQueryFilter]
index-service | 16:46:21,687 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- 0 from [Priority:495969] -> [Retain:full/60976, Retain:full/21826, [Predicate], ParamMatchingQueryFilter]
index-service | 16:46:21,688 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- 0 from [Priority:60976] -> [Retain:full/495969, Retain:full/21826, [Predicate], ParamMatchingQueryFilter]
index-service | 16:46:21,689 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- 0 from [Priority:21826] -> [Retain:full/495969, Retain:full/60976, [Predicate], ParamMatchingQueryFilter]
index-service | 16:46:21,690 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- 0 from [Full:495969] -> [Retain:full/60976, Retain:full/21826, [Predicate], ParamMatchingQueryFilter]
index-service | 16:46:21,690 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- After filtering: 0 -> 0
index-service | 16:46:21,691 QUERY INFO req:#9fd30347:b29672af IndexQueryService -- Index Result Count: 0
search-service | 16:46:21,697 QUERY INFO req:#9fd30347:b29672af SearchOperator -- Search Result Count: 0
search-service | 16:46:21,712 HTTP INFO rsp:#9fd30347:b29672af SearchService -- RSP 200

vs

search-service | 16:45:46,353 QUERY INFO req:#9fd30347:75b0e911 SearchOperator -- Human terms: the_shining
index-service | 16:45:46,355 QUERY INFO req:#9fd30347:75b0e911 IndexQueryService -- IndexQueryParams[qualityLimit=NONE, year=NONE, size=NONE, rank=NONE, searchSet=SearchSetAny, queryStrategy=AUTO]
index-service | 16:45:46,355 QUERY INFO req:#9fd30347:75b0e911 IndexQueryService -- include=[the_shining]
index-service | 16:45:46,355 QUERY INFO req:#9fd30347:75b0e911 IndexQueryService -- the_shining -> 495969 I
index-service | 16:45:46,356 QUERY INFO req:#9fd30347:75b0e911 IndexQueryService -- 33 from [Priority:495969] -> [[Predicate], ParamMatchingQueryFilter]
index-service | 16:45:46,356 QUERY INFO req:#9fd30347:75b0e911 IndexQueryService -- Omitting [Full:495969] -> [[Predicate], ParamMatchingQueryFilter]
index-service | 16:45:46,362 QUERY INFO req:#9fd30347:75b0e911 IndexQueryService -- After filtering: 33 -> 33
index-service | 16:45:46,362 QUERY INFO req:#9fd30347:75b0e911 IndexQueryService -- Index Result Count: 21
search-service | 16:45:46,399 QUERY INFO req:#9fd30347:75b0e911 SearchQueryIndexService -- Deduplicator ate 4 results
search-service | 16:45:46,399 QUERY INFO req:#9fd30347:75b0e911 SearchOperator -- Search Result Count: 17
search-service | 16:45:46,484 HTTP INFO rsp:#9fd30347:75b0e911 SearchService -- RSP 200

Use public suffix list

Currently the domain name parsing is a bit of an idiot and trying to guess based on some heuristic where the "TLD" ends and the rest of the domain begins. The civilized way of doing this is to use the public suffix list to do this, as the TLD in a DNS sense isn't particularly informative.

Need to build a parser and stick it into a data structure that makes the processing fast since we're parsing quite a number of domains... Maybe move stuff outside of EdgeDomain as well.

Page returned after report shows raw HTML instead of rendered HTML

I reported a website that's some kind of spam website that's a clone of another website. After reporting it, I was forwarded to a page that showed me the raw HTML that's intended to be rendered rather than rendered page. Screenshot attached attached below:

website-clone

This was on Firefox 121.0 on a Mac. I did not attempt to reproduce this because I don't know how I'd be able to reproduce this without filing reports which would either be duplicates or false.

(crawler) Implement sitemap support

Sitemaps are currently not supported. Implementing sitemap support might help the crawler with URL discovery on some sites.

There are some risks though. Some sitemaps are huge. Look at neocities' sitemaps for example. It's a sitemap of all of neocities. This needs to be dealt with gracefully. There probably needs to be some sort of fast-failing upper limit to avoid exposing the crawler to OOM problems.

Some sitemaps also contain URLs for other domains. Since Marginalia's crawler is designed to operate on a one-domain-at-a-time fashion, these may need to be ignored initially.

Maybe look at Google's specs? https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap
Also: https://www.sitemaps.org/protocol.html

Switch stemming algorithm?

Marginalia uses Porter stemming quite frequently in various applications.

It's worth examining options. Porter's algorithm is pretty janky, and considers e.g. 'universe' and 'university' to be of the same stem. OpenNLP's Snowball stemmer may also be an option, or a Krovetz stemmer.

Note: If the stemming algorithm is switched completely, the term frequency dictionary needs to be re-generated, as it stems words.

(converter) Improve generator fingerprinting

The search engine fingerprints the webserver to try to figure out what sort of a website it is. This is done by looking at the meta generator tag and various other tags in DocumentGeneratorExtractor

Most of the website generators that can be fingerprinted with the generator tag should be picked up automatically, but it may be necessary to categorize them with the switch tag that looks like final GeneratorType type = switch (parts[0]) {

For generators that don't set meta generator, they may be fingerprinted through comments, js or other features in fingerprintByComments(). The task is to look at the HTML code for identifying features, and then check for them in the function. There are several examples of this in the code already.

Detection is especially poor for several static site generators. Hugo is picked up fine, but there are many others and some are not detected at this point.

Improve Summary Extraction

features-convert/summary-extraction, that is, the logic that extracts a summary of each document is very inconsistent.

This is a pretty difficult problem with two steps:

  1. Find "the text" of the document, if there is such a thing, stripping away navigation and titles. This must be able to deal with both very old HTML that may use tags in a strange manner, as well as modern semantic HTML.
  2. Find the most informative portion of the text.

A curve ball is that this needs to be very fast and can't use very much memory. Calling API endpoints or using AI is out of the window entirely.

In practice, this is probably an Academically Hard problem. PhDs theses have been written with regard to part #2 alone. But maybe there is a good-enough solution that is better than some of the heuristics currently employed.

image

(search-service) Typeahead is slow

This is due to the API being used is on the search.marginalia.nu/suggest domain, which is cloudflared. Use api.marginalia.nu/suggest instead.

In tts.js. This script could probably be cleaned up as it's written in weird mix of modern and boomer style of js.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.