alexander-bauer / distru Goto Github PK
View Code? Open in Web Editor NEWDistributed search for Hyperboria, powered by Go
Distributed search for Hyperboria, powered by Go
Though Index.MergeRemote() exists in current versions, it may not work. It needs to be able to request full indexes from other servers, and merge them into the Index object properly. There should also be an RPC server command to do this.
I encountered this bug when I set up Apache to direct requests to my server at host/distru/
to host:9048
. This was accomplished with Apache's mod_proxy, rather than by a hard redirect. That worked alright for the first page, but was broken when Distru tried to search something, and directed to host/search/<term>
rather than host/distru/search/<term>
.
Brought to attention by @danry25.
webui/jquery.js
is sort of big (93K
) for what we're aiming for. Is it completely necessary?
$ ls -gGlh webui/
total 140K
-rw-r--r-- 1 494 Oct 24 23:23 common.js
-rw-r--r-- 1 263 Oct 25 00:30 index.html
-rw-r--r-- 1 93K Oct 24 23:23 jquery.js <---
-rw-r--r-- 1 1.4K Oct 24 23:23 style.css
When using fmt
numResults printed on the page easily, but now the the int does not print. What we're using now, we have to use []byte
for everything, and ints don't work. Strings can easily be changed like this:
w.Write([]byte("Derp"))
What I tried to do, and what is there right now, is this:
w.Write([]byte(string(numResults)))
This does not work though. This should be investigated upon, but also we need to take into account that we are not counting the number of results, yet and right now we just have a random number set like this:
numResults := 2
Tested with projectmeshnet.org. It causes a crypto-related runtime error.
panic: crypto: requested hash function is unavailable
goroutine 3 [running]:
crypto.Hash.New(0x5, 0x8, 0x0)
/usr/lib/go/src/pkg/crypto/crypto.go:62 +0x93
crypto/x509.(*Certificate).CheckSignature(0x188bea00, 0x4, 0x1898d00e, 0x70d, 0xe5c, ...)
/usr/lib/go/src/pkg/crypto/x509/x509.go:391 +0x5b
crypto/x509.(*Certificate).CheckSignatureFrom(0x188be800, 0x188bea00, 0x0, 0x0)
/usr/lib/go/src/pkg/crypto/x509/x509.go:370 +0x141
crypto/x509.(*CertPool).findVerifiedParents(0x18983f80, 0x188be800, 0x0, 0x0)
/usr/lib/go/src/pkg/crypto/x509/cert_pool.go:44 +0x158
crypto/x509.(*Certificate).buildChains(0x188be800, 0x1898f000, 0x65d65c, 0x1, 0x1, ...)
/usr/lib/go/src/pkg/crypto/x509/verify.go:198 +0x16d
crypto/x509.(*Certificate).Verify(0x188be800, 0x0, 0x0, 0x18983f80, 0x1898f020, ...)
/usr/lib/go/src/pkg/crypto/x509/verify.go:177 +0x17a
crypto/tls.(*Conn).clientHandshake(0x1891e180, 0x0, 0x0)
/usr/lib/go/src/pkg/crypto/tls/handshake_client.go:117 +0x1209
----- stack segment boundary -----
crypto/tls.(*Conn).Handshake(0x1891e180, 0x0, 0x0)
/usr/lib/go/src/pkg/crypto/tls/conn.go:808 +0xc3
net/http.(*Transport).getConn(0x18840c60, 0x189836c0, 0x189836c0, 0x0)
/usr/lib/go/src/pkg/net/http/transport.go:369 +0x398
net/http.(*Transport).RoundTrip(0x18840c60, 0x189137e0, 0x189801e0, 0x0, 0x0, ...)
/usr/lib/go/src/pkg/net/http/transport.go:155 +0x23b
net/http.send(0x189137e0, 0x1883ed80, 0x18840c60, 0x0, 0x0, ...)
/usr/lib/go/src/pkg/net/http/client.go:133 +0x325
net/http.(*Client).doFollowingRedirects(0x832d1b8, 0x189135b0, 0x18956410, 0x0, 0x0, ...)
/usr/lib/go/src/pkg/net/http/client.go:227 +0x568
net/http.(*Client).Get(0x832d1b8, 0x18901f90, 0x24, 0x80612c5, 0x0, ...)
/usr/lib/go/src/pkg/net/http/client.go:176 +0x86
net/http.Get(0x18901f90, 0x24, 0x19, 0x0, 0x0, ...)
/usr/lib/go/src/pkg/net/http/client.go:158 +0x40
main.getRobotsPermission(0x18878040, 0x19, 0x0, 0x0)
/home/sasha/dev/go/src/distru/index.go:241 +0x6b
main.newSite(0x18878040, 0x19, 0x18983000, 0x12)
/home/sasha/dev/go/src/distru/index.go:159 +0x82
main.Indexer(0x18800478, 0x18801180, 0x0)
/home/sasha/dev/go/src/distru/index.go:140 +0x6d
created by main.MaintainIndex
/home/sasha/dev/go/src/distru/index.go:134 +0x74
goroutine 1 [chan receive]:
net.(*pollServer).WaitRead(0x188018a0, 0x188523f0, 0x1888f3a0, 0xb)
/usr/lib/go/src/pkg/net/fd.go:268 +0x75
net.(*netFD).accept(0x188523f0, 0x8089474, 0x0, 0x1883e4a0, 0x18800180, ...)
/usr/lib/go/src/pkg/net/fd.go:622 +0x199
net.(*TCPListener).AcceptTCP(0x18891260, 0x81c638c, 0x0, 0x0)
/usr/lib/go/src/pkg/net/tcpsock_posix.go:322 +0x56
net.(*TCPListener).Accept(0x18891260, 0x0, 0x0, 0x0, 0x0, ...)
/usr/lib/go/src/pkg/net/tcpsock_posix.go:332 +0x39
main.Serve(0x1883eea0, 0x10)
/home/sasha/dev/go/src/distru/serve.go:43 +0x354
main.main()
/home/sasha/dev/go/src/distru/distru.go:12 +0x3d
goroutine 2 [syscall]:
created by runtime.main
/build/buildd/golang-1/src/pkg/runtime/proc.c:221
goroutine 5 [syscall]:
syscall.Syscall6()
/build/buildd/golang-1/src/pkg/syscall/asm_linux_386.s:46 +0x27
syscall.EpollWait(0x7, 0x18850008, 0xa, 0xa, 0xffffffff, ...)
/usr/lib/go/src/pkg/syscall/zerrors_linux_386.go:1780 +0x7d
net.(*pollster).WaitFD(0x18850000, 0x188018a0, 0x0, 0x0, 0x0, ...)
/usr/lib/go/src/pkg/net/fd_linux.go:146 +0x12b
net.(*pollServer).Run(0x188018a0, 0x5)
/usr/lib/go/src/pkg/net/fd.go:236 +0xdf
created by net.newPollServer
/usr/lib/go/src/pkg/net/newpollserver.go:35 +0x308
goroutine 6 [chan receive]:
net.(*pollServer).WaitRead(0x188018a0, 0x188812a0, 0x1888f3a0, 0xb)
/usr/lib/go/src/pkg/net/fd.go:268 +0x75
net.(*netFD).accept(0x188812a0, 0x8089474, 0x0, 0x1883e4a0, 0x18800180, ...)
/usr/lib/go/src/pkg/net/fd.go:622 +0x199
net.(*TCPListener).AcceptTCP(0x18800580, 0xc, 0x0, 0x0)
/usr/lib/go/src/pkg/net/tcpsock_posix.go:322 +0x56
net.(*TCPListener).Accept(0x18800580, 0x0, 0x0, 0x0, 0x0, ...)
/usr/lib/go/src/pkg/net/tcpsock_posix.go:332 +0x39
net/http.(*Server).Serve(0x18880120, 0x1888f540, 0x18800580, 0x0, 0x0, ...)
/usr/lib/go/src/pkg/net/http/server.go:1012 +0x77
net/http.(*Server).ListenAndServe(0x18880120, 0x18880120, 0x0)
/usr/lib/go/src/pkg/net/http/server.go:1002 +0x9f
net/http.ListenAndServe(0x81d8af0, 0x5, 0x0, 0x0, 0x18891300, ...)
/usr/lib/go/src/pkg/net/http/server.go:1074 +0x55
main.ServeWeb()
/home/sasha/dev/go/src/distru/web.go:15 +0xbf
created by main.Serve
/home/sasha/dev/go/src/distru/serve.go:40 +0x341
goroutine 37 [chan receive]:
net.(*pollServer).WaitRead(0x188018a0, 0x18913620, 0x1888f3a0, 0xb)
/usr/lib/go/src/pkg/net/fd.go:268 +0x75
net.(*netFD).Read(0x18913620, 0x18986000, 0x1000, 0x1000, 0xffffffff, ...)
/usr/lib/go/src/pkg/net/fd.go:428 +0x19a
net.(*TCPConn).Read(0x18971e48, 0x18986000, 0x1000, 0x1000, 0x0, ...)
/usr/lib/go/src/pkg/net/tcpsock_posix.go:87 +0xb1
bufio.(*Reader).fill(0x189800c0, 0x0)
/usr/lib/go/src/pkg/bufio/bufio.go:77 +0x115
bufio.(*Reader).Peek(0x189800c0, 0x1, 0x18956401, 0x0)
/usr/lib/go/src/pkg/bufio/bufio.go:102 +0x8b
net/http.(*persistConn).readLoop(0x1891ce80, 0x18878360)
/usr/lib/go/src/pkg/net/http/transport.go:521 +0x8f
created by net/http.(*Transport).getConn
/usr/lib/go/src/pkg/net/http/transport.go:382 +0x591
goroutine 38 [runnable]:
syscall.Syscall()
/build/buildd/golang-1/src/pkg/syscall/asm_linux_386.s:33 +0x57
syscall.Close(0x4, 0x0, 0x0)
/usr/lib/go/src/pkg/syscall/zerrors_linux_386.go:1700 +0x49
os.(*file).close(0x189740c0, 0x0, 0x0)
/usr/lib/go/src/pkg/os/file_unix.go:96 +0x48
----- stack segment boundary -----
created by runtime.gc
/build/buildd/golang-1/src/pkg/runtime/mgc0.c:882
bb33ac6ca0da169774f00283a56e4c1cb6ae4df3 updated web.go to be able to isolate search terms from the target URL. The front page of Distru should direct to /search/<searchterm>
via javascript. I can't figure out how to do this, because apparently my javascript is terrible.
Searching, currently, only works with exact search terms. Search terms should be made with partial matches, or a similar not-entirely-exclusive means.
This will add a space at the end of the search page instead of having it all crowded at the bottom touching the bottom of the page.
Whenever a user makes a search request, a duplicate search for "img/icon_16.png" appears in the server log, as well.
Prewritten regexes, which are not subject to change or any sort of variability, should be made using regex.MustCompile()
. This will cause a runtime panic if they fail to compile, which will never happen if they are not subject to change.
This sort of speaks for itself, but there should be a function to explicitly get a list of relevant pages (and their URLs) from a given index.
When Distru is given the command to index a site, it pays no attention to whether or not the site is running Distru. If the target site (except for the self-indexing) responds to a (possibly only-the-target-site Index.MergeRemote()
) distru request, then it should not be actively crawled.
This will drastically reduce the amount of http network traffic, if enough sites run Distru instances.
This is caused primarily by the fact that every page Distru indexes has to have an individual http request. These are currently run linearly; each page must wait for the previous one to finish. Each site must wait for the previous site to finish. Most of the lag, in this case, is network lag; Distru can handle the indexing very fast, but the target sites are slow to respond, and therefore they slow down our indexing.
This can be fixed by allowing Distru to send many HTTP requests simultaneously, mainly to separate sites, so that they can be indexed simultaneously. More complicatedly, requests to different pages could be handled simultaneously. This would require a bit more overhead, to keep track of pages already being requested, but it is definitely doable.
fmt.Fprint(w, "</style>")
//close the <head> element
fmt.Fprint(w, "</head>")
//add the <body> element
fmt.Fprint(w, "<body>")
//display the search term at the top
fmt.Fprintf(w, "<div class=\"searchterm\">%d results for <strong>%s</strong></div>", numResults, searchTerm)
//TODO: SEARCH HERE.
//this is a temporary example of what searches will look like
fmt.Fprint(w, "<div class=\"results\">test</div>")
fmt.Fprint(w, "<div class=\"results\">test2</div>")
//close the <body> element
fmt.Fprint(w, "</body>")
//close the <html> element
fmt.Fprint(w, "</html>")
These should be reduced to as few elements as possible, and the use of fmt
should probably be discouraged, just for the sake of producing a smaller output binary. This could be resolved using the native method of http.ResponseWriter.
//Write a []byte to the connection.
err := w.Write("All of the style HTML on one line or read from a file, including the Version number")
When malformed requests are sent to the Indexers, the result is usually an unprintable Index. Indexers and their slave functions should implement better error handling and communication.
Results should be ranked in order of relevance. Exactly what relevance is will take some figuring.
UDP is better designed for Distru's purposes. The problem with this is that some of the server/client code needs to be slightly restructured to use UDPConn
objects instead of Conn
s established on TCP.
For example, a link to http://example.org/help
from http://example.org
should be treated as an internal link, as should a link to http://example.org/
from http://example.org/a/page/down/the/tree
. The first one should be treated as /help
and the second should be treated as /
.
Sites should be timestamped by when they were indexed.
In the search results page, one can click to the left or right of a search result box, and still be taken to the page. This is misleading, because the box does not highlight in this case.
There's not really much explaining what exactly distru does and exactly how to use it.
As of right now you have to create your own config file.
sudo touch /etc/distru.conf
And then after that, Distru needs permission to add things to it and to change it (a number of different numbers could be used with chmod, but I'm just using 660 as an example).
sudo chmod 660 /etc/distru.conf
Distru should do all of this itself. The user should not have to do these things.
The configuration file should contain start-up values, such as the number of indexers to start. Each of these should be set with defaults, and they should be added to the configuration if they are not set.
This has been a long standing issue. Distru should be able to index sites automatically on startup, and which sites those are must be user-configurable.
Searches should implement the Index.MergeRemote()
function in order to expand the index as much as possible before making the search. The resultant, larger Index should probably be discarded afterward, except for sites with relevant results.
The search page version number is significantly larger than the one on the front page.
This has been a long standing issue, but I never created an issue for it. Distru needs to index whole websites, rather than just front pages.
The function Index.Search()
function, though rudimentary, returns usable and formatted results. These should be used by the WebUI to produce the content on search/term
pages.
Pages should be scraped for more identifying information. This should include the content of the <title>
tag, at the very least, , but hopefully more. This will be for the display of individual pages as search results.
Sites should have a field to keep track of the number of referrers. This way, they and their pages can be ranked better in terms of how popular they are on the mesh.
It is possible to send multiple HTTP GET requests, so as to reduce network lag drastically. Currently, Distru sends a GET request for a certain page, waits for the server to respond, processes the page, pulls out the relevant links, and sends another request. This could be sped up greatly if Distru requested all known, unindexed pages from the server simultaneously. We will need to look into the http package to do this.
This may have the consequence of fixing #12.
This is the strangest bug that I have ever encountered. If the server's index does not include uppit.us
, it will fail to send, at least when receiving a connection from localhost
.
Apparently, Distru identifies itself as Go http package
, whereas it should be identifying itself by the BotName
constant, which is Distru
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.