Giter Site home page Giter Site logo

distru's People

Contributors

alexander-bauer avatar inhies avatar lukevers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

danry25 lukevers

distru's Issues

Index sharing

Though Index.MergeRemote() exists in current versions, it may not work. It needs to be able to request full indexes from other servers, and merge them into the Index object properly. There should also be an RPC server command to do this.

Search directs to /search/..., rather than appending search/... to the url

I encountered this bug when I set up Apache to direct requests to my server at host/distru/ to host:9048. This was accomplished with Apache's mod_proxy, rather than by a hard redirect. That worked alright for the first page, but was broken when Distru tried to search something, and directed to host/search/<term> rather than host/distru/search/<term>.

[webui] jquery is kind of big

webui/jquery.js is sort of big (93K) for what we're aiming for. Is it completely necessary?

$ ls -gGlh webui/
total 140K
-rw-r--r-- 1  494 Oct 24 23:23 common.js
-rw-r--r-- 1  263 Oct 25 00:30 index.html
-rw-r--r-- 1  93K Oct 24 23:23 jquery.js    <---
-rw-r--r-- 1 1.4K Oct 24 23:23 style.css

numResults does not write on search page.

When using fmt numResults printed on the page easily, but now the the int does not print. What we're using now, we have to use []byte for everything, and ints don't work. Strings can easily be changed like this:

w.Write([]byte("Derp"))

What I tried to do, and what is there right now, is this:

w.Write([]byte(string(numResults)))

This does not work though. This should be investigated upon, but also we need to take into account that we are not counting the number of results, yet and right now we just have a random number set like this:

numResults := 2

Attempt to index https:// site causes a runtime error

Tested with projectmeshnet.org. It causes a crypto-related runtime error.


panic: crypto: requested hash function is unavailable

goroutine 3 [running]:
crypto.Hash.New(0x5, 0x8, 0x0)
    /usr/lib/go/src/pkg/crypto/crypto.go:62 +0x93
crypto/x509.(*Certificate).CheckSignature(0x188bea00, 0x4, 0x1898d00e, 0x70d, 0xe5c, ...)
    /usr/lib/go/src/pkg/crypto/x509/x509.go:391 +0x5b
crypto/x509.(*Certificate).CheckSignatureFrom(0x188be800, 0x188bea00, 0x0, 0x0)
    /usr/lib/go/src/pkg/crypto/x509/x509.go:370 +0x141
crypto/x509.(*CertPool).findVerifiedParents(0x18983f80, 0x188be800, 0x0, 0x0)
    /usr/lib/go/src/pkg/crypto/x509/cert_pool.go:44 +0x158
crypto/x509.(*Certificate).buildChains(0x188be800, 0x1898f000, 0x65d65c, 0x1, 0x1, ...)
    /usr/lib/go/src/pkg/crypto/x509/verify.go:198 +0x16d
crypto/x509.(*Certificate).Verify(0x188be800, 0x0, 0x0, 0x18983f80, 0x1898f020, ...)
    /usr/lib/go/src/pkg/crypto/x509/verify.go:177 +0x17a
crypto/tls.(*Conn).clientHandshake(0x1891e180, 0x0, 0x0)
    /usr/lib/go/src/pkg/crypto/tls/handshake_client.go:117 +0x1209
----- stack segment boundary -----
crypto/tls.(*Conn).Handshake(0x1891e180, 0x0, 0x0)
    /usr/lib/go/src/pkg/crypto/tls/conn.go:808 +0xc3
net/http.(*Transport).getConn(0x18840c60, 0x189836c0, 0x189836c0, 0x0)
    /usr/lib/go/src/pkg/net/http/transport.go:369 +0x398
net/http.(*Transport).RoundTrip(0x18840c60, 0x189137e0, 0x189801e0, 0x0, 0x0, ...)
    /usr/lib/go/src/pkg/net/http/transport.go:155 +0x23b
net/http.send(0x189137e0, 0x1883ed80, 0x18840c60, 0x0, 0x0, ...)
    /usr/lib/go/src/pkg/net/http/client.go:133 +0x325
net/http.(*Client).doFollowingRedirects(0x832d1b8, 0x189135b0, 0x18956410, 0x0, 0x0, ...)
    /usr/lib/go/src/pkg/net/http/client.go:227 +0x568
net/http.(*Client).Get(0x832d1b8, 0x18901f90, 0x24, 0x80612c5, 0x0, ...)
    /usr/lib/go/src/pkg/net/http/client.go:176 +0x86
net/http.Get(0x18901f90, 0x24, 0x19, 0x0, 0x0, ...)
    /usr/lib/go/src/pkg/net/http/client.go:158 +0x40
main.getRobotsPermission(0x18878040, 0x19, 0x0, 0x0)
    /home/sasha/dev/go/src/distru/index.go:241 +0x6b
main.newSite(0x18878040, 0x19, 0x18983000, 0x12)
    /home/sasha/dev/go/src/distru/index.go:159 +0x82
main.Indexer(0x18800478, 0x18801180, 0x0)
    /home/sasha/dev/go/src/distru/index.go:140 +0x6d
created by main.MaintainIndex
    /home/sasha/dev/go/src/distru/index.go:134 +0x74

goroutine 1 [chan receive]:
net.(*pollServer).WaitRead(0x188018a0, 0x188523f0, 0x1888f3a0, 0xb)
    /usr/lib/go/src/pkg/net/fd.go:268 +0x75
net.(*netFD).accept(0x188523f0, 0x8089474, 0x0, 0x1883e4a0, 0x18800180, ...)
    /usr/lib/go/src/pkg/net/fd.go:622 +0x199
net.(*TCPListener).AcceptTCP(0x18891260, 0x81c638c, 0x0, 0x0)
    /usr/lib/go/src/pkg/net/tcpsock_posix.go:322 +0x56
net.(*TCPListener).Accept(0x18891260, 0x0, 0x0, 0x0, 0x0, ...)
    /usr/lib/go/src/pkg/net/tcpsock_posix.go:332 +0x39
main.Serve(0x1883eea0, 0x10)
    /home/sasha/dev/go/src/distru/serve.go:43 +0x354
main.main()
    /home/sasha/dev/go/src/distru/distru.go:12 +0x3d

goroutine 2 [syscall]:
created by runtime.main
    /build/buildd/golang-1/src/pkg/runtime/proc.c:221

goroutine 5 [syscall]:
syscall.Syscall6()
    /build/buildd/golang-1/src/pkg/syscall/asm_linux_386.s:46 +0x27
syscall.EpollWait(0x7, 0x18850008, 0xa, 0xa, 0xffffffff, ...)
    /usr/lib/go/src/pkg/syscall/zerrors_linux_386.go:1780 +0x7d
net.(*pollster).WaitFD(0x18850000, 0x188018a0, 0x0, 0x0, 0x0, ...)
    /usr/lib/go/src/pkg/net/fd_linux.go:146 +0x12b
net.(*pollServer).Run(0x188018a0, 0x5)
    /usr/lib/go/src/pkg/net/fd.go:236 +0xdf
created by net.newPollServer
    /usr/lib/go/src/pkg/net/newpollserver.go:35 +0x308

goroutine 6 [chan receive]:
net.(*pollServer).WaitRead(0x188018a0, 0x188812a0, 0x1888f3a0, 0xb)
    /usr/lib/go/src/pkg/net/fd.go:268 +0x75
net.(*netFD).accept(0x188812a0, 0x8089474, 0x0, 0x1883e4a0, 0x18800180, ...)
    /usr/lib/go/src/pkg/net/fd.go:622 +0x199
net.(*TCPListener).AcceptTCP(0x18800580, 0xc, 0x0, 0x0)
    /usr/lib/go/src/pkg/net/tcpsock_posix.go:322 +0x56
net.(*TCPListener).Accept(0x18800580, 0x0, 0x0, 0x0, 0x0, ...)
    /usr/lib/go/src/pkg/net/tcpsock_posix.go:332 +0x39
net/http.(*Server).Serve(0x18880120, 0x1888f540, 0x18800580, 0x0, 0x0, ...)
    /usr/lib/go/src/pkg/net/http/server.go:1012 +0x77
net/http.(*Server).ListenAndServe(0x18880120, 0x18880120, 0x0)
    /usr/lib/go/src/pkg/net/http/server.go:1002 +0x9f
net/http.ListenAndServe(0x81d8af0, 0x5, 0x0, 0x0, 0x18891300, ...)
    /usr/lib/go/src/pkg/net/http/server.go:1074 +0x55
main.ServeWeb()
    /home/sasha/dev/go/src/distru/web.go:15 +0xbf
created by main.Serve
    /home/sasha/dev/go/src/distru/serve.go:40 +0x341

goroutine 37 [chan receive]:
net.(*pollServer).WaitRead(0x188018a0, 0x18913620, 0x1888f3a0, 0xb)
    /usr/lib/go/src/pkg/net/fd.go:268 +0x75
net.(*netFD).Read(0x18913620, 0x18986000, 0x1000, 0x1000, 0xffffffff, ...)
    /usr/lib/go/src/pkg/net/fd.go:428 +0x19a
net.(*TCPConn).Read(0x18971e48, 0x18986000, 0x1000, 0x1000, 0x0, ...)
    /usr/lib/go/src/pkg/net/tcpsock_posix.go:87 +0xb1
bufio.(*Reader).fill(0x189800c0, 0x0)
    /usr/lib/go/src/pkg/bufio/bufio.go:77 +0x115
bufio.(*Reader).Peek(0x189800c0, 0x1, 0x18956401, 0x0)
    /usr/lib/go/src/pkg/bufio/bufio.go:102 +0x8b
net/http.(*persistConn).readLoop(0x1891ce80, 0x18878360)
    /usr/lib/go/src/pkg/net/http/transport.go:521 +0x8f
created by net/http.(*Transport).getConn
    /usr/lib/go/src/pkg/net/http/transport.go:382 +0x591

goroutine 38 [runnable]:
syscall.Syscall()
    /build/buildd/golang-1/src/pkg/syscall/asm_linux_386.s:33 +0x57
syscall.Close(0x4, 0x0, 0x0)
    /usr/lib/go/src/pkg/syscall/zerrors_linux_386.go:1700 +0x49
os.(*file).close(0x189740c0, 0x0, 0x0)
    /usr/lib/go/src/pkg/os/file_unix.go:96 +0x48
----- stack segment boundary -----
created by runtime.gc
    /build/buildd/golang-1/src/pkg/runtime/mgc0.c:882

[webui] front page should direct to /search/<searchterm>

bb33ac6ca0da169774f00283a56e4c1cb6ae4df3 updated web.go to be able to isolate search terms from the target URL. The front page of Distru should direct to /search/<searchterm> via javascript. I can't figure out how to do this, because apparently my javascript is terrible.

Non-exact searching

Searching, currently, only works with exact search terms. Search terms should be made with partial matches, or a similar not-entirely-exclusive means.

Prewritten regexes should use regex.MustCompile()

Prewritten regexes, which are not subject to change or any sort of variability, should be made using regex.MustCompile(). This will cause a runtime panic if they fail to compile, which will never happen if they are not subject to change.

Searching

This sort of speaks for itself, but there should be a function to explicitly get a list of relevant pages (and their URLs) from a given index.

Sites should request an index before attempting to do the http crawl

When Distru is given the command to index a site, it pays no attention to whether or not the site is running Distru. If the target site (except for the self-indexing) responds to a (possibly only-the-target-site Index.MergeRemote()) distru request, then it should not be actively crawled.

This will drastically reduce the amount of http network traffic, if enough sites run Distru instances.

Indexing routines take a while to finish

This is caused primarily by the fact that every page Distru indexes has to have an individual http request. These are currently run linearly; each page must wait for the previous one to finish. Each site must wait for the previous site to finish. Most of the lag, in this case, is network lag; Distru can handle the indexing very fast, but the target sites are slow to respond, and therefore they slow down our indexing.

This can be fixed by allowing Distru to send many HTTP requests simultaneously, mainly to separate sites, so that they can be indexed simultaneously. More complicatedly, requests to different pages could be handled simultaneously. This would require a bit more overhead, to keep track of pages already being requested, but it is definitely doable.

web.go has too many fmt.Fprintf() calls

    fmt.Fprint(w, "</style>")
    //close the <head> element
    fmt.Fprint(w, "</head>")
    //add the <body> element
    fmt.Fprint(w, "<body>")
    //display the search term at the top
    fmt.Fprintf(w, "<div class=\"searchterm\">%d results for <strong>%s</strong></div>", numResults, searchTerm)

    //TODO: SEARCH HERE.
    //this is a temporary example of what searches will look like
    fmt.Fprint(w, "<div class=\"results\">test</div>")
    fmt.Fprint(w, "<div class=\"results\">test2</div>")

    //close the <body> element
    fmt.Fprint(w, "</body>")
    //close the <html> element
    fmt.Fprint(w, "</html>")

These should be reduced to as few elements as possible, and the use of fmt should probably be discouraged, just for the sake of producing a smaller output binary. This could be resolved using the native method of http.ResponseWriter.

//Write a []byte to the connection.
err := w.Write("All of the style HTML on one line or read from a file, including the Version number")

Bad error detection and handling

When malformed requests are sent to the Indexers, the result is usually an unprintable Index. Indexers and their slave functions should implement better error handling and communication.

result ranking

Results should be ranked in order of relevance. Exactly what relevance is will take some figuring.

Should probably use UDP

UDP is better designed for Distru's purposes. The problem with this is that some of the server/client code needs to be slightly restructured to use UDPConn objects instead of Conns established on TCP.

Self-links should be treated as internal links

For example, a link to http://example.org/help from http://example.org should be treated as an internal link, as should a link to http://example.org/ from http://example.org/a/page/down/the/tree. The first one should be treated as /help and the second should be treated as /.

Documentation

There's not really much explaining what exactly distru does and exactly how to use it.

Distru does not create the config file.

As of right now you have to create your own config file.

sudo touch /etc/distru.conf

And then after that, Distru needs permission to add things to it and to change it (a number of different numbers could be used with chmod, but I'm just using 660 as an example).

sudo chmod 660 /etc/distru.conf

Distru should do all of this itself. The user should not have to do these things.

Configuration Controls

The configuration file should contain start-up values, such as the number of indexers to start. Each of these should be set with defaults, and they should be added to the configuration if they are not set.

Index configurable sites automatically

This has been a long standing issue. Distru should be able to index sites automatically on startup, and which sites those are must be user-configurable.

Active index sharing

Searches should implement the Index.MergeRemote() function in order to expand the index as much as possible before making the search. The resultant, larger Index should probably be discarded afterward, except for sites with relevant results.

Recursive indexing

This has been a long standing issue, but I never created an issue for it. Distru needs to index whole websites, rather than just front pages.

WebUI needs to display proper search results

The function Index.Search() function, though rudimentary, returns usable and formatted results. These should be used by the WebUI to produce the content on search/term pages.

Page info

Pages should be scraped for more identifying information. This should include the content of the <title> tag, at the very least, , but hopefully more. This will be for the display of individual pages as search results.

Referred Sites

Sites should have a field to keep track of the number of referrers. This way, they and their pages can be ranked better in terms of how popular they are on the mesh.

Multiple HTTP GET requests

It is possible to send multiple HTTP GET requests, so as to reduce network lag drastically. Currently, Distru sends a GET request for a certain page, waits for the server to respond, processes the page, pulls out the relevant links, and sends another request. This could be sped up greatly if Distru requested all known, unindexed pages from the server simultaneously. We will need to look into the http package to do this.

This may have the consequence of fixing #12.

UserAgent not set

Apparently, Distru identifies itself as Go http package, whereas it should be identifying itself by the BotName constant, which is Distru.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.