nlnwa / gowarcserver Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Describe the bug
Because unsortedParallelSearch is used for http + https with two iterators, the response have wrong order between http and https sites.
To Reproduce
Ask for a url without scheme (which causes a double search on http + https)
Expected behavior
Response is sorted by date
Additional context
There should be an alternative search that is similar to closestUniSearch where is searched N keys with N iterators, select the oldest/newest item and iterate only this until all iterators are expended.
https://github.com/nlnwa/gowarcserver/blob/master/internal/server/warcserver/db.go
Describe the solution you'd like
Instead of using docker hub, we can simply use github's own docker registry. Update .github workflow to reflect this
Is your feature request related to a problem? Please describe.
It is valid to have records that only have WARC-Refers-To-Target-URI and WARC-Refers-To-Date (see 6.7.2). This is currently not supported in gowarcserver
Describe the solution you'd like
Add handling of records without WARC-Refers-To in the loader/loader.go Load function
Is your feature request related to a problem? Please describe.
Draft of PR #3 introduced sever regression. To avoid merging regressions in the master there should be cmd tests to verify that all commands and arguments works as expected
Describe the solution you'd like
Unit tests for each command and argument combination
Describe alternatives you've considered
None
Additional context
https://medium.com/swlh/unit-testing-cli-programs-in-go-6275c85af2e7
Is your feature request related to a problem? Please describe.
Other than go vet we don't really lint go code in CI which might lead to common problems being introduced
Describe the solution you'd like
Introduce https://github.com/golangci/golangci-lint with some common linters to the CI
Describe alternatives you've considered
Additional context
Describe the bug
gowarcserver panics when i try to index testdata/IAH-20080430204825-00000-blackbook.warc
To Reproduce
Steps to reproduce the behavior:
-f cdxj
i.e ./warcserver index -f cdxj ./testdata/IAH-20080430204825-00000-blackbook.warc
Expected behavior
gowarcserver should index the files or report a user error to the terminal.
Additional context
[akselhjerpbakk@localhost gowarcserver]$ ./warcserver index -f cdxj ./testdata/IAH-20080430204825-00000-blackbook.warc
Using config file: /home/akselhjerpbakk/Projects/warcproject/gowarcserver/config.yaml
Format: cdxj
Count: 2
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x91749c]
goroutine 1 [running]:
github.com/golang/protobuf/jsonpb.(*jsonWriter).marshalMessage(0xc0001c9aa0, 0xcd9ba0, 0xc0001d4a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/akselhjerpbakk/go/pkg/mod/github.com/golang/[email protected]/jsonpb/encode.go:219 +0x73c
github.com/golang/protobuf/jsonpb.(*Marshaler).marshal(0x0, 0xccfca0, 0xc0001d4a00, 0x32, 0x2e2, 0xc0001d4a00, 0xc000330040, 0x0)
/home/akselhjerpbakk/go/pkg/mod/github.com/golang/[email protected]/jsonpb/encode.go:116 +0x207
github.com/golang/protobuf/jsonpb.(*Marshaler).MarshalToString(...)
/home/akselhjerpbakk/go/pkg/mod/github.com/golang/[email protected]/jsonpb/encode.go:78
github.com/nlnwa/gowarcserver/pkg/index.(*CdxJ).Write(0xc000010268, 0xcd4920, 0xc00007cd40, 0x7ffee18cb20a, 0x32, 0x2e2, 0x0, 0x0)
/home/akselhjerpbakk/Projects/warcproject/gowarcserver/pkg/index/indexwriter.go:74 +0xe6
github.com/nlnwa/gowarcserver/cmd/warcserver/cmd/index.readFile(0xc0001dc5a0, 0x0, 0x0)
/home/akselhjerpbakk/Projects/warcproject/gowarcserver/cmd/warcserver/cmd/index/index.go:120 +0x15f
github.com/nlnwa/gowarcserver/cmd/warcserver/cmd/index.runE(0xc0001dc5a0, 0x0, 0x0)
/home/akselhjerpbakk/Projects/warcproject/gowarcserver/cmd/warcserver/cmd/index/index.go:92 +0xf0
github.com/nlnwa/gowarcserver/cmd/warcserver/cmd/index.NewCommand.func2(0xc0001d3400, 0xc0001dc780, 0x1, 0x3, 0x0, 0x0)
/home/akselhjerpbakk/Projects/warcproject/gowarcserver/cmd/warcserver/cmd/index/index.go:79 +0xba
github.com/spf13/cobra.(*Command).execute(0xc0001d3400, 0xc0001dc6f0, 0x3, 0x3, 0xc0001d3400, 0xc0001dc6f0)
/home/akselhjerpbakk/go/pkg/mod/github.com/spf13/[email protected]/command.go:826 +0x47c
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001d2f00, 0xc000000180, 0xc0001c9f78, 0x4118e5)
/home/akselhjerpbakk/go/pkg/mod/github.com/spf13/[email protected]/command.go:914 +0x30b
github.com/spf13/cobra.(*Command).Execute(...)
/home/akselhjerpbakk/go/pkg/mod/github.com/spf13/[email protected]/command.go:864
main.main()
/home/akselhjerpbakk/Projects/warcproject/gowarcserver/cmd/warcserver/main.go:27 +0x2b
Is your feature request related to a problem? Please describe.
Currently the gowarcserver's can consume more memory than they should in their host container
Describe the solution you'd like
Badger has an extensive API that should enable a solution where the end user can configure the memory usage on startup using arguments and/or the config file. @maeb has a previous attempt at this which can be found on over at the gowarc repo (link is not permanent, so it might die at some point).
Additional context
A good place to start might be badgers documentation entry on memory usage https://dgraph.io/docs/badger/get-started/#memory-usage
All option fields in badger v2.2007.2:
https://github.com/dgraph-io/badger/blob/d5a25b83fbf4f3f61ff03a9202e36f5b75544426/options.go#L35
// Required options.
Dir string
ValueDir string
// Usually modified options.
SyncWrites bool
TableLoadingMode options.FileLoadingMode
ValueLogLoadingMode options.FileLoadingMode
NumVersionsToKeep int
ReadOnly bool
Truncate bool
Logger Logger
Compression options.CompressionType
InMemory bool
// Fine tuning options.
MaxTableSize int64
LevelSizeMultiplier int
MaxLevels int
ValueThreshold int
NumMemtables int
// Changing BlockSize across DB runs will not break badger. The block size is
// read from the block index stored at the end of the table.
BlockSize int
BloomFalsePositive float64
KeepL0InMemory bool
BlockCacheSize int64
IndexCacheSize int64
LoadBloomsOnOpen bool
NumLevelZeroTables int
NumLevelZeroTablesStall int
LevelOneSize int64
ValueLogFileSize int64
ValueLogMaxEntries uint32
NumCompactors int
CompactL0OnClose bool
LogRotatesToFlush int32
ZSTDCompressionLevel int
// When set, checksum will be validated for each entry read from the value log file.
VerifyValueChecksum bool
// Encryption related options.
EncryptionKey []byte // encryption key
EncryptionKeyRotationDuration time.Duration // key rotation duration
// BypassLockGaurd will bypass the lock guard on badger. Bypassing lock
// guard can cause data corruption if multiple badger instances are using
// the same directory. Use this options with caution.
BypassLockGuard bool
// ChecksumVerificationMode decides when db should verify checksums for SSTable blocks.
ChecksumVerificationMode options.ChecksumVerificationMode
// DetectConflicts determines whether the transactions would be checked for
// conflicts. The transactions can be processed at a higher rate when
// conflict detection is disabled.
DetectConflicts bool
// Transaction start and commit timestamps are managed by end-user.
// This is only useful for databases built on top of Badger (like Dgraph).
// Not recommended for most users.
managedTxns bool
// 4. Flags for testing purposes
// ------------------------------
maxBatchCount int64 // max entries in batch
maxBatchSize int64 // max batch size in bytes
Is your feature request related to a problem? Please describe.
Regressions is hard to avoid when expanding features of an application.
Describe the solution you'd like
We should watch for regressions in performance through use of CI (does not need to be github hosted).
CI step to profile master branch should look for improvements and regressions in performance. This would also serve as integration testing, which has been a minor issue in the repository (i.e duplicate flags are not noticed on compile time and needs to be triggered as an error on runtime)
Describe alternatives you've considered
None
Additional context
https://go.dev/blog/pprof
https://hackernoon.com/go-the-complete-guide-to-profiling-your-code-h51r3waz
https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go
Is your feature request related to a problem? Please describe.
Badger 2.x is becoming old
Describe the solution you'd like
Update badger to the latest 3.xx release
Additional context
Badger repository
Is your feature request related to a problem? Please describe.
Collision of badger db folder can will cause panic. There is also overhead in running an empty DB
Describe the solution you'd like
Allow user to use argument i.e --no-badger
to disable badger db. This will make it easier to have parent gowarc servers (see issue #4) that only ask children with replicas
Describe alternatives you've considered
None
Additional context
None
Describe the bug
When a new commit arrives in master, the CI will make an artifact even if the testing and linting CI fails
To Reproduce
Push to master
Expected behavior
Docker artifact is omitted in the event of failures in tests and linter
Screenshots
None
Additional context
Introduced in PR 34
The error handling in ServeHTTP should be refactored.
Currently there is a function that handles errors
func (h *searchHandler) handleError(err error, w http.ResponseWriter) {
if err != nil {
w.Header().Set("Content-Type", "text/plain")
w.WriteHeader(404)
fmt.Fprintf(w, "Error: %v\n", err)
}
}
This gets called once, and its in ServeHTTP
key, err := surt.SsurtString(uri, true)
if err != nil {
h.handleError(err, w)
return
}
The 404 return status seems to be a bit eager as the error can be caused by uri parsing causing an error. There are also other cases where errors are not dealt with properly.
Describe the bug
When a user sends a warc id as a url in a search URL, the endpoint will respond with records according to matching ID and report some errors parsing the url
To Reproduce
Steps to reproduce the behavior:
http://localhost:9999/search?url=%3Curn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007%3E
Expected behavior
Warcserver should respond as soon as the url fails to parse and should never begin search in db
Screenshots
None
Additional context
None
Is your feature request related to a problem? Please describe.
The gowarcserver should support the memento API. This enables users to perform queries related to time frames and easily paginate using time.
Describe the solution you'd like
Implement the memento specification as an optional API. It would also be nice to allow compiler flags dictate which API's that should be used in the executable
Additional context
http://timetravel.mementoweb.org/guide/api/
Is your feature request related to a problem? Please describe.
There is a lot of bloat when it comes to binding viper and cobra
Describe the solution you'd like
Bind viper and cobra variables
if err := viper.BindPFlags(cmd.Flags()); err != nil {
log.Fatalf("Failed to bind serve flags: %v", err)
}
Describe alternatives you've considered
Additional context
nlnwa/gowarc@1e8ab14#diff-ba80d9a080c8fa689d8dfa784ab21bea256579ab97f2aef8c28c9e3adb0289edR53
Add this project to the badger use list
Is your feature request related to a problem? Please describe.
Currently gowarcserver is dependent on two protobuf versions. To make matters worse, one of them is an old RC and not a stable version
Describe the solution you'd like
Depend on one version, preferably the newest one (1.5.1 at the time of writing this)
Additional context
https://pkg.go.dev/github.com/golang/[email protected]
Is your feature request related to a problem? Please describe.
We should avoid introducing common problems into our codebase
Describe the solution you'd like
Go through https://golangci-lint.run/usage/linters/#disabled-by-default-linters--e--enable and see what we want to enable
Is your feature request related to a problem? Please describe.
As gowarcserver becomes more used with bigger datasets the need for splitting bigger collections might occur.
Describe the solution you'd like
Files can inherently be split since warc records are a thing. Add support for config files to express indexing a subsection of a file.
Example:
config for instance 1, index and "own" records in warc including record 0 to 10
indexFiles:
- example.warc:0-10
config for instance 2, index and "own" records in warc including record 11 and the remaining in the file
indexFiles:
- example.warc:11-
The configuration should also be set up in a way so that instance 1 and 2 are sibilings (common parent) in order for queries about example.warc to work.
Describe alternatives you've considered
None
Additional context
None
Is your feature request related to a problem? Please describe.
Currently gowarcserver has 3 index databases in different formats. One or two of these formats are not needed, then they will bloat the memory footprint of the application without any benefit.
Describe the solution you'd like
Allow user to use config and arguments to toggle off individual index databases
Describe alternatives you've considered
Removing id index might also be a good change if we are not using the id index in production.
Is your feature request related to a problem? Please describe.
There is a relatively big WIP change to gowarc which will update the gowarc API
Describe the solution you'd like
Update the dependency to use the new changes in gowarc
Describe alternatives you've considered
Additional context
Is your feature request related to a problem? Please describe.
Badger has problems when the index becomes too big. Currently, we solve this making multiple gowarcserver instances.
Describe the solution you'd like
We want a distributed DB for gowarcserver to simplify the gowarcserver's responsibility
Describe alternatives you've considered
None
Additional context
https://tikv.org/
Based on meeting with @maeb. He had an idea of a potential direction to improve gowarcserver.
Is your feature request related to a problem? Please describe.
This will solve two problems.
Describe the solution you'd like
We can structure gowarcservers like a tree. Each node in the tree can hold records and N child nodes. Using arguments or editing the config should allow you to point at child nodes of the gowarcserver that is being fired up. When the server receive a query it should process the query while also ask all children to do the same. How it should handle finding results is left undefined for now i.e discarding request to children and just send found item or wait for all children to answer before aggregating result etc. It's important to note that based on the diagram, the only difference between a parent- and leaf node is that the leaf node has no registered children. Programmatically they should be identical.
Problem 1 will be solved by introduction of the concept of a parent-child relation. It will allow us to set up a network of servers where a root instance can aggregate queries throughout the gowarcserver network. Loke will only have to know about the root. This will result in the end user not having to care about which collection that contains the target record.
Problem 2 will be solved by the fact that queries can be aggregated using go routines to children and self which should make queries scale with increased data. Indexing of records will also be distributed without locking it to a topic or area (i.e all indexing of newspapers having to be central)
It's worth noting that this will introduce greater complexity to the codebase and abusing said tree structure might lead to slower results as request will be chained based on tree depth.
This will also open up future optimizations. Examples of this could be: caching common queries where no changes has been made in the db or skipping nodes when we already know target node for query.
Additional context
Googles talk about about go servers (mainly from slide 33 and out)
Potential API http://timetravel.mementoweb.org/guide/api/
Describe the bug
It seems like badger refuses connection to the new pod with rollout strategy. The scope of fixing this bug might be to big to be worth it.
@maeb can expand this report if he wants (i.e steps to reproduce)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Screenshots
Additional context
Is your feature request related to a problem? Please describe.
It's easy to abuse the functionality in daterange. It should be refactored to use timestamp or other mechanism that makes values harder to abuse
Describe the solution you'd like
Convert daterange string to an int or date oriented type
Describe alternatives you've considered
None
Additional context
None
Describe the bug
Same flag name for different cobra subcommands does not work. See spf13/viper#233 for description of the issue.
We currently use flag port in serve and proxy subcommands.
Solution
The solution outlined in spf13/viper#233 (comment) will fix the problem.
Is your feature request related to a problem? Please describe.
Indexer tries to index any and all files in the traversal path.
Describe the solution you'd like
A flag or environment variable to specify a pattern for filenames to include/exclude. E.g.
INCLUDE="*.warc.gz"
or EXCLUDE="*.md5"
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.