Giter Site home page Giter Site logo

Comments (8)

huahaiy avatar huahaiy commented on August 26, 2024 1

Thanks. Datalevin is single writer, so it uses only one core when writing/indexing. However, you can saturate all cores when reading/searching.

from datalevin.

huahaiy avatar huahaiy commented on August 26, 2024

Thanks for trying.

It should be obvious that the benchmark code was depending on the parent project's source code, as we test our latest changes often with these benchmarks. For the benchmark to work, one needed to compile Datalevin at least once to build all the classes first, e.g.

cd ..
lein test

Apparently, I should not have made the assumption that people know this, so I changed the dependency to use the released Datalevin library instead. Please pull and try again.

from datalevin.

hierophantos avatar hierophantos commented on August 26, 2024

Thank you for your thoughtful and prompt response!

As I'm walking through this document as my first-go into the project, I'm finding out with fresh eyes according to the documentation, which I'm enjoying how the writing leaves me with a greater sense of clarity.

I'm also noticing a slight typo in https://github.com/juji-io/datalevin/tree/master/search-bench#test-data; where the output path should read data/wiki.json as follows:

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

wikiextractor -o - --json --no-templates enwiki-latest-pages-articles.xml.bz2 |
jq -s '.[] | select((.text | length) > 500) | {url, text}' > data/wiki.json

Glad to bring greater clarity and precision.

from datalevin.

hierophantos avatar hierophantos commented on August 26, 2024

Also, maybe another obvious answer to those more experienced: Is there a way to saturate all or more of the cores on my machine for the search? I'm currently only seeing one core utilized.

from datalevin.

hierophantos avatar hierophantos commented on August 26, 2024

And also needs to read data/queries40k.txt in two places here here: https://github.com/juji-io/datalevin/tree/master/search-bench#test-queries.

wget https://trec.nist.gov/data/million.query/09/09.mq.topics.20001-60000.gz
gzip -d 09.mq.topics.20001-60000.gz
mv 09.mq.topics.20001-60000 data/queries40k.txt
sed -i -e 's/\([0-9]\+\)\:[0-9]\://g' data/queries40k.txt

Got another error 57m24s into the process 🤦‍♂️.

from datalevin.

huahaiy avatar huahaiy commented on August 26, 2024

If you have built the index, you don't have to redo it again, just comment out the line that builds the index.

(index-wiki-json "data/wiki-datalevin-all" "data/wiki.json")

from datalevin.

hierophantos avatar hierophantos commented on August 26, 2024

I was also getting errors running sed on 09.mq.topics.20001-60000 due to encodings that it didn't know how to read. I ended up using iconv to convert it to UTF-8 and then piped it to awk instead (because I found the syntax of awk less cumbersome compared to all the escape characters needed for sed, and awk produced an intermediate result that I could use to troubleshoot the error; also sed was complaining about file not existing during this process, whereas I could spit the results using awk. 🤷

from this, my https://github.com/juji-io/datalevin/tree/master/search-bench#test-queries reads:

wget https://trec.nist.gov/data/million.query/09/09.mq.topics.20001-60000.gz
gzip -d 09.mq.topics.20001-60000.gz
iconv -f ISO-8859-1 -t UTF-8 09.mq.topics.20001-60000 |
awk '{gsub(/[0-9]+:[0-9]:/,"")}1' > data/queries40k.txt

Also, I noticed https://github.com/juji-io/datalevin/tree/master/search-bench#test-data needs a mkdir data to be complete:

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
mkdir data
wikiextractor -o - --json --no-templates enwiki-latest-pages-articles.xml.bz2 | 
jq -s '.[] | select((.text | length) > 500) | {url, text}' > data/wiki.json

Anyways, I have the query results working now. Lookin' good! Thanks for all the feedback. 🙏

Not sure if you'd want a PR wrapping datalevin.bench/index-wiki-json to check if the relevant file exists already and skipping, or consider that yourself?

from datalevin.

huahaiy avatar huahaiy commented on August 26, 2024

Sure thing. PR is welcome. Thanks.

from datalevin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.