Giter Site home page Giter Site logo

keremzaman / semantic-sh Goto Github PK

View Code? Open in Web Editor NEW
24.0 3.0 3.0 45 KB

semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models (BERT).

License: MIT License

Python 98.05% Dockerfile 1.37% Shell 0.58%
simhash word-vectors fasttext bert locality-sensitive-hashing transformer text-similarity text-clustering text-search

semantic-sh's People

Contributors

keremzaman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

semantic-sh's Issues

save/load custom models

Hi,

Hope you are all well !

Is it possible to save/dump models and to load them again afterwards ? avoiding the re-index all documents because I have 230k of them.

Cheers,
X

add/return custom attributes

Hi,

Hope you are all well !

It would be useful to add custom attributes like the doc_id when indexing or retrieving similar documents.

Thanks in advance :-) for any insights or inputs on that.

Cheers,
X

get hash endpoint error

Hi,

Hope you are all well !

I tried to get the hash of an abstract and it triggers the following error:

semantic-sh_1                 |  * Tip: There are .env or .flaskenv files present. Do "pip install python-dotenv" to use them.
semantic-sh_1                 |  * Serving Flask app "server" (lazy loading)
semantic-sh_1                 |  * Environment: production
semantic-sh_1                 |    WARNING: This is a development server. Do not use it in a production deployment.
semantic-sh_1                 |    Use a production WSGI server instead.
semantic-sh_1                 |  * Debug mode: off
semantic-sh_1                 | /opt/service/semantic_sh/semantic_sh.py:51: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
semantic-sh_1                 |   return np.vstack((np.random.normal(0, 1, dim) for i in range(0, key_size)))
semantic-sh_1                 |  * Running on http://0.0.0.0:5001/ (Press CTRL+C to quit)
semantic-sh_1                 | Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
semantic-sh_1                 | [2020-08-09 06:48:59,687] ERROR in app: Exception on /api/hash [GET]
semantic-sh_1                 | Traceback (most recent call last):
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
semantic-sh_1                 |     response = self.full_dispatch_request()
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1952, in full_dispatch_request
semantic-sh_1                 |     rv = self.handle_user_exception(e)
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1821, in handle_user_exception
semantic-sh_1                 |     reraise(exc_type, exc_value, tb)
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
semantic-sh_1                 |     raise value
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1950, in full_dispatch_request
semantic-sh_1                 |     rv = self.dispatch_request()
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1936, in dispatch_request
semantic-sh_1                 |     return self.view_functions[rule.endpoint](**req.view_args)
semantic-sh_1                 |   File "./server.py", line 20, in generate_hash
semantic-sh_1                 |     return hex(sh.get_hash(txt))
semantic-sh_1                 |   File "/opt/service/semantic_sh/semantic_sh.py", line 88, in get_hash
semantic-sh_1                 |     y = np.matmul(self._proj, enc)
semantic-sh_1                 | ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 768 is different from 300)
semantic-sh_1                 | 51.210.37.251 - - [09/Aug/2020 06:48:59] "GET /api/hash?text=Recent+work+has+demonstrated+substantial+gains+on+many+NLP+tasks+and+benchmarks+by+pre-training+on+a+large+corpus+of+text+followed+by+fine-tuning+on+a+specific+task.+While+typically+task-agnostic+in+architecture%2C+this+method+still+requires+task-specific+fine-tuning+datasets+of+thousands+or+tens+of+thousands+of+examples.+By+contrast%2C+humans+can+generally+perform+a+new+language+task+from+only+a+few+examples+or+from+simple+instructions+-+something+which+current+NLP+systems+still+largely+struggle+to+do.+Here+we+show+that+scaling+up+language+models+greatly+improves+task-agnostic%2C+few-shot+performance%2C+sometimes+even+reaching+competitiveness+with+prior+state-of-the-art+fine-tuning+approaches.+Specifically%2C+we+train+GPT-3%2C+an+autoregressive+language+model+with+175+billion+parameters%2C+10x+more+than+any+previous+non-sparse+language+model%2C+and+test+its+performance+in+the+few-shot+setting.+For+all+tasks%2C+GPT-3+is+applied+without+any+gradient+updates+or+fine-tuning%2C+with+tasks+and+few-shot+demonstrations+specified+purely+via+text+interaction+with+the+model.+GPT-3+achieves+strong+performance+on+many+NLP+datasets%2C+including+translation%2C+question-answering%2C+and+cloze+tasks%2C+as+well+as+several+tasks+that+require+on-the-fly+reasoning+or+domain+adaptation%2C+such+as+unscrambling+words%2C+using+a+novel+word+in+a+sentence%2C+or+performing+3-digit+arithmetic.+At+the+same+time%2C+we+also+identify+some+datasets+where+GPT-3%27s+few-shot+learning+still+struggles%2C+as+well+as+some+datasets+where+GPT-3+faces+methodological+issues+related+to+training+on+large+web+corpora.+Finally%2C+we+find+that+GPT-3+can+generate+samples+of+news+articles+which+human+evaluators+have+difficulty+distinguishing+from+articles+written+by+humans.+We+discuss+broader+societal+impacts+of+this+finding+and+of+GPT-3+in+general. HTTP/1.1" 500 -

Any idea how to sort it ? Is it related to the server configuration ?

Cheers,
X

semantic-sh crash after some document addition

Hi,

Hope you are all well !

I tried to push 300k abstracts into semantic-sh but the server crash without any debug information at some point.

Is there a way to enable some debug informations ?
Do you want me to prepare you a dump of my dataset ?

Thanks in advance for any replies or insights about that.

Cheers,
X

make a rest service with flask

Hi,

Hope you are all well !

Do you think it is possible to make a restful web service of semantic-sh ? or to dockerize it ?

Cheers,
X

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.