Comments (5)
I removed some numbers by casting them to a float and checking if it works for now. Tokens like +1.23/-7.23
is not removed though, but there are also quite a few other tokens that contain just punctuation that we might have to look at anyway later on.
from cs-insights-crawler.
I checked what is wrong with the tf-idf function and I just overlooked that it is generating a matrix, but by using .idf_
I already got the weigthing for each feature/token out of it. I changed the function, so you could more easily access the matrix.
We can get the highest tf-idf scores using this, which gives us the following: Highest tf-idf scores in selection: [('+0.4', 1, 6.938854596835685), ('+0.6', 1, 6.938854596835685), ('+0.7', 1, 6.938854596835685), ('+1', 1, 6.938854596835685), ('-25.3', 1, 6.938854596835685), ('-50.5', 1, 6.938854596835685),
followed by some links. As you can see, removing numbers isn't so trivial, as I can only give sklearn a corpus of words and putting every number in every way in it is not feasible. Though numbers do not appear in visualization done in scattertext or pyLDAvis anymore.
The issue with the counting during the demo was caused by a missing default value, so the CLI overwrote the other default value.
from cs-insights-crawler.
Thanks for the update Lennart.
I wonder if the .idf_ is just the inverse document part of the equation. In any case, if we can access the entire matrix it should be fine.
About the number issue. Are you removing the numerical characters before running the tf-idf? I believe this would be easier, as we treat the input before using it in any processing, right before/after the stopword removal. I'm still wondering if we should use Tfidf vectorizer instead of transformer. The former is usually used when the input is the raw documents, and the latter if you already have a count matrix. Also, in the first, several tasks can be automated with a parameter flag (e.g. stopword removal, n-grams, max features, min, regex, etc).
from cs-insights-crawler.
The CountVectorizer I use has the parameters you mentioned and creates the matrix the Tf-idf Transformer needs. I can check if the results would be the same.
The parameters are also the issue for the stopwords, as I can only pass a list of stopwords, which sklearn will remove. I think I can smuggle it into the tokenization, so numbers will also be removed. Then we would also do the stopwords removal ourselfs, because we have to check numbers with a function and can't pass a list of all numbers to remove. Maybe I also missed something and you can also pass a function.
from cs-insights-crawler.
Sorry for the delay Lennart.
Yes, don't overthink this. Just a regex to get rid of punctuation/numbers is enough. Essentially the stopword removal is nothing more than a simple comprehension that checks is a given word in listed or not.
from cs-insights-crawler.
Related Issues (20)
- Extract call for papers from venue page
- DBLP Client, Processor, Backend Client HOT 1
- Implement DBLP Client HOT 1
- Implement automated storing to db/backend HOT 4
- Implement Processor class HOT 1
- Add automatic documentation and hosting on GitHub pages
- Add Dockerfile and docker-compose for grobid and project HOT 1
- Umlaute author and conference names
- Match venue names
- Expand use of --s2_use_tldrs, --s2_use_citations, --s2_use_embeddings
- Add pep8-naming
- Dataset Release v2.0 HOT 1
- Fix using all entries in export
- Remove paperAbstracts from non open access papers in zenodo
- Automatic upload to Zenodo HOT 1
- Expand use of --s2_filter_pubmed, --s2_filter_arxiv, --s2_filter_pubmedcentral HOT 1
- Add test configuration
- Add CSO annotations to release HOT 1
- Link Scopus and Web of Science to D3
- Total number of works is not equivalent to count of papers.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cs-insights-crawler.