Giter Site home page Giter Site logo

docs-scraper's Introduction

Dependency status License Bors enabled

โšก A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow ๐Ÿ”

Meilisearch helps you shape a delightful search experience in a snap, offering features that work out-of-the-box to speed up your workflow.

A bright colored application for finding movies screening near the user A dark colored application for finding movies screening near the user

๐Ÿ”ฅ Try it! ๐Ÿ”ฅ

โœจ Features

  • Search-as-you-type: find search results in less than 50 milliseconds
  • Typo tolerance: get relevant matches even when queries contain typos and misspellings
  • Filtering and faceted search: enhance your users' search experience with custom filters and build a faceted search interface in a few lines of code
  • Sorting: sort results based on price, date, or pretty much anything else your users need
  • Synonym support: configure synonyms to include more relevant content in your search results
  • Geosearch: filter and sort documents based on geographic data
  • Extensive language support: search datasets in any language, with optimized support for Chinese, Japanese, Hebrew, and languages using the Latin alphabet
  • Security management: control which users can access what data with API keys that allow fine-grained permissions handling
  • Multi-Tenancy: personalize search results for any number of application tenants
  • Highly Customizable: customize Meilisearch to your specific needs or use our out-of-the-box and hassle-free presets
  • RESTful API: integrate Meilisearch in your technical stack with our plugins and SDKs
  • Easy to install, deploy, and maintain

๐Ÿ“– Documentation

You can consult Meilisearch's documentation at https://www.meilisearch.com/docs.

๐Ÿš€ Getting started

For basic instructions on how to set up Meilisearch, add documents to an index, and search for documents, take a look at our Quick Start guide.

โšก Supercharge your Meilisearch experience

Say goodbye to server deployment and manual updates with Meilisearch Cloud. No credit card required.

๐Ÿงฐ SDKs & integration tools

Install one of our SDKs in your project for seamless integration between Meilisearch and your favorite language or framework!

Take a look at the complete Meilisearch integration list.

Logos belonging to different languages and frameworks supported by Meilisearch, including React, Ruby on Rails, Go, Rust, and PHP

โš™๏ธ Advanced usage

Experienced users will want to keep our API Reference close at hand.

We also offer a wide range of dedicated guides to all Meilisearch features, such as filtering, sorting, geosearch, API keys, and tenant tokens.

Finally, for more in-depth information, refer to our articles explaining fundamental Meilisearch concepts such as documents and indexes.

๐Ÿ“Š Telemetry

Meilisearch collects anonymized data from users to help us improve our product. You can deactivate this whenever you want.

To request deletion of collected data, please write to us at [email protected]. Don't forget to include your Instance UID in the message, as this helps us quickly find and delete your data.

If you want to know more about the kind of data we collect and what we use it for, check the telemetry section of our documentation.

๐Ÿ“ซ Get in touch!

Meilisearch is a search engine created by Meili, a software development company based in France and with team members all over the world. Want to know more about us? Check out our blog!

๐Ÿ—ž Subscribe to our newsletter if you don't want to miss any updates! We promise we won't clutter your mailbox: we only send one edition every two months.

๐Ÿ’Œ Want to make a suggestion or give feedback? Here are some of the channels where you can reach us:

Thank you for your support!

๐Ÿ‘ฉโ€๐Ÿ’ป Contributing

Meilisearch is, and will always be, open-source! If you want to contribute to the project, please take a look at our contribution guidelines.

๐Ÿ“ฆ Versioning

Meilisearch releases and their associated binaries are available in this GitHub page.

The binaries are versioned following SemVer conventions. To know more, read our versioning policy.

Differently from the binaries, crates in this repository are not currently available on crates.io and do not follow SemVer conventions.

docs-scraper's People

Contributors

3t8 avatar abhishak3 avatar alallema avatar bidoubiwa avatar bors[bot] avatar brunoocasali avatar buehlmann avatar curquiza avatar dependabot-preview[bot] avatar dependabot[bot] avatar dichotommy avatar eskombro avatar guimachiavelli avatar haroenv avatar mdraevich avatar meili-bors[bot] avatar meili-bot avatar renehernandez avatar sanders41 avatar smartmind12 avatar spahrson avatar tpayet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docs-scraper's Issues

Browser Tests

The test suite needs to be run skipping any tests that uses chromedriver with pipenv run pytest ./scraper/src -k "not _browser". Is this because the path is needed for the chrome driver? If that is the reason do you want them to be runnable? I could add an optional command line flag that lets you specify the path to the driver and make a fixture to set the environment variable which would make them runnable. Something like pipenv run pytest ./scraper/src --chromedriver=/usr/local/bin/chromedriver. The flag would be optional and wouldn't have to be set. This would allow browser tests to still be skipped as they are currently if someone doesn't have a chromedriver installed.

I'm also thinking the current browser tests aren't viable tests? I tried running them and most fail so I think if adding the flag for the driver path is something you want the tests would also need to be updated?

Example configuration complains about invalid JSON

I've installed the latest Docker image locally and to start I cut and paste the example JSON you provided in your docs and changed a few properties but the JSON is 100% valid JSON and yet the container exists with the following error:

image

Here is the JSON that was used:

{
  "index_uid": "rust-api",
  "start_urls": ["https://docs.rs/tauri/latest/tauri"],
  "sitemap_urls": [],
  "selectors": {
    "lvl0": {
      "selector": "h1",
      "global": true,
      "default_value": "Title"
    },
    "lvl1": {
      "selector": "h2",
      "global": true,
      "default_value": "Section"
    },
    "lvl2": "[title=tauri::*]",
    "lvl3": ".docblock-short",
    "lvl4": ".theme-default-content h4",
    "lvl5": ".theme-default-content h5",
    "lvl6": "null",
    "text": "#main"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": ["relevant", "relevance"],
      "relevant": ["relevancy", "relevance"],
      "relevance": ["relevancy", "relevant"]
    }
  }
}

Update `path to the config file` example in the Readme

What's wrong?

Inside the Readme, we have the following example on the Run the scraper section indicating <path-to-your-config-file>:

pipenv run ./docs_scraper <path-to-your-config-file>

But then, we can find these examples:

<path-to-your-config-file> is now config.json in the with Docker section

    -e MEILISEARCH_HOST_URL=<your-meilisearch-host-url> \
    -e MEILISEARCH_API_KEY=<your-meilisearch-api-key> \
    -v <absolute-path-to-your-config-file>:/docs-scraper/config.json \
    getmeili/docs-scraper:latest pipenv run ./docs_scraper config.json

Same in the In a GitHub Action section with the following example:

       docker run -t --rm \
          -e MEILISEARCH_HOST_URL=$HOST_URL \
          -e MEILISEARCH_API_KEY=$API_KEY \
          -v $CONFIG_FILE_PATH:/docs-scraper/config.json \
          getmeili/docs-scraper:latest pipenv run ./docs_scraper config.json

Solution

We should rename the config.json occurrences with <path-to-your-config-file>

the command docs_scraper could not be found within PATH or Pipfile's [scripts].

hi
I have trouble in using the docker scraper, I try it on mac, when I run the docker, it tell me the command docs_scraper could not be found within PATH or Pipfile's [scripts].. I don't know what I did wrong.
Can you help me solve this problem๏ผŸ

#My steps

frist

docker run -d  -it --rm -e MEILI_MASTER_KEY=trantor   -p 9494:9494    -v $(pwd)/data.ms:/data.ms  meilisearch_chrome

  • meilisearch_chrome is my local image which in order to solve the Env CHROMEDRIVER_PATH='/usr/bin/chromedriver' is not a path to a file problem

second

docker run -t --rm     -e MEILISEARCH_HOST_URL=http://localhost:9494/    -e MEILISEARCH_API_KEY=trantor    -v /Users/silviayuan/test/docs-scraper/config.json:/docs-scraper/config.json meilisearch_chrome  pipenv run ./docs_scraper config.json

After my second step , it throw error.

image

CONFIG is not a valid JSON

This is my config.json file

    {
  "index_uid": "gsxd",
  "start_urls": ["http://my_domain.vn/"],
  "sitemap_urls": ["http://my_domain.vn/sitemap.xml"],
  "stop_urls": [],
  "selectors": {
    "lvl0": {
      "selector": ".theme-default-content h1",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": {
      "selector": ".theme-default-content h2",
      "global": true,
      "default_value": "Chapter"
    },
    "text": ".theme-default-content p, .theme-default-content li"
  }
}

then I run docker

docker run -t --rm \
    -e MEILISEARCH_HOST_URL=http://my_domain.vn:7700 \
    -e MEILISEARCH_API_KEY=key \
    -v config.json:/docs-scraper/config.json \
    getmeili/docs-scraper:latest pipenv run ./docs_scraper config.json

this is what I got

Traceback (most recent call last):
  File "/docs-scraper/scraper/src/config/config_loader.py", line 99, in _load_config
    data = json.loads(config, object_pairs_hook=OrderedDict)
  File "/usr/local/lib/python3.8/json/__init__.py", line 370, in loads
    return cls(**kw).decode(s)
  File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./docs_scraper", line 22, in <module>
    run_config(sys.argv[1])
  File "/docs-scraper/scraper/src/index.py", line 34, in run_config
    config = ConfigLoader(config)
  File "/docs-scraper/scraper/src/config/config_loader.py", line 67, in __init__
    data = self._load_config(config)
  File "/docs-scraper/scraper/src/config/config_loader.py", line 104, in _load_config
    raise ValueError('CONFIG is not a valid JSON') from value_error
ValueError: CONFIG is not a valid JSON

Please help!.

Add an explanation about the config file name inside the Readme

What's wrong

By reading the Readme from top to bottom, we are given an example of config file but it is not clear in which file we have to put this.

The information can be found below in an other section (the config file can be anything as it is passed as an argument later), but the information is arriving later than what we would expect.

Solution

Explain inside the config file paragraph that the config file name can be anything and doesn't matter.

Implement "selector_rank"

Description

Currently, there is a way to add rank between pages

{
  "start_urls": [
    {
      "url": "http://www.example.com/docs/concepts/",
      "page_rank": 5
    },
    {
      "url": "http://www.example.com/docs/contributors/",
      "page_rank": 1
    }
  ]
}

Documented here :

https://github.com/meilisearch/docs-scraper#using-page-rank-

I have a usecase where my pages do not have different page rank, but rather some content on the page has more significance than other.

I would like to be able to rank the content with selectors.

Basic example

<h1>My website</h1>

<h2>Reasons</h2>
<p class="introduction">FoobarFoobarFoobarFoobarFoobarFoobarFoobar</p>

<h2>Other things</h2>
 <p>Details</p>

In this case, I know that the content that is inside the "introduction" paragraph is more important than the other content at the end, and I would like my search results to reflect that.

I'm thinking of using something like this for the config :

{
  "selectors": {
    "lvl1": "h1",
    "lvl2": "h2",
    "text": [ 
            { 
                "rank": 5,
                "selector": "p.introduction"
            },
            { 
                "rank": 1,
                "selector": "p:not(.introduction)"
            },
    ]
}

Other

This is a real problem I have, on my website, the CHANGELOG and README of my software is on the same page, and results from the CHANGELOG sometimes come before results from the README. The results from the README should be prioritized.

Maybe the API could be different, instead of adding an array inside the selectors, one could use :

"selector_ranks":   [ 
            { 
                "rank": 5,
                "selector": "p.introduction"
            }
    ]

And simply use "p" for the "text" selector.

Having problem to run docs-scraper 0.10.1 with Python 3.6/3.8.0

I had machines running Fedora 32 and Centos 7/8.

On the development machine (Fedora 32). It has Python 3.8.5.
I could run the docs-scraper 0.10.1 without problem.

On the production machine Centos 7 (python 3.6) and Centos 8.2 (python 3.6 and python 3.8.0)
It could compile the docs_scraper package but having problem when it run, below is the error message.

$ pipenv run ./docs_scraper docs_scraper-config.json
Courtesy Notice: Pipenv found itself running within a virtual environment, so it will automatically use that environment, instead of creating its own for any project. You can set PIPENV_IGNORE_VIRTUALENVS=1 to force pipenv to ignore that environment and create its own instead. You can set PIPENV_VERBOSITY=-1 to suppress this warning.
2020-08-07 10:02:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.kappawingman.com> (referer: None)
Traceback (most recent call last):
File "/home/username/venv/docs-scraper-0.10.1/lib64/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/username/venv/docs-scraper-0.10.1/lib64/python3.6/site-packages/scrapy/spiders/init.py", line 93, in parse
raise NotImplementedError('{}.parse callback is not defined'.format(self.class.name))
NotImplementedError: DocumentationSpider.parse callback is not defined

All python 3.6/3.8.0 have warning about the 'twisted' package'.
Please take a look, thanks.

Wrong command line usage in Contributing.md

In the development workflow section of the Contributing.md file, it says the following:

$ pipenv install
$ pipenv run ./docs_scraper run <path-to-your-config-file>

The line:

$ pipenv run ./docs_scraper run <path-to-your-config-file>

has an extra run argument after the path to ./docs_scraper

It should be instead:

$ pipenv run ./docs_scraper <path-to-your-config-file>

docs-scraper is not compatible with python 3.11

Because of selenium's dependency exceptiongroup we are not compatible with python 3.11.

   "exceptiongroup": {
            "hashes": [
                "sha256:542adf9dea4055530d6e1279602fa5cb11dab2395fa650b8674eaec35fc4a828",
                "sha256:bd14967b79cd9bdb54d97323216f8fdf533e278df937aa2a90089e7d6e06e5ec"
            ],
            "markers": "python_version < '3.11'",
            "version": "==1.0.4"
        },

The latest version of selenium seems to remove this dependency See #283. Nonetheless, we should still ensure that we are compatible with python 3.11 once the PR is merged.

Publish a new Docker image containing Chrome binary

In order to solve the issue #139 for the Docker users, we will need to publish a new version of the base Dockerfile containing the Chrome binary.

To accomplish this issue we need to:

  • Create a new Dockerfile with the Chrome binary additions
  • Configure Github Actions to release a new image version getmeili/docs-scraper-with-chrome
  • Update README sections regarding the usage of this new image, which will be required only by the users who need the chrome binary.

After this addition, we will be able to instruct users to use this new image when they need it, and we will not impact the current users of the getmeili/docs-scraper image with a non-requested size addition.

hint: we could base this new image in the algolia's image https://hub.docker.com/layers/algolia/docsearch-scraper/latest/ or in this comment #139 (comment)

docs-scraper integration with Apache Tika

Hello,
Recently I completed the task to build local search system with the possibilities to index word files / markdown / pdf.

I made this one using nginx autoindex module (customized a bit) and meilisearch (scraper + engine + search bar).
Just because docs-scraper do not index word / markdown / pdf files by default, I made some sort of changes:

  1. for markdown files I used markdown2 to convert .md to .html
  2. for word / pdf files I used remote server with Apache Tika in order to convert to .html.

So I'm going to understand are you as developers interested in those changes. If yes, I will do PR. See my code here. To be precise see files custom_downloader_middleware.py & documentation_spider.py

P.S. I do not believe that my code has a great optimization so I'm open for a some sort of criticism :)

Having issues indexing my jekyll website from local container

Hello,

I am very new to meilisearch, and I am very excited to be able to integrate it to my website.

I am using the docs-searchbar.js, and the scraper.

My starting point of the site is generated with Jekyll, but all the pages are not generated in the same way, they are generated dynamically with a js script.

I am able to index all links redirecting to all my pages (which are just like table of contents), which appear after clicking on each link.

I thought that the scraper would be able to go scrape the content from links, but maybe I will have to custom/override the scraper to fit my need. Not very sure about that!

For example one of my pages is => http://localhost:9000/docs/en/master/getting_started/new_diff/quickstart/4_use.html,
and another one looks like this => http://localhost:9000/docs/en/master/getting_started/docs/quickstart/1_introduction.html

I tried specifying the link with the complete URL in the scraper config, and the content is well indexed, but that means that I will have to declare all links :( if I want to make it work

My scraper config is this

{
  "index_uid": "docs",
  "start_urls": [
    {
      "url": "http://localhost:9000/docs/en/master/getting_started/",
      "selectors_key": "getting-started"
    },
    {
      "url": "http://localhost:9000/docs/en/master/developer-guides/",
      "selectors_key": "developer-guides"
    }
  ],
  "selectors": {
    "default": {
      "lvl0": {
        "selector": "",
        "default_value": "Documentation"
      },
      "lvl1": "ul.site-toc li span.toc-section-heading",
      "lvl2": "h1",
      "lvl3": "h2",
      "text": "p"
    },
    "getting-started": {
      "lvl0": {
        "selector": "",
        "default_value": "Getting started"
      },
      "lvl1": "ul.site-toc li span.toc-section-heading",
      "lvl2": "h1",
      "lvl3": "h2",
      "text": ".docs .page-content-container .page-content p"
    },
    "developer-guides": {
      "lvl0": {
        "selector": "",
        "default_value": "Developer Guides"
      },
      "lvl1": ".docs h1",
      "lvl2": ".docs h2",
      "lvl3": "h3",
      "text": ".docs .page-content-container .page-content p"
    },
}
"custom_settings": {
    "stopWords": [
      "a", "and", "as", "at", "be", "but", "by",
      "do", "does", "doesn't", "for", "from",
      "in", "is", "it", "no", "nor", "not",
      "of", "off", "on", "or",
      "so", "should", "than", "that", "that's", "the",
      "then", "there", "there's", "these",
      "this", "those", "to", "too",
      "up", "was", "wasn't", "what", "what's", "when", "when's",
      "where", "where's", "which", "while", "who", "who's",
      "with", "won't", "would", "wouldn't"
    ],
    "synonyms": {
      "relevancy": ["relevant", "relevance"],
      "relevant": ["relevancy", "relevance"],
      "relevance": ["relevancy", "relevant"]
    }
  }

Using getmeili/docs-scraper:v0.10.4 and meilisearch version 0.19.0

Doc Scraper removing old index on 2nd run

Initially created by @munim
2 days ago

Dear team,

I am trying out Meilisearch and indexing our side using docs-scraper project from Meilisearch. It worked for me at some level but when I ran the scraper again with the same command, it cleaned all the items and started from scratch. Here's what I did:

  1. Created a Docker network and started Meilisearch with Docker:
$ docker run -it --rm \
    -p 7700:7700 \
    -e MEILI_MASTER_KEY='123'\
    -v $(pwd)/meili_data:/meili_data \
    --network="meilisearch-test-01" \
    getmeili/meilisearch:v0.28 \
    meilisearch --env="development"
  1. Created a scraper config file mentioned in the project README

  2. Started the scraper with the following command:

$ docker run -t --rm \
    -e MEILISEARCH_HOST_URL=http://exciting_banach:7700 \
    -e MEILISEARCH_API_KEY=123 \
    --network="meilisearch-test-01" \
    -v `pwd`/test-scraper.config.json:/docs-scraper/config.json \
    getmeili/docs-scraper:latest pipenv run ./docs_scraper config.json
  1. It took around 30 mins to scrap 50K pages.
  2. I rerun the scraper after making some changes to the config
  3. Now, I see all my previous entries from Meilisearch are removed and new entries are being added.

My question is: How can I update the entries rather than removing old entries and recreate again?

Add compatibility so that both URL path types are supported (absolute and relative)

Hello,

I am currently working on a documentation website generated with Jekyll, and Meilisearch has been pretty easy to add, with the docs-scraper and the docs-searchbar.

In fact, I have many instances of my documentation website hosted in different places

That means that I have to run the docs-scraper for each site (update of repository).

I wish I could run only one scraper for all my sites, and be independent of where each documentation site is hosted. so my question is:

  • Is it possible to replace absolute with relative URLs in docs-scraper?.

I guess, I can do that somehow overriding some of the logic from the source code of the scraper. But is there another way? (maybe someone else has already thought/discussed about that).

Thanks in advance!!

Ability to update documents only from the same domain

I have setup Meilisearch for a multi-site search frontend recently at work and I am using the docs-scraper image to push the scraped data from each of the sites to the Meilisearch server. Out-of-box, it works great!!

As we are planning to index more documents from different internal websites, I am running into the problem of needing to use different configurations files to scrape sites with different layouts. As of the current scraping logic, this is not supported because every time that the scrape image run it deletes everything from the docs index.

Having the ability of only updating (probably by deleting and adding) the documents that match the domain that is being scraped would allow to use multiple different config files with the scraper image

I would love to take a stab at this, but probably would need some pointers first on how to do it

Determine mandatory fields in the config file

From @sanders41 in this comment

When using a minimal docs-scraper config:

{
  "index_uid": "docs",
  "start_urls": ["https://docs.meilisearch.com"]
}

Errors are thrown for missing fields:

Then I get the error TypeError: argument of type 'NoneType' is not iterable that happens here. I am thinking the json I am using is not what you have in mind? The thing that makes me question that and wonder if there could be an issue is the parser is called from ConfigLoader and in that Class selectors gets initialized to None. This makes me think calling the parser with selectors set to None shouldn't throw an error? Should I instead be using the basic config json file?

We should be able to determine what fields are mandatory and which fields should or should not be mandatory.

Use facets

Make the scraper use the facets (when the facets feature will be ready in a stable version of MeiliSearch)
I will be useful for the docs versioning and the handling of the different languages.

Edit

  • Add example in README (even the custom_settings part for attributesForFaceting)

Keep the settings of the previous index

Currently, for performance reasons, each time the scraper runs, it deletes the index, creates a new one, adds the default settings, and adds new documents (the content of the website).

Problem: if the user set up his/her own settings manually (for example, adding synonyms) the settings will be removed at each scraping.
Currently, the config file does not provide any way to accept settings for MeiliSearch.

A quick fix: keep the settings of the previous index, and apply them right after the default one. See the edit below

In the future: provide a field in the config file to customize the settings of MeiliSearch.

Why doing that right now?

We need to add synonyms in our own documentation: relevancy -> relevant

Edit

I see that the config file is already ready to receive a custom_settings but the scraper does not use it. It might be better to use this field instead ๐Ÿ˜‡

Change master branch to main

Let's be allies and make this change that means a lot.

Here is a blog post that explain a little more why it's important, and how to easily do it. It will be a bit more complicated with automation, but we still should do it!

Missing 'requests' module?

Description
I am testing the docs-scraper on my Synology NAS, and the container stops immediately with the error listed below.

Expected behavior
Scrape some sites

Current behavior
container immediately stops on error.

Screenshots or Logs
File "./docs_scraper", line 5, in
from scraper.src.index import run_config
File "/docs-scraper/scraper/src/index.py", line 7, in
import requests

Environment (please complete the following information):
Synology DSM 7.1 running Docker
docs-scaper v0.12.3 docker image (getmeili/docs-scraper:v0.12.3)

Fix chromedrivers tests in CI that had to be removed

Description
The tests on the chromedriver are removed again from the CI because of failing tests with selenium

See failing tests

run: pipenv run pytest -m "not chromedriver"

Expected behavior
Not fail

Current behavior

=========================== short test summary info ============================
FAILED tests/config_loader/open_selenium_browser_test.py::TestOpenSeleniumBrowser::test_browser_needed_when_config_contains_automatic_tag

Screenshots or Logs
If applicable, add screenshots or logs to help explain your problem.

TypeError: argument of type 'NoneType' is not iterable

I am on the Meilisearch droplet from Digitalocean.

Fresh installation of docs-scraper via install directions, python 3.8.2.

When attempting to run the first scrape, I receive the following error:

pipenv run ./docs_scraper config.json Traceback (most recent call last): File "./docs_scraper", line 22, in <module> run_config(sys.argv[1]) File "/root/docs-scraper/scraper/src/index.py", line 34, in run_config config = ConfigLoader(config) File "/root/docs-scraper/scraper/src/config/config_loader.py", line 81, in __init__ self._parse() File "/root/docs-scraper/scraper/src/config/config_loader.py", line 114, in _parse self.selectors = SelectorsParser().parse(self.selectors) File "/root/docs-scraper/scraper/src/config/selectors_parser.py", line 64, in parse if 'lvl0' in config_selectors: TypeError: argument of type 'NoneType' is not iterable

I have reduced my config json to the smallest possible to rule out any issues:

{ "index_uid": "docs", "start_urls": ["https://socialtools.io"], "strip_chars": " .,;:#" }

Automatize push

Currently, I manually push changes to Heroku.
It could be better to add a GitHub Action to automatically deploy on Heroku repository when pushing on master branch, or with a tag.

Remove nb_hits update in config file

Since we don't use this metric (we indeed don't update the config file and push it to GitHub after updating it) we should remove the file update in the code.

But I think it's still interesting to keep the final prompt that says:

Nb hits: XXX

docs-scraper for everyone

Currently, docs-scraper scraps only MeiliSearch's documentation.

This repository could work for every documentation. But, so far, the repo is not perfect and the README does not provide enough information.

The steps are:

  • change the code to remove all "useless" part. Solve with #8.
  • detail the README to explain how to use it and how I deploy it on Heroku with a cron job. Instead, we could provide a docker image to run the scraper after each docs deployment.

Run the integration tests also on the builded docker images

Recently we got an issue with a mismatch of chromedriver between the version of Python or docker image was using and the one required from selenium (I suppose), see #284.

This is bound to happen again and no tests are ran to ensure we have no mismatch.
Thus, we should add a test ensuring that our docker images are working.

Tests meilisearch implementation

These tests only tries communication with MeiliSearch and not the pertinence of the scraper:

Tests should be made to test if MeiliSearch implementations works correctly.

In meilisearch_helper.py the following is done:

  • Delete the scrape index if it already exists
  • Create a new index with the same name
  • Add default and custom settings to index

These should be tested if it was done successfully.
You can confirm it worked correctly using the GET /indexes method.

A test directory should be created meilisearch_***.

In that directory the different tests should be made

  • A simple meilisearch configuration with the right credentials and no setting (#154)
    • Check if index was correctly added to Meilisearch
    • Check if default setting were added correctly
  • A simple meilisearch configuration with the right credentials and settings
    • Check if index was correctly added to Meilisearch
    • Check if default setting were added correctly
  • A simple meilisearch configuration with the right credentials and bad settings
    • Check if index was correctly added to Meilisearch
    • Check if error is raised

To start this tests, their should be a running instance of Meilisearch.

.github/workflows/test-lint.yml

      - name: Docker setup
        run: docker run -d -p 7700:7700 getmeili/meilisearch:latest ./meilisearch --no-analytics=true --master-key='masterKey'
      - name: Run tests
        run: pipenv run pytest ./scraper/src -k "not _browser"

Upgrade scrapy to v2.3.0 by removing the NotImplementedError

Since this scrapy upgrading, we got an error when running:

$ pipenv run ./docs_scraper config.json
> Docs-Scraper: https://docs.meilisearch.com 27 records)
2020-09-10 14:42:09 [scrapy.core.scraper] ERROR: Spider error processing <GET https://docs.meilisearch.com> (referer: None)
Traceback (most recent call last):
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.8/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/curquiza/Documents/docs-scraper/scraper/src/documentation_spider.py", line 184, in parse_from_start_url
    return self.parse(response)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.8/site-packages/scrapy/spiders/__init__.py", line 93, in parse
    raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: DocumentationSpider.parse callback is not defined

Nb hits: 27

Only the master branch is concerned, the latest release (v0.10.1 does not contain this error).

Edit

I reverted the concerned PR (manually because GitHub wasn't able to revert it automatically) to make the master branch working again. See #65.
The new goal of this issue is to upgrade Scrapy from v2.2.1 to v2.3.0 by fixing the NotImplementedError at the same time.

Cannot scrap documents into meilisearch

Greetings,

I am using Pelican as my blog generator.
I am not using Ubuntu so I need to use docker to run the doc-scraper.

I can run the small tutorial and I can import data into the meilisearch.

But I cannot run the doc-scraper to get data into meilisearch.
Below is the error:

Traceback (most recent call last):
File "./docs_scraper", line 22, in
run_config(sys.argv[1])
File "/docs-scraper/scraper/src/index.py", line 43, in run_config
config.custom_settings
File "/docs-scraper/scraper/src/meilisearch_helper.py", line 108, in init
settings = {**MeiliSearchHelper.SETTINGS, **custom_settings}
TypeError: 'NoneType' object is not a mapping

Below is my config.json
{
"index_uid": "docs",
"sitemap_urls": ["https://www.kappawingman.com/sitemap.xml"],
"start_urls": ["https://www.kappawingman.com"],
"selectors": {
"lvl0": {
"selector": ".entry-content",
"global": true,
"default_value": "Documentation"
},
"lvl1": "#main_content h1",
"lvl2": ".toc-backref h2",
"lvl3": ".toc-backref h3",
"text": ".entry-content p, .entry-content li"
},
"strip_chars": " .,;:#",
"scrap_start_urls": true
}

On the meilisearch console, I saw these messages:
[2020-06-09T16:15:34Z INFO tide::middleware::logger] DELETE /indexes/docs 204 17ms
[2020-06-09T16:15:34Z INFO tide::middleware::logger] POST /indexes 201 15ms

Any help would be appreciated, thanks.

docs scraper failed to build indexes with v0.28.0 405 Client Error: Method Not Allowed for url

Description
Description of what the bug is about.
0.28.0

Current behavior

Traceback (most recent call last):
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/meilisearch/_httprequests.py", line 101, in __validate
    request.raise_for_status()
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 405 Client Error: Method Not Allowed for url: https://meilisearch.owenyoung.com/indexes/owen-blog/settings
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "./docs_scraper", line 22, in <module>
    run_config(sys.argv[1])
  File "/docs-scraper/scraper/src/index.py", line 40, in run_config
    meilisearch_helper = MeiliSearchHelper(
  File "/docs-scraper/scraper/src/meilisearch_helper.py", line 105, in __init__
    self.add_settings(MeiliSearchHelper.SETTINGS, custom_settings)
  File "/docs-scraper/scraper/src/meilisearch_helper.py", line 109, in add_settings
    self.meilisearch_index.update_settings(settings)
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/meilisearch/index.py", line 641, in update_settings
    return self.http.post(
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/meilisearch/_httprequests.py", line 63, in post
    return self.send_request(requests.post, path, body, content_type)
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/meilisearch/_httprequests.py", line 45, in send_request
    return self.__validate(request)
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/meilisearch/_httprequests.py", line 104, in __validate
    raise MeiliSearchApiError(str(err), request) from err
meilisearch.errors.MeiliSearchApiError: MeiliSearchApiError. 405 Client Error: Method Not Allowed for url: https://meilisearch.owenyoung.com/indexes/owen-blog/settings
Error: Process completed with exit code 1.

Environment (please complete the following information):

  • OS: [e.g. Debian GNU/Linux] Debian
  • Meilisearch version: [e.g. v.0.28.0]
  • docs-scraper version: [e.g v0.12.2]

Upgrade scrapy

This package currently uses scrapy v1.7.4.

We cannot upgrade it to v2.X.X and even to v1.8.0 because of this error:

2020-06-19 15:49:41 [scrapy.core.downloader.handlers] ERROR: Loading "scrapy.core.downloader.handlers.http.HTTPDownloadHandler" for scheme "http"
Traceback (most recent call last):
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 46, in __init__
    method=self._sslMethod,
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/utils/misc.py", line 144, in create_instance
    return objcls.from_settings(settings, *args, **kwargs)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/contextfactory.py", line 35, in from_settings
    return cls(method=method, tls_verbose_logging=tls_verbose_logging, tls_ciphers=tls_ciphers, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'method'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/__init__.py", line 51, in _load_handler
    dh = dhcls(self._crawler.settings)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 53, in __init__
    crawler=None,
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/utils/misc.py", line 144, in create_instance
    return objcls.from_settings(settings, *args, **kwargs)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/contextfactory.py", line 35, in from_settings
    return cls(method=method, tls_verbose_logging=tls_verbose_logging, tls_ciphers=tls_ciphers, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'method'
2020-06-19 15:49:41 [scrapy.core.downloader.handlers] ERROR: Loading "scrapy.core.downloader.handlers.http.HTTPDownloadHandler" for scheme "https"
Traceback (most recent call last):
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 46, in __init__
    method=self._sslMethod,
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/utils/misc.py", line 144, in create_instance
    return objcls.from_settings(settings, *args, **kwargs)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/contextfactory.py", line 35, in from_settings
    return cls(method=method, tls_verbose_logging=tls_verbose_logging, tls_ciphers=tls_ciphers, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'method'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/__init__.py", line 51, in _load_handler
    dh = dhcls(self._crawler.settings)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 53, in __init__
    crawler=None,
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/utils/misc.py", line 144, in create_instance
    return objcls.from_settings(settings, *args, **kwargs)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/contextfactory.py", line 35, in from_settings
    return cls(method=method, tls_verbose_logging=tls_verbose_logging, tls_ciphers=tls_ciphers, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'method'
2020-06-19 15:49:41 [docs-test] ERROR: Failure without response Unsupported URL scheme 'https': __init__() got an unexpected keyword argument 'method'
2020-06-19 15:49:41 [docs-test] ERROR: Failure without response Unsupported URL scheme 'https': __init__() got an unexpected keyword argument 'method'

Goal: fix this error and upgrade scrapy to v2.X.X.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.