meilisearch / docs-scraper Goto Github PK

View Code? Open in Web Editor NEW

264.0 264.0 46.0 13.39 MB

Scrape documentation into Meilisearch

Home Page: https://www.meilisearch.com

License: Other

Python 99.72% Dockerfile 0.28%

integration meilisearch scraper

docs-scraper's Introduction

Website | Roadmap | Meilisearch Cloud | Blog | Documentation | FAQ | Discord

⚡ A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow 🔍

Meilisearch helps you shape a delightful search experience in a snap, offering features that work out-of-the-box to speed up your workflow.

🔥 Try it! 🔥

✨ Features

Search-as-you-type: find search results in less than 50 milliseconds
Typo tolerance: get relevant matches even when queries contain typos and misspellings
Filtering and faceted search: enhance your users' search experience with custom filters and build a faceted search interface in a few lines of code
Sorting: sort results based on price, date, or pretty much anything else your users need
Synonym support: configure synonyms to include more relevant content in your search results
Geosearch: filter and sort documents based on geographic data
Extensive language support: search datasets in any language, with optimized support for Chinese, Japanese, Hebrew, and languages using the Latin alphabet
Security management: control which users can access what data with API keys that allow fine-grained permissions handling
Multi-Tenancy: personalize search results for any number of application tenants
Highly Customizable: customize Meilisearch to your specific needs or use our out-of-the-box and hassle-free presets
RESTful API: integrate Meilisearch in your technical stack with our plugins and SDKs
Easy to install, deploy, and maintain

📖 Documentation

You can consult Meilisearch's documentation at https://www.meilisearch.com/docs.

🚀 Getting started

For basic instructions on how to set up Meilisearch, add documents to an index, and search for documents, take a look at our Quick Start guide.

⚡ Supercharge your Meilisearch experience

Say goodbye to server deployment and manual updates with Meilisearch Cloud. No credit card required.

🧰 SDKs & integration tools

Install one of our SDKs in your project for seamless integration between Meilisearch and your favorite language or framework!

Take a look at the complete Meilisearch integration list.

⚙️ Advanced usage

Experienced users will want to keep our API Reference close at hand.

We also offer a wide range of dedicated guides to all Meilisearch features, such as filtering, sorting, geosearch, API keys, and tenant tokens.

Finally, for more in-depth information, refer to our articles explaining fundamental Meilisearch concepts such as documents and indexes.

📊 Telemetry

Meilisearch collects anonymized data from users to help us improve our product. You can deactivate this whenever you want.

To request deletion of collected data, please write to us at [email protected]. Don't forget to include your Instance UID in the message, as this helps us quickly find and delete your data.

If you want to know more about the kind of data we collect and what we use it for, check the telemetry section of our documentation.

📫 Get in touch!

Meilisearch is a search engine created by Meili, a software development company based in France and with team members all over the world. Want to know more about us? Check out our blog!

🗞 Subscribe to our newsletter if you don't want to miss any updates! We promise we won't clutter your mailbox: we only send one edition every two months.

💌 Want to make a suggestion or give feedback? Here are some of the channels where you can reach us:

For feature requests, please visit our product repository
Found a bug? Open an issue!
Want to be part of our Discord community? Join us!

Thank you for your support!

👩‍💻 Contributing

Meilisearch is, and will always be, open-source! If you want to contribute to the project, please take a look at our contribution guidelines.

📦 Versioning

Meilisearch releases and their associated binaries are available in this GitHub page.

The binaries are versioned following SemVer conventions. To know more, read our versioning policy.

Differently from the binaries, crates in this repository are not currently available on crates.io and do not follow SemVer conventions.

docs-scraper's People

Contributors

Stargazers

Watchers

Forkers

anborg eskombro thefrenchmatt tpayet renehernandez myface-wang spahrson ra2003 jsbrain sanders41 suryatmodulus alphaninja27 kinshukdua aabur gofirestar haroenv benjamimjr thedatamine clean-code-studio derbl4ck mdraevich buehlmann danielbrodin servoice smartmind12 3t8 banshaj-paudel brunoocasali lukaskf worthlesspixels ansys baronrustamov sid255 strogo ramarnat dohsimpson jobscore guimachiavelli frauniki psyrtsov kijung-im tonyrl

docs-scraper's Issues

Browser Tests

The test suite needs to be run skipping any tests that uses chromedriver with pipenv run pytest ./scraper/src -k "not _browser". Is this because the path is needed for the chrome driver? If that is the reason do you want them to be runnable? I could add an optional command line flag that lets you specify the path to the driver and make a fixture to set the environment variable which would make them runnable. Something like pipenv run pytest ./scraper/src --chromedriver=/usr/local/bin/chromedriver. The flag would be optional and wouldn't have to be set. This would allow browser tests to still be skipped as they are currently if someone doesn't have a chromedriver installed.

I'm also thinking the current browser tests aren't viable tests? I tried running them and most fail so I think if adding the flag for the driver path is something you want the tests would also need to be updated?

Example configuration complains about invalid JSON

I've installed the latest Docker image locally and to start I cut and paste the example JSON you provided in your docs and changed a few properties but the JSON is 100% valid JSON and yet the container exists with the following error:

Here is the JSON that was used:

{
  "index_uid": "rust-api",
  "start_urls": ["https://docs.rs/tauri/latest/tauri"],
  "sitemap_urls": [],
  "selectors": {
    "lvl0": {
      "selector": "h1",
      "global": true,
      "default_value": "Title"
    },
    "lvl1": {
      "selector": "h2",
      "global": true,
      "default_value": "Section"
    },
    "lvl2": "[title=tauri::*]",
    "lvl3": ".docblock-short",
    "lvl4": ".theme-default-content h4",
    "lvl5": ".theme-default-content h5",
    "lvl6": "null",
    "text": "#main"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": ["relevant", "relevance"],
      "relevant": ["relevancy", "relevance"],
      "relevance": ["relevancy", "relevant"]
    }
  }
}

Update `path to the config file` example in the Readme

What's wrong?

Inside the Readme, we have the following example on the Run the scraper section indicating <path-to-your-config-file>:

pipenv run ./docs_scraper <path-to-your-config-file>

But then, we can find these examples:

<path-to-your-config-file> is now config.json in the with Docker section

    -e MEILISEARCH_HOST_URL=<your-meilisearch-host-url> \
    -e MEILISEARCH_API_KEY=<your-meilisearch-api-key> \
    -v <absolute-path-to-your-config-file>:/docs-scraper/config.json \
    getmeili/docs-scraper:latest pipenv run ./docs_scraper config.json

Same in the In a GitHub Action section with the following example:

       docker run -t --rm \
          -e MEILISEARCH_HOST_URL=$HOST_URL \
          -e MEILISEARCH_API_KEY=$API_KEY \
          -v $CONFIG_FILE_PATH:/docs-scraper/config.json \
          getmeili/docs-scraper:latest pipenv run ./docs_scraper config.json

Solution

We should rename the config.json occurrences with <path-to-your-config-file>

Exception: Env CHROMEDRIVER_PATH='/usr/bin/chromedriver' is not a path to a file

Hey guys.
I have trouble in using the docker scraper, I try it on mac and ubuntu, every time I run the docker image, it tells me that the file named chromedriver is not a file.
But I could run it manually.
Can you help me resolve it ? Is the chromedriver in the docker image or locallly?

Make all the tests work and put a GH Action

Trigger a GitHub action on each PR to check all the tests.

the command docs_scraper could not be found within PATH or Pipfile's [scripts].

hi
I have trouble in using the docker scraper, I try it on mac, when I run the docker, it tell me the command docs_scraper could not be found within PATH or Pipfile's [scripts].. I don't know what I did wrong.
Can you help me solve this problem？

#My steps

frist

docker run -d  -it --rm -e MEILI_MASTER_KEY=trantor   -p 9494:9494    -v $(pwd)/data.ms:/data.ms  meilisearch_chrome

meilisearch_chrome is my local image which in order to solve the Env CHROMEDRIVER_PATH='/usr/bin/chromedriver' is not a path to a file problem

second

docker run -t --rm     -e MEILISEARCH_HOST_URL=http://localhost:9494/    -e MEILISEARCH_API_KEY=trantor    -v /Users/silviayuan/test/docs-scraper/config.json:/docs-scraper/config.json meilisearch_chrome  pipenv run ./docs_scraper config.json

After my second step , it throw error.

Update the README according to new tools and the article

Once the article about "How to integrate a search bar for your docs" will be released, we should update the README to take into account:

the vuepress plugin
the article
docs-searchbar.js

Fail to create Docker Image for since python 3.8 is used

See this job

CONFIG is not a valid JSON

This is my config.json file

    {
  "index_uid": "gsxd",
  "start_urls": ["http://my_domain.vn/"],
  "sitemap_urls": ["http://my_domain.vn/sitemap.xml"],
  "stop_urls": [],
  "selectors": {
    "lvl0": {
      "selector": ".theme-default-content h1",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": {
      "selector": ".theme-default-content h2",
      "global": true,
      "default_value": "Chapter"
    },
    "text": ".theme-default-content p, .theme-default-content li"
  }
}

then I run docker

docker run -t --rm \
    -e MEILISEARCH_HOST_URL=http://my_domain.vn:7700 \
    -e MEILISEARCH_API_KEY=key \
    -v config.json:/docs-scraper/config.json \
    getmeili/docs-scraper:latest pipenv run ./docs_scraper config.json

this is what I got

Traceback (most recent call last):
  File "/docs-scraper/scraper/src/config/config_loader.py", line 99, in _load_config
    data = json.loads(config, object_pairs_hook=OrderedDict)
  File "/usr/local/lib/python3.8/json/__init__.py", line 370, in loads
    return cls(**kw).decode(s)
  File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./docs_scraper", line 22, in <module>
    run_config(sys.argv[1])
  File "/docs-scraper/scraper/src/index.py", line 34, in run_config
    config = ConfigLoader(config)
  File "/docs-scraper/scraper/src/config/config_loader.py", line 67, in __init__
    data = self._load_config(config)
  File "/docs-scraper/scraper/src/config/config_loader.py", line 104, in _load_config
    raise ValueError('CONFIG is not a valid JSON') from value_error
ValueError: CONFIG is not a valid JSON

Please help!.

Add an explanation about the config file name inside the Readme

What's wrong

By reading the Readme from top to bottom, we are given an example of config file but it is not clear in which file we have to put this.

The information can be found below in an other section (the config file can be anything as it is passed as an argument later), but the information is arriving later than what we would expect.

Solution

Explain inside the config file paragraph that the config file name can be anything and doesn't matter.

Dependabot couldn't authenticate with https://pypi.org/simple/

Dependabot couldn't authenticate with https://pypi.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

Create image version that works in arm environments

Currently, the image from the docker hub only works in an amd environment. Nonetheless, we should be compatible with arm as well.

TODO:

Update publish-docker-latest, following the example of the docker CI of meilisearch. Note that we do not need the corn trigger nor the Send CI information to Cloud team.
Remove publish-docker-tag.yml
Update the README accordingly, explaining how to run in both env using docker

Implement "selector_rank"

Description

Currently, there is a way to add rank between pages

{
  "start_urls": [
    {
      "url": "http://www.example.com/docs/concepts/",
      "page_rank": 5
    },
    {
      "url": "http://www.example.com/docs/contributors/",
      "page_rank": 1
    }
  ]
}

Documented here :

https://github.com/meilisearch/docs-scraper#using-page-rank-

I have a usecase where my pages do not have different page rank, but rather some content on the page has more significance than other.

I would like to be able to rank the content with selectors.

Basic example

<h1>My website</h1>

<h2>Reasons</h2>
<p class="introduction">FoobarFoobarFoobarFoobarFoobarFoobarFoobar</p>

<h2>Other things</h2>
 <p>Details</p>

In this case, I know that the content that is inside the "introduction" paragraph is more important than the other content at the end, and I would like my search results to reflect that.

I'm thinking of using something like this for the config :

{
  "selectors": {
    "lvl1": "h1",
    "lvl2": "h2",
    "text": [ 
            { 
                "rank": 5,
                "selector": "p.introduction"
            },
            { 
                "rank": 1,
                "selector": "p:not(.introduction)"
            },
    ]
}

Other

This is a real problem I have, on my website, the CHANGELOG and README of my software is on the same page, and results from the CHANGELOG sometimes come before results from the README. The results from the README should be prioritized.

Maybe the API could be different, instead of adding an array inside the selectors, one could use :

"selector_ranks":   [ 
            { 
                "rank": 5,
                "selector": "p.introduction"
            }
    ]

And simply use "p" for the "text" selector.

Having problem to run docs-scraper 0.10.1 with Python 3.6/3.8.0

I had machines running Fedora 32 and Centos 7/8.

On the development machine (Fedora 32). It has Python 3.8.5.
I could run the docs-scraper 0.10.1 without problem.

On the production machine Centos 7 (python 3.6) and Centos 8.2 (python 3.6 and python 3.8.0)
It could compile the docs_scraper package but having problem when it run, below is the error message.

$ pipenv run ./docs_scraper docs_scraper-config.json
Courtesy Notice: Pipenv found itself running within a virtual environment, so it will automatically use that environment, instead of creating its own for any project. You can set PIPENV_IGNORE_VIRTUALENVS=1 to force pipenv to ignore that environment and create its own instead. You can set PIPENV_VERBOSITY=-1 to suppress this warning.
2020-08-07 10:02:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.kappawingman.com> (referer: None)
Traceback (most recent call last):
File "/home/username/venv/docs-scraper-0.10.1/lib64/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/username/venv/docs-scraper-0.10.1/lib64/python3.6/site-packages/scrapy/spiders/init.py", line 93, in parse
raise NotImplementedError('{}.parse callback is not defined'.format(self.class.name))
NotImplementedError: DocumentationSpider.parse callback is not defined

All python 3.6/3.8.0 have warning about the 'twisted' package'.
Please take a look, thanks.

Upgrade python version to 3.8

It's not currently possible to handle multiple python versions in the Pipfile -> pypa/pipenv#1050

We should upgrade the python version of this package to 3.8

Wrong command line usage in Contributing.md

In the development workflow section of the Contributing.md file, it says the following:

$ pipenv install
$ pipenv run ./docs_scraper run <path-to-your-config-file>

The line:

$ pipenv run ./docs_scraper run <path-to-your-config-file>

has an extra run argument after the path to ./docs_scraper

It should be instead:

$ pipenv run ./docs_scraper <path-to-your-config-file>

docs-scraper is not compatible with python 3.11

Because of selenium's dependency exceptiongroup we are not compatible with python 3.11.

   "exceptiongroup": {
            "hashes": [
                "sha256:542adf9dea4055530d6e1279602fa5cb11dab2395fa650b8674eaec35fc4a828",
                "sha256:bd14967b79cd9bdb54d97323216f8fdf533e278df937aa2a90089e7d6e06e5ec"
            ],
            "markers": "python_version < '3.11'",
            "version": "==1.0.4"
        },

The latest version of selenium seems to remove this dependency See #283. Nonetheless, we should still ensure that we are compatible with python 3.11 once the PR is merged.

Upgrade pylint

Upgrade pylint (like in this PR #113) and fix the linter errors.

Publish a new Docker image containing Chrome binary

In order to solve the issue #139 for the Docker users, we will need to publish a new version of the base Dockerfile containing the Chrome binary.

To accomplish this issue we need to:

Create a new Dockerfile with the Chrome binary additions
Configure Github Actions to release a new image version getmeili/docs-scraper-with-chrome
Update README sections regarding the usage of this new image, which will be required only by the users who need the chrome binary.

After this addition, we will be able to instruct users to use this new image when they need it, and we will not impact the current users of the getmeili/docs-scraper image with a non-requested size addition.

hint: we could base this new image in the algolia's image https://hub.docker.com/layers/algolia/docsearch-scraper/latest/ or in this comment #139 (comment)

docs-scraper integration with Apache Tika

Hello,
Recently I completed the task to build local search system with the possibilities to index word files / markdown / pdf.

I made this one using nginx autoindex module (customized a bit) and meilisearch (scraper + engine + search bar).
Just because docs-scraper do not index word / markdown / pdf files by default, I made some sort of changes:

for markdown files I used markdown2 to convert .md to .html
for word / pdf files I used remote server with Apache Tika in order to convert to .html.

So I'm going to understand are you as developers interested in those changes. If yes, I will do PR. See my code here. To be precise see files custom_downloader_middleware.py & documentation_spider.py

P.S. I do not believe that my code has a great optimization so I'm open for a some sort of criticism :)

Having issues indexing my jekyll website from local container

Hello,

I am very new to meilisearch, and I am very excited to be able to integrate it to my website.

I am using the docs-searchbar.js, and the scraper.

My starting point of the site is generated with Jekyll, but all the pages are not generated in the same way, they are generated dynamically with a js script.

I am able to index all links redirecting to all my pages (which are just like table of contents), which appear after clicking on each link.

I thought that the scraper would be able to go scrape the content from links, but maybe I will have to custom/override the scraper to fit my need. Not very sure about that!

For example one of my pages is => http://localhost:9000/docs/en/master/getting_started/new_diff/quickstart/4_use.html,
and another one looks like this => http://localhost:9000/docs/en/master/getting_started/docs/quickstart/1_introduction.html

I tried specifying the link with the complete URL in the scraper config, and the content is well indexed, but that means that I will have to declare all links :( if I want to make it work

My scraper config is this

{
  "index_uid": "docs",
  "start_urls": [
    {
      "url": "http://localhost:9000/docs/en/master/getting_started/",
      "selectors_key": "getting-started"
    },
    {
      "url": "http://localhost:9000/docs/en/master/developer-guides/",
      "selectors_key": "developer-guides"
    }
  ],
  "selectors": {
    "default": {
      "lvl0": {
        "selector": "",
        "default_value": "Documentation"
      },
      "lvl1": "ul.site-toc li span.toc-section-heading",
      "lvl2": "h1",
      "lvl3": "h2",
      "text": "p"
    },
    "getting-started": {
      "lvl0": {
        "selector": "",
        "default_value": "Getting started"
      },
      "lvl1": "ul.site-toc li span.toc-section-heading",
      "lvl2": "h1",
      "lvl3": "h2",
      "text": ".docs .page-content-container .page-content p"
    },
    "developer-guides": {
      "lvl0": {
        "selector": "",
        "default_value": "Developer Guides"
      },
      "lvl1": ".docs h1",
      "lvl2": ".docs h2",
      "lvl3": "h3",
      "text": ".docs .page-content-container .page-content p"
    },
}
"custom_settings": {
    "stopWords": [
      "a", "and", "as", "at", "be", "but", "by",
      "do", "does", "doesn't", "for", "from",
      "in", "is", "it", "no", "nor", "not",
      "of", "off", "on", "or",
      "so", "should", "than", "that", "that's", "the",
      "then", "there", "there's", "these",
      "this", "those", "to", "too",
      "up", "was", "wasn't", "what", "what's", "when", "when's",
      "where", "where's", "which", "while", "who", "who's",
      "with", "won't", "would", "wouldn't"
    ],
    "synonyms": {
      "relevancy": ["relevant", "relevance"],
      "relevant": ["relevancy", "relevance"],
      "relevance": ["relevancy", "relevant"]
    }
  }

Using getmeili/docs-scraper:v0.10.4 and meilisearch version 0.19.0

Fix the failing pytest version update

See #275

The tests are failing since the new version, it should be investigated and fixed.

Doc Scraper removing old index on 2nd run

Initially created by @munim
2 days ago

Dear team,

I am trying out Meilisearch and indexing our side using docs-scraper project from Meilisearch. It worked for me at some level but when I ran the scraper again with the same command, it cleaned all the items and started from scratch. Here's what I did:

Created a Docker network and started Meilisearch with Docker:

$ docker run -it --rm \
    -p 7700:7700 \
    -e MEILI_MASTER_KEY='123'\
    -v $(pwd)/meili_data:/meili_data \
    --network="meilisearch-test-01" \
    getmeili/meilisearch:v0.28 \
    meilisearch --env="development"

Created a scraper config file mentioned in the project README
Started the scraper with the following command:

$ docker run -t --rm \
    -e MEILISEARCH_HOST_URL=http://exciting_banach:7700 \
    -e MEILISEARCH_API_KEY=123 \
    --network="meilisearch-test-01" \
    -v `pwd`/test-scraper.config.json:/docs-scraper/config.json \
    getmeili/docs-scraper:latest pipenv run ./docs_scraper config.json

It took around 30 mins to scrap 50K pages.
I rerun the scraper after making some changes to the config
Now, I see all my previous entries from Meilisearch are removed and new entries are being added.

My question is: How can I update the entries rather than removing old entries and recreate again?

Add analytics to repo

This require to also

Add the version of docs-scraper following this example
Add in the CONTRIBUTING guide the requirement to update the version following this example
Add a check of version script in a.workflow/script dir following this example
Add in publish-docker-latest.yml and in publish-docker-tag.yml the check-release script following this example but remove the pypi part inside this

Add compatibility so that both URL path types are supported (absolute and relative)

Hello,

I am currently working on a documentation website generated with Jekyll, and Meilisearch has been pretty easy to add, with the docs-scraper and the docs-searchbar.

In fact, I have many instances of my documentation website hosted in different places

That means that I have to run the docs-scraper for each site (update of repository).

I wish I could run only one scraper for all my sites, and be independent of where each documentation site is hosted. so my question is:

Is it possible to replace absolute with relative URLs in docs-scraper?.

I guess, I can do that somehow overriding some of the logic from the source code of the scraper. But is there another way? (maybe someone else has already thought/discussed about that).

Thanks in advance!!

Ability to update documents only from the same domain

I have setup Meilisearch for a multi-site search frontend recently at work and I am using the docs-scraper image to push the scraped data from each of the sites to the Meilisearch server. Out-of-box, it works great!!

As we are planning to index more documents from different internal websites, I am running into the problem of needing to use different configurations files to scrape sites with different layouts. As of the current scraping logic, this is not supported because every time that the scrape image run it deletes everything from the docs index.

Having the ability of only updating (probably by deleting and adding) the documents that match the domain that is being scraped would allow to use multiple different config files with the scraper image

I would love to take a stab at this, but probably would need some pointers first on how to do it

Determine mandatory fields in the config file

From @sanders41 in this comment

When using a minimal docs-scraper config:

{
  "index_uid": "docs",
  "start_urls": ["https://docs.meilisearch.com"]
}

Errors are thrown for missing fields:

Then I get the error TypeError: argument of type 'NoneType' is not iterable that happens here. I am thinking the json I am using is not what you have in mind? The thing that makes me question that and wonder if there could be an issue is the parser is called from ConfigLoader and in that Class selectors gets initialized to None. This makes me think calling the parser with selectors set to None shouldn't throw an error? Should I instead be using the basic config json file?

We should be able to determine what fields are mandatory and which fields should or should not be mandatory.

Support python 3.9

Related to #111 (comment)

Supporting python 3.9:

precise it in the README
add it to the CI (we still need to test older python versions)

Update index without downtime

Swap indexes without downtime.

Currently is not possible in Meilisearch but as soon as it is we need to implement it here.

The roadmap idea is here: https://roadmap.meilisearch.com/c/65-swap-index

Use facets

Make the scraper use the facets (when the facets feature will be ready in a stable version of MeiliSearch)
I will be useful for the docs versioning and the handling of the different languages.

Edit

Add example in README (even the custom_settings part for attributesForFaceting)

New release?

Mainly to leverage the fix for #17

Keep the settings of the previous index

Currently, for performance reasons, each time the scraper runs, it deletes the index, creates a new one, adds the default settings, and adds new documents (the content of the website).

Problem: if the user set up his/her own settings manually (for example, adding synonyms) the settings will be removed at each scraping.
Currently, the config file does not provide any way to accept settings for MeiliSearch.

~~A quick fix: keep the settings of the previous index, and apply them right after the default one.~~ See the edit below

In the future: provide a field in the config file to customize the settings of MeiliSearch.

Why doing that right now?

We need to add synonyms in our own documentation: relevancy -> relevant

Edit

I see that the config file is already ready to receive a custom_settings but the scraper does not use it. It might be better to use this field instead 😇

Docs-scraper not working when url has a port

As for now, docs-scraper does not work on docs-scraper. A workaround would be to use a tool like ngrok.

Ngrok exposes local servers behind NATs and firewalls to the public internet over secure tunnels

The bug should be fixed.

When trying to scrap with localhost you will receive the following result:

Docs-Scraper: http://localhost:8080 0 records)

Problem existed in docsearch-scraper as well: algolia/docsearch-scraper#461 (comment)

CI jobs are failing

The jobs in the CI are failing

linter-check
unit-tests
integration-tests

https://github.com/meilisearch/docs-scraper/actions/runs/1445509929

Change master branch to main

Let's be allies and make this change that means a lot.

Here is a blog post that explain a little more why it's important, and how to easily do it. It will be a bit more complicated with automation, but we still should do it!

Missing 'requests' module?

Description
I am testing the docs-scraper on my Synology NAS, and the container stops immediately with the error listed below.

Expected behavior
Scrape some sites

Current behavior
container immediately stops on error.

Screenshots or Logs
File "./docs_scraper", line 5, in
from scraper.src.index import run_config
File "/docs-scraper/scraper/src/index.py", line 7, in
import requests

Environment (please complete the following information):
Synology DSM 7.1 running Docker
docs-scaper v0.12.3 docker image (getmeili/docs-scraper:v0.12.3)

Fix chromedrivers tests in CI that had to be removed

Description
The tests on the chromedriver are removed again from the CI because of failing tests with selenium

See failing tests

run: pipenv run pytest -m "not chromedriver"

Expected behavior
Not fail

Current behavior

=========================== short test summary info ============================
FAILED tests/config_loader/open_selenium_browser_test.py::TestOpenSeleniumBrowser::test_browser_needed_when_config_contains_automatic_tag

Screenshots or Logs
If applicable, add screenshots or logs to help explain your problem.

Add 2021 to license year

I'll create a GitHub action for next years.

TypeError: argument of type 'NoneType' is not iterable

I am on the Meilisearch droplet from Digitalocean.

Fresh installation of docs-scraper via install directions, python 3.8.2.

When attempting to run the first scrape, I receive the following error:

pipenv run ./docs_scraper config.json Traceback (most recent call last): File "./docs_scraper", line 22, in <module> run_config(sys.argv[1]) File "/root/docs-scraper/scraper/src/index.py", line 34, in run_config config = ConfigLoader(config) File "/root/docs-scraper/scraper/src/config/config_loader.py", line 81, in __init__ self._parse() File "/root/docs-scraper/scraper/src/config/config_loader.py", line 114, in _parse self.selectors = SelectorsParser().parse(self.selectors) File "/root/docs-scraper/scraper/src/config/selectors_parser.py", line 64, in parse if 'lvl0' in config_selectors: TypeError: argument of type 'NoneType' is not iterable

I have reduced my config json to the smallest possible to rule out any issues:

{ "index_uid": "docs", "start_urls": ["https://socialtools.io"], "strip_chars": " .,;:#" }

Automatize push

Currently, I manually push changes to Heroku.
It could be better to add a GitHub Action to automatically deploy on Heroku repository when pushing on master branch, or with a tag.

Add a link in the README to the integration guides

Remove nb_hits update in config file

Since we don't use this metric (we indeed don't update the config file and push it to GitHub after updating it) we should remove the file update in the code.

But I think it's still interesting to keep the final prompt that says:

Nb hits: XXX

docs-scraper for everyone

Currently, docs-scraper scraps only MeiliSearch's documentation.

This repository could work for every documentation. But, so far, the repo is not perfect and the README does not provide enough information.

The steps are:

change the code to remove all "useless" part. Solve with #8.
detail the README to explain how to use it and ~~how I deploy it on Heroku with a cron job.~~ Instead, we could provide a docker image to run the scraper after each docs deployment.

Run the integration tests also on the builded docker images

Recently we got an issue with a mismatch of chromedriver between the version of Python or docker image was using and the one required from selenium (I suppose), see #284.

This is bound to happen again and no tests are ran to ensure we have no mismatch.
Thus, we should add a test ensuring that our docker images are working.

Tests meilisearch implementation

These tests only tries communication with MeiliSearch and not the pertinence of the scraper:

Tests should be made to test if MeiliSearch implementations works correctly.

In meilisearch_helper.py the following is done:

Delete the scrape index if it already exists
Create a new index with the same name
Add default and custom settings to index

These should be tested if it was done successfully.
You can confirm it worked correctly using the GET /indexes method.

A test directory should be created meilisearch_***.

In that directory the different tests should be made

A simple meilisearch configuration with the right credentials and no setting (#154)
- Check if index was correctly added to Meilisearch
- Check if default setting were added correctly
A simple meilisearch configuration with the right credentials and settings
- Check if index was correctly added to Meilisearch
- Check if default setting were added correctly
A simple meilisearch configuration with the right credentials and bad settings
- Check if index was correctly added to Meilisearch
- Check if error is raised

To start this tests, their should be a running instance of Meilisearch.

.github/workflows/test-lint.yml

      - name: Docker setup
        run: docker run -d -p 7700:7700 getmeili/meilisearch:latest ./meilisearch --no-analytics=true --master-key='masterKey'
      - name: Run tests
        run: pipenv run pytest ./scraper/src -k "not _browser"

Upgrade scrapy to v2.3.0 by removing the NotImplementedError

Since this scrapy upgrading, we got an error when running:

$ pipenv run ./docs_scraper config.json
> Docs-Scraper: https://docs.meilisearch.com 27 records)
2020-09-10 14:42:09 [scrapy.core.scraper] ERROR: Spider error processing <GET https://docs.meilisearch.com> (referer: None)
Traceback (most recent call last):
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.8/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/curquiza/Documents/docs-scraper/scraper/src/documentation_spider.py", line 184, in parse_from_start_url
    return self.parse(response)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.8/site-packages/scrapy/spiders/__init__.py", line 93, in parse
    raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: DocumentationSpider.parse callback is not defined

Nb hits: 27

Only the master branch is concerned, the latest release (v0.10.1 does not contain this error).

Edit

I reverted the concerned PR (manually because GitHub wasn't able to revert it automatically) to make the master branch working again. See #65.
The new goal of this issue is to upgrade Scrapy from v2.2.1 to v2.3.0 by fixing the NotImplementedError at the same time.

Cannot scrap documents into meilisearch

Greetings,

I am using Pelican as my blog generator.
I am not using Ubuntu so I need to use docker to run the doc-scraper.

I can run the small tutorial and I can import data into the meilisearch.

But I cannot run the doc-scraper to get data into meilisearch.
Below is the error:

Traceback (most recent call last):
File "./docs_scraper", line 22, in
run_config(sys.argv[1])
File "/docs-scraper/scraper/src/index.py", line 43, in run_config
config.custom_settings
File "/docs-scraper/scraper/src/meilisearch_helper.py", line 108, in init
settings = {**MeiliSearchHelper.SETTINGS, **custom_settings}
TypeError: 'NoneType' object is not a mapping

Below is my config.json
{
"index_uid": "docs",
"sitemap_urls": ["https://www.kappawingman.com/sitemap.xml"],
"start_urls": ["https://www.kappawingman.com"],
"selectors": {
"lvl0": {
"selector": ".entry-content",
"global": true,
"default_value": "Documentation"
},
"lvl1": "#main_content h1",
"lvl2": ".toc-backref h2",
"lvl3": ".toc-backref h3",
"text": ".entry-content p, .entry-content li"
},
"strip_chars": " .,;:#",
"scrap_start_urls": true
}

On the meilisearch console, I saw these messages:
[2020-06-09T16:15:34Z INFO tide::middleware::logger] DELETE /indexes/docs 204 17ms
[2020-06-09T16:15:34Z INFO tide::middleware::logger] POST /indexes 201 15ms

Any help would be appreciated, thanks.

docs scraper failed to build indexes with v0.28.0 405 Client Error: Method Not Allowed for url

Description
Description of what the bug is about.
0.28.0

Current behavior

Traceback (most recent call last):
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/meilisearch/_httprequests.py", line 101, in __validate
    request.raise_for_status()
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 405 Client Error: Method Not Allowed for url: https://meilisearch.owenyoung.com/indexes/owen-blog/settings
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "./docs_scraper", line 22, in <module>
    run_config(sys.argv[1])
  File "/docs-scraper/scraper/src/index.py", line 40, in run_config
    meilisearch_helper = MeiliSearchHelper(
  File "/docs-scraper/scraper/src/meilisearch_helper.py", line 105, in __init__
    self.add_settings(MeiliSearchHelper.SETTINGS, custom_settings)
  File "/docs-scraper/scraper/src/meilisearch_helper.py", line 109, in add_settings
    self.meilisearch_index.update_settings(settings)
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/meilisearch/index.py", line 641, in update_settings
    return self.http.post(
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/meilisearch/_httprequests.py", line 63, in post
    return self.send_request(requests.post, path, body, content_type)
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/meilisearch/_httprequests.py", line 45, in send_request
    return self.__validate(request)
  File "/root/.local/share/virtualenvs/docs-scraper-PNgpf51m/lib/python3.8/site-packages/meilisearch/_httprequests.py", line 104, in __validate
    raise MeiliSearchApiError(str(err), request) from err
meilisearch.errors.MeiliSearchApiError: MeiliSearchApiError. 405 Client Error: Method Not Allowed for url: https://meilisearch.owenyoung.com/indexes/owen-blog/settings
Error: Process completed with exit code 1.

Environment (please complete the following information):

OS: [e.g. Debian GNU/Linux] Debian
Meilisearch version: [e.g. v.0.28.0]
docs-scraper version: [e.g v0.12.2]

Upgrade scrapy

This package currently uses scrapy v1.7.4.

We cannot upgrade it to v2.X.X and even to v1.8.0 because of this error:

2020-06-19 15:49:41 [scrapy.core.downloader.handlers] ERROR: Loading "scrapy.core.downloader.handlers.http.HTTPDownloadHandler" for scheme "http"
Traceback (most recent call last):
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 46, in __init__
    method=self._sslMethod,
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/utils/misc.py", line 144, in create_instance
    return objcls.from_settings(settings, *args, **kwargs)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/contextfactory.py", line 35, in from_settings
    return cls(method=method, tls_verbose_logging=tls_verbose_logging, tls_ciphers=tls_ciphers, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'method'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/__init__.py", line 51, in _load_handler
    dh = dhcls(self._crawler.settings)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 53, in __init__
    crawler=None,
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/utils/misc.py", line 144, in create_instance
    return objcls.from_settings(settings, *args, **kwargs)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/contextfactory.py", line 35, in from_settings
    return cls(method=method, tls_verbose_logging=tls_verbose_logging, tls_ciphers=tls_ciphers, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'method'
2020-06-19 15:49:41 [scrapy.core.downloader.handlers] ERROR: Loading "scrapy.core.downloader.handlers.http.HTTPDownloadHandler" for scheme "https"
Traceback (most recent call last):
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 46, in __init__
    method=self._sslMethod,
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/utils/misc.py", line 144, in create_instance
    return objcls.from_settings(settings, *args, **kwargs)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/contextfactory.py", line 35, in from_settings
    return cls(method=method, tls_verbose_logging=tls_verbose_logging, tls_ciphers=tls_ciphers, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'method'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/__init__.py", line 51, in _load_handler
    dh = dhcls(self._crawler.settings)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 53, in __init__
    crawler=None,
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/utils/misc.py", line 144, in create_instance
    return objcls.from_settings(settings, *args, **kwargs)
  File "/Users/curquiza/.local/share/virtualenvs/docs-scraper-ao5z5akx/lib/python3.6/site-packages/scrapy/core/downloader/contextfactory.py", line 35, in from_settings
    return cls(method=method, tls_verbose_logging=tls_verbose_logging, tls_ciphers=tls_ciphers, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'method'
2020-06-19 15:49:41 [docs-test] ERROR: Failure without response Unsupported URL scheme 'https': __init__() got an unexpected keyword argument 'method'
2020-06-19 15:49:41 [docs-test] ERROR: Failure without response Unsupported URL scheme 'https': __init__() got an unexpected keyword argument 'method'

Goal: fix this error and upgrade scrapy to v2.X.X.

meilisearch / docs-scraper Goto Github PK

docs-scraper's Introduction

Website | Roadmap | Meilisearch Cloud | Blog | Documentation | FAQ | Discord

✨ Features

📖 Documentation

🚀 Getting started

⚡ Supercharge your Meilisearch experience

🧰 SDKs & integration tools

⚙️ Advanced usage

📊 Telemetry

📫 Get in touch!

👩‍💻 Contributing

📦 Versioning

docs-scraper's People

Contributors

Stargazers

Watchers

Forkers

docs-scraper's Issues

What's wrong?

Solution

frist

second

What's wrong

Solution

Edit

Why doing that right now?

Edit

Edit

Recommend Projects

Recommend Topics

Recommend Org