catalyst / moodle-search_elastic Goto Github PK
View Code? Open in Web Editor NEWAn Elasticsearch engine plugin for Moodle's Global Search
Home Page: https://moodle.org/plugins/search_elastic
License: GNU General Public License v3.0
An Elasticsearch engine plugin for Moodle's Global Search
Home Page: https://moodle.org/plugins/search_elastic
License: GNU General Public License v3.0
Extend plugin to support indexing and searching of images.
AWS has a service that will return content information of a provided image. Integrate this service with the plugin so images can be indexed and searched.
Relevant documentation links:
http://docs.aws.amazon.com/aws-sdk-php/v3/api/api-rekognition-2016-06-27.html#detectlabels
https://aws.amazon.com/rekognition/
Initial high level tasks
This is currently done as a complex formslib class, which hasn't been implemented in #60
It appears that Tika does not have a configuration based limit for the size of file that can be processed by the Tika service. Instead it seems to be limited by the Java memory for the Tika applicaiton. This is not ideal.
To give some control over the size of files submitted to Tika we need to add a user configuration option to this plugin.
This configuration option will limit the size of the file sent to Tika.
If a file is larger than this setting a file record in Elasticsearch will be created but the the file content will not be included in the index.
When a File is embedded inside of a HTML content block, and Tika File indexing is enabled, during a search that includes content inside of the file, Moodle confuses the containing object of that text to be a block, and not a file, and so exceptions when trying to locate the block that may not exist.
This issue was fixed in PR #55, which forces the ID to be correct before display, to avoid exceptions. It is worth investigating what causes the ID to be incorrect in the first place, to avoid other issues of this type in future.
E.g. after set_data_from_engine just before display, the ID would be set to '123', the ID of the document of interest, not 'html_block-content-9', the ID of the containing block, which is what it should be set to. This points to an underlying problem in the way the data is indexed when files are embedded inside a block
Following the discussion with @mattporritt creating this issue.
Recently we got a couple of requests for reports like "Get all resources that use embedded link". It's really hard to scan all DB tables to get this info and build required report.
It seems like elastic search could be used for that. One missing bit is to be able to search by regex.
When the CLI index task is run, and documents are added to the index the CLI output is wrong.
Regardless of how many documents are added the index messages of the form:
No new documents to index for area.
are received.
Not sure if this is a bug in this plugin or in core Global Search
Google vision seems to have advantages over aws lex. Add support to index images using vision
Hi Matt,
I am currently looking into search_elastic and have added the latest version of this plugin to a Moodle 3.2.3+ (Build: 20170622) instance and have hooked this up to a fresh elasticsearch 5.5 instance.
I also started a standalone tika instance running on a separate machine and configured this tika instance in the plugin's settings.
While doing the first indexing with
sudo -u apache /opt/rh/rh-php70/root/usr/bin/php /var/www/html/moodle_dev3/search/cli/indexer.php --force
with fileindexing enabled, the indexer script encountered a fatal error and stopped with this message:
PHP Notice: Undefined variable: client in /var/www/html/moodle_dev3/search/engine/elastic/classes/document.php on line 157
Notice: Undefined variable: client in /var/www/html/moodle_dev3/search/engine/elastic/classes/document.php on line 157
Default exception handler: Fehler: Call to a member function post() on null Debug:
Error code: generalexceptionmessage
* line 157 of /search/engine/elastic/classes/document.php: Error thrown
* line 286 of /search/engine/elastic/classes/document.php: call to search_elastic\document->extract_text()
* line 352 of /search/engine/elastic/classes/engine.php: call to search_elastic\document->export_file_for_engine()
* line 510 of /search/engine/elastic/classes/engine.php: call to search_elastic\engine->process_document_files()
* line 588 of /search/classes/manager.php: call to search_elastic\engine->add_document()
* line 75 of /search/cli/indexer.php: call to core_search\manager->index()
!!! Fehler: Call to a member function post() on null !!!
!!
Error code: generalexceptionmessage !!
!! Stack trace: * line 157 of /search/engine/elastic/classes/document.php: Error thrown
* line 286 of /search/engine/elastic/classes/document.php: call to search_elastic\document->extract_text()
* line 352 of /search/engine/elastic/classes/engine.php: call to search_elastic\document->export_file_for_engine()
* line 510 of /search/engine/elastic/classes/engine.php: call to search_elastic\engine->process_document_files()
* line 588 of /search/classes/manager.php: call to search_elastic\engine->add_document()
* line 75 of /search/cli/indexer.php: call to core_search\manager->index()
!!
I traced the problem back to commit 4c32c71 which breaks the connection to tika. Based on the latest code, this patch should solve the problem and clean up the function at the same time:
diff --git a/classes/document.php b/classes/document.php
index 6a574df..50ab93b 100644
--- a/classes/document.php
+++ b/classes/document.php
@@ -148,19 +148,18 @@ class document extends \core_search\document {
*/
private function extract_text($file) {
// TODO: add timeout and retries for tika.
- $config = get_config('search_elastic');
$extractedtext = '';
$port = $this->tikaport;
- $hostname = rtrim($this->tikahostname, "/");
+ $hostname = $this->tikahostname;
$url = $hostname . ':'. $port . '/tika/form';
+ $client = new \curl();
$response = $client->post($url, array('file' => $file));
if ($client->info['http_code'] === 200) {
$extractedtext = $response;
}
return $extractedtext;
-
}
/**
However, I am wondering how this problem could remain undetected as you are running this plugin in production...
Thanks,
Alex
Hi Matt,
I am currently looking into search_elastic and have added the latest version of this plugin to a Moodle 3.2.3+ (Build: 20170622) instance and have hooked this up to a fresh elasticsearch 5.5 instance.
I also started a standalone tika instance running on a separate machine and configured this tika instance in the plugin's settings.
While doing the first indexing with
sudo -u apache /opt/rh/rh-php70/root/usr/bin/php /var/www/html/moodle_dev3/search/cli/indexer.php --force
with fileindexing enabled, the indexer script encountered a fatal error and stopped with this message:
////Default exception handler: Die Datei kann nicht gelesen werden. Eventuell existiert sie nicht oder es gibt ein Rechteproblem. Debug: [dataroot]/filedir/c2/40/c24091a092e5afc7310088ce2e416d1d7efcda11
Error code: storedfilecannotread
* line 579 of /lib/filestorage/stored_file.php: file_exception thrown
* line 274 of /search/engine/elastic/classes/document.php: call to stored_file->get_imageinfo()
* line 353 of /search/engine/elastic/classes/engine.php: call to search_elastic\document->export_file_for_engine()
* line 511 of /search/engine/elastic/classes/engine.php: call to search_elastic\engine->process_document_files()
* line 588 of /search/classes/manager.php: call to search_elastic\engine->add_document()
* line 75 of /search/cli/indexer.php: call to core_search\manager->index()
!!! Die Datei kann nicht gelesen werden. Eventuell existiert sie nicht oder es gibt ein Rechteproblem. !!!
!! [dataroot]/filedir/c2/40/c24091a092e5afc7310088ce2e416d1d7efcda11
Error code: storedfilecannotread !!
!! Stack trace: * line 579 of /lib/filestorage/stored_file.php: file_exception thrown
* line 274 of /search/engine/elastic/classes/document.php: call to stored_file->get_imageinfo()
* line 353 of /search/engine/elastic/classes/engine.php: call to search_elastic\document->export_file_for_engine()
* line 511 of /search/engine/elastic/classes/engine.php: call to search_elastic\engine->process_document_files()
* line 588 of /search/classes/manager.php: call to search_elastic\engine->add_document()
* line 75 of /search/cli/indexer.php: call to core_search\manager->index()
!!
The german sentences in the debug message mean that the file couldn't be read from disk. This conclusion is correct as I ran the indexing on a Moodle test instance which was rsynced from our production system with excluding big files in Moodledata (almost always videos) for storage saving reasons. Unfortunately, due to this fact, I am unable to test indexing properly on this test instance.
Would it be possible to check if a file really exists on disk before it is sent to the file indexing backend and ignore it otherwise?
Thanks,
Alex
Hi,
I am currently looking into search_elastic to run it as an alternativ to search_solr, mainly because Elasticsearch seems to be easier to install and run on our RHEL 7 systems.
I have seen that search_elastic requires local_aws. May I ask if local_aws is really necessary for local ElasticSearch instances (i.e. not in AWS)?
If it is really necessary, could you please also publish local_aws to the Moodle plugins repository so that we can install it from there and get update notifications?
If it is not necessary, could you please remove this requirement from version.php and e.g. replace it with some custom checks if local_aws is installed before a AWS instance can be configured?
Thanks,
Alex
After this change https://tracker.moodle.org/browse/MDL-62869 is landed to the core, we'd like to implement the described feature in elastic search the plugin.
Steps to replicate:
Error: Error executing query in search engine: Failed to parse query [*title:news]
Running CLI reindex command
sudo -u www-data php search/cli/indexer.php --force --reindex
I'm getting
Processing Messages - sent area
++ Error retrieving core_message-message_sent 80 document, not all required data is available: Invalid user ++
Maybe it's been related to some user being removed. Not sure if it needs any attention, but log it as part of my testing.
It has been noted that if there is no pre-existing Moodle configuration for Elastic Search and the default config values fail to resolve to a response, this will instead return a new \search_elastic\guzzle_exception()
which does not contain a getBody()
method, thereby breaking the \search_elastic\engine
method validate_index()
Here is a stack trace from an upgrade on a Moodle instance:
Default exception handler: Exception - Call to undefined method search_elastic\guzzle_exception::getBody() Debug:
Error code: generalexceptionmessage
Add webservice and AJAX API support.
This will allow webservices and embedded Ajax functions in Moodle to run search queries and get results. This would allow for custom search interfaces such as chat bots.
There is a very good argument that this functionality should exist in core a Global Search, however, in the interest of getting this into the wild quickly I'm adding to this plugin first.
Add functionality to get terms that are most popular in the search index.
Useful links:
https://www.elastic.co/guide/en/elasticsearch/reference/2.0/search-aggregations-bucket-terms-aggregation.html
https://stackoverflow.com/questions/27741717/elasticsearch-how-to-get-popular-words-list-of-documents
This will be useful in helping to train AI interfaces assist
Also make available via Ajax enabled webservice
One issue in classes/query.php, line 321:
For example, this code is never executed because it seams that $usercontents is an object in 3.5 and not an array
// Add contexts. if (gettype($usercontexts) == 'array') { $contexts = $this->construct_contexts($usercontexts); array_push ($query['query']['bool']['filter']['bool']['must'], $contexts); }
Option "Search within enrolled courses only" does not have any effect than
Another change I had to make to make it work is this (added groupid, line 118):
$excludedfields = array('itemid', 'areaid', 'courseid', 'contextid', 'userid', 'owneruserid', 'modified', 'type', 'groupid' // added group id, otherwise it always fails when searching for string );
To see this error:
title:something
as specified in the help icon documentation:
See the error:
Sep 30 10:13:36 moodle_prod: 2019/09/30 10:13:36 [error]: *1 FastCGI sent in stderr: "PHP mes
sage: Default exception handler: Error executing query in Elasticsearch backend. Debug: {"error":{"root_cause":[{"type":"parse_exce
ption","reason":"parse_exception: Encountered \" \":\" \": \"\" at line 1, column 10.\nWas expecting one of:\n <EOF> \n <AND>
...\n <OR> ...\n <NOT> ...\n \"+\" ...\n \"-\" ...\n <BAREOPER> ...\n \"(\" ...\n \"*\" ...\n \"^\" ...\n
<QUOTED> ...\n <TERM> ...\n <FUZZY_SLOP> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n \"[\" ..
.\n \"{\" ...\n <NUMBER> ...\n "}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query",
"g" while reading upstream, server: localhost, request: "GET /search/index.php?q=something%3Aelse&c
ontext=11 HTTP/1.1", upstream: "fastcgi://unix:/var/run/php/php7.2-fpm.sock:", host: "localhost", referrer: "localhost/my/"
Refactor plugin to use a subplugin architecture.
The Tika text extraction and AWS Rekognition Image rekognition features should be sub plugins. This would make it easier to manage and create integrations to other services.
For example to use Google's image recognition service instead of AWS
Moodle 3.3 introduced a document converter API: https://docs.moodle.org/dev/File_Converters
Need to modify this plugin to use the converter API for converting images and text files ready for indexing.
NOTE: this will probably mean making pre and post Moodle 3.3 branches and maintaining 2 versions of the plugin until Moodle 3.2 is out of support
The original idea was to implement sub plugin support as outlined in issue #19 however the document converter API is a better approach.
This is a critical one.
When the index is initally created everything is fine. i.e by running: sudo -u www-data php search/cli/indexer.php --force --reindex
However, when the index is updated either by scheduled task or CLI (sudo -u www-data php search/cli/indexer.php) all search results are deleted from the search engine backend.
README file contains a robust description about the plugin and related platforms. It feels like it could be simplified and most of the info moved to github WIKI where it could be better organised.
Track and display top search keywords to users.
a rough start: https://stackoverflow.com/questions/16094112/elasticsearch-popular-keywords
This could also help with an autocomplete feature
We are seeing an edge case where a search term entered in Moodle returns no results to the user. However there are plenty of results being returned by the Elasticsearch backend.
This looks to be an issue with the compile_results method in the engine class. It is still to be 100% confirmed but the root cause seems to be:
The next steps:
ie from /admin/searchareas.php
I first tried to delete the broken area, html block. It says it was deleted and no longer showed in the table. But the search results were still broken so I do not beleive it did the correct thing on the elastic side of the fence and reported a false positive.
When I deleted all indexed content and reset the lot, then it did work as expected.
Just after installing the plugin, when configuring, I am getting a CURL error because it cannot connect to localhost.
Hi,
I am currently looking into search_elastic to run it as an alternativ to search_solr, mainly because Elasticsearch seems to be easier to install and run on our RHEL 7 systems.
I have seen that you recommend running Tika for file indexing as a standalone application (see https://github.com/catalyst/moodle-search_elastic#tika-setup). However, there are no rpm packages for Tika out there as far as I see and fiddling with a manually configured service for Tika can be daunting.
On the other hand, you also write that there are Elasticsearch plugins for Tika. I have found https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html and would like to ask:
Thanks,
Alex
http://docs.guzzlephp.org/en/latest/request-options.html#proxy
ie it should honor the stuff found inside:
/admin/settings.php?section=http
Currently in engine::compile_results() items that return with \core_search\manager::ACCESS_DELETED are deleted from the elasticsearch index synchronously. This is a performance hit and doesn't need to be this way.
Instead lets refactor it to be an adhoc task
New SDK allows you to have a dynamic list of regions. See catalyst/moodle-local_aws#13
Hi Matt,
I am currently looking into search_elastic and have added the latest version of this plugin to a Moodle 3.2.3+ (Build: 20170622) instance and have hooked this up to a fresh elasticsearch 5.5 instance.
While doing the first indexing with
sudo -u apache /opt/rh/rh-php70/root/usr/bin/php /var/www/html/moodle_dev3/search/cli/indexer.php --force
I saw that there are tons of CLI debug messages for the "Messages - received" and "Messages - sent" area telling me:
++ Error retrieving core_message-message_sent 2111873 document, not all required data is available: Ungültige Nutzer/in ++
* line 59 of /message/classes/search/base_message.php: call to debugging()
* line 61 of /message/classes/search/message_sent.php: call to core_message\search\base_message->get_document()
* line ? of unknownfile: call to core_message\search\message_sent->get_document()
* line 103 of /lib/classes/dml/recordset_walk.php: call to call_user_func()
* line 573 of /search/classes/manager.php: call to core\dml\recordset_walk->current()
* line 75 of /search/cli/indexer.php: call to core_search\manager->index()
("Ungültige Nutzer/in" is the german term for "Invalid user" as I have set $CFG->lang = 'de' in config.php.
However, after some time and a very long CLI output, the indexing job comes to an end.
I had also quickly setup a SOLR instance some weeks ago and I can't remember that indexing the same Moodle instance with SOLR had also thrown these kind of errors.
The only reason for these problems I can think of is that we are using auth_ldap sync on a regular basis to delete Moodle accounts which have disappeared in LDAP, so there might be messages in the Moodle database which don't have a connected sender or receiver Moodle account anymore.
In the end, I am wondering if these debug messages come from your plugin or from Moodle core and if I should worry about them or not.
Thanks in advance,
Alex
Elasticsearch has a 2GB limit on record size.
We need to add a check in code to not submit records greater than this limit.
This is an edge case that I don't expect to occur in real world use. However, to be safe we should add a check in the code.
If an individual record is over 2GB this check will fail and a debug message will be raised. The document at fault will not be added to the index
$item->index->status >= 300 would be future proof
add protection from network failure around json_decode and $client->post ( $docurl, $payload )->getBody
Moodle have introduced time based partial indexing for search areas. Need to update pluginto support this
eg searching for "dentist lithograph" shows a result, but searching for "lithograph dentist" doesn't find the same result. Further more searching for "lithograph OR dentist" doesn't return anything either.
https://s-cqu-mba.catalyst-au.net/search/?page=0&q=dentist%20lithograph&title×tart=0&timeend=0
I have run into a situation where a restored config from another environment results in Moodle upgrade hanging at the search_elastic step as it tries and fails to reach the elastisearch server. Having a configurable timeout would resolve this issue where the elastisearch server falls over and/or cannot be reached.
When an existing activity is edited and the search index updated, the number of returned search results for that activity increases by one. This occurs every time the activity is modified and re indexed resulting in an ever increasing number of results for the same activity
Review how course name and description are indexed. This might be a core bug or an elastic search plugin bug.
I have a course call "Mathematics 101", and when I search for "math" I don't get any results. I should get the course.
Core bugs that could be related: MDL-59373 and MDL-55303
Currently there is an issue when the document mapping is created for the index. This is causing all document field types to be set to "text" instead of integer, date etc. This is causing date sorting to not work for search results. It is also causing strange result behaviour.
Also to make it worse, this condition happens on a real site, but does not happen for unit tests. The unit tests get the document created correctly.
Real site document mapping
curl -XGET 'http://localhost:9200/moodle2/_mapping?pretty=true'
{
"moodle2" : {
"mappings" : {
"doc" : {
"properties" : {
"areaid" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"contextid" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"courseid" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"description1" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"itemid" : {
"type" : "long"
},
"modified" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"owneruserid" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"parentid" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"type" : {
"type" : "long"
},
"userid" : {
"type" : "long"
}
}
}
}
}
}
Unit test document mapping
curl -XGET 'http://localhost:9200/moodle_test/_mapping?pretty=true'
{
"moodle_test" : {
"mappings" : {
"doc" : {
"properties" : {
"areaid" : {
"type" : "keyword"
},
"content" : {
"type" : "text"
},
"contextid" : {
"type" : "integer"
},
"courseid" : {
"type" : "integer"
},
"id" : {
"type" : "keyword"
},
"itemid" : {
"type" : "integer"
},
"modified" : {
"type" : "date",
"format" : "epoch_second"
},
"owneruserid" : {
"type" : "integer"
},
"parentid" : {
"type" : "keyword"
},
"title" : {
"type" : "text"
},
"type" : {
"type" : "integer"
}
}
}
}
}
}
Add proper support for 3.10
Steps to reproduce:
Expected result:
Actual result:
I assume that this additional index can't be made automatically under the hood and that it basically needs a full re-index of the existing content to also index the existing files. That's why I would just propose to add this fact to the description of the "Enable file indexing" setting in search_elastic.
Using rekognition to extract data from images costs money, see: https://aws.amazon.com/rekognition/pricing/
It would be good to to have a report that analyzes the searchable images in a Moodle estimate and provides a report as to how much they will cost to index.
This will be useful in determining if we should turn on this feature.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.