ibm-watson-data-lab / simple-search-service Goto Github PK

A faceted search engine and content API.

JavaScript 65.08% CSS 13.23% HTML 21.69%

cloudant search csv ibm-cloud

simple-search-service's Introduction

Overview: Simple Search Service

Simple Search Service is an IBM Cloud app that lets you quickly create a faceted search engine, exposing an API you can use to bring search into your own apps. The service also creates a website that lets you preview the API and test it against your own data as well as manage your data via a simple CMS.

Once deployed, use the browser to upload CSV or TSV data. Specify the fields to facet, and the service handles the rest.

How it works

The application uses these Bluemix services:

a Node.js runtime
a Cloudant database

Once the data is uploaded, you can use the UI to browse and manage your data via the integrated CMS. Additionally, a CORS-enabled API endpoint is available at <your domain name>/search. The endpoint takes advantage of Cloudant's built-in integration for Lucene full-text indexing. Here's what you get:

fielded search - ?q=colour:black+AND+brand:fender
free-text search - ?q=black+fender+strat
pagination - ?q=black+fender+strat&bookmark=<xxx>
faceting
sorting - ?sort=color or ?sort=-color for descending

You can use this along with the rest of the API to integrate the Simple Search Service into your apps. For a full API reference, click here.

While this app is a demo to showcase how easily you can build an app on Bluemix using Node.js and Cloudant, it also provides a mature search API that scales with the addition of multiple Simple Search Service nodes. In fact, a similar architecture powers the search experience in the Bluemix services catalog.

A more detailed walkthrough of using Simple Search Service is available here.

Architecture Diagram

Running the app on IBM Cloud

The fastest way to deploy this application to Bluemix is to click the Deploy to IBM Cloud button below.

Don't have a IBM Cloud account? If you haven't already, you'll be prompted to sign up for an IBM Cloud account when you click the button. Sign up, verify your email address, then return here and click the the Deploy to IBM Cloud button again. Your new credentials let you deploy to the platform and also to code online with Bluemix and Git. If you have questions about working in Bluemix, find answers in the IBM Cloud Docs.

Manual deployment to IBM Cloud

Manual deployment to IBM Cloud requires git and the Cloud Foundry CLI

$ git clone https://github.com/ibm-watson-data-lab/simple-search-service.git
$ cf create-service cloudantNoSQLDB Lite simple-search-service-cloudant-service  
$ cd simple-search-service
$ cf push

Running the app locally

Clone this repository then run npm install to add the Node.js libraries required to run the app.

Then create some environment variables that contain your Cloudant URL.

# Cloudant URL
export SSS_CLOUDANT_URL='https://<USERNAME>:<PASSWORD>@<HOSTNAME>'

replacing the USERNAME, PASSWORD and HOSTNAME placeholders for your own Cloudant account's details.

Then run:

node app.js

Service Registry

The Simple Search Service utilises Etcd to discover and utilise some of our other Simple Services to extend and improve the service.

Other services that are available to the Simple Search Service are:

The Simple Autocomplete Service - Add auto completion to the CMS
The Simple Caching Service - Enable caching of popular searches
Metrics Collector Microservice - Enable logging of searches

Enabling the Service Registry

Enabling the Service Registry requires setting an environment variable, ETCD_URL. This should be the URL of your Etcd instance including any basic HTTP authentication information

export ETCD_URL='http://username:[email protected]'

If the Service Registry is enabled, any discovered services will be displayed on the Services page, with a toggle to enable or disable these services.

Once enabled these services will automatically be integrated into the Simple Search Service.

Lockdown mode

If you have uploaded your content into the Simple Search Service but now want only the /search endpoint to be available publicly, you can enable "Lockdown mode".

Simply set an environment variable called LOCKDOWN to true before running the Simple Search Service:

export LOCKDOWN=true
node app.js

or set a custom environment variable in Bluemix.

When lockdown mode is detected, all web requests will be get a 401 Unauthorised response, except for the /search endpoint which will continue to work. This prevents your data being modified until lockdown mode is switched off again, by removing the environment variable.

If you wish to get access to the Simple Search Service whilst in lockdown mode, you can enable basic HTTP authentication by setting two more environment variables:

SSS_LOCKDOWN_USERNAME
SSS_LOCKDOWN_PASSWORD

When these are set, you are able to bypass lockdown mode by providing a matching username and password. If you access the UI, your browser will prompt you for these details. If you want to access the API you can provide the username and password as part of your request:

curl -X GET 'http://<yourdomain>/row/4dac2df712704b397f1b64a1c8e25033' --user <username>:<password>

API Reference

The Simple Search Service has an API that allows you to manage your data outside of the provided UI. Use this to integrate the SImple Search Service with your applications.

Search

Search is provided by the GET /search endpoint.

Fielded Search

Search on any of the indexed fields in your dataset using fielded search.

# Return any docs where colour=black
GET /search?q=colour:black

Fielded search uses Cloudant Search.

Free-text Search

Search across all fields in your dataset using free-text search.

# Return any docs 'black' is mentioned
GET /search?q=black

Pagination

Get the next page of results using the bookmark parameter. This is provided in all results from the /search endpoint (see example responses below). Pass this in to the next search (with the same query parameters) to return the next set of results.

# Return the next set of docs where 'black' is mentioned
GET /search?q=black&bookmark=<...>

It is possible to alter the amount of results returned using the limit parameter.

# Return the next set of docs where 'black' is mentioned, 10 at a time
GET /search?q=black&bookmark=<...>&limit=10

Example Response

All searches will respond in the same way.

{
  "total_rows": 19, // The total number of rows in the dataset
  "bookmark": "g1AAAA...JjFkA0kLVvg", // bookmark, for pagination
  "rows": [  // the rows returned in this response
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... },
    { ... }
  ],
  "counts": { // counts of the fields which were selected as facets during import
    "type": {
      "Black": 19
    }
  },
  "_ts": 1467108849821
}

Get a specific row

A specific row can be returned using it's unique ID, found in the _id field of each row. This is done by using the GET /row/:id endpoint.

GET /row/44d2a49201625252a51d252824932580

This will return the JSON representation of this specific row.

Add a new row

New data can be added a row at a time using the POST /row endpoint.

Call this endpoint passing in key/value pairs that match the fields in the existing data. There are NO required fields, and all field types will be enforced. The request will fail if any fields are passed in that do not already exist in the dataset.

POST /row -d'field_1=value_1&field_n=value_n'

The _id of the new row will be auto generated and returned in the id field of the response.

{
	"ok":true,
	"id":"22a747412adab2882be7e38a1393f4f2",
	"rev":"1-8a23bfa9ee2c88f2ae8dd071d2cafd56"
}

Update an existing row

Exiting data can be updated using the PUT /row/:id endpoint.

Call this endpoint passing in key/value pairs that match the fields in the existing data - you must also include the _id parameter in the key/value pairs. There are NO required fields, and all field types will be enforced. The request will fail if any fields are passed in that do not already exist in the dataset.

Note: Any fields which are not provided at the time of an update will be removed. Even if a field is not changing, it must always be provided to preserve its value.

The response is similar to that of adding a row, although note that the revision number of the document has increased.

{
	"ok":true,
	"id":"22a747412adab2882be7e38a1393f4f2",
	"rev":"2-6281e0a21ed461659dba6a96d3931ccf"
}

Deleting a row

A specific row can be deleting using it's unique ID, found in the _id field of each row. This is done by using the DELETE /row/:id endpoint.

DELETE /row/44d2a49201625252a51d252824932580

The response is similar to that of editing a row, although again note that the revision number of the document increased once more.

{
	"ok":true,
	"id":"22a747412adab2882be7e38a1393f4f2",
	"rev":"3-37b4f5c715916bf8f90ed997d57dc437"
}

Initializing the index

To programatically delete all data and initialize the index

POST /initialize

including the schema property in the payload defining the following structure

{ "fields": [
    {
      "name": "id",
      "type": "string",
      "example": "example_id",
      "facet": true
    },
    {
      "name": "score",
      "type": "number",
      "example": 8,
      "facet": false
    },
    {
      "name": "tags",
      "type": "arrayofstrings",
      "example": "example_tag_1,example_tag_2",
      "facet": true
    }
  ]
}

> This example defines a schema containing three fields of which two will be enabled for faceted search.

Valid values:

Property name: any string
Property type: number, boolean, string, arrayofstrings (e.g. val1,val2,val3)
Property example: any valid value for this type
Property facet: true or false

Privacy Notice

Refer to https://github.com/IBM/metrics-collector-client-node#privacy-notice

Disabling Deployment Tracking

For manual deploys, deployment tracking can be disabled by removing require("metrics-tracker-client").track(); from the end of the app.js main server file.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

simple-search-service's People

Contributors

Stargazers

Watchers

simple-search-service's Issues

Add information to the "upload data" page indicating how a new data set can be uploaded

After data has been uploaded there's a continue button on the Get Data, Upload Data and Create Index pages. There's no help text on any of these pages outlining what one would have to do to load another data set. One basically has to open every SSS page to find the answer (Settings > Delete Data).

CSV files with field names that contain spaces

So imagine we have a data like this

firstname,lastname,home address
glynn,bird,10 front street

the firstname and lastname fields are ok, but the home address field causes us problems.

We try and fudge it in the front end (https://github.com/ibm-cds-labs/simple-search-service/blob/master/public/js/seams.js#L153) to replace spaces with underscores. But the back end needs to do the same because Cloudant doesn't like names of indexes to have spaces in either.

So I tried this change at this line https://github.com/ibm-cds-labs/simple-search-service/blob/master/lib/schema.js#L65:

  for (var i in schema.fields) {
    var f = schema.fields[i];
    var nicename = f.name.replace(/ /g,"_");
    if (f.name != "_id") {
      func += '  indy("' + nicename + '", doc["'+ f.name + '"], ' + f.facet + ');\n'; 
    }     
  }

to do the same fudge at the back end. But the this only fudges the index name.

So then I tried removing the front-end fudge so that it leaves the field name with a space in it, but then I get JS errors at the front-end:

jquery.js:1496 Uncaught Error: Syntax error, unrecognized expression: select[name=Business Unit]

because selectors with spaces in freak out too.

In summary

on the server side, in schema.js, we are going to need to fudge the index name so that it doesn't have spaces in, otherwise Cloudant freaks out
but we probably want the actual JSON data to retain the field names that were in the file, spaces and all
and we want to make sure the front end can handle that

Add support for Travis CI

Use jshint and and jscs tasks to verify that code changes are clean.

search returns no results (or no error/warning) when performing faceted search on a stop word

the search index that gets created is using the standard analyzer. this standard analyzer has the following stop words: "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"

stop words do not get indexed so when trying to perform a faceted search with one of these words no results are returned and there are no messages (error or warning). as a result the user may think the search failed or there is an issue with the search service.

manifest.yml: change Cloudant plan from Shared to Lite

The Shared plan is no longer available. D2BM will fail without this change.

Force at least one field to be faceted

We shouldn't let users pass by the "Create index" screen without selecting at least one facet. They can decide later to not use the facets, but this demo is about the power of faceting -- everything that comes after that step depends on it.

Array element[0] returns no results with SEaMS + Compose.io Redis

Super weird, but if you go to my demo app using the optional Redis service at http://seams-broberg-1040.mybluemix.net/admin/search you'll see all the nice facets. Try searching for genre:'C' and it works beautifully. So do all the other elements except for genre:'A'

Pretty sure this is related to the Redis cache option, because this was working perfectly yesterday with the default IBM cache. Thanks.

Dataset at https://www.dropbox.com/s/4fp7v7ikndtbc1p/movieSEaMS_001.tsv?dl=0

Preview Content API tab: Example with Dynamic content instructions appear to be "incomplete"

The following text is displayed:

Example with Dynamic Content
Here, in this example, let's pretend the content in this content area is all about when the is/are . Good news, you indexed your data with as a facet, so we can use the Simple Search Service to retrieve exactly the results you want, and then use Javascript to write the JSON returned by the API into the page.

Cache isn't cleared when data is deleted

Ensure cache is cleared when the "Delete" button is pressed in the admin.

the `limit` setting needs to find its way into the /search app

... it shouldn't have to be set on API calls.

So, the app needs to change, and requests out to /search should no longer be appended with the limit query param.

Readme instructions should contain information on how to manually create the cloudant service in BM

full API reference isn't online

this is linked to from the README but 404s https://github.com/ibm-cds-labs/simple-search-service/blob/master/API%20Reference.md

Preview Content API link in main nav should be disabled before data is loaded

Should follow the same pattern as Create Index and Preview Search.

TSV quote-crash

Rasied by @mikebroberg (transferred from another repo)

I had values in a TSV file that I wanted to upload to SEaMS as type "arrayofstrings". I edited my TSV in Excel to do some stupid sorting and didn't realize that Excel had wrapped some of the values in double-quotes.

When attempting to upload the TSV with double-quotes to SEaMS, the first step, "1 of 3 - Upload data," would read 100 percent, but the UI would not go on to the second step, "2 of 3 - Schema." Trying to return to the main page for my SEaMS app (http://seams-broberg-152.mybluemix.net/admin/home) would fail. That's because the app crashed in Bluemix.

I didn't check this functionality in a CSV because the TSV better handled my commas, but if there is some scenario where this would happen with a CSV, Brad suggested checking that out too. Worth fixing for the TSV though, if people have a field they want to upload as an array so they can facet on its elements. Thanks!

Error prompt when users try and upload an invalid file

If a user tries to upload a CSV or TSV that doesn't include a header row, we should alert them. (Without that header row, we can't present a coherent "Create Index" screen, among other issues.)

Add "Redis by Compose" service name requirement to the README

I'm not sure how you want to phrase it, but without finding and parsing the code, I didn't realize I had to exactly match the name "Redis by Compose" in my Bluemix deployment since Bluemix usually appends something like "-xy" to the name when binding a service.

create a user object that includes the VCAP object

With this, we can expose to users access to their Bluemix, Cloudant (and eventually, Redis) dashboards.

Provide API endpoint that can be used to initialize the SSS index without a prior data upload

In certain scenarios it would be desirable to have the ability to programmatically remove all data from the index and define a custom schema. The existing /import API endpoint requires a prior data upload.

Allow for use of a shared Cloudant instance among multiple search services

We have several instances of the SSS deployed. Each application operates on its own Cloudant instance. It would be nice if all instances could use a dedicated repository database in a single instance to reduce the number of Bluemix resources.

service fails silently when CSV doesn't have a header value

https://www.dropbox.com/s/06dz37k1ep1zbfa/Screenshot%202016-07-01%2014.57.16.png?dl=0

Allow overriding the Cloudant URL when deployed to Bluemix

Sometimes it's useful to specify a Cloudant URL without binding to a Cloudant service directly (e.g. if you want to use a proxy).

This should be possible by specifying a URL using SSS_CLOUDANT_URL but SSS always looks for a Cloudant service in VCAP_SERVICES if it's present, preventing SSS_CLOUDANT_URL from being used in Bluemix deployments.

trouble uploading data

When a user is uploading data, we show them the preview of the data that's about to get indexed, but we don't allow them to "go back" and upload again. Poses a problem when, for example, a user is trying to upload/index a CSV with a poorly formatted header row.

Requires a hard refresh of the app to resolve.

it is borked

Using local config for Cloudant
ERR: Invalid or unexpected value passed to Simple Service Registration module
body-parser deprecated bodyParser: use individual json/urlencoded middlewares app.js:67:40
server starting on http://localhost:6001
(node:6521) DeprecationWarning: Using Buffer without new will soon stop working. Use new Buffer(), or preferably Buffer.from(), Buffer.allocUnsafe() or Buffer.alloc() instead.
_http_server.js:193
throw new RangeError(Invalid status code: ${statusCode});

schema.js/load masks schema load errors

Trying to insert/update rows using the /rows endpoints, SSS returned

"statusCode":404,"body":"{\"error\":\"COL1 is not a valid parameter,COL2 is not a valid parameter,COL3 is not a valid parameter,COL4 is not a valid parameter\",\"reason\":\"Validation failure\"

Debugging the issue I noticed that an error occurred loading the schema. However, that error is not propagated and an empty default schema is returned ...

seamsdb.get("schema", function(err, data) {
    if (err) {
      return callback(null, defaultSchema);
    }
    callback(err, data);
  });

... causing validation failures because COL1, COL2 etc are unknown.

Testing integration with Slack

This is the comment.

Typo in Search Tips

missing ' character in search tips (see screen shot). But query works anyway.

Lockdown mode

If an environment variable LOCKDOWN is present and set to "true", then UI should not allow data to be deleted or added. SSS would just behave as a search API.

Set content-type header to application/json when returning search results

Currently text/plain is returned.

Preview Search displays an empty string if column data is null

It appears that if a data value is null, an empty string is displayed in the rows preview. I loaded the movies sample data set and previewed the search. The first 20 rows did not display any data under the rating and earnings_rank headings, giving the appearance that there's a bug. Only after I inspected the raw data I realized that the sample rows shown contained null values. It might be good if a place holder string could be displayed instead of the empty string to avoid confusion.

Disallow the “automagical content discovery” on array_of_strings facets

We still need to be able facet on array_of_strings facets. It's just that the Content API (api.html) view doesn't present content correctly when the “automagical content discovery” happens on array_of_strings facets.

Can't facet on number datatype

Raised by @mikebroberg and transferred from another repo:

Whenever I try to facet on a number value, I can't render any search results. This happened both with CSVs and TSVs.

Make it easier to upload larger datasets (by url or compressed files)

For larger datasets, uploading can be a pain. I uploaded a 800MB file for 2-3 hours over a poor connection outside of Berlin. I realize my use case is narrow, but I think these proposed enhancements could help others:

allow users to provide a URL that SSS could fetch from
allow SSS to detech .gz files and uncompress them.

In my case, I wound up setting up a remote ubuntu desktop box that I uploaded a gzipped file (91MB), then uncompressed to upload in the SSS GUI.

Good luck and thanks for everyone's hard work so far!

P.S. I realized that @bradnoble and I used to work together back at Mullen in 2004 or 2005. I was a 23yo account guy back then and would be surprised if you remembered me. Glad to see you're doing well. :)

Deployment Tracker Service is discontinued

Please refer to the following document for details and migration instructions https://github.com/IBM-Bluemix/cf-deployment-tracker-client-node/wiki/migrating-to-the-new-metrics-tracker-client.

Support for NULL shorthand (\N)

Raised by @mikebroberg from another repo:

When you write data to file from MySQL, it uses a shorthand for NULL values in the resulting CSV or TSV: \N See https://dev.mysql.com/doc/refman/5.0/en/null-values.html for more.

It would be cool if SEaMS could support this. I guess you wouldn't want to have the option to import as JavaScript's null datatype, but maybe have it in there as a string? Dunno.

instructions unclear re: sample data set download

I just completed the Simple Search Service tutorial and it was actually pretty easy - I did it front to end in under 10 minutes. The only wall I ran into was when downloading the sample movie data set from github - the link took me straight to that movie TSV file code rather than the master open data folder where I could actually have the option to download it. I understand why you did that - so that you can see exactly which data you're supposed to use and upload to the service - but it's not made very clear in the directions that you then need to go to the master directory and download the whole zip. Unless there is a way around this, which I didn't see.

ibm-watson-data-lab / simple-search-service Goto Github PK

simple-search-service's Introduction

Overview: Simple Search Service

How it works

Architecture Diagram

Running the app on IBM Cloud

Manual deployment to IBM Cloud

Running the app locally

Service Registry

Enabling the Service Registry

Lockdown mode

API Reference

Search

Fielded Search

Free-text Search

Pagination

Example Response

Get a specific row

Add a new row

Update an existing row

Deleting a row

Initializing the index

Privacy Notice

Disabling Deployment Tracking

License

simple-search-service's People

Contributors

Stargazers

Watchers

Forkers

simple-search-service's Issues

Recommend Projects

Recommend Topics

Recommend Org