The art of technology

Blog

Back

ElasticSearch - introduction

Technology
ElasticSearch - introduction

For the past couple of months, I've been developing and integrating full text search in one of our projects (using Node.js and MongoDB alongside). In this series, I would like to share the process of this development and showcase some interesting points I had struggled with.

In this part, I will focus on Elasticsearch introduction.

So, what is Elasticsearch?

Quoting Wikipedia:

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

...

Elasticsearch is developed alongside a data collection and log-parsing engine called Logstash, an analytics and visualisation platform called Kibana, and Beats, a collection of lightweight data shippers. The four products are designed for use as an integrated solution, referred to as the "Elastic Stack" (formerly the "ELK stack")

-- Elasticsearch. (2021, May 16). From Wikipedia. https://en.wikipedia.org/wiki/Elasticsearch

What does this mean for us, developers?

First of all, we don't have to develop some magic blackbox or ugly regex queries for our existing database. All of the complex search logic is being handled by Elasticsearch and our job is only to implement mappers and queries. Using Node.js alongside, we can appreciate results being returned in JSON format.

Secondly, Elasticsearch is great for real-time data processing and is being widely used as a logging storage. Many projects already use some form of Elasticsearch and many of you probably are at least a little familiar with this technology.

Local installation

Elasticsearch

There are many ways to install Elasticsearch for local testing / development. In this series, we will be using Docker. For those who aren't familiar with Docker, here is an official tutorial.

First of all, we need to obtain the official Elasticsearch Docker image by running the command bellow. The imaged will be saved in your local Docker image directory.

> docker pull docker.elastic.co/elasticsearch/elasticsearch:7.12.0

Now, let's start the container.

> docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name local-elasticsearch docker.elastic.co/elasticsearch/elasticsearch:7.12.0

As you can see, we gave the container a name local-elasticsearch. The reason for that is that we can easily identify the container and use the same continer (with the same stored data) even after we stop it.

To stop the container.

> docker stop local-elasticsearch

To run the container.

> docker start local-elasticsearch

Now we have Elasticsearch up and running.

Kibana (optional, but recommended)

Elasticsearch by itself doesn't come with any user-friendly utility to insert or query data. We would have to use raw GET / POST requests. But there is a better way to query Elasticsearch for local development.

Let's pull Kibana Docker image. Warning: you should always use the same version of Kibana as of Elasticsearch.

> docker pull docker.elastic.co/kibana/kibana:7.12.0

And run it with link to our Elasticsearch container.

> docker run -d --link local-elasticsearch:elasticsearch -p 5601:5601 --name local-kibana docker.elastic.co/kibana/kibana:7.12.0

Now we have both Elasticsearch and Kibana running. Syntax-highlited and indented utility tool to query Elasticsearch is now at http://localhost:5601/ in section Management -> Dev Tools.

Inserting documents

For purpose of this series, let's think of grocery store. This grocery store holds information about the products it sells. These products are defined by name, price, weightInGrams, description and origin.

In order to insert some documents, we have to create index.

Elasticsearch index can be compared to database collection. It holds similar documents with custom settings and mappings. More about indexes and their mapping will be explained in future parts of the series.

Let's create an empty index.

PUT /products

And insert documents.

POST /products/_doc
{
  "name": "Banana",
  "price": 1.50,
  "weightInGrams": 1000.0,
  "description": "Yellow banana fruit. Great healthy snack. Lot's of fiber.",
  "origin": "Colombia, South America"
}

POST /products/_doc
{
  "name": "Fried chips",
  "price": 0.99,
  "weightInGrams": 100.0,
  "description": "Fried salted potato chips. Great Netflix snack.",
  "origin": "Czech Republic, Europe"
}

POST /products/_doc
{
  "name": "Spaghetti",
  "price": 1.40,
  "weightInGrams": 250.0,
  "description": "Wheat Italian Spaghetti. Some recipes you can cook: Bolognese, Carbonara...",
  "origin": "Italy, Europe"
}

POST /products/_doc
{
  "name": "Ground beef",
  "price": 6.29,
  "weightInGrams": 600.0,
  "description": "Nicely ground beef. Great for Burgers or Bolognese.",
  "origin": "Argentina, South America"
}

POST /products/_doc
{
  "name": "Eggs",
  "price": 2.20,
  "weightInGrams": 200.0,
  "description": "White eggs. Good for maintaining healthy diet. You can cook Carbonara.",
  "origin": "Czech Republic, Europe"
}

POST /products/_doc
{
  "name": "Pilsner beer",
  "price": 1.10,
  "weightInGrams": 500.0,
  "description": "Alcoholic beverage. Should be pronounced golden miracle. Will boost your Netflix experience.",
  "origin": "Czech Republic, Europe"
}

To make sure everything went ok, we can query all of them.

GET /products/_search

Querying data

First of all, let me introduce how Elasticsearch matches documents. Under the hood, Elasticsearch only processes exact matches. That means that if we want to do anything beyond exact matching input to document field, we have to preprocess (use analyzer on) the input and possibly the document field as well. How this is done will be explained in future part of the series.

term query: queries the exact match on input value and document field (does not do any preprocessing).

Avoid using it for text fields, text fields are preprocessed by Elasticsearch by default and exact match might not be working properly. In our example, we did not introduce any mapping for our index, so name, description and origin are automatically set by Elasticsearch as text fields (after being inserted as strings).

GET /products/_search
{
  "query": {
    "term": {
      "price": {
        "value": 1.10
      }
    }
  }
}

returns

{
  ...
  "hits" : {
   ...
    "hits" : [
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "W7-IdHkBzZBWrmXj_BtZ",
        "_score" : 1.0,
        "_source" : {
          "name" : "Pilsner beer",
          "price" : 1.1,
          "weightInGrams" : 500.0,
          "description" : "Alcoholic beverage. Should be pronounced golden miracle. Will boost your Netflix experience.",
          "origin" : "Czech Republic, Europe"
        }
      }
    ]
  }
}

In order to understand preprocessing and scoring I have to introduce bool query.

bool query: matches documents using combination of boolean clauses. It computes the score for each document (basically the barebones of full text search - there can be many results for query, we want the most relevant (best scored) returned).

There are 4 types of them.

  • must: queries must appear in matching documents (contributes to score)
  • filter: queries must appear in matching documents (does not contribute to score)
  • should: queries should appear in matching documents (contributes to score)
  • must_not: queries must not appear in matching documents (does not contribute to score)

match query: matches documents with text preprocessing - analyzers. Under the hood, it uses bool query to overcome Elasticsearch exact match nature (uses exact matches on multiple tokens with must clause). If there are no analyzers specified, it uses standard analyzer on both input value and document field.

So query

GET /products/_search
{
  "query": {
    "match": {
      "description": {
        "query": "netflix"
      }
    }
  }
}

gets us

{
  ...
  "hits" : {
    ...
    "max_score" : 1.1538364,
    "hits" : [
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "V7-IdHkBzZBWrmXj4Rsf",
        "_score" : 1.1538364,
        "_source" : {
          "name" : "Fried chips",
          "price" : 0.99,
          "weightInGrams" : 100.0,
          "description" : "Fried salted potato chips. Great Netflix snack.",
          "origin" : "Czech Republic, Europe"
        }
      },
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "W7-IdHkBzZBWrmXj_BtZ",
        "_score" : 0.9295486,
        "_source" : {
          "name" : "Pilsner beer",
          "price" : 1.1,
          "weightInGrams" : 500.0,
          "description" : "Alcoholic beverage. Should be pronounced golden miracle. Will boost your Netflix experience.",
          "origin" : "Czech Republic, Europe"
        }
      }
    ]
  }
}

Notice how the score was computed. Chips seems more relevant than beer. This is caused by description of chips having fewer words than beer.

Let's inspect that. We can analyze the standard analyzer using the command bellow.

// chips
POST _analyze
{
  "analyzer": "standard",
  "text": "Fried salted potato chips. Great Netflix snack."
}

// beer
POST _analyze
{
  "analyzer": "standard",
  "text": "Alcoholic beverage. Should be pronounced golden miracle. Will boost your Netflix experience."
}

Tokens returned:

// chips
{
  "tokens" : [
    { "token" : "fried", ... }, { "token" : "salted", ... }, { "token" : "potato", ... }, 
    { "token" : "chips", ... }, { "token" : "great", ... }, { "token" : "netflix", ... }, 
    { "token" : "snack", ... }
  ]
}

// beer
{
  "tokens" : [
    { "token" : "alcoholic", ... }, { "token" : "beverage", ... }, { "token" : "should", ... }, 
    { "token" : "be", ... }, { "token" : "pronounced", ... }, { "token" : "golden", ... }, 
    { "token" : "miracle", ... }, { "token" : "will", ... }, { "token" : "boost", ... },
    { "token" : "your", ... }, { "token" : "netflix", ... }, { "token" : "experience", ... }
  ]
}

As we can see, standard analyzer lowercases the text and spltis text on standard delimeters into tokens.

These tokens are then matched against "netflix" token with bool query using must clause. As there are fewer words in chips description and both of the descriptions have the same amount of "netflix" token inside of the description (1/7 of "netflix" token in chips description, 1/12 of "netflix" token in beer description), the chips are than returned as more relevant.

Conclusion

In this part, we learned how to set up Elasticsearch with Kibana, insert documents, search them and understood some basic principals of full text search in Elasticsearch.

For more detailed information about Elasticsearch, I strongly encourage you to read official docs https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html.

In the future part of the series, I will focus more on analyzers, mappings, complex queries and also on data synchronization between Elasticsearch and MongoDB.