If you have ever used a search bar on a website, you've probably used Elasticsearch. Elasticsearch is an open-source search and analytics engine used for full-text search as well as analyzing logs and metrics. It allows websites to use autocomplete in text fields, search suggestions, location or geospatial search. Tons of companies use Elasticsearch, including Nike, SportsEngine, Autodesk, and Expedia.
Our friends at AWS have provided a managed service to easily deploy and run Elasticsearch called Amazon Elasticsearch Service (sometimes referred to as Amazon ES).
In this post, we'll discuss the eight things that can go wrong in AWS Elasticsearch listed below and their impact, how Blue Matador can help prevent the problem, and how to fix the problem if it happens.
Get our ultimate guide to AWS CloudWatch monitoring free.
Amazon Elasticsearch sends performance metrics to Amazon CloudWatch every minute, indicating the cluster's overall health with three colors: green, yellow, and red. A green status means all shards in your cluster are healthy and there's nothing to worry about, but what about yellow and red?
Yellow cluster health means you don't have enough nodes in the cluster for the primary or replica shards. For example, if you had a 6-node cluster and created an index with 2 primary shards and 6 replicas, your cluster would be in a yellow state. The primary shards can be allocated, but not all of the replica shards will be. When the cluster's overall health is yellow, you are in danger of losing data.
Red cluster health status indicates that at least one primary shard and its replicas are not allocated in the cluster. If the unassigned primary shard is on a new index, nothing can be written to that index. If it happens on an existing index, then not only can new data not be written, but all the existing data can't be accessed for searching. Having an unassigned primary shard is one of the worst things that can happen in your data cluster. When your cluster is in the red status, AWS Elasticsearch will stop taking automatic snapshots of your indices. They only store these snapshots for 14 days, so if your cluster health status remains red longer than that, you could permanently lose your cluster's data.
If your cluster health is yellow, the solution is pretty straightforward: add another node or configure fewer shards in the affected index.
On the other hand, red cluster health is a more complicated fix. Two of the most common causes are failed cluster nodes or an Elasticsearch process crashing from an ongoing heavy processing load. Lucky for us, Elasticsearch offers two APIs that are helpful when troubleshooting a red cluster:
GET /_cluster/allocation/explain finds unassigned shards and explains why they can't be allocated to a node:
{
"index": "test4",
"shard": 0,
"primary": true,
"current_state": "unassigned",
"can_allocate": "no",
"allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes"
GET /_cat/indices?v lists the health status, number of documents, and disk usage for each index:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open test1 30h1EiMvS5uAFr2t5CEVoQ 5 0 820 0 14mb 14mb
green open test2 sdIxs_WDT56afFGu5KPbFQ 1 0 0 0 233b 233b
green open test3 GGRZp_TBRZuSaZpAGk2pmw 1 1 2 0 14.7kb 7.3kb
red open test4 BJxfAErbTtu5HBjIXJV_7A 1 0
green open test5 _8C6MIXOSxCqVYicH3jsEA 1 0 7 0 24.3kb 24.3kb
These APIs should identify problematic indices. According to Amazon, the fastest way to get your cluster out of the red status is to delete the red indices, scale your instance types, and then recreate them.
Without you having to configure or maintain anything, Blue Matador is constantly monitoring your cluster health status and will immediately let you know when it falls to yellow or red, helping you avoid potential data loss.
AWS Elasticsearch utilizes two types of nodes: master nodes and data nodes. Master nodes manage and assign tasks to data nodes, and data nodes perform the tasks, including indexing and searching. These nodes require CPU to perform their tasks. The CPU utilization metric refers to how much CPU the nodes are taking compared to how much is available.
Consistently high CPU utilization either on master or data nodes may impact the ability of your nodes to index and query documents. It could also cause your Amazon Elasticsearch domain to get stuck in the processing state after a configuration change.
There are two ways to resolve sustained high CPU utilization on data nodes. You can either reduce the amount of CPU usage taking place on your nodes, or you can increase the amount of CPU available.
For master nodes, the best way to reduce master CPU utilization is to increase the size of the instance type for your master nodes.
Some of the most common causes for high CPU usage on both data and master nodes are large or frequent queries or writes, JVM Memory Pressure, and having too many open indices.
Blue Matador automatically monitors all of your AWS Elasticsearch domains for sustained high CPU usage for potential overages.
Because CPU is a finite resource, we refer to this type of event as a depletion event.
If we detect an upcoming depletion event, we will notify you of the problem so you can fix it before it impacts your end user. With Blue Matador, there is no need to set a custom alarm; once it's integrated with your AWS infrastructure, it will start monitoring CPU utilization immediately and automatically.
Blue Matador monitors your Elasticsearch CPU utilization, plus all the other metrics mentioned in this blog, without any manual setup. Learn more > |
Because Elasticsearch is written in Java, it utilizes the JVM (Java Virtual Machine). JVM Memory Pressure is the percentage of the Java heap that is used by the nodes in your Elasticsearch cluster. It is determined by two factors: the amount of data on the cluster in proportion to the resource load and the query load on the cluster.
If your Elasticsearch nodes begin to use too much of the Java heap, JVM Memory Pressure increases. When utilization goes above 75%, the garbage collector struggles to reclaim enough memory, and AWS Elasticsearch begins to slow or stop processes to free up memory in an attempt to prevent a JVM OutOfMemoryError, which will crash Elasticsearch. If memory pressure climbs around 95%, Amazon Elasticsearch kills any process trying to allocate memory. If it kills a critical process, some cluster nodes could fail.
The best way to fix this problem is to prevent it from happening in the first place. You can avoid high JVM Memory Pressure by avoiding queries on wide ranges, limiting the number of requests made at the same time, and ensuring that you select the right number of shards and that you distribute them evenly between your nodes.
If your JVM memory pressure is consistently above 75%, then your Elasticsearch cluster is under-resourced. You'll need to explore scaling it up to better match the amount of data in your indices or the query load.
Blue Matador monitors JVM Memory Pressure on both data nodes and master nodes with the appropriate thresholds in place without you having to create or manage any alarms. It will send you an event when JVM Pressure hits 80%.
AWS Key Management Service (KMS) allows you to encrypt your Elasticsearch Service data, including indices, log files, etc.
It's possible to lose access to your KMS encryption keys, meaning your clusters will be unable to access any data you've encrypted with KMS. There are two primary ways this can happen: Your KMS master key is disabled, or your master key is deleted or has its grants revoked. If this happens, the data in your Elasticsearch cluster will be permanently unavailable.
If your master key is just disabled, reenabling the key will bring your cluster back to normal operation.
If your master key is deleted, CloudWatch will show the metric KMSKeyInaccessible with a value of 1. A deleted key is a much bigger problem—you cannot recover domains in this state, so you'll need to access your most recent snapshot and migrate the data from there.
Blue Matador will be instantly watching for KMS key errors on your Elasticsearch clusters without you doing anything. Once detected, it will show up on your dashboard and Blue Matador will notify you via your preferred method.
Earlier, we talked about master and data nodes and how they're different. We'll need to go into a little more detail about master nodes to explain how the master reachability metric works.
Amazon recommends that you have three dedicated master nodes for each of your production AWS Elasticsearch domains. A group of master nodes forms a quorum of nodes, which elect a single master node.
The master reachability metric will indicate when the elected master node is unhealthy. In CloudWatch, this will appear as MasterReachableFromNode with a value of 0 and means the master node is unreachable or has stopped. In the event of a failure, write and read requests to the cluster will fail.
The main culprits behind unreachable master nodes are network connectivity issues or a dependency problem. The first thing you should do is check to make sure you've followed Amazon's suggestion of having three master nodes—then, if one has failed, you'll have a backup. Having three master nodes also helps you avoid a partitioned network, also known as a split-brain problem.
If you find that the number of nodes isn't the problem, try scaling up your AWS Elasticsearch domain, either by adding more memory, updating the version you're running, or both.
It's also possible, though rare, that a hardware failure is causing the master reachability error.
Like it does with many other metrics, Blue Matador keeps an eye out for this error and other problems that could lead up to it. When Blue Matador detects master nodes are failing the reachability check, it will send you the appropriate notification.
When you deploy Amazon Elasticsearch, you will select the number of nodes your domain will have. Each of those nodes will run on a separate EC2 instance.
Although rare, an EC2 instance in your Elasticsearch cluster might get terminated. Usually, AWS Elasticsearch will restart these for you on its own; however, the restart may fail or take a very long time, and your nodes will stay broken. This can cause all kinds of problems.
To prevent this error, make sure you have at least one replica for each index in your AWS Elasticsearch domain. You should also enable error logs in your domain to help you troubleshoot in the event you have issues with node count.
Node count could display inaccurately during cluster configuration changes and routine AWS Elasticsearch maintenance. In this case, there may not be a problem at all.
You could create an AWS CloudWatch alarm to notify you if your system throws a node count error, but it might alert you in the middle of Amazon performing maintenance on AWS Elasticsearch, making you panic when in reality there's nothing wrong. CloudWatch also won't automatically adjust your alarm settings if you scale your cluster up or down.
With our machine learning and alert automation, Blue Matador instantly watches this problem once the software is installed to your environment. We automatically monitor the number of nodes in your cluster and detect when there are fewer than you have configured. As you scale your cluster, we automatically update thresholds for you.
When it comes to storage on Amazon Elasticsearch Service, you can decide between local on-instance storage or EBS volumes. When you create your AWS Elasticsearch domain, you’ll select which storage option you prefer.
One of the most common problems we see with AWS Elasticsearch is outgrowing the disk space allocated to your domain. If nodes don't have enough storage space to perform shard relocation, basic write operations like adding documents and creating indices can begin to fail.
The solution to this problem is pretty straightforward—you need to get some more disk space, either by reducing the size of your source data, cleaning up unnecessary replicas, or purchasing additional storage.
Blue Matador constantly monitors your disk space and will automatically let you know well in advance whether you're getting close to the limit. As you get closer to full, we'll upgrade the severity of the event.
This problem is often a result of high JVM Memory Pressure and/or shrinking storage space. When there is either not enough disk space or heap space, the Elasticsearch cluster prevents the creation of new indices or documents for all or part of the cluster. This problem often takes a long time to resolve, and in the meantime, you cannot add indices or documents to the affected nodes, so it's best to avoid it in the first place.
If writes are blocked due to a lack of disk space, according to Amazon, you can resolve the issue by scaling your Amazon Elasticsearch Service domain to use larger instance types, more instances, or more Elastic Block Store (EBS)-based storage. To prevent blocking due to JVM Memory Pressure, avoid making large batches of queries at once, or ensure the right number of shards are equitably distributed among your nodes. If you're consistently having memory trouble, you may need to allocate additional resources.
Not only will Blue Matador alert you when your clusters are actively blocking writes, but we also monitor JVM Memory Pressure, CPU utilization, and disk space and notify you when they fall out of normal range, all without you having to create any alarms.
Amazon Elasticsearch Service is a powerful tool for any site, but without performance insight, you may not know anything is wrong until a customer says "Hey, the search bar isn't working." Not ideal.
It's possible to monitor all of these items in CloudWatch, but you'll have to manually set alarms to receive notifications and stay on top of updating your alarms whenever you scale your Elasticsearch clusters up or down or create new clusters.
On the other hand, Blue Matador instantly monitors all of the leading indicators of these errors and will automatically notify you if it detects problems. It will also dynamically adjust thresholds for alarms when you scale or when you create new clusters. When it finds an error, Blue Matador will triage the event based on severity, reducing alert fatigue and making sure you see the most important events at the right time, helping you keep your search functions working correctly.