Tag Archives: Index

Elasticsearch Interview Questions and Answers


Top 50 Elasticsearch Frequently Asked Interview Questions are collected based on my Interview Experience on ELK (Elasticsearch, Logstash and Kibana) with different Organization. I have divided these question in three categories  as below .

  • Elasticsearch Overview Questions and Answers.
  • Basic Concepts and Terminology Questions and Answers.
  • Advance and Practical Questions and Answers.

Elasticsearch Overview Questions and Answers

1.What is Elasticsearch?

“Elasticsearch is Open source, cross-paltform, scalable, full-text search and analytical engine based on Apache Lucene technology. It help in NRT (Near Real Time) analysis and full text search on big volume of data on distributed clustered environment.”

  • Elasticsearch is developed by Apache in Java Language.
  • Elasticsearch store records in form of JSON documents as key and value.
  • By Default Schema free if required schema can added by mapping from client app.
  • Access by HTTP over the browser, by application through Elasticsearch REST Client API or Elasticsearch Transport Client.
  • Elasticsearch Organization provide some application and plug-in for making Elasticsearch more useful like Kibana for doing search and Analysis by different charts and Dashboard.

2. What are the advantages of Elasticsearch?

  • Elasticsearch is implemented on Java, which makes it compatible on almost every platform.
  • Elasticsearch is Near Real Time (NRT), in other words after one second the added document is searchable in this engine.
  • Elasticsearch cluster is distributed, which makes it easy to scale and integrate in any big organizations.
  • Creating full backups of data are easy by using the concept of gateway, which is present in Elasticsearch.
  • Elasticsearch REST uses JSON objects as responses, which makes it possible to invoke the Elasticsearch server with a large number of different programming languages.
  • Elasticsearch supports almost every document type except those that do not support text rendering.
  • Handling multi-tenancy is very easy in Elasticsearch when compared to Apache Solr.

3. What are the Disadvantages of Elasticsearch?

  • Elasticsearch does not have multi-language support in terms of handling request and response data in JSON while in Apache Solr, where it is possible in CSV, XML and JSON formats.
  • Elasticsearch have a problem of Split Brain situations, but in rare cases.

4. What is difference and Similarities between NoSQL MongoDB and Elasticsearch?

Elasticsearch is Apache Lucene based RESTful NRT(Near Real Time) search and analytics engine while MongoDB is an open source document-oriented Database Management System.

Similarities

Certain features are common between both products like Document-oriented Store, Schema free, Distributed Data Storage, High-Availability, Sharding, Replication etc.

Difference

 There are many differences between both products as below

Type Elasticsearch MongoDB
Indexing
  • Uses Apache Lucene for indexing.
  • Real-time indexing and searching power from Lucene, which allows creation of index on every field of a document by default.
  • Based on traditional B+ Tree.
  • Define the index, which improves query performance, but affects write operations.

 

Language Implemented in Java Implemented in C++
Documents Stores JSON documents Stores them in BSON (Binary JSON) format. (though, it looks same like a JSON document to the end user)
REST Interface RESTful Not RESTful
Map Reduce Not Support MapReduce Allow Map Reduce Operation
Huge Data Store and Retrieve Huge Data Store and Search Huge Data

5. What are common area of use Elasticsearch?

  • It’s useful in application where need to do analysis, statics and need to find out anomalies on data based on pattern.
  • It’s useful where need to send alerts when particular condition matched like stock market, exception from logs etc.
  • It’s useful with application where log analysis and issue solution provide because of full search in billions of records in milliseconds.
  • It’s compatible with application like Filebeat, Logstash and Kibana for storage of high volume data for analysis and visualize in form of chart and dashboards.

6. What are operations can be performed on Elasticsearch Documents?

Elasticsearch perform some basic operations like:

  • Indexing
  • Searching
  • Fetching
  • Updating
  • Delete Documents.

Basic Concepts and Terminology Questions and Answers

7. What is Elasticsearch Cluster ?

Cluster is a collection of one or more nodes which provide capabilities to search text on scattered data on nodes. It’s identified by unique name with in network so that all associated nodes will join together by cluster name.

Operation Persistent : Cluster also maintain keep records of all transaction level changes for schema if anything get change in data for index and track of availability of Nodes in cluster so that make data easily available if any fail-over of any node.

Elasticsearch Cluster
Elasticsearch Cluster

In above screen Elasticsearch cluster “FACING_ISSUE_IN_IT” having three master and four data node.

8.What is Elasticsearch Node?

Node is a Elasticsearch server which associate with in a cluster. It’s store data , help cluster for indexing data and search query. It’s identified by unique name in Cluster if name is not provided in elasticsearch will generate random Universally Unique Identifier(UUID) on time of server start.

A Cluster can have one or more Nodes .If first node start that will have Cluster with single node and when other node will start will add with that cluster.

Data Node storage
Data Node Documents Storage

In above screen trying to represent data of two indexes like I1 and I2. Where Index I1 is having two type of documents T1 and T2 while index I2 is having only type T2 and these shards are distributes over all nodes in cluster. This data node is having documents of shard (S1) for  Index I1 and shard (S3) for Index I2. It’s also keeping replica of documents of shards S2 of Index I2 and I1 which are store some other nodes in cluster.

9. What are types of Node in Elasticsearch?

With in Elasticsearch Cluster each Node know others Node based on configuration decide role/responsibility of each individual Node.  Below are Elasticsearch Node Types.

  • Master-Eligible Node.
  • Data Node.
  • Ingest Node.
  • Tribe Node/Coordinating Node.

10. What is Master Node and Master Eligible Node in Elasticsearch?

Master Node control cluster wide operations like creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes. It is important for cluster health to have a stable master node. Master Node elected based on configuration properties node.master=true (Default).

Master Eligible Node decide based on below  configuration

discovery.zen.minimum_master_node : number (default 1)

and above number decide based (master_eligible_nodes / 2) + 1

11. What is Data Node in Elasticsearch?

Data nodes hold the shards/replica that contain the documents that was indexed. Data Nodes perform data related operation such as CRUD, search aggregation etc. Set node.data=true (Default) to make node as Data Node.

Data Node operations are I/O-, memory-, and CPU-intensive. It is important to monitor these resources and to add more data nodes if they are overloaded.The main benefit of having dedicated data nodes is the separation of the master and data roles.

12. What is Ingest Node in Elasticsearch?

Ingest nodes can execute pre-processing an ingest pipeline to a document in order to transform and enrich the document before indexing. With a heavy ingest load, it makes sense to use dedicated ingest nodes and to mark the master and data nodes  as false and node.ingest=true.

13. What is Tribe Node and Coordinating Node in Elasticsearch?

Tribe node, is special type of node that coordinate to connect to multiple clusters and perform search and  others operation across all connected clusters. Tribe Node configured by settings tribe.*.

Coordinating Node behave like Smart Load balancer which able to handle master duties, to hold data, and pre-process documents, then you are left with a coordinating node that can only route requests, handle the search reduce phase, and distribute bulk indexing.

Every node is implicitly a coordinating node. This means that a node that has all three node.masternode.data and node.ingest set to false will only act as a coordinating node, which cannot be disabled. As a result, such a node needs to have enough memory and CPU in order to deal with the gather phase.

14. What is Index in Elasticsearch?

An Index is collection of documents with same characteristics which stores on nodes in distributed fashion and its identify by unique name on which perform different operations like insert , search query, update and delete for documents. A cluster can have as many indexes with unique name.

A document store in Index and assigned a type to it and an Index can have multiple types of documents.

15. What is Shards in Elasticsearch?

Shards are partitions of indexes scattered on nodes in order to make  scal. It provide capability to store large amount (billions) of documents for same index to store in cluster even one disk of node is not capable to store it. Shards also maintain Inverted Index of documents token to make full-text search fast.

16. What is Replica in Elasticsearch?

Replica is copy of shard which store on different node or same node. A shard can have zero or more replica. If shard on one node then replica of shard will store on another node.

17. What are Benefits of Shards and Replica in Elasticsearch?

  • Shards splits indexes in horizontal partition for high volumes of data.
  • It perform operations parallel to each shards or replica on multiple node for index so that increase system performance and throughput.
  • Recovered easily in case of fail-over of node because data replica exist on another node because replica always store on different node where shards exist.

Some Important Points:

When we create index by default elasticseach index configure as 5 shards and 1 replica but we can configure it from config/elasticsearch.yml file or by passing shards and replica values in mapping when index create.

Once index created we can’t change shards configuration but modify in replica. If need to update in shards only option is re-indexing.

Each Shard itself a Lucene index and it can keep max 2,147,483,519 (= Integer.MAX_VALUE – 128) documents. For merging of search results and failover taken care by elasticsearch cluster.

18. What is Document in Elasticsearch?

Each Record store in index is called a document which store in JSON object. Document is Similar to row in term of RDBMS only difference is that each document will have different number of fields and structure but common fields should have same data type.

19. What is a Type in Elasticsearch ?

Type is logical category/grouping/partition of index whose semantics is completely up to user and type will always have same number of columns for each documents.

ElasticSearch => Indices => Types => Documents with Fields/Properties

20. What is a Document Type in Elaticsearch?

A document type can be seen as the document schema / mapping definition, which has the mapping of all the fields in the document along with its data types.

21. What is indexing in ElasticSearch ?

The process of storing data in an index is called indexing in ElasticSearch. Data in ElasticSearch can be dividend into write-once and read-many segments. Whenever an update/modification is attempted, a new version of the document is written to the index.

22. What is inverted index in Elasticsearch ?

Inverted Index is backbone of Elasticsearch which make full-text search   fast.  Inverted index consists of a list of all unique words that occurs in  documents and for each word, maintain a list of documents number and positions in which it appears.

For Example  : There are two documents and having content as :

1: FacingIssuesOnIT is for ELK.

2: If ELK check FacingIssuesOnIT.

To make inverted index each document will split in words (also called as terms or token) and create below sorted index .

Term                   Doc_1  Doc_2
-------------------------
FacingIssuesOnIT    |   X   |  X
is                  |   X   |
for                 |   X   |  
ELK                 |   X   |  X
If                  |       |  X
check               |       |  X

Now when we do some full-text search for String will sort documents based on existence and occurrence of matching counts .

Usually in Books we have inverted indexes on last pages. Based on the word we can thus find the page on which the word exists.

23. What is an Analyzer in ElasticSearch ?

While indexing data in Elastic Search, data is transformed internally by the Analyzer defined for the index, and then indexed. An analyzer is building block of  character filters, tokenizers and token filters. Following types of Built-in Analyzers are available in Elasticsearch 5.6.

Analyzer

Description
Standard Analyzer

Divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lower cases terms, and supports removing stop words.

Simple Analyzer

Divides text into terms whenever it encounters a character which is not a letter. It lower cases all terms.

White space Analyzer

Divides text into terms whenever it encounters any white space character. It does not lowercase terms.

Stop Analyzer

It is like the simple analyzer, but also supports removal of stop words.

Keyword Analyzer

A “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.

Pattern Analyzer

Uses a regular expression to split the text into terms. It supports lower-casing and stop words.

Language Analyzer

Elasticsearch provides many language-specific analyzers like English or French.

Finger Print Analyzer

A specialist analyzer which creates a fingerprint which can be used for duplicate detection.

24. What is a Tokenizer in ElasticSearch ?

tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. Inverted indexes are created and updates using these token values by recording the order or position of  each term and the start and end character offsets of the original word which the term represents.

An analyzer must have exactly one Tokenizer.

25. What is Character Filter in Elasticsearch Analyzer?

character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to convert Hindu-Arabic numerals (٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML elements like  from the stream.

An analyzer may have zero or more character filters, which are applied in order.

26.What is Token filters in Elasticsearch Analyzer?

token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase, a stop token filter removes common words (stop words) like the from the token stream, and a synonym token filter introduces synonyms into the token stream.

Token filters are not allowed to change the position or character offsets of each token.

An analyzer may have zero or more token filters, which are applied in order.

27. What are Type of Token Filters in Elasticsearch Analyzer?

Elasticsearch have number of built in Token filters which can use in custom filters.

28.  What is the is use of attributes- enabled, index and store ?

The enabled attribute applies to various ElasticSearch specific/created fields such as _index and _size. User-supplied fields do not have an “enabled” attribute.

Store means the data is stored by Lucene will return this data if asked. Stored fields are not necessarily searchable. By default, fields are not stored, but full source is. Since you want the defaults (which makes sense), simply do not set the store attribute.

The index attribute is used for searching. Only indexed fields can be searched. The reason for the differentiation is that indexed fields are transformed during analysis, so you cannot retrieve the original data if it is required.

29.What is the query language of ElasticSearch ?

Elasticsearch uses the Apache Lucene query language, which is called as Query DSL.

30. Does Elasticsearch have a schema ?

Yes, Elasticseach can have mappings which can be used to enforce schema on documents. We define Elasticsearch Index Schema by defining Mappings.

Advance and Practical Interview Questions and Answers 

31.What are Scripting Languages Support by Elasticsearch?

Elasticsearch supports custom scripting available in Lucene Expression, Groovy, Python,Java Script and Painless.

32. What is Painless and their benefits in Elasticsearch?

Painless is a simple, secure scripting language designed specifically for use with Elasticsearch 5.XX . It is the default scripting language for Elasticsearch  and can safely be used for inline and stored scripts. Painless use anywhere scripts can be used in Elasticsearch.

Benefits of Painless :

  • Fast performance: Painless scripts run several times faster than the alternatives.
  • Safety: Fine-grained whitelist with method call/field granularity.
  • Optional typing: Variables and parameters can use explicit types or the dynamic def type.
  • Syntax: Extends Java’s syntax to provide Groovy-style scripting language features that make scripts easier to write.
  • Optimizations: Designed specifically for Elasticsearch scripting.

33. How to store Elasticsearch Node Data to external Directory?

By default in Elasticsearch  data path location is $ES_HOME/data.  Keeping data in external path from Elasticsearch directory is beneficial while doing upgrade or any modification of Elasticsearch so that no any data loss.

For pointing to external path there are two ways to do :

First :   Set static path on elasticsearch.yml file as below .

path.data: /opt/app/FacingIssuesOnIT/data

Second : By Passing argument from command line while starting Elasticsearch.

./bin/elasticsearch Epath.data=/opt/app/FacingIssuesOnIT/data

33. What is Restore and Snapshot in Elasticsearch?

Snapshot : Snapshot is copy or backup of individual indices or an entire cluster into a  remote repository like shared file system, S3, or HDFS. Snapshots are not archival because they can only be restored to versions of Elasticsearch that can read the index.

Steps to create Snapshot:

  • Setup Backup directory
PUT /_snapshot/facingIssueOnIT_bkp
{
"type": "fs",  
"settings": { 
"compress": true, 
"location": "/mount/backups/facingIssueOnIT_bkp" 
         }
}
  • Check status
GET /_snapshot/facingIssueOnIT_bkp
or 
GET /_snapshot/_all
{ 
"facingIssueOnIT_bkp": { 
"type": "fs", 
"settings": { 
 "compress": true, 
 "location": "/mount/backups/facingIssueOnIT_bkp" 
 }
 } 
}
  • After registering repository create Snapshot of Cluster or Index as Below
For Cluster
PUT /_snapshot/facingIssueOnIT_bkp/snapshot_1?wait_for_completion=true

For indexes
PUT /_snapshot/facingIssueOnIT_bkp/snapshot_1
{
  "indices": "index_1,index_2",
  "ignore_unavailable": true,
  "include_global_state": false
}

wait_for_completion=true is use to complete prompt then only you can execute any other action for doing it in background use as false.

Restored : Restored is used to retrieve backup/snapshot indexes again in cluster. Restore can we done on cluster level and index level.

Cluster Level
POST /_snapshot/facingIssueOnIT_bkp/snapshot_1/_restore
Index Level
POST /_snapshot/facingIssueOnIT_bkp/snapshot_1/_restore
{
  "indices": "index_1,index_2",
  "ignore_unavailable": true,
  "include_global_state": true,
  "rename_pattern": "index_(.+)",
  "rename_replacement": "restored_index_$1"
}

34. What is Elasticsearch REST API and use of it?

Elasticsearch provides a very comprehensive and powerful REST API that you can use to interact with your cluster. Among the few things that can be done with the API are as follows:

  • Check your cluster, node, and index health, status, and statistics
  • Administer your cluster, node, and index data and metadata
  • Perform CRUD (Create, Read, Update, and Delete) and search operations against your indexes
  • Execute advanced search operations such as paging, sorting, filtering, scripting, aggregations, and many others

To learn more on Elasticsearch REST API follow link Elasticsearch Tutorial

35. How to check Elasticsearch Cluster Health?

To know about cluster health follow below URL over curl or on your browser.

GET /_cat/health?v

36. What are type of Cluster Health Status?

  • Green means everything is good (cluster is fully functional).
  •  Yellow means all data is available but some replicas are not yet allocated (cluster is fully functional)
  • Red means some data is not available for whatever reason.
  • Note: that even if a cluster is red, it still is partially functional (i.e. it will continue to serve search requests from the available shards) but you will likely need to fix it ASAP since you have missing data

37.  How to know Number of Nodes?

GET /_cat/nodes?v

Response:

ip        heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1           10           5   5    4.46                        mdi      *      PB2SGZY

Here, we can see our one node named “PB2SGZY”, which is the single node that is currently in our cluster.

38. How to get list of available Indices in Elasticsearch Cluster?

GET /_cat/indices?v

39. How to create Indexes?

PUT /customer?pretty
GET /_cat/indices?v

39. How to delete Index and records?

DELETE /customer?pretty
GET /_cat/indices?v

and 

PUT /customer
PUT /customer/external/1
{
  "name": "John Doe"
}
GET /customer/external/1
DELETE /customer

If we study the above commands carefully, we can actually see a pattern of how we access data in Elasticsearch. That pattern can be summarized as follows:

<REST Verb> //<Type>/<ID>

This REST access pattern is so pervasive throughout all the API commands that if you can simply remember it, you will have a good head start at mastering Elasticsearch.

40. How to update record and document fields value in Index?

We’ve previously seen how we can index a single document. Let’s recall that command again:

PUT /customer/external/1?pretty
{
  "name": "John Doe"
}

Again, the above will index the specified document into the customer index, external type, with the ID of 1. If we then executed the above command again with a different (or same) document, Elasticsearch will replace (i.e. reindex) a new document on top of the existing one with the ID of 1:

PUT /customer/external/1?pretty
{
  "name": "Jane Doe"
}

The above changes the name of the document with the ID of 1 from “John Doe” to “Jane Doe”. If, on the other hand, we use a different ID, a new document will be indexed and the existing document(s) already in the index remains untouched.

PUT /customer/external/2?pretty
{
  "name": "Jane Doe"
}

The above indexes a new document with an ID of 2.

When indexing, the ID part is optional. If not specified, Elasticsearch will generate a random ID and then use it to index the document. The actual ID Elasticsearch generates (or whatever we specified explicitly in the previous examples) is returned as part of the index API call.

This example shows how to index a document without an explicit ID:

POST /customer/external?pretty
{
  "name": "Jane Doe"
}

Note that in the above case, we are using the POST verb instead of PUT since we didn’t specify an ID.

 

Read More

To read more on Elasticsearch Configuration, Sample Elasticsearch REST Clients, Search Queries Types with example follow link Elasticsearch Tutorial and Elasticsearch Issues.

Hope this blog was helpful for you.

Leave you feedback to enhance more on this topic so that make it more helpful for others.

Elasticsearch Overview


“Elasticsearch is open source cross-platform developed completely in Java. It’s built on top of Lucene which provide full text search on high volumes of data quickly and easily do analysis based on indexing. It is schema free and provide NRT(Near real Time) search results.”

Advantage of Elasticsearch

Full Text Search 

Elasticserach built on top of Lucene which provide full-featured  library to search full-text on any open source.

Schema Free 

Elasticsearch stores documents in JSON format and based on it detects words and type to make it searchable.

Restful API 

Elastisearch is easily accessible over browser by using URL and also support for Restful API to perform Operation. For read more on Elasticsearch REST follow link for Elasticsearch REST JAVA API Overview.

Operation Persistence

Elasticsearch cluster keep records of all transaction level changes for schema if anything get change in data for index and track of availability of Nodes in cluster so that make data easily available if any fail-over of any node.

Area of use Elasticsearch?

  • It’s useful in application where need to do analysis, statics and need to find out anomalies on data based on pattern.
  • It’s useful where need to send alerts when particular condition matched like stock market, exception from logs etc.
  • It’s useful with application where log analysis and issue solution provide because of full search in billions of records in milliseconds.
  • It’s compatible with application like Filebeat, Logstash and Kibana for storage of high volume data for analysis and visualize in form of chart and dashboards.

Basic Concepts and Terminology

Cluster

Cluster is a collection of one or more nodes which provide capabilities to search text on scattered data on nodes. It’s identified by unique name with in network so that all associated nodes will join together by cluster name.

For more info on Cluster configuration and query follow link Elasticsearch Cluster.

Elasticsearch Cluster
Elasticsearch Cluster

In above screen elasticsearch cluster “FACING_ISSUE_IN_IT” having three master and four data node.

Node

Node is a Elasticsearch server which associate with a cluster. It’s store data , help cluster for indexing data and search query. It’s identified by unique name in Cluster if name is not provided elasticsearch will generate random Universally Unique Identifier(UUID) on time of server start.

A Cluster can have one or more Nodes .If first node start that will have Cluster with single node and when other node will start will add with that cluster.

For more info on Node Configuration, Master Node, Data Node, Ingest node follow link Elasticsearch Node.

Data Node storage
Data Node Documents Storage

In above screen trying to represent data of two indexes like I1 and I2.Where Index I1 is having two type of documents T1 and T2 while index I2 is having only type T2 and these shards are distributes over all nodes in cluster. This data node is having documents of shard (S1) for  Index I1 and shard (S3) for Index I2. It’s also keeping replica of documents of shards S2 of Index I2 and I1 which are store some other nodes in cluster.

Index

An Index is collection of documents with same characteristics which stores on nodes in distributed fashion and its identify by unique name on which perform different operation like search query, update and delete for documents. A cluster can have as many indexes with unique name.

A document store in Index and assigned a type to it and an Index can have multiple types of documents.

For more info on Index Creation, Mapping Template , CRUD follow link Elasticsearch Index.

Shards

Shards are partitions of indexes scattered on nodes. It provide capability to store large amount (billions) of documents for same index to store in cluster even one disk of node is not capable to store it.

Replica

Replica is copy of shard which store on different node. A shard can have zero or more replica. If shard on one node then replica of shard will store on another node.

Benefits of Shards and Replica

  • Shards splits indexes in horizontal partition for high volumes of data.
  • It perform operations parallel to each shards or replica on multiple node for index so that increase system performance and throughput.
  • Recovered easily in case of fail-over of node because data replica exist on another node because replica always store on different node where shards exist.

Note

When we create index by default elasticseach index configure as 5 shards and 1 replica but we can configure it from config/elasticsearch.yml file or by passing shards and replica values in mapping when index create.

Once index created we can’t change shards configuration but modify in replica. If need to update in shards only option is re-indexing.

Each Shard itself a Lucene index and it can keep max 2,147,483,519 (= Integer.MAX_VALUE – 128) documents. For merging of search results and failover taken care by elasticsearch cluster.

For more info on elasticsearch Shards and Replica follow Elasticsearch Shards and Replica configuration.

Document

Each Record store in index is called a document which store in JSON object. JSON data exchange is fast over internet and easy to handle on browser side display.

Read More

To read more on Elasticsearch Configuration, Sample Elasticsearch REST Clients, Search Queries Types with example follow link Elasticsearch Tutorial and Elasticsearch Issues.

Hope this blog was helpful for you.

Leave you feedback to enhance more on this topic so that make it more helpful for others.

Elasticsearch REST Index Manager Auto Client for CRUD


Elasticsearch 5 REST Java Index Manager Auto Client can  help to manage index life from client end by setting configuration for keeping  index   open, close, delete indexes  for this no any third party tool required.

Below steps for auto  index management will save your time of index management manually and will take care of index life based on configure time.

Pre-requisite

  • Minimum requirement for Java 8 version required.
  • Add dependency for Elasticsearch REST and JSON Mapping in your pom.xml or add in your class path.
  • Index name format should be like IndexName-2017.06.10 for Ex. app1-logs-2017.06.08 if you have different date format change accordingly in below code.

We will follow below steps to create this client and auto run:

  • Create Java Maven Project ElasticsearchAutoIndexManager.
  • Add ElasticSearchIndexManagerClient in Project.
  • Test
  • Create auto run jar for project
  • Create script file for run auto jar
  • Create Cron Tab configuration for schedule and receive email alert.

Create Java Maven Project ElasticsearchAutoIndexManager

Create console based JAVA maven project as in below screen shot with name as ElasticsearchAutoIndexManager . To know more about Console based Java maven project follow link How to create Maven Java Console Project?

Elasticsearch REST Auto Client

Add below dependency in pom.xml file

<!--Elasticsearch REST jar-->
<dependency>
			<groupId>org.elasticsearch.client</groupId>
			<artifactId>rest</artifactId>
			<version>5.1.2</version>
</dependency>
<!--Jackson jar for mapping json to Java -->
<dependency>
			<groupId>com.fasterxml.jackson.core</groupId>
			<artifactId>jackson-databind</artifactId>
			<version>2.8.5</version>
</dependency>

Add below ElasticSearchIndexManagerClient class in com.facingissuesonit.es package and make below constant fields changes as per your server info and requirement.

Set INDEX_NO_ACTION_TIME so that till these days difference no action will take. For Example as set 2 means till  two days index will searchable in system.

private static final int INDEX_NO_ACTION_TIME = 2; 

Set INDEX_CLOSE_TIME so that all indexes will in close status means exist in elasticsearch server but not searchable.For Example as set 5 means if index life is more than five days  will close these indexes and keep it as long as Index delete time not reach.

private static final int INDEX_CLOSE_TIME = 5; 

Set INDEX_DELETE_TIME decide when to delete these indexes which match this criteria. For example as set 15 means will delete all indexes which are having index life more than 15 days.

private static final int INDEX_DELETE_TIME = 15;

private static final String ELASTICSEARCH_SERVER = “ServerHost”;

private static final int ELASTICSEARCH_SERVER_PORT = 9200;

Note : Set proxy server and login credential information if required else comment out.

package com.facingissuesonit.es;

import java.io.IOException;
import java.io.InputStream;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.time.temporal.ChronoUnit;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;

import org.apache.http.HttpEntity;
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestClientBuilder;

import com.fasterxml.jackson.databind.ObjectMapper;

public class ElasticSearchIndexManagerClient {
	private static final int INDEX_NO_ACTION_TIME = 2;
	private static final int INDEX_CLOSE_TIME = 5;
	private static final int INDEX_DELETE_TIME = 15;
	private static final String ELASTICSEARCH_SERVER = "ServerHost";
	private static final int ELASTICSEARCH_SERVER_PORT = 9200;
	public static void main(String[] args) {
		RestClient client;
		String indexName = "", indexDateStr = "";
		LocalDate indexDate = null;
		long days = 0;
		final DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
		final LocalDate todayLocalDate = LocalDate.now();

		try {
			ElasticSearchIndexManagerClient esManager=new ElasticSearchIndexManagerClient();
			//Get Connection from Elasticsearch
			client=esManager.getElasticsearchConnectionClient();
			if(client!=null)
			{
				IndexDetail[] indexList = esManager.getIndexDetailList(client);

				if (indexList != null && indexList.length > 0) {
					for (IndexDetail indexDetail : indexList) {
						indexName = indexDetail.getIndexName();
						System.out.println(indexName);
						indexDateStr = indexName.substring(indexName.lastIndexOf("-") + 1);
						//Below code is for getting number of days difference from index creation ad current date
						try {
							indexDate = LocalDate.parse(indexDateStr.replace('.', '-'), formatter);
							days = ChronoUnit.DAYS.between(indexDate, todayLocalDate);
							esManager.performAction(indexDetail, days,client);
						} catch (Exception ex) {
							System.out.println("Index is not having formatted date as required : yyyy.MM.dd :"+indexName);
						}
					}
				}
			}
		} catch (Exception ex) {
			System.out.println("Exception found while index management");
			ex.printStackTrace();
			System.exit(1);
		} finally {
			System.out.println("Index Management successfully completed");
			System.exit(0);
		}
	}
	//Get Elasticsearch Connection
	private RestClient getElasticsearchConnectionClient() {
		final CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
		credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("userid", "password"));

		RestClient client = RestClient
				.builder(new HttpHost(ELASTICSEARCH_SERVER,ELASTICSEARCH_SERVER_PORT))
				.setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {

					public HttpAsyncClientBuilder customizeHttpClient(HttpAsyncClientBuilder httpClientBuilder) {
						return httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider)
								.setProxy(new HttpHost("ProxyHost", "ProxyPort"));

					}
				}).setMaxRetryTimeoutMillis(60000).build();
		return client;
	}
	//Get List of Indexes in Elaticsearxh Server
	public IndexDetail[] getIndexDetailList(RestClient client)
	{
		IndexDetail[] indexDetails=null;
		HttpEntity in=null;
		try
		{
		ObjectMapper jacksonObjectMapper = new ObjectMapper();
		Response response = client.performRequest("GET", "/_cat/indices?format=json&pretty", Collections.singletonMap("pretty", "true"));
		in =response.getEntity();
		indexDetails=jacksonObjectMapper.readValue(in.getContent(), IndexDetail[].class);
		System.out.println("Index found :"+indexDetails.length);
		}
		catch(IOException ex)
		{
			ex.printStackTrace();
		}

		return indexDetails;
	}
	//This Method Decide what action need to take based based Index creation date and configured date for No Action, close and Delete indexes
	private  void performAction(IndexDetail indexDetail, long days,RestClient client) {
		String indexName = indexDetail.getIndexName();
		if (days >= INDEX_NO_ACTION_TIME) {
			if (!(indexDetail.getStatus() != null && indexDetail.getStatus().equalsIgnoreCase("close"))) {
				// Close index condition
				if (days >= INDEX_CLOSE_TIME) {
					System.out.println("Close Index :" + indexName);
					closeIndex(indexName,client);
				}
			}
			// Delete index condition
			if (days >= INDEX_DELETE_TIME) {
				if (!(indexDetail.getStatus() != null && indexDetail.getStatus().equalsIgnoreCase("close"))) {
					System.out.println("Delete Index :" + indexName);
					deleteIndex(indexName,client);
				} else {
					System.out.println("Delete Close Index :" + indexName);
					deleteCloseIndex(indexName,client);
				}
			}
		}
	}

	// Operation on Indexes
		private  void closeIndex(String indexName,RestClient client) {

			flushIndex(indexName,client);
			postDocuments(indexName + "/_close", client);
			System.out.println("Close Index :" + indexName);
		}

		private  void deleteIndex(String indexName,RestClient client) {
			flushIndex(indexName,client);
			deleteDocument(indexName,client);
			System.out.println("Delete Index :" + indexName);
		}

		private  void deleteCloseIndex(String indexName,RestClient client) {
			openIndex(indexName,client);
			flushIndex(indexName,client);
			deleteDocument(indexName, client);
			System.out.println("Delete Close Index :" + indexName);
		}

		private  void openIndex(String indexName,RestClient client) {
			postDocuments(indexName + "/_open", client);
			System.out.println("Open Index :" + indexName);
		}

		private  void flushIndex(String indexName,RestClient client) {
			postDocuments(indexName + "/_flush", client);
			System.out.println("Flush Index :" + indexName);
			try {
				Thread.sleep(3000);
			} catch (InterruptedException ex) {
				ex.printStackTrace();
			}
		}
		//POST perform action used for creation and updation indexes
		public InputStream postDocuments(String endPoint,RestClient client)
		{
			InputStream in=null;
			Response response=null;
			try
			{
				response = client.performRequest("POST", endPoint, Collections.<String, String>emptyMap());
				in =response.getEntity().getContent();
			}
			catch(IOException ex)
			{
				System.out.println("Exception in post Documents :");
				ex.printStackTrace();
			}
			return in;
		}
		//DELETE perform action use for Deletion of Index
		public InputStream deleteDocument(String endPoint,RestClient client)
		{
			InputStream in=null;
			try
			{

		    Response response = client.performRequest("DELETE", endPoint, Collections.singletonMap("pretty", "true"));
			in =response.getEntity().getContent();
			}
			catch(IOException ex)
			{
				System.out.println("Exception in delete Documents :");
				ex.printStackTrace();
			}
			return in;
		}

}

In above code pretty state forward and step by step. Let’s me explain about below operation.

Open :  Index status open keep index available for searching and we can perform below operation like close and delete on indexes when it open status. For making Index open we can use below command in curl .

curl POST /indexName-2017.06.10/_open

Flush: This operation  is required before executing close and delete operation on indexes so that all running transactions on indexes complete.

curl POST /indexName-2017.06.10/_flush

Close : Close indexes persist in elasticsearch server but not available for searching. For making Index open we can use below command in curl .

curl POST /indexName-2017.06.10/_close

Delete : Delete operation on index will delete completely from server.

curl POST /indexName-2017.06.10/_delete

Now our code is ready to take care of indexes based on configured time and test . we test it after running above code.

Below steps are for making your index manager code auto runnable in Linux environment.

Create Auto Runnable Jar

Export ElasticsearchAutoIndexManager project as auto runnable jar by selecting as Launch class ElascticsearchIndexManagerClient. To learn about Auto runnable jar creation steps following link How to make and auto run /executable jar with dependencies?

Create Script File to Execute  Autorun Jar

Create script file as below with name as IndexManger.sh and save it.

#!/bin/bash
~/JAVA/jdk1.8.0_66/bin/java  -jar /opt/app/facingissuesonit/automate/IndexManagerClient.jar

Create Cron Tab configuration for schedule and receive email alert

Linux provide cron tab for executing schedule job/scripts. by using cron tab will execute  runnable jar by using above script file

  • Use command crontab -e to make and edit existing entries in cron tab.
  • Make below cron entry in this editor  for executing IndexManager.sh script on every night 1AM.
  • If you want to get execution alert to you and your team  with console logs also add your email id as below.
  • Save cron tab as ESC then (:wq)

Below are some more example for cron tab expression.

0 * * * *           : Run Every hour of day
* * * * *           : Every minute of day
30 4 * * *         : Run on 4:30 AM everyday
5 10,22 * * *   : Run twice on 10:05 and 22:05
5 0 * * *          : Run after Midnight

Read More

To read more on Elasticsearch REST , sample clients, configurations with example follow link Elasticsearch REST Tutorial and Elasticsearch Issues.

Hope this blog was helpful for you.

Leave you feedback to enhance more on this topic so that make it more helpful for others.

Elasticsearch REST JAVA Client to get Index Details List


Below is example to get Index Detail in Java Array by using Elasticsearch REST Java client. Here client will call endpoint  “/_cat/indices?format=json” to retrieve all detail of index list. It is same as we use GET by CURL

GET http://elasticsearchHost:9200/_cat/indices?format=json
 

Pre-requisite

  • Minimum requirement for Java 7 version required.
  • Add below dependency for Elasticsearch REST and JSON Mapping in your pom.xml or add in your class path.

Dependency

<!--Elasticsearch REST jar-->
<dependency>
			<groupId>org.elasticsearch.client</groupId>
			<artifactId>rest</artifactId>
			<version>5.1.2</version>
</dependency>
<!--Jackson jar for mapping json to Java -->
<dependency>
			<groupId>com.fasterxml.jackson.core</groupId>
			<artifactId>jackson-databind</artifactId>
			<version>2.8.5</version>
</dependency>

Sample Code

import java.io.IOException;
import java.util.Collections;

import org.apache.http.HttpEntity;
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestClientBuilder;

import com.fasterxml.jackson.databind.ObjectMapper;

public class ElasticsearchRESTIndexClient {

	public static void main(String[] args) {
		IndexInfo []indexArr = null;
		RestClient client = null;
		try {
			client = openConnection();
			if (client != null) {
				// performRequest GET method will retrieve all index detail list
				// information from elastic server
				Response response = client.performRequest("GET", "/_cat/indices?format=json",
						Collections.singletonMap("pretty", "true"));
				// GetEntity api will return content of response in form of json
				// in Http Entity
				HttpEntity entity = response.getEntity();
				ObjectMapper jacksonObjectMapper = new ObjectMapper();
				// Map json response to Java object in IndexInfo Array
				// Cluster Info
				indexArr = jacksonObjectMapper.readValue(entity.getContent(), IndexInfo[].class);
				for(IndexInfo indexInfo:indexArr)
				{
				System.out.println(indexInfo);
			    }
			}

		} catch (Exception ex) {
			System.out.println("Exception found while getting cluster detail");
			ex.printStackTrace();
		} finally {
			closeConnnection(client);
		}

	}

	// Get Rest client connection
	private static RestClient openConnection() {
		RestClient client = null;
		try {
			final CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
			credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("userid", "password"));
			client = RestClient.builder(new HttpHost("elasticHost", Integer.parseInt("9200")))
					.setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {
						// Customize connection as per requirement
						public HttpAsyncClientBuilder customizeHttpClient(HttpAsyncClientBuilder httpClientBuilder) {
							return httpClientBuilder
									// Credentials
									.setDefaultCredentialsProvider(credentialsProvider)
									// Proxy
									.setProxy(new HttpHost("proxyServer", 8080));

						}
					}).setMaxRetryTimeoutMillis(60000).build();

		} catch (Exception ex) {
			ex.printStackTrace();
		}
		return client;
	}

	// Close Open connection
	private static void closeConnnection(RestClient client) {
		if (client != null) {
			try {
				client.close();
			} catch (IOException ex) {
				ex.printStackTrace();
			}
		}
	}

}

Index Info Object where JSON index detail will map

import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import com.fasterxml.jackson.annotation.JsonProperty;

@JsonIgnoreProperties(ignoreUnknown = true)
public class IndexInfo {
@JsonProperty(value = "health")
private String health;
@JsonProperty(value = "index")
private String indexName;
@JsonProperty(value = "status")
private String status;
@JsonProperty(value = "pri")
private int shards;
@JsonProperty(value = "rep")
private int replica;
@JsonProperty(value = "pri.store.size")
private String dataSize;
@JsonProperty(value = "store.size")
private String totalDataSize;
@JsonProperty(value = "docs.count")
private String documentCount;

@Override
public String toString()
{
	StringBuffer str=new StringBuffer(60);
	str.append("{\n");
	str.append("    \"").append("indexName").append("\":\"").append(indexName).append("\",\n");
	str.append("    \"").append("health").append("\":\"").append(health).append("\",\n");
	str.append("    \"").append("status").append("\":\"").append(status).append("\",\n");
	str.append("    \"").append("shards").append("\":\"").append(shards).append("\",\n");
	str.append("    \"").append("replica").append("\":\"").append(replica).append("\",\n");
	str.append("    \"").append("dataSize").append("\":\"").append(dataSize).append("\",\n");
	str.append("    \"").append("totalDataSize").append("\":\"").append(totalDataSize).append("\",\n");
	str.append("    \"").append("documentCount").append("\":\"").append(documentCount).append("\"\n");
	str.append("    \"");
	return str.toString();
}
public String getIndexName() {
	return indexName;
}
public void setIndexName(String indexName) {
	this.indexName = indexName;
}
public int getShards() {
	return shards;
}
public void setShards(int shards) {
	this.shards = shards;
}
public int getReplica() {
	return replica;
}
public void setReplica(int replica) {
	this.replica = replica;
}
public String getDataSize() {
	return dataSize;
}
public void setDataSize(String dataSize) {
	this.dataSize = dataSize;
}
public String getTotalDataSize() {
	return totalDataSize;
}
public void setTotalDataSize(String totalDataSize) {
	this.totalDataSize = totalDataSize;
}
public String getDocumentCount() {
	return documentCount;
}
public void setDocumentCount(String documentCount) {
	this.documentCount = documentCount;
}
public String getStatus() {
	return status;
}
public void setStatus(String status) {
	this.status = status;
}
public String getHealth() {
	return health;
}
public void setHealth(String health) {
	this.health = health;
}
}

Read More on Elasticsearch REST

Integration

Integrate Filebeat, Kafka, Logstash, Elasticsearch and Kibana

Filebeat,Elasticsearch Output Configuration


If we need  to shipped server logs lines  directly to elasticseach  over HTTP by filebeat . We have set below fields for elasticsearch output according to your elasticsearch server configuration and follow below steps.

  1.  Uncomment output.elasticsearch in filebeat.yml file Elasticsearch
  2.  Set host and port in hosts line
  3.  Set index name as you want. If it’s not set filebeat will create default index as “filebeat-%{+yyyy.MM.dd}” .
output.elasticsearch :

   enabled:true
   hosts:["localhost:9200"]
   index:app1-logs-%{+yyyy.MM.dd}"

Elasticsearch server credentials configuration if any 

  1.  Set user name and password
  2.  Set protocol if https because default protocol is http
    username:userid
    password:pwd

Elasticsearch Index Template Configuration: We can update elasticsearch index template from filebeat which will define settings and mappings to determine field analysis.

Auto Index Template Loading: Filebeat package will load default template filebeat.template.json to elasticsearch if no any template configuration for template and will not overwrite template.

Customize Index Template Loading: We can upload our user define template and update version also by using below configuration.

#(if set as false template need to upload manually)
template.enabled:true
#default value is filebeat
template.name:"app1"
#default value is filebeat.template.json.
template.path:"app1.template.json"
#default value is false
template.overwrite:false 

By default, template.overwrite value is false and will not overwrite index template if already exist on elasticsearch.  For overwriting index template make this flag as true in filebeat.yml configuraton file.

Latest Template Version Loading from Filebeat: Set template.overwrite as true and if need to update template file version as 2.x then set path of Latest template file with below configuration.


template.overwrite:true
template.versions.2x.enabled: true
template.versions.2x.path:"${path.config}/app1.template-es2x.json"

Manually Index Template Loading : for manually index loading please refer Elasticsearch Index Template Management.

Compress Elasticsearch Output :  Filebeat provide gzip compression level which varies from 1 to 9. As compression level increase processing speed will reduce but network speed increase.By default compression level disable and value is 0.


compress_level: 0

Other configuration Options:

bulk_max_size : Default values is 50. If filebeat is generating events more than configure batch max size it will split events in configure size batches and send to elasticsearch. As much as batch size will increase performance will improve but require more buffring. It can cause other issue like connection, errors, timeout for requests.

Never set value of bulk size as 0 because there would not be any buffering for events and filebeat will send each event directly to elasticsearch.

timeout: Default value is 90 seconds. If no response http request will timeout.

flush_interval: waiting time for new events for bulk requests. If bulk request max size sent before this specified time, new bulk index request created.

max_retries: Default value is 3. When max retry reach specified limit and evens not published all events will drop. Filebeat also provide option to retry until all events are published by setting value as less than 0.

worker:  we can configure number of worker for each host publishing events to elasticseach which will do load balancing.

 Sample Filebeat Configuration file:

Sample filebeat.yml file to Integrate Filebeat with Elasticsearch

Integration

Complete Integration Example Filebeat, Kafka, Logstash, Elasticsearch and Kibana

Read More

To read more on Filebeat topics, sample configuration files and integration with other systems with example follow link Filebeat Tutorial  and  Filebeat Issues. To know more about YAML follow link YAML tutorials.

Leave you feedback to enhance more on this topic so that make it more helpful for others.