오픈소스/ElasticSearch

Elasticsearch 운영 중 예상 error 테스트

민둥곰 2021. 5. 16. 18:25

목적

운영시 발생할 수 있는 문제에 대하여 테스트

테스트 환경

하나의 cluster, 3개의 다중 마스터&데이터, 1개의 데이터 Node 환경
SSL 및 xpack.security.enabled: true 상태 ( SSL 및 xpack plugin 적용 상태)
Elastic의 elasticsearch basic 라이선스의 경우 대고객용(resell)이 아닌 경우 사용 가능 그에 따라, OpenDistro가 아닌 elastic으로 테스트 변경

Node	Elastic-Version	vCPU	RAM	Master Node	Data Node
node-0.example.com.01	~~opendistro(AWS)-7.6.1~~ elastic-7.5.2	4	8GB	true	true
node-0.example.com.02	~~opendistro(AWS)-7.6.1~~ elastic-7.5.2	4	8GB	true	true
node-0.example.com.03	~~opendistro(AWS)-7.6.1~~ elastic-7.5.2	4	8GB	true	true
data-node-01	~~opendistro(AWS)-7.6.1~~ elastic-7.5.2	4	8GB	false	true

테스트

테스트 목적

운영 시 발생할 수 있는 문제 테스트
shards deafult RAID : RAID 1+0

서버 shutdown

node-0.example.com.02 ES 강제 stop

ES 서버 Disk 사용량 85%(default) 이상 일 경우 re-allocation 미 발생
모든 index yellow로 변환
총 3가지 water-mark 값 수정, data-node 추가 & index 삭제로 해결 가능
- water mark 값 수정 및 read-only 값 수정 (Default: water mark:85%, read-only:false)
- data node 추가
- index 삭제

결과

모든 index yello로 변경

해당 Node에 있는 Shard UNASSIGNED 발생

Shrad reroute 중 문제 발생

Disk 사용량 85% 초과로 인하여 reroute 불가 현상 발생 → data node 추가 혹은 index 삭제 필요

Disk 총 용량에 기반하여 shard allocation 미 발생) 현재 노드의 disk 사용량이 default 값 85% 넘어 reallocation이 안되는 문제 발생

서버 교체 테스트

운영 중 발생할 수 있는 서버 교체 or 버젼 update 테스트

서버 교체 중 ES data-node stop 시 index auto-reallocation 발생
다른 서버 disk 사용량 급증
reallocation 잠시 stop 이후 ES data-node stop 이후 서버 교체 진행

shard 관리 중인 경우 위 과정 진행 필요 (primary, recovery)

결과

reallocation 설정 값 수정 없이 진행 결과 (true)

Primary, recovery shard 변경

즉, Shards에 대하여 관리하고 있다면 reallocation none 이후 진행 필요

2.reallocation 설정 값 수정 이후 진행 결과 (false)

PUT _cluster/settings
{
"transient" : {
"cluster.routing.allocation.enable" : "none"
}
}

위 API 실행 이후

{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "unknown setting [archived.xpack.monitoring.collection.enabled] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
}
],
"type" : "illegal_argument_exception",
"reason" : "unknown setting [archived.xpack.monitoring.collection.enabled] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
},
"status" : 400
}

1) reallocation 방지 설정

curl -XPUT {localhost or IP}:9200/_cluster/settings -d '{

"transient" : {

"cluster.routing.allocation.enable" : "none"

}

2) 새 장비에서 elasticsearch 데몬 시작

systemctl start elasticsearch

systemctl enable elasticsearch

3) 교체 장비 elasticsearch 종료

1. curl -XPOST 'http://{localhost or IP}:9200/_cluster/nodes/_local/_shutdown'

2. curl -XPOST 'http://{localhost or IP}:9200/_shutdown'

3. systemctl stop elasticsearch || systemctl disable elasticsearch

4) reallocation 수행

curl -XPUT {localhost or IP}:9200/_cluster/settings -d '{

"transient" : {

"cluster.routing.allocation.enable" : "all"

}

Disk 용량 추가

운영 중 발생할 수 있는 disk 용량 추가 테스트

LVM 통하여 진행 예정 (default data path 수정)

-- 아직 진행 전

Backup & Recovery

##### BackUP ######
 
elasticsearch.yml
path.repo 지정 필요
 
PUT /_snapshot/{repoName}       #PUT /_snapshot/{test_backup_01}
{
  "type": "fs",
  "settings": {
    "location": "path" #"{/backup/elasticsearch (path.repo 와 동일 아래 하위 디렉토리 가능 여부 확인 필요}",
    "compress": true
  }
}
 
PUT /_snapshot/{repoName}/{snapshotName}?wait_for_completion=true      #PUT /_snapshot/{test_backup_01}/{test-20.06.25}?wait_for_completion=true
{
  "indices" : "indexname" #"{indexName*} or {indexName,indexName,indexName}",
  "ignore_unavailable": true,
  "include_global_state": true
}
 
#path.repo/indicies/ index별 존재
 
 
##### Recovery #####
 
#snapshot 확인
GET /_snapshot/{repoName}/{snapshotName}          #GET /_snapshot/test_backup_01/test-20.06.25
 
#1 index close 후 복구
#복구 전 index close
POST /{indexName}/_close   #POST /test-restore/_close
 
#복구 진행
POST /_snapshot/{repoName}/{snapshotName}/_restore  #POST /_snapshot/test_backup_01/test-20.06.25/_restore
 
#2 index 삭제 후 복구
#복구 전 인덱스 삭제
DELETE /{indexName} #DELETE /test-restore
 
#복구 진행
POST /_snapshot/{repoName}/{snapshotName}/_restore  #POST /_snapshot/test_backup_01/test-20.06.25/_restore

3개의 마스터, 4개의 데이터 노드 환경

4개의 데이터 노드 path.repo는 NFS와 같이 4개의 데이터 노드에서 모두 접근 가능한 fileSystem 이여야 함 (NFS로 임시 진행)

백업받은 클러스터가 아닌 다른 클러스터에서 복원 시 문제 발생 (snapshot UUID 가 다른 점)

Snapshot API / 기능 사용

스냅샷 생성

스냅샷을 통한 복구

복구 전 해당 index close 필요

복구 진행

복구 시간:

[centos@test-elasticsearch-master-01 ~]$ time curl -XPOST http://192.168.147.37:9200/_snapshot/test_backup_01/test-20.06.25/_restore
{"accepted":true}
real 0m0.118s
user 0m0.010s
sys 0m0.025s

용량	docs.count	백업 시간	복구 시간
약 900Mb	19000	약 21초	0.118 초

아래 오픈소스 사용 테스트 진행

https://github.com/elasticsearch-dump/elasticsearch-dump

elasticdump --input=http://192.168.147.37:9200/aojmcqclnqda --output=./testbackup.json

Thu, 25 Jun 2020 05:36:52 GMT | Total Writes: 19000
Thu, 25 Jun 2020 05:36:52 GMT | dump complete

elasticdump --input=./testbackup.json --output=http://192.168.147.37:9200 --output-index=test-restore

Thu, 25 Jun 2020 06:07:57 GMT | starting dump
Thu, 25 Jun 2020 06:07:57 GMT | got 100 objects from source file (offset: 0)

Thu, 25 Jun 2020 06:11:04 GMT | Total Writes: 19000
Thu, 25 Jun 2020 06:11:04 GMT | dump complete

elasticdump --input=http://192.168.147.37:9200/test-restore --output=./testbackup_02.json

Thu, 25 Jun 2020 06:13:31 GMT | starting dump
Thu, 25 Jun 2020 06:13:32 GMT | got 100 objects from source elasticsearch (offset: 0)

Thu, 25 Jun 2020 06:16:37 GMT | Total Writes: 19000
Thu, 25 Jun 2020 06:16:37 GMT | dump complete

[centos@test-elasticsearch-backup-01 ~]$ elasticdump --input=./mysql-slowquery-20.06.29.json --output=http://192.168.147.37:9200/test-mysqlslowquery
Sun, 28 Jun 2020 23:26:28 GMT | starting dump

Sun, 28 Jun 2020 23:31:33 GMT | dump complete
Sun, 28 Jun 2020 23:31:33 GMT | got 0 objects from source file (offset: 30577)

[centos@t-es-dump-backup ~]$ elasticdump --input=http://{IP or hostname}:9200/mysql-slowquery --output=./mysql-slowquery-20.06.29.json
Sun, 28 Jun 2020 22:44:21 GMT | starting dump

Sun, 28 Jun 2020 22:49:27 GMT | Total Writes: 30577
Sun, 28 Jun 2020 22:49:27 GMT | dump complete

용량	docs.count	백업 시간	복구 시간
약 900Mb	19000	약 3분	약 3분
58.1Mb	30577	약 5분	약 5분

용량이 아닌 document 개수에 따라 백업/복구 시간 증가

RBAC( Role-Base-Access-Control) 적용

Kibana Index 별 유저 및 role 권한 설정

--업데이트 예정--

SSL 적용

Kibana 및 elasticsearch SSL 적용

--업데이트 예정--

Cluster Monitoring 적용

JAVA Heap, 검색 속도(application) 등 cluster monitoring 기능 적용

--업데이트 예정--

'오픈소스 > ElasticSearch' 카테고리의 다른 글

ES API (0)	2021.09.15
Elaistcsearch 검색 테스트 (0)	2021.05.16
ElasticSearch 성능 테스트 (0)	2021.05.16
ElaistcSearch 3 master node & 1 data node 설정 (0)	2021.05.16
Cent OS 7 ElasticSearch 설치 (0)	2021.05.16

현재글Elasticsearch 운영 중 예상 error 테스트

이모저모 정리

개인 정리 및 공부 목적입니다. 잘못된 내용이 있을 수 있으니 따뜻한 마음으로 댓글 조언 부탁드립니다

keyboard macro, ES, Fluentd mysql slow query, Elaistcsearch, Elasticsearch Analyzer, MySQL SlowQuery Log, Logstash, Eliastcsearch 검색 테스트, ElasitcSearch, Elasticsearch 성능 테스트, Nori, ES 검색 테스트, 파이썬 키보드 매크로, Filebeat mysql slowquery, Fluentd, python keyboard macro, filebeat, MySQL Slow Query Log, logstash mysql slow query log, Elasticsearch Tokenizer,

Today :
Yesterday :

이모저모 정리

Elasticsearch 운영 중 예상 error 테스트

목적

테스트 환경

테스트