ES节点服务器异常掉电重启导致shard不能启动的问题修复

本文记录了一次因异常断电导致的Elasticsearch节点服务器translog损坏问题,详细描述了问题现象、解决步骤及translog的重要性。在问题发生后,通过关闭集群,删除损坏的translog恢复日志文件,然后重启集群,成功解决了TranslogCorruptedException异常。

今天elasticsearch两个节点服务器异常掉电重启,遇到translog损坏的异常,将修复的过程记录下来。

1、问题

单机数据量有2亿+,一个index,20+个字段,使用bulk不停的写数据,bulk.size=5000,此时机器意外断电宕机。

机器修复后重启ES,出现translogCorruptedException异常:

[2018-04-18 16:29:25,950][WARN ][indices.cluster          ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [ocslog-2018.04.18][0] failed to recover shard
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
        at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
        ... 4 more
Caused by: java.io.EOFException
        at org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)
        at org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)
        at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)
        at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)
        at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:495)
        at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
        ... 5 more
[2018-04-18 16:29:25,959][WARN ][cluster.action.shard     ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] sending failed shard for [ocslog-2018.04.18][0], node[hROI0lMDQvqFNa2pcdlrGg], [P], s[INITIALIZING], indexUUID [opNRuZb0QS-UseDQlAadfA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[ocslog-2018.04.18][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]
[2018-04-18 16:29:25,959][WARN ][cluster.action.shard     ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] received shard failed for [ocslog-2018.04.18][0], node[hROI0lMDQvqFNa2pcdlrGg], [P], s[INITIALIZING], indexUUID [opNRuZb0QS-UseDQlAadfA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[ocslog-2018.04.18][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]
[2018-04-18 16:29:26,304][WARN ][cluster.action.shard     ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] received shard failed for [ocslog-2018.04.18][0], node[hROI0lMDQvqFNa2pcdlrGg], [P], s[INITIALIZING], indexUUID [opNRuZb0QS-UseDQlAadfA], reason [master [elasticsearch_43_ssd][hROI0lMDQvqFNa2pcdlrGg][ocsbak][inet[/192.168.0.43:9300]]{tag=hot, max_local_storage_nodes=1, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]
[2018-04-18 16:29:26,624][WARN ][cluster.action.shard     ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] received shard failed for [ocslog-2018.04.18][0], node[hROI0lMDQvqFNa2pcdlrGg], [P], s[INITIALIZING], indexUUID [opNRuZb0QS-UseDQlAadfA], reason [master [elasticsearch_43_ssd][hROI0lMDQvqFNa2pcdlrGg][ocsbak][inet[/192.168.0.43:9300]]{tag=hot, max_local_storage_nodes=1, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]
[2018-04-18 16:29:27,174][WARN ][cluster.action.shard     ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] received shard failed for [ocslog-2018.04.18][0], node[hROI0lMDQvqFNa2pcdlrGg], [P], s[INITIALIZING], indexUUID [opNRuZb0QS-UseDQlAadfA], reason [master [elasticsearch_43_ssd][hROI0lMDQvqFNa2pcdlrGg][ocsbak][inet[/192.168.0.43:9300]]{tag=hot, max_local_storage_nodes=1, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]

提示:ocslog-2018.04.18,主分片失败,有如下错误信息:
TranslogCorruptedException[translog corruption while reading from stream]
看样子是异常掉电导致 Translog 日志异常了

2、解决

  • 先关闭集群
  • 尝试清除 ocslog-2018.04.18 索引对应的 Translog 恢复日志,找到 translog 文件所在目录:
    /home/local/elasticsearch/data/elasticsearch_ocs/nodes/0/indices/ocslog-2018.04.18/0/translog
    下面有一个 translog*.recorving 的文件,将其备份并删除
  • 重启集群后问题恢复

3、总结

ES 的translog中包含 对ES所有的所有更改,是数据备份和恢复的重要组件。
如果在写translog时发生宕机事故,translog写入流程没有正常的结束,translog文件结尾没有正确的结束符号,导致eof Exception。
详细参考:https://blog.csdn.net/jiao_fuyou/article/details/79997292

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值