磁盘损坏导致container-sync服务退出(SwiftBug)

本文发布时间: 2019-Mar-22
出处:http://blog.csdn.net/cywosp/article/details/23848083之前在项目中做了一个监控swift各个服务运行情况的模块,swift中的服务包括:container-updater , account-auditor, object-replicator, proxy-server, container-replicator, object-auditor, object-expirer, container-auditor, container-server, account-server, account-reaper, container-sync, account-replicator, object-updater, object-server共15个,其中proxy-server, account-server, container-server, object-server这四个服务是需要监控的重中之重,它们不工作意味着swift集群就不能对外提供服务了,因此在集群故障处理中,监控这些服务状态就显得尤为重要。 前段时间监控模块在运行时产生了一些问题让发现了swift的一些小Bug,其中就有当加入到swift中的硬盘损害时导致container-sync服务停止的问题。该Bug的具体log表现如下:Apr 15 10:07:24 0d7d51e8-024e-3a94-a310-46cf5426b3f9 container-sync UNCAUGHT EXCEPTION#012Traceback (most recent call last):#012 File "/usr/bin/swift-container-sync", line 23, in <module>#012 run_daemon(ContainerSync, conf_file, **options)#012 File "/usr/lib/python2.6/site-packages/swift/common/daemon.py", line 110, in run_daemon#012 klass(conf).run(once=once, **kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/common/daemon.py", line 57, in run#012 self.run_forever(**kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/container/sync.py", line 162, in run_forever#012 for path, device, partition in all_locs:#012 File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1521, in audit_location_generator#012 partitions = listdir(datadir_path)#012 File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1814, in listdir#012 return os.listdir(path)#012OSError: [Errno 5] Input/output error: '/srv/node/sdb1/containers'根据日志输出我们可以分析得到是sdd1磁盘发生了input/output错误,导致程序在调用listdir函数时抛出了异常,listdir实现如下:#swift/common/utils.pydef listdir(path): try: return os.listdir(path) except OSError as err: if err.errno != errno.ENOENT: # ENOENT: No such file or directory 文件/路径不存在 raise # 如果所要list的目录(path)不存在则将异常往外抛出 return []listdir函数被audit_location_generator函数调用,具体实现如下:#swift/common/utils.pydef audit_location_generator(devices, datadir, suffix='', mount_check=True, logger=None): device_dir = listdir(devices) # randomize devices in case of process restart before sweep completed shuffle(device_dir) for device in device_dir: ……该函数没有捕捉异常,所产生的异常都继续往上抛了audit_location_generator函数被run_forever函数调用,具体实现如下:#swift/container/sync.pydef run_forever(self): sleep(random() * self.interval) while True: begin = time() all_locs = audit_location_generator(self.devices, container_server.DATADIR, '.db', mount_check=self.mount_check, logger=self.logger) for path, device, partition in all_locs: self.container_sync(path) if time() - self.reported >= 3600: # once an hour self.report() elapsed = time() - begin if elapsed < self.interval: sleep(self.interval - elapsed)从上面三个函数以及它们的调用过程可以知道run_forever中没有捕获异常,如果产生了未知异常,那么run_forever函数就会异常退出,从而导致了对应的进程崩溃。磁盘发生IO错误时/var/log/messages的记录: Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: scanning ... Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: end_request: I/O error, dev sdb, sector 976403386 Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): metadata I/O error: block 0x3a32bb76 ("xlog_iodone") error 5 numblks 64 Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1115 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa072c8b1 Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): Log I/O Error Detected. Shutting down filesystem Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): Please umount the filesystem and rectify the problem(s) Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned. Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: sd 0:2:1:0: [sdb] Synchronizing SCSI cache Apr 15 10:06:54 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned. Apr 15 10:07:24 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned. Apr 15 10:07:54 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned.该问题虽然对整个集群系统并不带来太大的问题,况且现在的磁盘坏的概率现在已经很低了,但是对整个集群的健康状况以及数据的container的一致性带来了一点小影响。因此,我在swift官方bug报告网站中提交了该bug,不知道大牛们会不会采纳并解决。具体见: https://bugs.launchpad.net/swift/+bug/1307798


(以上内容不代表本站观点。)
---------------------------------
本网站以及域名有仲裁协议。
本網站以及域名有仲裁協議。

2024-Mar-04 02:10pm
栏目列表