Bug #2001
Add timeout to ssh polling
Status: | Closed | Start date: | 05/07/2013 | |
---|---|---|---|---|
Priority: | High | Due date: | ||
Assignee: | - | % Done: | 0% | |
Category: | Core & System | |||
Target version: | Release 4.0 | |||
Resolution: | fixed | Pull request: | ||
Affected Versions: | OpenNebula 3.8 |
Description
Hi,
We had a problem with a xen hypervisor. It became unreachable due to a crash and ssh polls were accumulating in the frontend.
There were a lot of processes like :
ssh -n xen1.mydomain if [ -x "/var/tmp/one/vmm/xen/poll" ]; then /var/tmp/one/vmm/ sh -c ssh -n xen1.mydomain 'if [ -x "/var/tmp/one/vmm/xen/poll" ]; then /var/tmp/o
As a consequence, monitoring was not working anymore in opennebula. All hosts were in "init" state, crashed VMs from the failed hypervisor were still in running state. Scheduler was not working anymore. I tried to reinstanciate VMs, they were stuck in "BOOT" state but nothing was done on the target hosts. I had to shutdown oned and reboot frontend for quick recovery.
There must be an ssh timeout to handle this case. Something like "-o ConnectTimeout=15" or shorter timeout like 5 seconds and implement retries if it is not already done. The timeout should ideally be configured in oned.conf.
Best regards,
Laurent
History
#1 Updated by Ruben S. Montero about 8 years ago
- Status changed from New to Closed
- Resolution set to fixed
The monitor process has been improved in version 4.0, specially to tackle problems like the one you have just described.
However it is a good idea to tune the ssh config parameters. It'd be better to use the oneadmin ssh config file for this. I've updated the documentation to include you suggestion.
http://opennebula.org/documentation:rel4.0:ignc#secure_shell_access_front-end
THANKS
#2 Updated by Laurent Grawet about 8 years ago
Hi,
Thanks for the information. I had the same idea about .ssh/config. I've updated the config. Is it better to put a small timeout (5 sec) like in doc, are there retries ? Or is it safer to use longer timeout like 15 sec ?
#3 Updated by Laurent Grawet about 8 years ago
Ok, I've just seen I can configure retries in oned.conf with "-r". Default is 0.
VM_MAD = [ name = "vmm_xen", executable = "one_vmm_exec", arguments = "-t 15 -r 0 xen", default = "vmm_exec/vmm_exec_xen.conf", type = "xen" ]
Thanks