Bug #1896
VMs stuck in UNKNOWN mode
Status: | Closed | Start date: | 04/12/2013 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | - | % Done: | 0% | |
Category: | - | |||
Target version: | - | |||
Resolution: | fixed | Pull request: | ||
Affected Versions: | OpenNebula 3.8 |
Description
Hello,
I'm facing a really strange behaviour since a few weeks now.
I've shutted down my whole ONE infra, and put it online again.
All the VMs have been restored via sunstone.
And 90% of them went back to RUNNING.
But on 2 hosts only, the VMs are still stuck in UNKNOWN status.
"virsh list" on host is saying that they are running.
e.g :
virsh list
Id Name State
----------------------------------
1 one-294 running
"ruby /var/tmp/one/vmm/kvm/poll one-xxx" is giving an answer with monitoring information.
e.g :
ruby -wd /var/tmp/one/vmm/kvm/poll one-294
STATE=a NETTX=231069507 USEDCPU=0.1 USEDMEMORY=2118112 NETRX=507801157
But in oned.log the answer of the poll is still status=d.
e.g :
Fri Apr 12 08:13:56 2013 [VMM][I]: Monitoring VM 294.
Fri Apr 12 08:13:56 2013 [VMM][D]: Message received: LOG I 294 ExitCode: 0
Fri Apr 12 08:13:56 2013 [VMM][D]: Message received: POLL SUCCESS 294 STATE=d
I verified that oneadmin has still rights to access passwordlessly the hosts and it's correctly configured.
I'm a bit loosing my french with this since it seems that the value is not cached anywhere, but I didn't check the DB.
Any help on this could be of great use.
Kind regards
Cyrille
History
#1 Updated by Ruben S. Montero about 8 years ago
The answers are different indeed:
POLL SUCCESS 294 STATE=d
vs
STATE=a NETTX=231069507 USEDCPU=0.1 USEDMEMORY=2118112 NETRX=507801157
Maybe not all the scripts have been properly copied. Maybe you can try to remove and copy again the whole /var/tmp/one/ (either manually or with onehost sync)
#2 Updated by Cyrille Duverne about 8 years ago
Hello,
I made a onehost sync and now, ho magic, ALL VMs are in UNKNOWN status, except the ones that are present on the sunstone machine...
That's a change, not in the good way, but still...
Thanks in advance for your feedback.
Cyrille
#3 Updated by Ruben S. Montero about 8 years ago
Are the /var/tmp/one being recreated in the host, do you have any problem (e.g. space) in /var/tmp in that host?
#4 Updated by Cyrille Duverne about 8 years ago
Well,
No space issue on the hosts.
But from the master : ls -lArth /var/tmp : drwxr-xr-x 10 oneadmin oneadmin 4.0K Sep 30 2012 one
From the remote hosts : ls -lArth /var/tmp : drwxr-xr-x 9 oneadmin oneadmin 4.0K Nov 23 10:43 one
Is this normal that after the onehost sync the folder is aged of 23/11 ?
It seems to be really weird for me... I really don't understand the issue here.
All accesses are granted to oneadmin, poll directly from the host give a good answer, but when coming from the master, it seems not to be working...
#5 Updated by Ruben S. Montero about 8 years ago
I do not see how this would affect the execution of run_probes, so to be different from the command line and from the dirver (ssh). But it may be worth trying to sync the clocks of master and host, recreate the /var/tmp/one and see if that fix the problem...
#6 Updated by Cyrille Duverne about 8 years ago
Well well well, I finally managed to solve this.
By... drumbs are rolling... removing ganglia options in oned.conf !
I'm using ganglia and for an unknown reason it seems that ganglia was not responding on 2 hosts...
Then I've chosen to remove it.
Do you know any other monitoring soft that doesn't need to have an agent running on the VM to achieve basic monitoring tasks ?
Let me say that this kind of feature integrated in ONE could be GREAT !!!
Thanks a lot for your investigation and time.
Have a great week end.
Cyrille
#7 Updated by Ruben S. Montero about 8 years ago
- Status changed from New to Closed
- Resolution set to fixed
OK. Great!!!
Yes we are thinking in developing a very thin and light agent using the current probe mechanism and the XML-RPC API of OpenNebula. Basically adding an option to change the monitoring strategy from pull (polling based) to push...