Bug #3419
leases not correctly freed
Status: | Closed | Start date: | 12/09/2014 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | Carlos Martín | % Done: | 0% | |
Category: | Core & System | |||
Target version: | Release 4.10.2 | |||
Resolution: | worksforme | Pull request: | ||
Affected Versions: | OpenNebula 4.10 |
Description
I still don't know yet how to reproduce, but after deploying and then deleting some VMs, a few IP are still marked as used in AR of their VNET.
onevnet show 126
VIRTUAL NETWORK 126 INFORMATION ID : 126 NAME : x0010.data USER : oneadmin GROUP : oneadmin CLUSTER : - BRIDGE : br0 VLAN : Yes VLAN ID : 2010 USED LEASES : 1 PERMISSIONS OWNER : um- GROUP : --- OTHER : --- VIRTUAL NETWORK TEMPLATE BRIDGE="br0" CLUSTER="private_networks" DESCRIPTION="reseau DATA pour x0010" NETWORK_ADDRESS="192.168.2.0" NETWORK_MASK="255.255.255.0" PHYDEV="" VLAN="YES" VLAN_ID="2010" ADDRESS RANGE POOL AR TYPE SIZE LEASES MAC IP GLOBAL_PREFIX 0 IP4 190 1 02:00:c0:a8:02:0a 192.168.2.10 - LEASES AR OWNER MAC IP IP6_GLOBAL 0 VM : 1147 02:00:c0:a8:02:12 192.168.2.18
onevm show 1147
VIRTUAL MACHINE 1147 INFORMATION ID : 1147 NAME : appli0.x0010 USER : oneadmin GROUP : oneadmin STATE : DONE LCM_STATE : LCM_INIT RESCHED : No START TIME : 12/03 11:42:50 END TIME : 12/03 11:50:39 DEPLOY ID : - VIRTUAL MACHINE MONITORING USED MEMORY : 0K USED CPU : 0 NET_TX : 0K NET_RX : 0K PERMISSIONS OWNER : um- GROUP : --- OTHER : --- VM DISKS ID TARGET IMAGE TYPE SAVE SAVE_AS 0 sda soclelinux-2012-2.31-jboss430-x86_6 file NO - VM NICS ID NETWORK VLAN BRIDGE IP MAC 0 admin yes br0 192.168.199.3 02:00:c0:a8:c7:03 1 x0010.appli yes br0 192.168.1.19 02:00:c0:a8:01:13 2 x0010.data yes br0 192.168.2.18 02:00:c0:a8:02:12 USER TEMPLATE COMMENT="Serveur applicatif pour la plateforme x0010" HYPERVISOR="kvm" LOGO="images/logos/centos.png" SCHED_REQUIREMENTS="TYPE=\"HYPERVISOR\"" VIRTUAL MACHINE TEMPLATE AUTOMATIC_REQUIREMENTS="!(PUBLIC_CLOUD = YES)" CONTEXT=[ DISK_ID="1", ETH0_DNS="10.156.0.2", ETH0_GATEWAY="192.168.199.254", ETH0_IP="192.168.199.3", ETH0_MAC="02:00:c0:a8:c7:03", ETH1_DNS="192.168.1.254", ETH1_GATEWAY="192.168.1.254", ETH1_IP="192.168.1.19", ETH1_MAC="02:00:c0:a8:01:13", ETH1_MASK="255.255.255.0", ETH1_NETWORK="192.168.1.0", ETH2_IP="192.168.2.18", ETH2_MAC="02:00:c0:a8:02:12", ETH2_MASK="255.255.255.0", ETH2_NETWORK="192.168.2.0", FILES_DS="/var/lib/one//datastores/2/1c9b55dd6cccaf334177609daf9ec5d9:'init.sh' ", NETWORK="YES", PROJET="x0010", PUPPET_ENV="x0010", PUPPET_PORT="8140", PUPPET_SERVER="192.168.199.200", SET_HOSTNAME="appli0.x0010", TARGET="hda" ] CPU="0.1" GRAPHICS=[ KEYMAP="fr", LISTEN="0.0.0.0", PORT="7047", TYPE="VNC" ] MEMORY="768" OS=[ ARCH="x86_64" ] TEMPLATE_ID="5" VMID="1147"
It's impossible to delete vnet 126 because openNebula says there is on IP in use in the leases. Ip is belonging to a VM that has been deleted and doesn't not exist anymore.
Any idea how to fix that ?
onevnet delete 126 [VirtualNetworkDelete] Cannot delete virtual network. Can not remove a virtual network with leases in use
Best regards,
Edouard.
History
#1 Updated by Ruben S. Montero over 6 years ago
- Category set to Core & System
#2 Updated by Edouard Bourguignon over 6 years ago
It seems I have a similar problem with the accounting. It shows an old deleted VM that is still consuming CPU hours (this VM is now using any IP lease). Should I open a new bug report?
#3 Updated by Ruben S. Montero over 6 years ago
- Assignee set to Carlos Martín
- Target version set to Release 4.10.2
Thanks Edouard, we'll take a look art both issues within this ticket. thanks again for your feedbak!
#4 Updated by Edouard Bourguignon over 6 years ago
- File Capture d_écran de 2014-12-10 15_24_17.png added
Here is a screenshot of the accounting problem. VM 121 is using a lot of CPU hours.
Here is the running VMs:
ID USER GROUP NAME STAT UCPU UMEM HOST TIME 171 oneadmin oneadmin puppet-0.s22 runn 21 1G pvidgsh104 13d 03h09 172 oneadmin oneadmin puppetenc-0.s22 runn 0 768M pvidgsh103 13d 03h09 1290 edouard users socle-2012-2.31 poff 0 0K pvidgsh101 0d 03h25 1291 edouard users modif-13-1291 runn 0 768M pvidgsh104 0d 01h37 1292 oneadmin oneadmin koan-1292 runn 0 512M pvidgsh102 0d 00h03
Here is details on VM 121:
VIRTUAL MACHINE 121 INFORMATION ID : 121 NAME : CentOS6-121 USER : benoit GROUP : users STATE : DONE LCM_STATE : LCM_INIT RESCHED : No START TIME : 11/18 14:41:16 END TIME : 11/26 11:00:25 DEPLOY ID : one-121 VIRTUAL MACHINE MONITORING USED MEMORY : 0K USED CPU : 0 NET_TX : 0K NET_RX : 1K PERMISSIONS OWNER : um- GROUP : --- OTHER : --- VM DISKS ID TARGET IMAGE TYPE SAVE SAVE_AS 0 vda CentOS 6.5 file NO - VM NICS ID NETWORK VLAN BRIDGE IP MAC 0 admin yes br0 192.168.199.108 02:00:c0:a8:c7:6c 1 appli yes br0 192.168.200.75 02:00:c0:a8:c8:4b VIRTUAL MACHINE HISTORY SEQ HOST ACTION DS START TIME PROLOG 0 pvidgsh103 live-migrate 0 11/18 14:41:35 2d 01h22m 0h00m14s 1 pvidgsh102 live-migrate 0 11/20 16:04:15 0d 00h21m 0h00m00s 2 pvidgsh103 stop 0 11/20 16:26:00 0d 00h15m 0h00m00s 3 pvidgsh104 none 0 11/20 16:52:24 0d 00h43m 0h00m00s 4 pvidgsh104 none 0 11/20 17:53:11 19d 21h36m 0h00m00s 5 pvidgsh104 none 0 11/20 17:54:10 19d 21h35m 0h00m00s 6 pvidgsh104 none 0 11/20 17:56:13 19d 21h33m 0h00m00s 7 pvidgsh104 none 0 11/20 17:56:41 19d 21h33m 0h00m00s 8 pvidgsh104 none 0 11/20 17:57:44 19d 21h32m 0h00m00s 9 pvidgsh104 none 0 11/20 17:58:48 19d 21h31m 0h00m00s 10 pvidgsh104 none 0 11/20 17:59:46 19d 21h30m 0h00m00s 11 pvidgsh104 none 0 11/20 18:01:52 19d 21h28m 0h00m00s 12 pvidgsh104 stop 0 11/20 18:03:04 4d 21h58m 0h00m00s 13 pvidgsh101 stop 0 11/25 16:07:54 0d 01h29m 0h00m00s 14 pvidgsh103 stop 0 11/25 17:42:24 0d 00h00m 0h00m00s USER TEMPLATE ERROR="Tue Nov 25 17:42:24 2014 : Error restoring VM: Could not restore from /var/lib/one//datastores/0/121/checkpoint" HYPERVISOR="kvm" LOGO="images/logos/centos.png" VIRTUAL MACHINE TEMPLATE AUTOMATIC_REQUIREMENTS="!(PUBLIC_CLOUD = YES)" CONTEXT=[ DISK_ID="1", ETH0_DNS="10.156.0.2", ETH0_GATEWAY="192.168.199.254", ETH0_IP="192.168.199.108", ETH0_MAC="02:00:c0:a8:c7:6c", ETH1_IP="192.168.200.75", ETH1_MAC="02:00:c0:a8:c8:4b", NETWORK="YES", TARGET="hda" ] CPU="1" GRAPHICS=[ KEYMAP="fr", LISTEN="0.0.0.0", PORT="6021", TYPE="VNC" ] MEMORY="768" OS=[ ARCH="x86_64" ] TEMPLATE_ID="0" VMID="121"
I have already done a onedb fsck with no luck. Glad to help :)
#5 Updated by Carlos Martín over 6 years ago
Hi,
Could you please send us the individual VM logs for both VMs? We would also need the 'onevm show -x' output.
Thanks.
#6 Updated by Carlos Martín over 6 years ago
About the accounting problem, definitely looks like the history records 4 to 11 do not have an end time, and are all still considered open (running). Let's see if the logs provide any clue to find out why.
#7 Updated by Edouard Bourguignon over 6 years ago
- File one-121.xml added
- File one-1147.xml added
Here is the onevm show 121 -x.
For the individual VM logs do you mean the libvirt log? Or some file in /var/log/one? Because in /var/log/one/121.log I only have this:
Tue Nov 25 17:42:24 2014 [Z0][VMM][E]: Could not restore from /var/lib/one//datastores/0/121/checkpoint Tue Nov 25 17:42:24 2014 [Z0][VMM][E]: Error restoring VM: Could not restore from /var/lib/one//datastores/0/121/checkpoint Wed Nov 26 11:00:25 2014 [Z0][TM][W]: Ignored: TRANSFER SUCCESS 121 -
I remember I try to use snaphot on the disk of this vm without success, but the VM was running VM. At least this one in particular.
I also add the onevm show 1147 -x. This VM has only this in its log:
Wed Dec 3 11:50:39 2014 [Z0][DiM][I]: New VM state is DONE.
For the accounting problem, here is log for VM 11:
Wed Nov 12 18:26:59 2014 [Z0][DiM][I]: New VM state is ACTIVE. Wed Nov 12 18:27:00 2014 [Z0][LCM][I]: New VM state is PROLOG. Wed Nov 12 18:27:32 2014 [Z0][LCM][I]: New VM state is BOOT Wed Nov 12 18:27:32 2014 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/11/deployment.0 Wed Nov 12 18:27:32 2014 [Z0][VMM][I]: ExitCode: 0 Wed Nov 12 18:27:32 2014 [Z0][VMM][I]: Successfully execute network driver operation: pre. Wed Nov 12 18:27:32 2014 [Z0][VMM][I]: ExitCode: 0 Wed Nov 12 18:27:32 2014 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy. Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: post: Executed "sudo ovs-vsctl set Port vnet2 tag=199". Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: post: Executed "sudo ovs-ofctl add-flow br0 in_port=5,arp,dl_src=02:00:c0:a8:c7:09,priority=45000,actions=drop". Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: post: Executed "sudo ovs-ofctl add-flow br0 in_port=5,arp,dl_src=02:00:c0:a8:c7:09,nw_src=192.168.199.9,priority=46000,actions=normal". Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: post: Executed "sudo ovs-ofctl add-flow br0 in_port=5,dl_src=02:00:c0:a8:c7:09,priority=40000,actions=normal". Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: post: Executed "sudo ovs-ofctl add-flow br0 in_port=5,priority=39000,actions=drop". Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: ExitCode: 0 Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: Successfully execute network driver operation: post. Wed Nov 12 18:27:33 2014 [Z0][LCM][I]: New VM state is RUNNING Wed Nov 12 19:16:20 2014 [Z0][LCM][I]: New VM state is CLEANUP. Wed Nov 12 19:16:21 2014 [Z0][DiM][I]: New VM state is DONE Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 Driver command for 11 cancelled Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 error: failed to get domain 'one-11' Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 error: Domaine non trouvé : no domain with matching name 'one-11' Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 ExitCode: 0 Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 Successfully execute virtualization driver operation: cancel. Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 clean: Executed "sudo ovs-ofctl del-flows br0 ,in_port=5,dl_src=02:00:c0:a8:c7:09". Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 clean: Executed "sudo ovs-ofctl del-flows br0 ,arp,in_port=5,dl_src=02:00:c0:a8:c7:09". Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 clean: Executed "sudo ovs-ofctl del-flows br0 ,arp,in_port=5,dl_src=02:00:c0:a8:c7:09,arp_spa=192.168.199.9". Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 clean: Executed "sudo ovs-ofctl del-flows br0 in_port=5". Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 ExitCode: 0 Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 Successfully execute network driver operation: clean. Wed Nov 12 19:16:22 2014 [Z0][VMM][W]: Ignored: LOG I 11 Successfully execute transfer manager driver operation: tm_delete. Wed Nov 12 19:16:22 2014 [Z0][VMM][W]: Ignored: LOG I 11 Successfully execute transfer manager driver operation: tm_delete. Wed Nov 12 19:16:22 2014 [Z0][VMM][W]: Ignored: CLEANUP SUCCESS 11
It seem interesting that it did find domain one-11 on the hypervisor, but that the cleanup was however successful. Same for other VM 4 to 10. Is it possible that this "failed to get domain 'one-11'" error could result from an unavailable ssh connection for oneadmin user?
#8 Updated by Edouard Bourguignon over 6 years ago
Sorry I meant "It seem interesting that it did not find domain one-11 on the hypervisor, but that the cleanup was however successful."
#9 Updated by Carlos Martín over 6 years ago
Hi there,
I've been trying to reproduce your problem with the leases, and I found that they can get stuck as used when the AR is resized to be smaller, see #3443.
Can it be the reason in your case? Did you change your AR size?
#10 Updated by Edouard Bourguignon over 6 years ago
No I'm pretty sure I didn't change the AR size. All the vnet + AR were created using some scripts (stress tests to create many vnets/VM). So today I try to rerun this scripts and the problem didn't occur.
The only thing I can think of is that during the first deployement and deletion of all this VMs, when the problem did occur, the network were really busy, and then the NFS get a bit stuck, perhaps preventing some SSH connections to the hypervisors. I remember having some VM still running on the hypervisors but shown as DONE/deleted from OpenNebula.
It's not a big deal if there is a procedure to clean this problem. Is it the case? How to I free this leases?
#11 Updated by Ruben S. Montero over 6 years ago
- Status changed from Pending to Closed
- Resolution set to worksforme
Marking this as worksforme as we cannot reproduce it.
The only way to clean the problem would be by implementing #3056. Meanwhile you can add a new AR with the addresses not freed, this is not very elegant but functionally oned could continue using those leases.
#12 Updated by Edouard Bourguignon over 6 years ago
Great #3056 is what we need to go in production