Failing to acquire vTAP interfaces in a virtualized environment - virtualization

Suppose I've got the following setup:
A Host which runs two KVM VMs(VM1 and VM2), a virtual bridge virbr, and two bridge taps vTAP1 and vTAP2. The VM's are attached to the vTAPS respectively.
I've got an application running on the host which measures different load metrics on the bridge. For achieving this it needs to acquire the vTAPS in order to stream the packets between the vTAPs through the bridge for measurement.
The problem is that I can't acquire the vTAPS, because the ioctl TUNSETIFF syscall fails with EBUSY errno.
I guess that it happens because the application(runned on the host) is not the owner of the taps(which owned by the VMs). Adding new temporary bridge vTAPS for measurement may not be always a solution because sometimes I would want to measure the flow directly between the VM vTAPS.
Attempted solution: There is a Multiqueue tuntap interface:
Linux supports multiqueue tuntap which can use multiple
file descriptors (queues) to parallelize packets sending or receiving. The
device allocation is the same as before, and if user wants to create multiple
queues, TUNSETIFF with the same device name must be called many times with
IFF_MULTI_QUEUE flag.
using IFF_MULTI_QUEUE stopped the ioctl from failing with EBUSY errno but it started failing on write syscall to the vTAP with EINVAL errno. So it didn't really solve anything.
I would appreciate any help, thanks.

Related

Mysql resource temporarily unavailable

I'm seeing a few of these errors during high load times:
mysql_connect() [<a
href='function.mysql-connect'>function.mysql-connect</a>]: [2002] Resource
temporarily unavailable (trying to connect via
unix:///var/lib/mysql/mysql.sock)
From what I can tell the mysql server isn't hitting its max connections limit, but there's something else stopping it from serving the query. What other limits would MySQL be hitting?
I'm running RHEL 6.2 64bit with MySQL 5.5.21
Let's assume your system is currently Unix-based (as given in your problem statement). If this is correct, here's the set of issues you may be running into:
You've run out of memory available to MySQL.
This is the most likely problem you're facing. Each connection in MySQL's connection pool requires memory to function, and if this resource is exhausted, no further connections can be made. Of course, the memory footprints and maximum packet sizes of various operations can be tuned in your equivalent to my.cnf if you discover this to be an issue.
Here's an additional thread that can help there, but you may also consider using simpler profiling tools like top to get a good ballpark estimate of what's going on.
You've run out of file descriptors available to your MySQL user account.
Another common issue: if you're trying to service requests that require file IO above the 1,024 boundary (by default), you will run into cases where the operation simply fails. This is because most systems specify a soft and hard limit on the number of open file descriptors each user can have available at one time, and walking over this threshold can cause problems.
This will usually have a series of glaringly obvious signs expressed in your log files. Check /var/log/messages and your comparable directories (for example, /var/log/mysql to see if you can find anything interesting.
You've run into a livelock or deadlock scenario where your thread is unsatisfiable.
Corollary to memory and file descriptor exhaustion, threads can time out if you've overstepped the computational load your system is capable of handling. It won't throw this error message, but this is something to watch out for in the future.
Your system is running out of PIDs available to fork.
Another common scenario: fork only has so many PIDs available for its use at any given time. If your system is simply overforked, it will cease to be able to service requests.
The easiest check for this is to see if any other services can connect through to the machine. For example, trying to SSH into the box and discovering that you cannot is a big clue.
An upstream proxy or connection manager has run out of resources and ceased servicing requests.
If you have any service layer between your client and MySQL, it bears inspecting to see if it has crashed, hung, or otherwise become unstable. The advice above applies.
Your port mapper has exhausted itself after 65,536 connections.
Unlikely, but again, a possible exhaustion case. Checking the trivial service connection as above is, ehm, also the best port of call here.
In short: this is a resource exhaustion scenario, inclusive of the server simply being "down". You're going to have to profile your system further to see what you're blocking on. All the error message gives us in this case is the fact the resource is unavailable to the client -- we'd need to see more information about the server to determine a more adequate remedy.
I still haven't found which limits it was hitting, but I did manage to work around the problem. There was a problem with our session table (in vbulletin) which uses the MEMORY engine. The indexes for this table were HASH and thus when vbulletin purged this table once an hour it would lock the table just long enough to hold up other queries and push mysql to the limit of its resources.
By changing the indexes to BTREE this allowed MySQL to delete the rows from the session table a lot quicker and avoid any limits there were reached previously. The errors only started when we upgraded our master db server to MySQL 5.5, so I'm guessing MEMORY tables are handled differently in the latest release.
See http://www.mysqlperformanceblog.com/2008/02/01/performance-gotcha-of-mysql-memory-tables/ for information on speed increases from using BTREE indexes over HASH For MEMORY.
Geez, this could be so many things. It could be that the socket buffer space is exhausted. It could be that mysql is not accepting connections as fast as they are coming in and the backlog limit is reached (though I'd expect that to give you a "Connection Refused" error, I don't know for sure that's what you'll get for a Unix domain socket). It could be any of the things #MrGomez pointed out.
Since you are running Apache and MySQL on the same server and this is a problem under high load, it could well be that Apache is starving the system of some resource and you're just not seeing (noticing?) the dropped/failed incoming connections/requests in your logs.
Are you using connection pooling? If not, I'd start there.
I'd also look for errors in the Apache logs and syslog around the same time as the mysql_connect error and see what else turns up. I'd especially recommend getting MySQL moved over to its own separate dedicated server.

The Cluster refresh solution

Update: We are using AIX environment.
We have been facing some random issues with our queues (cluster queues), like:
2189 Cluster resolution error (Most frequent one)
2270 MQRC_NO_DESTINATIONS_AVAILABLE
2053 Queue full error(Weirdest) : Post one message, it will be successfully posted, post some 3-4 messages, it will throw this error
for the rest of the messages.
All these issues get resolved once we do a cluster refresh. But, I want to know the root cause, why we get these errors. What goes wrong?
How cluster refresh resolve these errors?
Could be a socket issue. You can monitor sockets according to your OS - like on windows can do
netstat -a -b -o >/newfile.txt
You could also use TCP Viewer on windows (one exe from Microsoft/ sysinternals) http://technet.microsoft.com/en-us/sysinternals/bb897437.aspx actually all the sys internal toos should be in your prod box if windows.
For sockets in linux/Un* there are other tools, some just ls commands into the RAM, depending on the version. Maybe a google will help.
Also if using windows consider moving some stuff to linux, you will have some pain in the beggining but will get better.
If this did not help you should post yor environment on your quesiton and give any other details. And if you get a jprofiler into production and use it when the issue happens.
At the very least you can do a jstack and jmap
What is version/ name of OS and of java, websphere?
If it is a socket issue can try increasing sockets (registry) and then profiling your code to see who is making too many sockets, what needs to be throttled or re-written.
Remember every page, every db connection, external cache hit (if you use) or any other URL work/ remote connection is usually a socket.

Diagnosing Win32 RegisterClass leak

We are trying to troubleshoot a nasty problem on a production server where the server will start misbehaving after running for awhile.
Diagnostics have led us to believe there may be a bug in a DLL that is used by one of the processes running on this server that is resulting in a global atom leak. The assumed vector is a process that is calling RegisterClass without a corresponding UnregisterClass (and the class name is using a random number as part of the name, so it's a different class name each time the process starts).
This article provided some information: https://blogs.msdn.microsoft.com/ntdebugging/2012/01/31/identifying-global-atom-table-leaks/
But we are reluctant to attempt kernel mode debugging on a production server, so we have tried installing windbg and using the !gatom command to list atoms for a given session.
I use windbg to attach to a process in one of the sessions (these processes are running as Windows Services if that matters), then invoke the !gatom command. The returned atom list doesn't have any window classes in it.
Then I read this: https://blogs.msdn.microsoft.com/oldnewthing/20150429-00/?p=44984
and it sounds like there is a separate atom table for windows classes. And no way to query it. I was hoping that we'd be able to actually see how many windows class atoms have been registered, and see if that list gets bigger over time, indicating a leak.
The documentation on !gatom is sparse, and I'm hoping I can get some expert confirmation or recommendations on how to proceed.
Does anyone have any ideas on how we can get at the list of registered Windows classes on a production server?
More detail about what happens when the server starts to misbehave:
We run many instances (>50) of the same application as separately registered services running from isolated executables and DLLs - so each of those 50 instances has their own private executables and DLLs.
During their normal run, the processes unload and reload a DLL (about every hour). There is a windows class used that's part of a "session handle" used by the DLL (the session handle is part of the registered windows class name), and that session handle is unique each time the DLL is loaded. So every hour, there is an additional Window class registration, made by a DLL (our service stays running).
After some period of time, the system will get into a state where further attempts to load the DLL in question fail. This may happen for one of the services, then gradually over time, other services will start to have the same problem.
When this happens, restarting the service does not fix the problem. The only way that we've found to get things running properly again is to reboot the server.
We are monitoring memory commit load, and we are well within the virtual memory of the server. We are even within the physical memory size.
I just did a code review the vendor of the DLL, and it looks like they are not actually calling RegisterClass from the DLL itself (they only make one RegisterClass call from the DLL, and it's a static string - not a different class name for each session). The DLL launches an EXE, and that EXE is the one that registers the session specific class name. Their EXE does call UnregisterClass (and even if it didn't, the EXE is terminated when we unload their DLL, so it seems that this may not be what is going on).
I am now out of bullets on this one. The behavior seems like some sort of resource leak or pool exhaustion. The next time this happens, I will try connecting to the failing process with windbg and see what the application atom pool looks like - but I'm not hopeful that is going to shed any light.
Update: The excellent AtomTableMonitor tool has narrowed the problem to rogue RegisterWindowMessage. I'm going to ask a more specific question focused on this exact issue: Diagnosing RegisterWindowsMessage leak
You may try using this standalone global atom monitor
The application appears to have capabilities to monitor atoms in services
that run in a different session
btw if you have narrowed it to RegisterWindowMessage
then spy++ can log the Registered messages system wide along with thread and process
spy++ (i am using it from vs2015 community)
ctrl+m select all windows in system
in the messages tab clear all and select registered
and start logging
you can also save the log (it is plain text in-spite of strange extension )
powershell -c "gc spy++.sxl -Tail 3"
<000152> 001F01A4 P message:0xC1B2 [Registered:"nsAppShell:EventID"] wParam:00000000 lParam:06EDFCE0 time:4:2
7:49.584 point:(408, 221)
<000153> 001F01A4 P message:0xC1B2 [Registered:"nsAppShell:EventID"] wParam:00000000 lParam:06EDFCE0 time:4:2
7:49.600 point:(408, 221)
<000154> 001F01A4 P message:0xC1B2 [Registered:"nsAppShell:EventID"] wParam:00000000 lParam:06EDFCE0 time:4:2
7:49.600 point:(408, 221)

VMware Workstation and Device/Credential Guard are not compatible

I have been running VMware for the last year no problems, today I opened it up to start one of my VM and get an error message, see screen shot.
I did follow the link and went through the steps, on step 4 I need to mount a volume using "mountvol".
when I try to mount a volume using mountvol X: \\?\Volume{5593b5bd-0000-0000-0000-c0f373000000}\ it keeps saying The directory is not empty. I even created a partition with 2GB and still the same message.
My Questions:
How can I mount the volume that is not empty even though it is?
Why did this Device/Credential Guard auto enable itself and how can I get rid of it or disable it.
CMD:
Device/Credential Guard is a Hyper-V based Virtual Machine/Virtual Secure Mode that hosts a secure kernel to make Windows 10 much more secure.
...the VSM instance is segregated from the normal operating
system functions and is protected by attempts to read information in
that mode. The protections are hardware assisted, since the hypervisor
is requesting the hardware treat those memory pages differently. This
is the same way to two virtual machines on the same host cannot
interact with each other; their memory is independent and hardware
regulated to ensure each VM can only access it’s own data.
From here, we now have a protected mode where we can run security
sensitive operations. At the time of writing, we support three
capabilities that can reside here: the Local Security Authority (LSA),
and Code Integrity control functions in the form of Kernel Mode Code
Integrity (KMCI) and the hypervisor code integrity control itself,
which is called Hypervisor Code Integrity (HVCI).
When these capabilities are handled by Trustlets in VSM, the Host OS
simply communicates with them through standard channels and
capabilities inside of the OS. While this Trustlet-specific
communication is allowed, having malicious code or users in the Host
OS attempt to read or manipulate the data in VSM will be significantly
harder than on a system without this configured, providing the
security benefit.
Running LSA in VSM, causes the LSA process itself (LSASS) to remain in
the Host OS, and a special, additional instance of LSA (called LSAIso
– which stands for LSA Isolated) is created. This is to allow all of
the standard calls to LSA to still succeed, offering excellent legacy
and backwards compatibility, even for services or capabilities that
require direct communication with LSA. In this respect, you can think
of the remaining LSA instance in the Host OS as a ‘proxy’ or ‘stub’
instance that simply communicates with the isolated version in
prescribed ways.
And Hyper-V and VMware can't work the same time. You have to migrate your VMs to Hyper-V or disable the feature. It should be enough to enselect the Hyper-V and Isolated user mode features in Control Panel->Program & Features->turn features on or off:
There is a much better way to handle this issue. Rather than removing Hyper-V altogether, you just make alternate boot to temporarily disable it when you need to use VMWare. As shown here...
http://www.hanselman.com/blog/SwitchEasilyBetweenVirtualBoxAndHyperVWithABCDEditBootEntryInWindows81.aspx
C:\>bcdedit /copy {current} /d "No Hyper-V"
The entry was successfully copied to {ff-23-113-824e-5c5144ea}.
C:\>bcdedit /set {ff-23-113-824e-5c5144ea} hypervisorlaunchtype off
The operation completed successfully.
note: The ID generated from the first command is what you use in the second one. Don't just run it verbatim.
When you restart, you'll then just see a menu with two options...
Windows 10
No Hyper-V
So using VMWare is then just a matter of rebooting and choosing the No Hyper-V option.
If you want to remove a boot entry again. You can use the /delete option for bcdedit.
First, get a list of the current boot entries...
C:\>bcdedit /v
This lists all of the entries with their ID's. Copy the relevant ID, and then remove it like so...
C:\>bcdedit /delete {ff-23-113-824e-5c5144ea}
As mentioned in the comments, you need to do this from an elevated command prompt, not powershell. In powershell the command will error.
I'm still not convinced that Hyper-V is The Thing for me, even with last year's Docker trials and tribulations and I guess you won't want to switch very frequently, so rather than creating a new boot and confirming the boot default or waiting out the timeout with every boot I switch on demand in the console in admin mode by
bcdedit /set hypervisorlaunchtype off
Another reason for this post -- to save you some headache: You thought you switch Hyper-V on with the "on" argument again? Nope. Too simple for MiRKoS..t. It's auto!
Have fun!
G.

SMS war continues, ideas welcome

I am trying to make U9 telit modem send SMS messages. I think I handle protocol correctly, at least, I manage to send them, but only under these circumstances: the native application was executed beforehand, and killed by task manager (without giving it a chance to initialize things).
It looks like the supplied application is good at doing certain initialization/deinitialization which is critical. I also see the difference between the two states in output of AT+CIND command. When I am trying to do things on my own, it returns zeroes (including signal quality), but when I run the same command after killing the native application, the output looks reasonable.
I am out nearly of ideas. I have tried many things, including attempts to spy at modem's COM ports (didn't work). Haven't tried setting windows hooks to see what the application is trying to get thru.
Perhaps you have encountered a similar situation?
Agg's "Advanced Serial Port Monitor" actually helped a lot. Sometimes it caused blue screen, but it helped uncover secret commands which seem to help. AT+PCFULL is not described anywhere on the net, for example. The real trigger of non-operatio was AT+CFUN, the power disable/standby feature.
Also, it appeared that we have more issues. At first, the modem appears on the bus only as disk drive. It doesn't want to appear as any other devices before the drivers are installed. So, the U9 Telit software sends an IOCTL to disk driver to tell the modem to reappear as more devices (modem, 3 serial ports, another disk drive).

Resources