|
As of October 27th, please open all new issues in the Red Hat Customer Portal . |
|
[
Permlink
| « Hide
]
Rob Davies added a comment - 10/May/10 08:36 AM
Core file when broker stopped dispatching. Log appender blocked, Message Store full - probably because dispatching stopped
It appears that a concurrent jmx copy operation, via the console may be the problem. This makes a temp change to the maxPageSize that can break dispatch if it happens concurrently with dispatch.
Would there be some admin task that uses the console or jmx to copy some messages by any chance? Hi gary. I noticed you were the last to comment on this issue. Is there an update?
Jack, I would like some feedback on my question about the use of jmx at the time that the problem occurred or the use of jmx programatically. From analysis of the logs the use of JMX (Jconsole or webconsole) copy seems to be the culprit. I wonder if this can be confirmed from the customer usecase or usage pattern of the application in question?
All CVS Carmark's, AMQ and ESB instances are monitored by HP SiteScope using a JMX interface similar to JConsole. The HP SiteScope monitors (polls) every couple of minutes to check in statistics like queue depths, Heap Sizes, Component States, and number of open sockets, etc.
Specificall for the CVS ESP (Enterprise Service Platform) the AMQ broker dtopped accepting new messages even when hundreds of publishers were attached trying to enqueue. Question: Can you help CVS understand why concurrent JMX operations lead to a hung state in a lightly loaded broker? So the problem is only with the use of the attribute on a destination view, maxPageSize, A change to this while dispatch is in operation can lead to dispatch stopping, which seems to be the case from the analysis of the logs.
A JMX copy/move/ operation changes this attribute as part of normal operation so this is the most likely culprit, but as the attribute exposed in read/write mode on the DestinationView any tool that can change those attributes could be involved. In any event, we need to fix this concurrency bug around the maxPageSize attribute, but It would be great to know if it is probable that via JMX, either that attribute was set or a copy or move operation was executed on that broker. The change in maxPageSize was dont via JMX after the hung condition manifested itself (Rob suggested it during troubleshooting). The only other thing we do with JMX is monitor so I dont think the sitescope tools would be doing a copy or move?
That explains why the broker got into the non recoverable state it was in but it does not help understand the original root cause. What information have we from before the JMX maxPageSize operation was invoked?
Just what was in the submitted logs. This really is an issue here as everything is monitored with JMX and if there a chance that the monitoring is somehow stoping messaging then we need to figure that out.
We can address the issue with concurrent access to maxPageSize via JMX and dispatch in the 5.4 release but from your comments it looks like this is not the root cause. I have opened http://fusesource.com/issues/browse/MB-706
To do more analysis we need some more thread dumps and logs from a broker that gets into this state, or a test case that can reproduce the hung scenario. Jeff yes, that looks like the sort of information we need. Essentially a collection of periodic thread dumps from the broker when the hang occurs.
Jack, can you try and pull together the relevant information, logs and thread dumps that are relevant to a given event, there seems to be information scattered across a bunch of jira issues. We need to make sure we are analyzing the correct latest information. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||