After 5.3 release it had been noticed that the demo instance tends to consume RAM fast and never release it. It led to a slow-down of an instance and to eventual crash of it. After some research based on the local instances it was noticed that the Magnolia 5 apps get stuck in the memory causing the
ComponentProvider's to stay there as well making the internal Guice maps to bloat.
The tools used
In order to reveal the culprit for the app objects to leak we have used the combination of the following technologies:
- VisualVM - standard JDK tool for monitoring Java app state and behaviour (overall memory consumption, CPU load, Heap Snapshots with filtering capabilities etc).
- JMeter - load testing suite for simulation of concurrent users and for starting a decent amount of different apps.
- JMeter and Magnolia web app has to be configured by following the instructions in https://vaadin.com/wiki/-/wiki/Main/JMeter+Testing
- XSRF protection has to be disabled.
- Plumbr - memory leak detection tool which effectively helps to determine the reasons for them to appear (the instances holding references to the leaking objects).
- Requires load testing to be done against the inspected instance in order to calibrate the search patterns (done via JMeter).
- We have bought one license of Plumbr for internal use.
After several simple simulations conducted (logging-in, starting and closing various apps etc) several potential problems were detected. Apparently it turned out that the culprits for app leaks were hiding within the shell apps.
AppLauncher. We use so called
EventbusProtectormechanism in order to prevent the event handlers to leak, i.e. when an app instance is closed, the event buses created in its scope get reset and all the handlers are removed. The same holds for the sub-app event buses and generally - to the AdminCentral-scoped (~ Vaadin UI - scoped) ones. However, there is a so called system event bus which is a web-app singleton object and obviously not covered by any (semi-) automatic protection mechanism. We have to really careful about all the event handlers we register against this event bus! For instance in
AppLauncherShellAppwe had the following:
However, we never de-register this event handler causing e.g. an app controller to leak, which in turn pulls quite a few of other app-related objects. The issue exists since version 5.0 from the very point when the app launcher layout update was introduced.
Pulse. In order to be able to use a factory method pattern for
PulseListFooterconstruction a lot of its internal fields were made static which is generally a bad practice for web-apps and Vaadin apps in particular. The biggest problem was hiding in the
PulseListFooter#menuItemsfield, because various menu items hold references to a wide range of AdminCentral-scoped components (
AdmincentralEventBus, Shell, SimpleTranslator, ComponentProvider) which in turn can pull a huge amount of smaller objects:
The issue exists since version 5.3 (introduced in the following commits (magnolia_ui): https://git.magnolia-cms.com/gitweb/?p=magnolia_ui.git;a=commit;h=6eb3a0b6089b47b22e34353a0e4ddf4c1b06328b).
- Heartbeat interval. There is no 100% reliable approach to detect whether the browser tab was closed or is still alive. In order to clean-up the dead UI instances Vaadin (~ AdminCentral instances) uses a heartbeat mechanism: the client-side periodically sends small requests and the server-side checks if there are three consequent heartbeats missed from a particular UI instance, then the latter is considered dead and eventually cleaned-up. Since that means that the dead AdminCentral instance would stay for at least 15 minutes in memory, it was decided to reduce the heartbeat interval to 90 seconds.
After the aforementioned problems were solved Plumbr was not able to detect the memory leaks anymore (at least with the simple test scenarios we were running), Garbage Collector started to be able to clean-up all the app-related resources which made the web-app instance much more stable.
The first attached picture displays that after a load test the memory consumption is reduced to the initial state, whereas on the second one it is visible that not all the resources are cleaned up and the sub-sequent load test raises memory consumption to the much higher values.
Role of Plumbr
Plumbr normally shows a similar/simplified chart as VisuaulVM memory consumption view does. The magic of Plumbr is that when it monitors the web app under a decent load (provided with load tests) and for a relatively long amount of time, it manages to detect a memory leak and even suggests the culprit that is holding the refs to the leaking objects. For the current case it helped a lot, because even from VisualVM output one can see, that there is a bloating amount of some certain collections (iirc
ArrayListMultimaps) and one can guess that it is related to Guice and the
ComponentProvider, but due to the high amount of such objects and different use cases - it was really hard to find out which object is guilty. Plumbr strips the legit cases and points to the real problems.
Other observations. Thread locks.
It was observed that when the tested instance performs at its limits, a thread lock might occur.
As we can see from the diagram 4 threads (T1-T4) are locked over 3 resources: 2 Vaadin sessions (S1 and S2) and
MessagesManager's listeners (MM#listeners). Solid lines indicate that a thread owns a resource, dashed line stands for an acquisition attempt.