The problem
After 5.3 release it had been noticed that the demo instance tends to consume RAM fast and never release it. It led to a slow-down of an instance and to eventual crash of it. After some research based on the local instances it was noticed that the Magnolia 5 apps get stuck in the memory causing the ComponentProvider
's to stay there as well making the internal Guice maps to bloat.
The tools used
In order to reveal the culprit for the app objects to leak we have used the combination of the following technologies:
- VisualVM - standard JDK tool for monitoring Java app state and behaviour (overall memory consumption, CPU load, Heap Snapshots with filtering capabilities etc).
- JMeter - load testing suite for simulation of concurrent users and for starting a decent amount of different apps.
- http://jmeter.apache.org/
- JMeter and Magnolia web app has to be configured by following the instructions in https://vaadin.com/wiki/-/wiki/Main/JMeter+Testing
- XSRF protection has to be disabled.
- Plumbr - memory leak detection tool which effectively helps to determine the reasons for them to appear (the instances holding references to the leaking objects).
- https://plumbr.eu/
- Requires load testing to be done against the inspected instance in order to calibrate the search patterns (done via JMeter).
- We have bought one license of Plumbr for internal use.
Findings
After several simple simulations conducted (logging-in, starting and closing various apps etc) several potential problems were detected. Apparently it turned out that the culprits for app leaks were hiding within the shell apps.
AppLauncher
. We use so calledEventbusProtector
mechanism in order to prevent the event handlers to leak, i.e. when an app instance is closed, the event buses created in its scope get reset and all the handlers are removed. The same holds for the sub-app event buses and generally - to the AdminCentral-scoped (~ Vaadin UI - scoped) ones. However, there is a so called system event bus which is a web-app singleton object and obviously not covered by any (semi-) automatic protection mechanism. We have to really careful about all the event handlers we register against this event bus! For instance inAppLauncherShellApp
we had the following:Leaking system eventbus handlersystemEventBus.addHandler(AppLauncherLayoutChangedEvent.class, new AppLauncherLayoutChangedEventHandler() { @Override public void onAppLayoutChanged(AppLauncherLayoutChangedEvent event) { AppLauncherLayout layout = this.appLauncherLayoutManager.getLayoutForCurrentUser(); // References to this particular object this.view.clearView(); initView(layout); for (AppLauncherGroup group : layout.getGroups()) { for (AppLauncherGroupEntry entry : group.getApps()) { // Reference to app controller which is even worse if (this.appController.isAppStarted(entry.getName())) { view.activateButton(true, entry.getName()); } } } } });
However, we never de-register this event handler causing e.g. an app controller to leak, which in turn pulls quite a few of other app-related objects. The issue exists since version 5.0 from the very point when the app launcher layout update was introduced.
Pulse
. In order to be able to use a factory method pattern forPulseListFooter
construction a lot of its internal fields were made static which is generally a bad practice for web-apps and Vaadin apps in particular. The biggest problem was hiding in thePulseListFooter#menuItems
field, because various menu items hold references to a wide range of AdminCentral-scoped components (AdmincentralEventBus, Shell, SimpleTranslator, ComponentProvider
) which in turn can pull a huge amount of smaller objects:Static members of PulseListFooterpublic final class PulseListFooter extends CustomComponent { private static SimpleTranslator i18n; private static PulseListView.Listener messagesListener; private static TasksListView.Listener tasksListener; // While fields above at least get reset this one only grows. private static List<ContextMenuItem> menuItems = new ArrayList<ContextMenuItem>(); ... }
The issue exists since version 5.3 (introduced in the following commits (magnolia_ui): https://git.magnolia-cms.com/gitweb/?p=magnolia_ui.git;a=commit;h=6eb3a0b6089b47b22e34353a0e4ddf4c1b06328b).
- Heartbeat interval. There is no 100% reliable approach to detect whether the browser tab was closed or is still alive. In order to clean-up the dead UI instances Vaadin (~ AdminCentral instances) uses a heartbeat mechanism: the client-side periodically sends small requests and the server-side checks if there are three consequent heartbeats missed from a particular UI instance, then the latter is considered dead and eventually cleaned-up. Since that means that the dead AdminCentral instance would stay for at least 15 minutes in memory, it was decided to reduce the heartbeat interval to 90 seconds.
After the aforementioned problems were solved Plumbr was not able to detect the memory leaks anymore (at least with the simple test scenarios we were running), Garbage Collector started to be able to clean-up all the app-related resources which made the web-app instance much more stable.
The first attached picture displays that after a load test the memory consumption is reduced to the initial state, whereas on the second one it is visible that not all the resources are cleaned up and the sub-sequent load test raises memory consumption to the much higher values.
Role of Plumbr
Plumbr normally shows a similar/simplified chart as VisuaulVM memory consumption view does. The magic of Plumbr is that when it monitors the web app under a decent load (provided with load tests) and for a relatively long amount of time, it manages to detect a memory leak and even suggests the culprit that is holding the refs to the leaking objects. For the current case it helped a lot, because even from VisualVM output one can see, that there is a bloating amount of some certain collections (iirc ArrayListMultimaps
) and one can guess that it is related to Guice and the ComponentProvider
, but due to the high amount of such objects and different use cases - it was really hard to find out which object is guilty. Plumbr strips the legit cases and points to the real problems.
Other observations. Thread locks.
It was observed that when the tested instance performs at its limits, a thread lock might occur.
As we can see from the diagram 4 threads (T1-T4) are locked over 3 resources: 2 Vaadin sessions (S1 and S2) and MessagesManager
's listeners (MM#listeners). Solid lines indicate that a thread owns a resource, dashed line stands for an acquisition attempt.
5 Comments
Magnolia International
Good stuff !
Two questions:
Aleksandr Pchelintcev
1) Answered within additional page paragraph.
2) I don't think there could be a significant traffic overhead, because the heartbeat request does not even involve a server response - it just records a timestamp of the latest client-side presence.
Magnolia International
oh and one more thing, could you specify where this leak was introduced and fixed ? (versions and/or JIRA tickets ?)
Mauro J Giamberardino
Hello everyone. I know that this post is old, but we are using a 5.3.3 version of the Magnolia Community Edition and we are having some memory issues. I would like to know the answer to the previous comment here:
"""Grégory Joseph
oh and one more thing, could you specify where this leak was introduced and fixed ? (versions and/or JIRA tickets ?)"""
Thanks
Aleksandr Pchelintcev
Hi Mauro J Giamberardino! If memory (and git log) serves - it was this one: MGNLUI-3122 - Getting issue details... STATUS . If you have any suspicions of could cause your troubles - feel free to share those here, maybe there're some improvements since then.