Page tree
Skip to end of metadata
Go to start of metadata

The problem

After 5.3 release it had been noticed that the demo instance tends to consume RAM fast and never release it. It led to a slow-down of an instance and to eventual crash of it. After some research based on the local instances it was noticed that the Magnolia 5 apps get stuck in the memory causing the ComponentProvider's to stay there as well making the internal Guice maps to bloat. 

The tools used

In order to reveal the culprit for the app objects to leak we have used the combination of the following technologies:

  • VisualVM  - standard JDK tool for monitoring Java app state and behaviour (overall memory consumption, CPU load, Heap Snapshots with filtering capabilities etc).
  • JMeter - load testing suite for simulation of concurrent users and for starting a decent amount of different apps.
  • Plumbr - memory leak detection tool which effectively helps to determine the reasons for them to appear (the instances holding references to the leaking objects).
    • https://plumbr.eu/
    • Requires load testing to be done against the inspected instance in order to calibrate the search patterns (done via JMeter).
    • We have bought one license of Plumbr for internal use.

Findings

After several simple simulations conducted (logging-in, starting and closing various apps etc) several potential problems were detected. Apparently it turned out that the culprits for app leaks were hiding within the shell apps.

  1. AppLauncher. We use so called EventbusProtector mechanism in order to prevent the event handlers to leak, i.e. when an app instance is closed, the event buses created in its scope get reset and all the handlers are removed. The same holds for the sub-app event buses and generally - to the AdminCentral-scoped (~ Vaadin UI - scoped) ones. However, there is a so called system event bus which is a web-app singleton object and obviously not covered by any (semi-) automatic protection mechanism. We have to really careful about all the event handlers we register against this event bus! For instance in AppLauncherShellApp we had the following:

    Leaking system eventbus handler
         systemEventBus.addHandler(AppLauncherLayoutChangedEvent.class, new AppLauncherLayoutChangedEventHandler() {
                @Override
                public void onAppLayoutChanged(AppLauncherLayoutChangedEvent event) {
                    AppLauncherLayout layout = this.appLauncherLayoutManager.getLayoutForCurrentUser();
    				// References to this particular object
    		        this.view.clearView();
            		initView(layout);
    		        for (AppLauncherGroup group : layout.getGroups()) {
            		    for (AppLauncherGroupEntry entry : group.getApps()) {
    						// Reference to app controller which is even worse
                    			if (this.appController.isAppStarted(entry.getName())) {
    	                    view.activateButton(true, entry.getName());
                    }
                }
             }
           }
         });

    However, we never de-register this event handler causing e.g. an app controller to leak, which in turn pulls quite a few of other app-related objects. The issue exists since version 5.0 from the very point when the app launcher layout update was introduced.

  2. Pulse.  In order to be able to use a factory method pattern for PulseListFooter construction a lot of its internal fields were made static which is generally a bad practice for web-apps and Vaadin apps in particular. The biggest problem was hiding in the PulseListFooter#menuItems field, because various menu items hold references to a wide range of AdminCentral-scoped components (AdmincentralEventBus, Shell, SimpleTranslator, ComponentProvider) which in turn can pull a huge amount of smaller objects:

    Static members of PulseListFooter
    public final class PulseListFooter extends CustomComponent {  
      private static SimpleTranslator i18n;
      private static PulseListView.Listener messagesListener;
      private static TasksListView.Listener tasksListener;
      // While fields above at least get reset this one only grows.
      private static List<ContextMenuItem> menuItems = new ArrayList<ContextMenuItem>();
      ...
    }

    The issue exists since version 5.3 (introduced in the following commits (magnolia_ui): https://git.magnolia-cms.com/gitweb/?p=magnolia_ui.git;a=commit;h=6eb3a0b6089b47b22e34353a0e4ddf4c1b06328b).

  3. Heartbeat interval. There is no 100% reliable approach to detect whether the browser tab was closed or is still alive. In order to clean-up the dead UI instances Vaadin (~ AdminCentral instances) uses a heartbeat mechanism: the client-side periodically sends small requests and the server-side checks if there are three consequent heartbeats missed from a particular UI instance, then the latter is considered dead and eventually cleaned-up. Since that means that the dead AdminCentral instance would stay for at least 15 minutes in memory, it was decided to reduce the heartbeat interval to 90 seconds.

    After the aforementioned problems were solved Plumbr was not able to detect the memory leaks anymore (at least with the simple test scenarios we were running), Garbage Collector started to be able to clean-up all the app-related resources which made the web-app instance much more stable.

    The first attached picture displays that after a load test the memory consumption is reduced to the initial state, whereas on the second one it is visible that not all the resources are cleaned up and the sub-sequent load test raises memory consumption to the much higher values.

Role of Plumbr

Plumbr normally shows a similar/simplified chart as VisuaulVM memory consumption view does. The magic of Plumbr is that when it monitors the web app under a decent load (provided with load tests) and for a relatively long amount of time, it manages to detect a memory leak and even suggests the culprit that is holding the refs to the leaking objects. For the current case it helped a lot, because even from VisualVM output one can see, that there is a bloating amount of some certain collections (iirc ArrayListMultimaps) and one can guess that it is related to Guice and the ComponentProvider, but due to the high amount of such objects and different use cases - it was really hard to find out which object is guilty. Plumbr strips the legit cases and points to the real problems. 



Other observations. Thread locks.

It was observed that when the tested instance performs at its limits, a thread lock might occur.



As we can see from the diagram 4 threads (T1-T4) are locked over 3 resources: 2 Vaadin sessions (S1 and S2) and MessagesManager's listeners (MM#listeners). Solid lines indicate that a thread owns a resource, dashed line stands for an acquisition attempt.  




  • No labels

5 Comments

  1. Good stuff ! 

    Two questions:

    • How did you end up using Plumbr ? The screenshots here are VisualVM, my understanding is that Plumbr is an agent; what does it do ?
    • What was the original heartbeat rate, 5 minutes I assume ? And are there potentially negative consequences to reducing it ? (increasing traffic ?)
    1. 1) Answered within additional page paragraph.

      2) I don't think there could be a significant traffic overhead, because the heartbeat request does not even involve a server response - it just records a timestamp of the latest client-side presence.

  2. oh and one more thing, could you specify where this leak was introduced and fixed ? (versions and/or JIRA tickets ?)

  3. Hello everyone. I know that this post is old, but we are using a 5.3.3 version of the Magnolia Community Edition and we are having some memory issues. I would like to know the answer to the previous comment here:

    """Grégory Joseph

    oh and one more thing, could you specify where this leak was introduced and fixed ? (versions and/or JIRA tickets ?)"""

    Thanks

  4. Hi Mauro J Giamberardino! If memory (and git log) serves - it was this one:  MGNLUI-3122 - Getting issue details... STATUS . If you have any suspicions of could cause your troubles - feel free to share those here, maybe there're some improvements since then.