Skip to end of metadata
Go to start of metadata
Your Rating: Results: PatheticBadOKGoodOutstanding! 12 rates

Huh, what's this about?

This page gives a quick example for how to set up magnolia with Jackrabbit JCR writing to Database only (as far as possible).

The default setup for magnolia uses the built-in DerbyDB, as well as the filesystem for storing the JCR Content. The sample configurations provided for mysql also only put part of the repository in the Database. The remaining parts land on the file system, by default within the webapp folder of magnolia.

There are a number of disadvantages to this "mixed" setup:

  1. Mixing file-system and DB provides for 2 points of failure.
  2. Mixing file-system and DB makes consistent backups more difficult. Basically, to guarantee a consistent backup, magnolia has to be shut down.
  3. Filesystem-backed storage means Repository is not clusterable in Jackrabbit.

Switching to a database-only setup gets rid of these disdvantages.

These instructions have been tested with Magnolia 4.3 and 4.4.

Structure of JCR Repositories

Basically, a JCR repository can have one or more workspaces. Each workspace is what is called a "repository" in magnolia: website, dms, data, imageing, config, etc...

Each workspace requires a number of different "Storage-Backends" in JCR in order to store all the different data-elements. These are:

  • A "FileSystem" for content
  • A "PersistanceManager" for content
  • A "Datastore" for large content (blobs)
  • A "FileSystem" for versions
  • A "PersistanceManager" for versions

Also required is:

  • A general "FileSystem" for the repository (all workspaces)

Confusingly, even though the element is called "FileSystem", both "FileSystem" and "PersistanceManager" can be configured to use a number of different backends, either file-system based, database based, or others.

Our objective is to configure everything to use database-backed storage.

So will everything be in the DB?

Unfortunately, the answer is "no". Even after configuring all storage to use database-backends, Jackrabbit will still write the following into the repositories folder:

  • a config file per workspace. If this file is missing, the workspace will be reinitialized, so make sure you don't delete these!!
  • the search-indexes per workspace. These can be deleted any time, and will be recreated as needed.

For more information, see the Jackrabbit Documentation and Wiki.

How to set it up?

Note: Your old repository will be gone, so make a backup fo your content first!!

  1. Create a database and appropriate users on the database server of your choice. You will need a seperate database for each magnolia instance, eg. one for author and one for public.
  2. Install appropriate JDBC drivers for your database in either WEB-INF/lib or TOMCAT_HOME/lib
  3. Create JNDI Datasource definitions in the web.xml, context.xml or server.xml files, see JNDI Datasources in the Apache Tomcat documentation for more details.
  4. Configure Jackrabbit like in the example file below. Your Jackrabbit Config file goes in the folder TOMCAT_HOME/webapps/magnoliaAuthor/WEB-INF/config/repo-conf
  5. Configure magnolia to use the new jackrabbit configuration. Edit TOMCAT_HOME/webapps/magnoliaAuthor/WEB-INF/config/default/magnolia.properties. Set the property "magnolia.repositories.jackrabbit.config",
    eg: magnolia.repositories.jackrabbit.config=WEB-INF/config/repo-conf/jackrabbit-mysql.xml
  6. Configure the repository home dir (where Jackrabbit will still write a few config files and the indices) to lie outside Tomcat's webapps folder.
    eg: magnolia.repositories.home=c:/dev/magnolia/repo-author/repositories
  7. Repeat steps 4-6 for the public instance.

See the following example JackRabbit config file:

Things to note:

My JNDI datasource in this example is called "magnoliaAuthorDS". This is for the author instance. For the public instance, replace all occurrences of "magnoliaAuthorDS" with the JNDI name of your public instance datasource.

Notes on MySQL

When using mysql for JackRabbit, you will probably need to configure the number of connections allowed by the database server to about 200. JackRabbit opens a lot of connections.

If you are using MySQL as the Database, and want to move the Datastore (Blob storage) into the DB as well (as in the setup above) then you will need to configure mysql to handle larger binary objects via JDBC. There are instructions for doing this on the Jackrabbit wiki as well as in the mysql documentation, but for my version of mysql (5.1) it was enough to add the following to my.cnf:

The limit specified here (in my example 32M for 32 megabytes) will be a hard limit on the maximum file size you will be able to upload to the repository.

MySQL datasource definition

You can define the datasource in your servlet container in the normal way. A standard javax.sql.Datasource definition, as described in the tomcat documentation will work. However, JackRabbit does not need the connection pooling offered by the standard datasource, and there is no easy way to disable the connection pool.

Some configurations could even be harmful, as JackRabbit keeps connections open for a very long time without using them, so if you have configured recovery of abandoned connections (recoverAbandoned=true), the connection pool may "steal back" connections JackRabbit is still using, mistakenly believing them to be abandoned.

To avoid this kind of thing from the outset you might consider configuring an unpooled datasource, for example as follows:

You need to specify the type so that Tomcat does not try to instantiate its own pooled DS. (warning) Two other important differences:

  • the user property - in Tomcat's regular DS, this is called username.
  • explicitUrl needs to be set to true unless you configure all parameters explicitly outside the url (including database name, which we don't do in this example).

Connection idle timeouts

In addition to the abandoned connection recovery at the tomcat end of things, there are also timeouts for idle connections configured at the mysql side.

This should not be a problem on a production server, where requests can be expected to come in at a somewhat constant rate. On development setups, where there may be no use of the system at all for a whole weekend, the connection-idle-timeout needs to be increased.

For mysql, add something like the following to your my.ini, and restart the server:

  1. Jul 14, 2011

    Richard, it would be useful if you could also share your DS configuration on the appserver side. I have one particular instance which somehow always seems to lock. Others (like documentation and forum) work perfectly fine with the same configuration, so this one's a bit bizarre, but I'd be curious to see other's configuration.

    (in particular because DS tend to do connection pooling, with JackRabbit doesn't need - nor want, ideally)

    1. Jul 28, 2011

      Hi Gregory!

      I have included a sample datasource definition in the wiki-article above. Essentially, by using one of the "simple" datasource types (which are supplied with most JDBC driver packages) you can easily configure a datasource without pooling.

      However, I don't think the pool is a problem in itself (it's just that JackRabbit does not use it, and never 'gives back' its connections, so the pool is a bit useless), unless you configure abandoned connection recovery. Since JackRabbit leaves its connection idle for what is sometimes a VERY long time, the pool would consider them abandoned and reclaim them, leading to problems. But with removeAbandoned=false (the default), there should be no problem with the pool.

      However, we also see some "lock-ups" like you mention, but we ONLY ever see them when we shut down tomcat. In this case, the shutdown takes a very long time (>6 mins), and we see many error messages of the form:

      2011-07-28 15:19:32,800 WARN  rg.apache.jackrabbit.core.fs.db.DatabaseFileSystem: execute failed, about to reconnect...
      2011-07-28 15:19:42,907 WARN  rg.apache.jackrabbit.core.fs.db.DatabaseFileSystem: execute failed, about to reconnect...
      2011-07-28 15:19:53,026 WARN  rg.apache.jackrabbit.core.fs.db.DatabaseFileSystem: execute failed, about to reconnect...
      2011-07-28 15:20:03,254 WARN  rg.apache.jackrabbit.core.fs.db.DatabaseFileSystem: execute failed, about to reconnect...

      Since we use this setup only in development, and the shutdown error messages do not seem to have any impact on the repositories, we have just been ignoring this.

      It's just a hunch, but I think it has to do with the connections being idle for too long. By default, MySQL closes idle connections after 8 hours. I will try raising this value, and let you know what the effect is.

      1. Jul 28, 2011

        Thanks !
        Yeah, i've setup longer idle time in mysql - i think. I don't think I configured a DbFS (your warnings above), so I'm getting different issues. I'll investigate further, but it might indeed be related to the pool "abandoning" connection (although that should also "work", given JackRabbit's connection recovery "manager" ...)

        Ha, interesting detail though, I don't specify the mysql-specific type (I have type="javax.sql.DataSource" nor a factory in my DSs!

        1. Aug 04, 2011

          Hi Gregory!

          That's the difference between pooled and unpooled. Without the factory and with type = javax.sql.DataSource (which is actually an interface) tomcat has "default" logic to use apache commons DBCP to create a pooled connection.

          By specifying the factory and type explicity I create a "simple" datasource for unpooled connections, using an implementation supplied with the JDBC driver. Most JDBC drivers have such "simple" datasource implementations, AFAIK.

          Regards from Vienna,

          Richard

          1. Aug 04, 2011

            Brilliant, thanks !

          2. Sep 02, 2011

            Wow, took me a while to figure out that the username property was in fact user for the Mysql datasource. Added a note about it.
            Thanks again !

      2. Aug 04, 2011

        Confirmed: the shutdown problems were due to connections that timed out at the mysql end. Increasing the connection-idle time in mysql got rid of these errors.

  2. Sep 12, 2011

    Regarding timeout settings, Jackrabbit has a connection recovery mechanism that should circumvent timed out connections. I've had weird behavior in the past, in part due to using a pooled connexion. It's always hard to say if it really works, since it logs warnings even when still attempting to recover the connection - do you have any insight ? If it really works, it means 1) we should not use ?autoreconnect at the driver level 2) we should not set wait_timeout at server level.

  3. Feb 12, 2013

    Hi Richard (/Magnolia),

    I only came across this very helpful article just now. I have some questions:

    1. You seem to imply that the default Magnolia/Jackrabbit filesystem storage needs to be backed up in order to have a complete backup of a Magnolia installation. I.e. that it contains data which cannot be recreated/regenerated. Is this true? I have done some tests in the past with a default Magnolia MySQL setup where I simply deleted the entire default filesystem storage (repositories folder) and managed to recreate the filesystem storage simply be restarting Magnolia twice (after the first startup the system was in a mess but after the second it was fine). Having said that we also have experiences in people not being able to pull this off. 
    2. How do you migrate from a situation where you have a filesystem Jackrabbit backend to a database Jackrabbit backend? If you say: simply delete the filesystem backend and have Magnolia recreate the backend in the database doesn't that invalidate my first question?
    3. Have you tested this since on Magnolia 4.5?

    We currently use more or less default Magnolia MySQL setups but do store the filesystem storage (repositories) outside of the web app folder. Even for local development using the default Derby database. Otherwise the repositories folder would be deleted (and recreated) for every new deployment of our Magnolia installation which is not what we want (the deployable artifact being the complete WAR; which is the standard JEE deployment model). It seems Magnolia has a different deployment model in mind when they decided to place the filesystem storage by default inside the web app folder. They seem to suggest a deployment model where you replace / add / remove artifacts (like JARs) inside the web app folder. I never really understood this. This is asking for problems (artifacts out of sync for one thing) in my opinion.

    cheers,

    Edgar

    1. Feb 13, 2013

      Ah sorry, I understand it better now. The Jackrabbit data store where by default all (>1000 bytes) JCR binaries are stored is not a type of cache: it is the only place where these binaries are stored. If you loose the data store, you loose the binaries.

      So your suggestion to keep the data store in the database instead of in the filesystem makes a lot of sense. The answer to my question #2 (how do you migrate) is I think: first make an complete content export (e.g. using the Magnolia backup scripts), change the configuration, start with empty databases and perform an import?

      cheers, Edgar

      1. Feb 14, 2013

        #2 yes, export, clean install, import is the way to go.

        For the other question as you found out it's not really about Magnolia 4.4 vs. 4.5 but about whether or not you use datastore or not which is possible in both (and configured by default in 4.5).

        1. Feb 14, 2013

          Thanks Jan. And I understood from other posts that the performance gain of using the datastore is very large (as opposed to not using it). I assume that this is true also when you store the datastore in a database and not on the filesystem?

          1. Feb 14, 2013

            In general yes, but it really depends on how well is your DB capable of handling binary data and how fast is the network connection between app and DB server (in case they are not on same host).

      2. Apr 18, 2014

        Hi,

        To be quite clear: I'm not at all recommending moving the DataStore to a database for production setups, at least not if blobs of serious size are being stored. The DB based Datastore deals with DB Blobs, which are always problematic. Also, the time required to read/write large blobs to/from the DB will block up connections for a long time, completely changing Jackrabbit's "connection behavior" and causing lock-ups and other problems under load. For a cluster, use a shared FS for the Datastore in production, locking doesn't matter as it is append-only.

        And really really don't do this in production with mysql.

        That warning given, if you aren't storing large blobs, or for development or testing it can still be a useful setup.

    2. Apr 18, 2014

      HI Edgar,

      Sorry, I left this unanswered a long time :-/ That post happened while I was on parental leave...

      To answer your questions:

      1) Yes the filesystem should be backed up in a consistent state with the database. In an ideal world you will stop the jackrabbit instance and perform the backup. In practice that can be problematic. If you use DB-Persistence, then most of the repository fs contents can be "regenerated" as you describe, but that isn't what I would call a "production ready" procedure that I would recommend as part of a backup strategy. And in any case, a fs based DataStore needs to be backup-ed.

      If online backup is required you can also back up the DB (perhaps using snapshots, in a transaction or using locks or some other way that gets you consistency) and then backup the fs based DataStore at any point after the DB. Since the Datastore is append only in normal operation that might get you a few "extra" blobs compared to the DB state, but they don't hurt.

      But Backup is not the only reason to want DB-only repositories. Two others that come to mind are: JCR clusters --> they need transactional persistence managers and filesystems, and it can be convenient to have the datastore in the DB for shared access. The other reason might be for development purposes - its easier to "swap out" repositories if all you need to do is change the DB.

      2) Import / Export, as Jan wrote. There's no other way, to my knowledge.

      3) Not directly in magnolia 4.5, but we'll be migrating our config for this to magnolia 5 soon. But this is all at the Jackrabbit level, and should be transparent to magnolia, really.

      Hope you're doing well! Regards from Vienna,

      Richard