This module was little more than a PoC done in my spare time and is no longer actively developed. If you are looking for a professional and flexible integration of Magnolia and Solr, please take a look at the new https://documentation.magnolia-cms.com/display/DOCS/Solr+module
Apache Solr is a "high performance enterprise search server, with XML/HTTP and Java/JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, distributed search, database integration, web admin and search interfaces" based on Lucene. The magnolia-solr-module aims at bringing Solr outstanding search features into Magnolia CMS. This page wants to be a step-by-step tutorial to get you quickly started with Solr and Magnolia. The only assumption we made here is that you are somehow familiar with Solr and have at least gone through their beginners' tutorial. The tutorial we present here is based on the search sample which comes with the
magnolia-solr-module itself. The Solr version to which we refer is the latest one (
1.4.1 at the moment of writing this).
Get the module
You can grab the latest module jar at Magnolia's Nexus. The Maven artifact dependency is the following:
Install Solr, magnolia-solr-module and dms module (optional)
- To start with, we suggest you to make a copy of Solr's example directory as a template for our project.
- Then go to
solr/conffolder and replace the contents of the default Solr
schema.xmlwith our simplified version of it (more on that later), which was devised especially for the samples.
Alternatively, the new
schema.xmlcan be found in the module sources at
- Add the latest
magnolia-module-dmsmodule to your Magnolia webapp's
pom.xml. The DMS module is optional, although strongly suggested if you want to see how indexing and search of DMS-managed documents works. Here we will proceed as if the DMS has been installed.
Finally, enable the installation of module's sample by setting
If you don't want to use a Maven based project to follow along this tutorial, be aware that
magnolia-solr-module1.0 depends on
Solr's default base URL is
magnolia-solr-module's configuration automatically points to it. This can be changed later on.
Enough with setting up things. Let's start the Solr server and Magnolia!
The sample will install a mini website with some contents to index and search along with a couple of paragraphs: one for entering text, the other for performing searches. This screenshot shows how the website looks like after having logged into Magnolia AdminCentral. The contents are to be found under
/shakespeare/tragedies. They contain the full text of "Romeo And Juliet" and "Hamlet" (courtesy of the MIT)
Let's now go to Tools->Solr and start adding contents to Solr. As you can see in the picture below, the Solr tool page has a very simple UI where you can perform the two most basic operations with it, i.e. adding/updating Magnolia managed contents to Solr and remove those contents from it. You will select a workspace and the path to the contents to be indexed or removed. Optionally, you will choose whether to include sub-nodes or not. In our case we want Solr to index all of Shakespeare's tragedies (well, actually the two of them we have), therefore we will select
website as workspace,
/shakespeare as the path and finally tick the include sub-nodes checkbox. Now click on Add or update to start indexing. Solr is very fast at indexing and in our case, having only a few contents, it only took (at least on our macbook) 36 ms. Let's do the same for DMS documents. Select
dms as workspace,
/demo-docs as the path and tick the include sub-nodes checkbox. Solr took (still on our machine) 261ms to index 11 documents (mainly PDFs). Indexing performance will, of course, depend on several factors and an analysis of those is clearly beyond the scope of this tutorial. Please refer to Solr's documentation for further information about it.
According to http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia it took 2 hours 40 minutes to index a dump of English Wikipedia containing 7278241 articles with peak memory usage at around 4GB.
Now that we have indexed our contents we can start searching for it. Let's click on the search page and enter a query term, i.e. "Verona". This is the city where "Romeo And Juliet" is staged, therefore we expect to find some results, as in fact we do.
Let's try another search where we expect to match more results both from the website and the DMS. Let's search for quite a common word - "fine". Below are the results. We actually found 4 documents, two from the website and two (Peter Pan and Wizard of Oz) from the DMS. As you can see, each result item is made up by a title link (leading either to a website page or to a DMS document), a text excerpt with our query term highlighted and finally the date of last modification for the page or document.
The module's sample comes with a search paragraph whose configuration is shown below.
You can take a look at the model class source code
info.magnolia.solr.samples.SearchParagraphModel to understand how Solrj API is used to query Solr.
The code should be self-explanatory. Also take a look at the paragraph
searchResult.jsp to see how the query results are shown in a page.
Now that we have used the module more or less as an administrator would do, developers might be asking a couple of questions: how does this thing work? Should the defaults not suit my needs, how do I customize the module? We will try to answer those questions in the next paragraph.
A look at module configuration and Solr's schema.xml
magnolia-solr-module configuration is found under
solr/config and it's made up of two main parts: the Solr server configuration and the Fields configuration.
Solr server config
Here are configured all things related to the Solr server we are working with. The screenshot below shows the expanded configuration. Most of the parameters concerns a Solr installation as a remote server (the default one). Nonetheless, Solr can also be used as an embedded server. The most noteworthy parameter is
embedded where you can choose to have an embedded server instead of a remote one. In that case you won't use an http connection to communicate with Solr and the
baseURL parameter will look something like
/path/to/my/solr/instance. For an explanation of all other parameters, please refer to http://wiki.apache.org/solr/Solrj#CommonsHttpSolrServer
Another essential configuration option is the
magnolia-solr-module always requires a unique key to be specified in Solr
schema.xml. This makes updating Solr index much easier by avoiding duplicate entries. The default module configuration expects a
schema.xml to contain a
<uniqueKey>id</uniqueKey>. In case you want to change this value, say to
<uniqueKey>myCoolID</uniqueKey>, you have to update it both in your
schema.xml and in the module configuration.
The field configuration allows to customize the module's default behavior and specify which content will be indexed into Solr. The screenshot below shows the expanded configuration. As you can see, it is organized by workspace (with website and dms provided by default) so that you can eventually add other workspaces in case you need.
The following table summarizes the various options
node data name (String)
(only for website) here you add the node data (property) names containing the text you want to index. Optionally you can enter a
(only for website) if
(only for website) if
(only for website) if
the fully qualified name of a class implementing
here you can plug in your custom class to create a
the fully qualified name of a class implementing
input documents are sent to Solr through a
"The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields" (from Solr's documentation). The part of schema.xml we're interested in here is the
<fields> part. Here is an excerpt from ours.
As you can see, here we declare all fields (node data in Magnolia parlance) we want to index. The node data names in our Field configuration (see above) must match the field names declared here. Failing to do so, will result in a runtime exception raised by Solr during indexing (unknown field...). The
schema.xml is at the core of Solr configuration and can be a bit intimidating at first. But it's definitely worth learning about all its numerous options which, on the other hand, are very well documented in the file itself and in the official documentation.
At the moment, the module provides no built-in synchronization mechanism to keep Magnolia and Solr aligned. All add/update/remove operations are done manually and this could lead to misaligned index and contents. E.g. one could index contents by mistake which has not been published yet or forget to run an index update on contents which has changed. Or contents could be unpublished/deleted from Magnolia and forget about removing it from Solr (and, by the way, this should be done before deleting contents from Magnolia itself, otherwise the module has no way to retrieve the document ids to be removed from Solr). However, Magnolia commands are provided for adding/updating and removal operations, and those could be easily scheduled or added to a workflow.