Page tree
Skip to end of metadata
Go to start of metadata

This module was little more than a PoC done in my spare time and is no longer actively developed. If you are looking for a professional and flexible integration of Magnolia and Solr, please take a look at the new https://documentation.magnolia-cms.com/display/DOCS/Solr+module

Your Rating: Results: PatheticBadOKGoodOutstanding! 99 rates

Introduction

Apache Solr is a "high performance enterprise search server, with XML/HTTP and Java/JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, distributed search, database integration, web admin and search interfaces" based on Lucene. The magnolia-solr-module aims at bringing Solr outstanding search features into Magnolia CMS. This page wants to be a step-by-step tutorial to get you quickly started with Solr and Magnolia. The only assumption we made here is that you are somehow familiar with Solr and have at least gone through their beginners' tutorial. The tutorial we present here is based on the search sample which comes with the magnolia-solr-module itself. The Solr version to which we refer is the latest one ( 1.4.1 at the moment of writing this).

Get the module

You can grab the latest module jar at Magnolia's Nexus. The Maven artifact dependency is the following:

<dependency>
  <groupId>info.magnolia.solr</groupId>
  <artifactId>magnolia-module-solr</artifactId>
  <version>1.0</version>
</dependency>

Install Solr, magnolia-solr-module and dms module (optional)

  • To start with, we suggest you to make a copy of Solr's example directory as a template for our project.
  • Then go to solr/conf folder and replace the contents of the default Solr schema.xml with our simplified version of it (more on that later), which was devised especially for the samples.
    Alternatively, the new schema.xml can be found in the module sources at http://svn.magnolia-cms.com/view/forge/magnolia-module-solr/trunk/src/etc/.
  • Add the latest magnolia-solr-module and magnolia-module-dms module to your Magnolia webapp's pom.xml. The DMS module is optional, although strongly suggested if you want to see how indexing and search of DMS-managed documents works. Here we will proceed as if the DMS has been installed.
  • Finally, enable the installation of module's sample by setting magnolia.bootstrap.samples=true in your magnolia.properties configuration file.

    Dependencies

    If you don't want to use a Maven based project to follow along this tutorial, be aware that magnolia-solr-module 1.0 depends on org.apache.solr:solr-solrj:1.4.1 and org.apache.solr:solr-core:1.4.1

Solr's default base URL is http://localhost:8983/solr/ and magnolia-solr-module's configuration automatically points to it. This can be changed later on.
Enough with setting up things. Let's start the Solr server and Magnolia!

Index contents

The sample will install a mini website with some contents to index and search along with a couple of paragraphs: one for entering text, the other for performing searches. This screenshot shows how the website looks like after having logged into Magnolia AdminCentral. The contents are to be found under /shakespeare/tragedies. They contain the full text of "Romeo And Juliet" and "Hamlet" (courtesy of the MIT)

Let's now go to Tools->Solr and start adding contents to Solr. As you can see in the picture below, the Solr tool page has a very simple UI where you can perform the two most basic operations with it, i.e. adding/updating Magnolia managed contents to Solr and remove those contents from it. You will select a workspace and the path to the contents to be indexed or removed. Optionally, you will choose whether to include sub-nodes or not. In our case we want Solr to index all of Shakespeare's tragedies (well, actually the two of them we have), therefore we will select website as workspace, /shakespeare as the path and finally tick the include sub-nodes checkbox. Now click on Add or update to start indexing. Solr is very fast at indexing and in our case, having only a few contents, it only took (at least on our macbook) 36 ms. Let's do the same for DMS documents. Select dms as workspace, /demo-docs as the path and tick the include sub-nodes checkbox. Solr took (still on our machine) 261ms to index 11 documents (mainly PDFs). Indexing performance will, of course, depend on several factors and an analysis of those is clearly beyond the scope of this tutorial. Please refer to Solr's documentation for further information about it.

Indexing performance

According to http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia it took 2 hours 40 minutes to index a dump of English Wikipedia containing 7278241 articles with peak memory usage at around 4GB.

Search contents

Now that we have indexed our contents we can start searching for it. Let's click on the search page and enter a query term, i.e. "Verona". This is the city where "Romeo And Juliet" is staged, therefore we expect to find some results, as in fact we do.

Let's try another search where we expect to match more results both from the website and the DMS. Let's search for quite a common word - "fine". Below are the results. We actually found 4 documents, two from the website and two (Peter Pan and Wizard of Oz) from the DMS. As you can see, each result item is made up by a title link (leading either to a website page or to a DMS document), a text excerpt with our query term highlighted and finally the date of last modification for the page or document.

The module's sample comes with a search paragraph whose configuration is shown below.

You can take a look at the model class source code info.magnolia.solr.samples.SearchParagraphModel to understand how Solrj API is used to query Solr.
The code should be self-explanatory. Also take a look at the paragraph searchResult.jsp to see how the query results are shown in a page.

Now that we have used the module more or less as an administrator would do, developers might be asking a couple of questions: how does this thing work? Should the defaults not suit my needs, how do I customize the module? We will try to answer those questions in the next paragraph.

A look at module configuration and Solr's schema.xml

The magnolia-solr-module configuration is found under solr/config and it's made up of two main parts: the Solr server configuration and the Fields configuration.

  • Solr server config

Here are configured all things related to the Solr server we are working with. The screenshot below shows the expanded configuration. Most of the parameters concerns a Solr installation as a remote server (the default one). Nonetheless, Solr can also be used as an embedded server. The most noteworthy parameter is embedded where you can choose to have an embedded server instead of a remote one. In that case you won't use an http connection to communicate with Solr and the baseURL parameter will look something like /path/to/my/solr/instance. For an explanation of all other parameters, please refer to http://wiki.apache.org/solr/Solrj#CommonsHttpSolrServer

Another essential configuration option is the /solrConfig/schema/uniqueKey. magnolia-solr-module always requires a unique key to be specified in Solr schema.xml. This makes updating Solr index much easier by avoiding duplicate entries. The default module configuration expects a schema.xml to contain a <uniqueKey>id</uniqueKey>. In case you want to change this value, say to <uniqueKey>myCoolID</uniqueKey>, you have to update it both in your schema.xml and in the module configuration.

  • Field config

The field configuration allows to customize the module's default behavior and specify which content will be indexed into Solr. The screenshot below shows the expanded configuration. As you can see, it is organized by workspace (with website and dms provided by default) so that you can eventually add other workspaces in case you need.

The following table summarizes the various options

name

value

description

fields

node data name (String)

(only for website) here you add the node data (property) names containing the text you want to index. Optionally you can enter a boostValue (default is 1.0). If no fields are specified, the default implementation will index all text, date, long, double, boolean node data it encounters when going through all paragraphs of a given web page. See info.magnolia.solr.documents.impl.WebsiteDocumentImpl. The node data names must match the fields declared in schema.xml (more on this later). Any unmatched property will cause a Solr runtime error during indexing (e.g. unknown field ...)

metadata

boolean

(only for website) if true will index page author and last modification date

path

boolean

(only for website) if true will index the page path

title

boolean

(only for website) if true will index the page title

documentImpl/class

the fully qualified name of a class implementing info.magnolia.solr.documents.MagnoliaSolrDocument(String)

here you can plug in your custom class to create a SolrInputDocument different from the default one or add input documents for a different workspace.

updateRequestImpl/class

the fully qualified name of a class implementing info.magnolia.solr.request.MagnoliaSolrUpdateRequest(String)

input documents are sent to Solr through a SolrUpdateRequest. Here you can plug in your custom class to replace default implementations or to create a new request for a different workspace.

  • schema.xml

"The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields" (from Solr's documentation). The part of schema.xml we're interested in here is the <fields> part. Here is an excerpt from ours.

   <field name="id" type="string" indexed="true" stored="true" required="true" />
   <field name="name" type="text" indexed="true" stored="true"/>
   <field name="path" type="text" indexed="true" stored="true"/>
   <field name="text" type="text" indexed="true" stored="true" multiValued="true"/>

   <!-- Common metadata fields, named specifically to match up with
     SolrCell metadata when parsing rich documents such as Word, PDF.
     Some fields are multiValued only because Tika currently may return
     multiple values for them.
   -->
   <field name="title" type="text" indexed="true" stored="true" multiValued="true"/>
   <field name="authorid" type="text" indexed="true" stored="true"/>
   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="lastmodified" type="date" indexed="true" stored="true"/>
   <field name="staticlink" type="string" indexed="true" stored="true" multiValued="false"/>

As you can see, here we declare all fields (node data in Magnolia parlance) we want to index. The node data names in our Field configuration (see above) must match the field names declared here. Failing to do so, will result in a runtime exception raised by Solr during indexing (unknown field...). The schema.xml is at the core of Solr configuration and can be a bit intimidating at first. But it's definitely worth learning about all its numerous options which, on the other hand, are very well documented in the file itself and in the official documentation.

Caveat

At the moment, the module provides no built-in synchronization mechanism to keep Magnolia and Solr aligned. All add/update/remove operations are done manually and this could lead to misaligned index and contents. E.g. one could index contents by mistake which has not been published yet or forget to run an index update on contents which has changed. Or contents could be unpublished/deleted from Magnolia and forget about removing it from Solr (and, by the way, this should be done before deleting contents from Magnolia itself, otherwise the module has no way to retrieve the document ids to be removed from Solr). However, Magnolia commands are provided for adding/updating and removal operations, and those could be easily scheduled or added to a workflow.

Resources