Page tree
Skip to end of metadata
Go to start of metadata

Summary

The workspace configuration is specified in the Jackrabbit repository configuration file. The configuration specified there acts as a template for the configuration of all new workspaces. Once the workspace is created you can then adjust the workspace configuration on a workspace-by-workspace basis. For example, you can adjust the search index configuration on the workspace to optimize how the data is indexed based on what you know about the data and how it's organized.

The search index in Jackrabbit is pluggable and has a default implementation based on Apache Lucene. It is configured in the file workspace.xml when the workspace is created. You have a lot of options for configuration. It's the goal of this page to outline those options in one place. This page is based on Jackrabbit 2.16 API, Lucene 3.6, and Magnolia 5.7.

Feel free to ask questions at the bottom of the page for further clarification.

workspace.xml

For each workspace that was created, there will also be a workspace.xml file created inside the workspace home directory that will be used for the workspace.

<?xml version="1.0" encoding="UTF-8"?>
<Workspace> 
  <FileSystem .../>
  <PersistenceManager .../>
  <SearchIndex .../>
  <WorkspaceSecurity .../>
</Workspace>

FileSystem

The virtual file system used by the search index.

<FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
  <param name="path" value="${rep.home}/repository" />
</FileSystem>

Jackrabbit provides a lot of choices for how you can configure the FileSystem. Choose the class (local, db, or in-mem) that best fits your use case.

See: http://jackrabbit.apache.org/jcr/jackrabbit-configuration.html#file-system-configuration and Jackrabbit Repository Configuration File#FileSystem.1

PersistenceManager

The persistence manager configuration for the search index.

<PersistenceManager class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
  <param name="url" value="jdbc:derby:${wsp.home}/db;create=true"/>
  <param name="schemaObjectPrefix" value="${wsp.name}_"/>
</PersistenceManager>

Jackrabbit provides a lot of choices for how you can configure the PersistenceManager. Choose the class (pool or in-mem) that best fits your use case.

See: https://wiki.apache.org/jackrabbit/PersistenceManagerFAQ and Jackrabbit Repository Configuration File#PersistenceManager

SearchIndex

Node names and property values are indexed as soon as the data is saved or as soon as the transaction is committed.   

Text extraction is done asynchronously in a in a background thread. That means changed or added text is not available immediately, but after a short delay. The exact behavior can be configured using the extractor* settings.

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">    
  <param name="path" value="${wsp.home}/index"/>
  <!-- SearchIndex will get the indexing configuration from the classpath, if not found in the workspace home -->
  <param name="indexingConfiguration" value="/info/magnolia/jackrabbit/indexing_configuration.xml"/>
  <param name="useCompoundFile" value="true"/>
  <param name="minMergeDocs" value="100"/>
  <param name="volatileIdleTime" value="3"/>
  <param name="maxMergeDocs" value="100000"/>
  <param name="mergeFactor" value="10"/>
  <param name="maxFieldLength" value="10000"/>
  <param name="bufferSize" value="10"/>
  <param name="cacheSize" value="1000"/>
  <param name="forceConsistencyCheck" value="false"/>
  <param name="autoRepair" value="true"/>
  <param name="queryClass" value="org.apache.jackrabbit.core.query.QueryImpl"/>
  <param name="respectDocumentOrder" value="true"/>
  <param name="resultFetchSize" value="100"/>
  <param name="extractorPoolSize" value="3"/>
  <param name="extractorTimeout" value="100"/>
  <param name="extractorBackLogSize" value="100"/>
  <!-- needed to highlight the searched term -->
  <param name="supportHighlighting" value="true"/>
  <!-- custom provider for getting an HTML excerpt in a query result with rep:excerpt() -->
  <param name="excerptProviderClass" value="info.magnolia.jackrabbit.lucene.SearchHTMLExcerpt"/>
</SearchIndex> 

Jackrabbit provides the following options in the class SearchIndex.

Index Location

  • path: The location of the index directory. This parameter is mandatory. A reasonable value is: ${wsp.home}/index.

Indexing Configuration

  • indexingConfiguration: The default indexing configuration file is located in the core module. You have the option to create a workspace specific file with this setting. See Search Index Configuration File.

    The configuration parameter indexingConfiguration is not set by default. This means all properties of a node are indexed.

  • indexingConfigurationClass: The name of the class that implements IndexingConfigurationIndexingConfigurationImpl implements a concrete indexing configuration.

  • analyzer: Sets the default analyzer in use for indexing. The default value is the StandardAnalyzer. The StandardAnalyzer uses an English language stop word set. Lucene provides language specific analyzers that can be configured on a property-by-property basis in the indexing configuration file.
  • directoryManager: The name of the class that implements DirectoryManagerFSDirectoryManager implements a directory manager for FSDirectory instances. RAMDirectoryManager implements a directory manager for RAMDirectory instances.
  • useSimpleFSDirectory: Indicates whether the DirectoryManager should use the SimpleFSDirectory instead of letting Lucene automatically pick an implementation based on the platform we are running on. Default is false.

See http://wiki.apache.org/jackrabbit/IndexingConfiguration

Indexing Performance

  • useCompoundFile: All files belonging to a segment have the same name with varying extensions. When using the Compound File format these files are collapsed into a single .cfs file. Useful for systems that frequently run out of file handles.

  • minMergeDocs: This setting no longer exists in Lucene 3.x.

  • volatileIdleTime: The Lucene indexer does not write changes to the permanent index immediately. At first, indexer writes the changes to a volatile index. Once the volatile index reaches a certain size it is persisted to the permanent index. Also there is the option to set a timer, in seconds, to control how often changes are written.

  • maxMergeDocs: While merging segments, Lucene will ensure that no segment with more than maxMergeDocs is created. 

  • mergeFactor: This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together. With the default value of 10, Lucene will store 10 documents in memory before writing them to a single segment on the disk.

  • maxFieldLength: Deprecated in Lucene 3.x.

  • bufferSize: Maximum number of documents that are held in a pending queue until added to the index.
  • cacheSize: Size of the document number cache. This cache maps UUIDs to lucene document numbers. If the doc number cache hits are poor then increasing this number could help.
  • maxVolatileIndexSize: The maximum volatile index size in bytes until it is written to disk. The default value is 1MB.
  • maxHistoryAge: The maximum age (in seconds) of the index history. The default value is 0. Which means, index commits are deleted as soon as they are not used anymore.
  • initializeHierarchyCache: With the default value of true the hierarchy cache is initialized on startup and control is only given back when the initialization has completed. When set to false the cache is populated during regular use.

See https://wiki.apache.org/lucene-java/ImproveIndexingSpeed

Index Consistency

  • forceConsistencyCheck: Runs a consistency check on every startup. If false, a consistency check is only performed when the search index detects a prior forced shutdown. When a consistency check is performed it can delay the start of the system. So this should only be run when a search index inconsistency is suspected. For example, a node not found error. A UUID exists in the search index but the corresponding node is not found. On the other hand, a node exists but is not recorded in the index. In both cases the index is inconsistent with the data.

  • autoRepair: Errors detected by a consistency check are automatically repaired. If false, errors are only written to the log.
  • enableConsistencyCheck: If set to true a consistency check is performed depending on the parameter forceConsistencyCheck. If set to false no consistency check is performed on startup, even if a redo log had been applied.
  • redoLogFactoryClass: The name of the class that implements RedoLogFactory. A redo log keeps track of changes that have not been committed to disk. While nodes are added to and removed from the volatile index (held in memory) a redo log is maintained to keep track of the changes. In case the Jackrabbit process terminates unexpectedly the redo log is applied when Jackrabbit is restarted the next time. DefaultRedoLogFactory is the default value.

See https://documentation.magnolia-cms.com/display/DOCS/Repository+inconsistency

Index Search

  • queryClass: Class used to perform JCR Queries. QueryImpl provides the default implementation for a JCR query. Raising the log level on QueryImpl to DEBUG will print query execution times to the log.
  • respectDocumentOrder: If true and the query does not contain an 'order by' clause, result nodes will be in document order (the order in which the were indexed by the system).

  • resultFetchSize: The number of results the query handler should initially fetch when a query is executed. Keep in mind that ACL checks must be performed on the result set. The larger the set the more time to load and check.

  • termInfosIndexDivisor: An indexDivisor for TermInfosReader so that on opening a reader you could further sub-sample the the termIndexInterval to use less RAM. Set to 1 by default, meaning all terms loaded into RAM. Set to 2 will load every other term into RAM but the trade off is you might have to scan twice. See LUCENE-1052.

See https://wiki.apache.org/lucene-java/ImproveSearchingSpeed

Text Extraction

  • extractorPoolSize: Defines the maximum number of background threads that are used to extract text from binary properties. If set to 0 then no background threads are allocated and text extractors run in the current thread. 1.5 to 2 times the number of processors is a good rule of thumb.

  • extractorTimeout: A text extractor is executed using a background thread if it doesn't finish within this timeout defined in milliseconds. This parameter has no effect if extractorPoolSize is 0.

  • extractorBackLogSize: The size of the extractor pool back log. If all threads in the pool are busy, incoming work is put into a wait queue. If the wait queue reaches the back log size, incoming extractor work will not be queued anymore but will be executed with the current thread.

  • maxExtractLength: Positive values are used as-is, negative values are interpreted as factors of the maxFieldLength parameter.
  • forkJavaCommand: Java command used to fork external parser processes or null (the default) for in-process text extraction. Use this to better control system stability and reliability by forcing indexing of binary documents into separate JVM processes. Any problems caused by parsing large or malformed documents won't affect the main process.

    example values

    Linux: nice java -Xmx512m
    Windows: cmd /c start /low /wait /b java -Xmx512m

Search Term Identification

  • supportHighlighting: If set to true additional information is stored in the index to support highlighting using the rep:excerpt() function.

  • excerptProviderClass: The name of the class that implements ExcerptProvider and should be used for the rep:excerpt() function in a query. By default this is set to SearchHTMLExcerpt.

Document Parsing

  • textFilterClasses: Deprecated in Jackrabbit 2.x. With Jackrabbit 2.x Apache Tika was introduced as the default binaries parser. By default Jackrabbit comes with a default tika-config.xml file that contains the configuration for which mime-types to parse and extract.
  • tikaConfigPath: Set the location of the tika-config.xml. For example, ${wsp.home}/tika-config.xml.

Synonym Provider

This allows users to use generalized language-dependent synonyms but more importantly very domain-specific synonyms like abbreviations or product names.

  • synonymProviderClass: The name of a class that implements SynonymProvider. The default value is null which means no class set. Jackrabbit provides the PropertiesSynonymProvider which implements a synonym provider based on a properties file. The location of the properties file is specified by the synonymProviderConfigPath.

  • synonymProviderConfigPath: The path to the synonym provider configuration file. This path interpreted relative to the path parameter. If there is a FileSystem element inside the SearchIndex element, then this path is interpreted relative to the root path of the FileSystem. Whether this parameter is mandatory depends on the synonym provider implementation. The default value is null which means no class set.

Spell Checking

  • spellCheckerClass: The name of a class that implements SpellChecker. No known implementation exists.

Scoring

Similarity defines the components of Lucene scoring.

  • similarityClass: The name of a class that extends Similarity.

See http://wiki.apache.org/jackrabbit/Search

WorkspaceSecurity

Workspace security is handled by the MagnoliaAccessProvider. See Jackrabbit Repository Configuration File#WorkspaceSecurity.

<WorkspaceSecurity>
  <AccessControlProvider class="info.magnolia.cms.core.MagnoliaAccessProvider" />
</WorkspaceSecurity>