In house search is limited in terms of performance and configuration, the goal is to use Solr as search engine.
Therefore you need a way to take the content of Magnolia and index it into Solr. This is the first problem, since you need a flexible configurable way of associating fields in Magnolia to fields in Solr.
All Solr features should be implemented as components that can be added individually to pages, some of the features ( faceting) can be associated with an existing search or not.
This is the second problem, find a flexible way of providing given solr features to Magnolia users, keeping search context or not on a page or session and limit requests.
Find a flexible and configurable way of indexing magnolia content and non magnolia content ( for instance an other website ) to the relevant fields in Solr.
Find a flexible way of offering given Solr features to Magnolia authors, keeping search context or not on a page or session and minimize requests to Solr.
Find a flexible way of extending or adding functionality to the module.
Find a flexible way to make usage of solr features as well possible not only for search specific questions ( e.g. suggestions, recommendations )
As the superuser or administrator I would like to have a solr instance set up by Magnolia when installing the relevant module(s).
As an author I want to configure easily which fields should be indexed. ( for instance in edit mode see with an extra icon if this field is indexed)
As an author I want to configure easily which pages should be indexed.
As an author I don't want to take care about synchronization of external index and magnolia content.
As an author I want to have the choice to index upon activation or in a workflow.
As an author I don't want to touch solr's schemes and configuration.
As an author I need to index documents referenced in pages as in workspaces.
As an author I don't need to take care about multi language content.
As an author I want to be able to add search input fields everywhere.
As an author I want to be able to add Search results everywhere and easily define pagination, max number of results.
As an author I need to be able to boost results ( i.e. increase relevance )
As an author I need to restrict the results to a certain subdomain.
As an author I need to be able to define a search by facets and/or terms and/or date and/or price.
As an author I would like to provide similar pages for a certain page, a title or an abstract in a component.
As a user I would like to get search suggestion when I type or when I don't find results.
As a user I would like to get the most relevant and recent results for my search.
We need a module to gather the content from Magnolia and/or another website and send it to Solr.
We need a module that will provide the glue between Magnolia components/classes and the SolrJ implementation.
> AKA: Rationale
> Pros and Cons
> Consequences of this approach.
Solr Search Provider:
The solr implementation actually consists of a set of model classes that each have their corresponding SolrProvider classes.
The FacetedSolrProvider class is based on the following architecture.
SearchService is a specific interface which defines the methods that should always be accessible, SolSearch is the base class it provides connectivity and admin methods.
SolrSearch provides the following methods but could be extended with util methods:
This way you can specify a provider easily
The content indexer uses the SolrSearchIndexer class to push docs to solr. Pulling docs ( used for pdf/word/etc.) can be done as well and is achieved by tika.
The ContentIndexer monitors the IndexerConfig and creates DataIndexers for each specified config using the DataIndexerFactory.
For an explanation of the fields in the config, one can check:
Thoughts, improvements and open questions
What can be improved:
The SolrSearchProvider module should maybe be restructured a bit.
Extra actions and views could be created to see the status of the synchronization between Magnolia content and Solr content.
The solr admin interface could be integrated in a iframe app, this could be a quick win.
The pages query and event listener code in the content indexer module could maybe be decoupled a bit more and/or optimized.
Deletion on unpublsihing