Page tree
Skip to end of metadata
Go to start of metadata

Problem

At its current implementation status, search presents several issues which make it unpractical if not unusable in most cases. (For the current search implementation see the old concept page at Concept - Search and Sort for Content Apps)

The main problem we want to look at here is well described by the following JIRA ticket

Proposal

We need to enable searching for multiple node types by issuing one single JCR query which will also be used for "paginating" the results with the same mechanism used by the list viewIn JCR SQL-2 this is feasible with the following syntax

select * from [nt:base] where ([jcr:primaryType] = 'foo' or [jcr:primaryType] = 'bar' or [jcr:primaryType] = 'qux')

Each node type declared in a workbench might have the additional boolean property hideInList . If not hidden, the node type is used by default in the JCR sql2 query for both list and search jcr containers. 

Mixins 

If a mixin type is declared under /workbench/nodeTypes it will be included in the list and search views with the following syntax (we assume in the following example that a node type named baz is a mixin and is declared under /workbench/nodeTypes)

select * from [nt:base] where ([jcr:primaryType] = 'foo' or [jcr:mixinTypes] ='baz')

By default, mixins beginning with jcr: nt: mix: rep: will be discarded. The same rule for hideInList applies here. See more about mixin types in JCR 2.0 specs.

Subtypes

Subtypes of node types declared in a workbench will be added to the list and search views, provided their parents are not defined as hidden (hideInList = true) nor strict (strict = true). By default, in node type definitions hideInlist and strict are false.

Performance

As the new query syntax involving a where clause on several node types may arise doubts about performance compared to a plain select on one node type, some basic manual tests have been conducted comparing the two syntaxes (the groovy script used to help doing the tests is attached to this page).

  • Contacts workspace -  50000 nodes all of type mgnl:contact 

    Queryfirst runsubsequent runs
    select * from [mgnl:contact]
    50000 nodes returned in ~5000ms
    50000 nodes returned in ~1000ms
    select * from [nt:base] where ([jcr:primaryType] = 'mgnl:contact') 
    50000 nodes returned in ~5000ms
    50000 nodes returned in ~1000ms
  • Contacts workspace -  50000 nodes:  20000 mgnl:contact, 10000 mgnl:folder, 10000 mgnl:content, 10000 mgnl:contentNode

    Queryfirst runsubsequent runs
    select * from [nt:base]   
    70000 nodes returned in ~
    6800ms
    70000 nodes returned in ~
    1000ms
    select * from [nt:base] where ([jcr:primaryType] = 'mgnl:contact' or [jcr:primaryType] = 'mgnl:content' or [jcr:primaryType] = 'mgnl:contentNode' or [jcr:primaryType] = 'mgnl:folder') 
    50000 nodes returned in ~
    6000ms
    50000 nodes returned in ~
    1000ms
    Query with limit 100 and offset 500first runsubsequent runs
    select * from [nt:base] where ([jcr:primaryType] = 'mgnl:contact' or [jcr:primaryType] = 'mgnl:content' or [jcr:primaryType] = 'mgnl:contentNode' or [jcr:primaryType] = 'mgnl:folder') 
    ~
    250ms
    ~
    250ms

    The above results seem to show that there is no performance penalty in querying for multiple node types with the where clause syntax.

Limiting search results to relevant matches

One negative feedback regarding current search concerns the large amount of matches a search sometimes produces which are often completely unrelated to the term queried for. Consider the following example.

We want to search for all properties in config named admin. We know there is only one in a typical CE setup, that is the one directly under /server. However, searching for admin will return more than 4900 nodes!, that is all config nodes. That's because the current search implementation will perform a full text search on all properties

of all nodes in the workspace, something like select * from [mgnl:contentNode] as t where contains(t.*, 'admin'). The problem in this case is that each node has by default a jcr:createdBy property which is automatically added upon node creation and whose value is by default admin

To solve this issue we need to exclude some well known propertites from the full text search index. This is doable thanks to http://wiki.apache.org/jackrabbit/IndexingConfiguration.

"Per default the configured properties are fulltext indexed if they are of type STRING and included in the node scope index. That is, you can do a jcr:contains(., 'foo') and it will return nodes that have a string property that contains the word foo. This behaviour can be disabled: ..." .

Here is the indexing_configuration.xml file we will use

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.2.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0" xmlns:mgnl="http://www.magnolia.info/jcr/mgnl" xmlns:jcr="http://www.jcp.org/jcr/1.0">
  <!--
      A global, generic indexing configuration used for all workspaces in Magnolia.
      It excludes some well known properties from the node scope
      fulltext index.
 -->
<index-rule nodeType="nt:base">
    <property isRegexp="true" nodeScopeIndex="false">mgnl:.*</property>
    <property isRegexp="true" nodeScopeIndex="false">jcr:.*</property>
    <property isRegexp="true">.*:.*</property>
</index-rule>
</configuration>

The file will be placed under src/main/resources/info/magnolia/jackrabbit  in Magnolia's core module which means it will eventually be available in the JVM classpath once the artifact is built. To make JR aware of the indexing configuration, each provided repository configuration (jackrabbit-bundle-derby-search.xml and friends) will have an added parameter called indexingConfiguration as in the following excerpt

...
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
      <param name="path" value="${wsp.home}/index" />
      <!-- SearchIndex will get the indexing configuration from the classpath, if not found in the workspace home -->
      <param name="indexingConfiguration" value="/info/magnolia/jackrabbit/indexing_configuration.xml"/>
...

The query eventually produced will be something like this

select * from [nt:base] as t where (([jcr:primaryType] = 'mgnl:content' or [jcr:primaryType] = 'mgnl:contentNode') and (localname() like 'admin%' or t.admin is not null or contains(t.*, 'admin')))

localname() like 'admin%' will search for nodes whose jcr name begins with "admin"

t.admin is not null will look for the existence of properties called admin

contains(t.*, 'admin')  will perform a full-text search for admin within all properties across all nodes but those excluded by our indexing configuration, meaning, in our case, that jcr:createdBy won't be searched. The actual result of the above query is 7 matching results (there are some nodes whose name contains the word admin), one of which is the node /server  containing the admin property (as far as I could see, it is not possible to get directly the property rather than its parent node). 

The three above conditions are needed as we don't know beforehand if the user means to search for a node name, a property name or a value.  It is noteworthy to mention that excluding certain properties from the full-text index still makes them available in other types of query. For example select * from [nt:base] as t where t.[jcr:createdBy] = 'admin' will find all the expected matches.

 

  • No labels

6 Comments

  1. We've already discussed this, but I had another short look at it to make sure we don't miss anything.

    From the UX perspective, there's these key problems that prevent the current search to be truly useful and usable:

    • Search doesn't search wide enough, e.g. in the Configuration app. This is covered by searching for multiple node types as listed above.
    • Search sometimes searches too wide, e.g. in the Contacts app or in the nice "admin" example you gave.
    • There there's also (less relevant here):
      • It's not obvious why a match was found (we eventually might want to visualize the actual match in a result to fix this).
      • When I have the list of results, it would nice to have more info on a particular result without having to open it (we plan to show more info in previews).

    Obviously, the first two problems are conflicting. To me, it's not entirely clear how we can return a list of search results that is relevant to a user's query. Your example with searching for "admin" is a very good one, same goes searching for "subscribers": if I know that there's a folder named "subscribers" or a property names "admin", I would want that one to show up first.

    Is a solution maybe to run two queries?

    • Run a more restrictive query first, which looks for word and exact matches only.
    • Then, in a second step, run a second query (e.g. for "*admin*" and/or using a full-text search).

    In the UI, we could:

    • Show the results from the restrictive query first, while running the second one.
      • The second result set would be merged with the first one (sorted by "relevance") or even shown in a second list below ("Most relevant results"/"Other results").
    • Run only the first query and offer a link/button to the user to run the search again using the second, broader query.

    The second option could be quite power- and useful really.

    1. Currently the search results are simply sorted by node name ascending. I might have a look at JCR/Lucene configuration and see if the latter can be tweaked to boost up the score of certain search hits so that it works sensibly in most search situations but I have some doubts about it.

      I kind of like the idea of having a link/button to run a second broader query even though it sounds more an "advanced search" feature to me. The problem of running two queries is that we don't know in advance what a user is searching for. Let's take again the "admin" example. If a user's intentions is to search for the property "admin", then a select ... where t.admin is not null as the first query would certainly return what the user is expecting (i.e. all properties named "admin" or rather the nodes containing them). But what if the user is searching for properties containing the word "admin"? Or node names? I guess this kind of fine-grained search belongs more to an advanced search feature where a user chooses explicitly what they are looking for (node, property name or property value). 

      1. Yes, we can't predict what the user really wants. She might want to have a narrow search now, a less narrow one next time.

        But I like the idea to come up with some sensible defaults, which support both cases. E.g. like this:

         


        We would return the exact matches first, then fill-in the broader ones (which take more time).

        This is not a spec, obviously, just a quick sketch. And it might be the end result of several intermediate steps we would take.

        1. Unfortunately, the problem again would be "what is the first, more accurate query to run"? The only way to know is to ask the user first, for instance by means of checkboxes, something this


          Which, in this particular example, would require two separate queries (one for nodes the other for property names), unless we can accept to mix up results.

          1. Eventually we should offer options like these, but I would want them to be added to an advanced search feature rather then showing them by default. Unfortunately, we're still somewhat away from building an advanced search UI, so we could go with your proposal nevertheless. Maybe we can find a way to make the options less prominent.

            My take on such situations is usually to be brave and decide for good, sensible defaults, which capture at least 80% of all cases. The problem we have here is that we can't really take such an informed decision, since we lack the user research data to support it. We do have input from the Services team, though, and that should allow us to come up with a default solution, which is good enough for now.
            Given that, I'd still stick with an exact match first, on all node and property names, then a second query doing full-text search in properties values.

            Let's call a meeting a discuss possible solutions. I'll arrange that.

        2. Sorry I had to edit your post in order to restore your initial mockup, as I clumsily tried to copy and edit it to get mine (smile)